HTML-HTML5-Parser

 view release on metacpan or  search on metacpan

lib/HTML/HTML5/Parser.pm  view on Meta::CPAN

      </tbody></table>
    </body>
  </html>

Yes, C<< <i>World</i> >> gets hoisted up before the C<< <table> >>. This
is weird, I know, but it's how browsers do it in real life.

So what should:

  $string   = q{<b>Hello</b></td></tr> <i>World</i>};
  $fragment = $parser->parse_balanced_chunk($string);

actually return? Well, you can choose...

  $string = q{<b>Hello</b></td></tr> <i>World</i>};
  
  $frag1  = $parser->parse_balanced_chunk($string, {within=>'div'});
  say $frag1->toString; # <b>Hello</b> <i>World</i>
  
  $frag2  = $parser->parse_balanced_chunk($string, {within=>'td'});
  say $frag2->toString; # <i>World</i><b>Hello</b>

If you don't pass a "within" option, then the chunk is parsed as if it
were within a C<< <div> >> element. This is often the most sensible
option. If you pass something like C<< { within => "foobar" } >>
where "foobar" is not a real HTML element name (as found in the HTML5
spec), then this method will croak; if you pass the name of a void
element (e.g. C<< "br" >> or C<< "meta" >>) then this method will
croak; there are a handful of other unsupported elements which will
croak (namely: C<< "noscript" >>, C<< "noembed" >>, C<< "noframes" >>).

Note that the second time around, although we parsed the string "as
if it were within a C<< <td> >> element", the C<< <i>Hello</i> >>
bit did not strictly end up within the C<< <td> >> element (not
even within the C<< <table> >> element!) yet it still gets returned.
We'll call things such as this "outliers". There is a "force_within"
option which tells parse_balanced_chunk to ignore outliers:

  $frag3  = $parser->parse_balanced_chunk($string,
                                          {force_within=>'td'});
  say $frag3->toString; # <b>Hello</b>

There is a boolean option "mark_outliers" which marks each outlier
with an attribute (C<< data-perl-html-html5-parser-outlier >>) to
indicate its outlier status. Clearly, this is ignored when you use
"force_within" because no outliers are returned. Some outliers may
be XML::LibXML::Text elements; text nodes don't have attributes, so
these will not be marked with an attribute.

A last note is to mention what gets returned by this method. Normally
it's an L<XML::LibXML::DocumentFragment> object, but if you call the
method in list context, a list of the individual node elements is
returned. Alternatively you can request the data to be returned as an
L<XML::LibXML::NodeList> object:

 # Get an XML::LibXML::NodeList
 my $list = $parser->parse_balanced_chunk($str, {as=>'list'});

The exact implementation of this method may change from version to
version, but the long-term goal will be to approach how common
desktop browsers parse HTML fragments when implementing the setter 
for DOM's C<innerHTML> attribute.

=back

The push parser and SAX-based parser are not supported. Trying
to change an option (such as recover_silently) will make
HTML::HTML5::Parser carp a warning. (But you can inspect the
options.)

=head2 Error Handling

Error handling is obviously different to XML::LibXML, as errors are
(bugs notwithstanding) non-fatal.

=over

=item C<error_handler>

Get/set an error handling function. Must be set to a coderef or undef.

The error handling function will be called with a single parameter, a
L<HTML::HTML5::Parser::Error> object.

=item C<errors>

Returns a list of errors that occurred during the last parse.

See L<HTML::HTML5::Parser::Error>.

=back

=head2 Additional Methods

The module provides a few methods to obtain additional, non-DOM data from
DOM nodes.

=over

=item C<dtd_public_id>

  $pubid = $parser->dtd_public_id( $doc );
  
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the
Public Identifier of the DTD used (if any).

=item C<dtd_system_id>

  $sysid = $parser->dtd_system_id( $doc );
  
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the
System Identifier of the DTD used (if any).

=item C<dtd_element>

  $element = $parser->dtd_element( $doc );

For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the



( run in 1.554 second using v1.01-cache-2.11-cpan-5b529ec07f3 )