HTML-HTML5-Parser
view release on metacpan or search on metacpan
lib/HTML/HTML5/Parser.pm view on Meta::CPAN
</tbody></table>
</body>
</html>
Yes, C<< <i>World</i> >> gets hoisted up before the C<< <table> >>. This
is weird, I know, but it's how browsers do it in real life.
So what should:
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$fragment = $parser->parse_balanced_chunk($string);
actually return? Well, you can choose...
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$frag1 = $parser->parse_balanced_chunk($string, {within=>'div'});
say $frag1->toString; # <b>Hello</b> <i>World</i>
$frag2 = $parser->parse_balanced_chunk($string, {within=>'td'});
say $frag2->toString; # <i>World</i><b>Hello</b>
If you don't pass a "within" option, then the chunk is parsed as if it
were within a C<< <div> >> element. This is often the most sensible
option. If you pass something like C<< { within => "foobar" } >>
where "foobar" is not a real HTML element name (as found in the HTML5
spec), then this method will croak; if you pass the name of a void
element (e.g. C<< "br" >> or C<< "meta" >>) then this method will
croak; there are a handful of other unsupported elements which will
croak (namely: C<< "noscript" >>, C<< "noembed" >>, C<< "noframes" >>).
Note that the second time around, although we parsed the string "as
if it were within a C<< <td> >> element", the C<< <i>Hello</i> >>
bit did not strictly end up within the C<< <td> >> element (not
even within the C<< <table> >> element!) yet it still gets returned.
We'll call things such as this "outliers". There is a "force_within"
option which tells parse_balanced_chunk to ignore outliers:
$frag3 = $parser->parse_balanced_chunk($string,
{force_within=>'td'});
say $frag3->toString; # <b>Hello</b>
There is a boolean option "mark_outliers" which marks each outlier
with an attribute (C<< data-perl-html-html5-parser-outlier >>) to
indicate its outlier status. Clearly, this is ignored when you use
"force_within" because no outliers are returned. Some outliers may
be XML::LibXML::Text elements; text nodes don't have attributes, so
these will not be marked with an attribute.
A last note is to mention what gets returned by this method. Normally
it's an L<XML::LibXML::DocumentFragment> object, but if you call the
method in list context, a list of the individual node elements is
returned. Alternatively you can request the data to be returned as an
L<XML::LibXML::NodeList> object:
# Get an XML::LibXML::NodeList
my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
The exact implementation of this method may change from version to
version, but the long-term goal will be to approach how common
desktop browsers parse HTML fragments when implementing the setter
for DOM's C<innerHTML> attribute.
=back
The push parser and SAX-based parser are not supported. Trying
to change an option (such as recover_silently) will make
HTML::HTML5::Parser carp a warning. (But you can inspect the
options.)
=head2 Error Handling
Error handling is obviously different to XML::LibXML, as errors are
(bugs notwithstanding) non-fatal.
=over
=item C<error_handler>
Get/set an error handling function. Must be set to a coderef or undef.
The error handling function will be called with a single parameter, a
L<HTML::HTML5::Parser::Error> object.
=item C<errors>
Returns a list of errors that occurred during the last parse.
See L<HTML::HTML5::Parser::Error>.
=back
=head2 Additional Methods
The module provides a few methods to obtain additional, non-DOM data from
DOM nodes.
=over
=item C<dtd_public_id>
$pubid = $parser->dtd_public_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the
Public Identifier of the DTD used (if any).
=item C<dtd_system_id>
$sysid = $parser->dtd_system_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the
System Identifier of the DTD used (if any).
=item C<dtd_element>
$element = $parser->dtd_element( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the
( run in 1.554 second using v1.01-cache-2.11-cpan-5b529ec07f3 )