HTML-HTML5-Parser
view release on metacpan or search on metacpan
<b>Hello</b></td></tr>
</tbody></table>
</body>
</html>
Yes, `<i>World</i>` gets hoisted up before the `<table>`. This is
weird, I know, but it's how browsers do it in real life.
So what should:
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$fragment = $parser->parse_balanced_chunk($string);
actually return? Well, you can choose...
$string = q{<b>Hello</b></td></tr> <i>World</i>};
$frag1 = $parser->parse_balanced_chunk($string, {within=>'div'});
say $frag1->toString; # <b>Hello</b> <i>World</i>
$frag2 = $parser->parse_balanced_chunk($string, {within=>'td'});
say $frag2->toString; # <i>World</i><b>Hello</b>
If you don't pass a "within" option, then the chunk is parsed as if it
were within a `<div>` element. This is often the most sensible option.
If you pass something like `{ within => "foobar" }` where "foobar" is
not a real HTML element name (as found in the HTML5 spec), then this
method will croak; if you pass the name of a void element (e.g. "br"
or "meta") then this method will croak; there are a handful of other
unsupported elements which will croak (namely: "noscript", "noembed",
"noframes").
Note that the second time around, although we parsed the string "as if
it were within a `<td>` element", the `<i>Hello</i>` bit did not
strictly end up within the `<td>` element (not even within the
`<table>` element!) yet it still gets returned. We'll call things such
as this "outliers". There is a "force_within" option which tells
parse_balanced_chunk to ignore outliers:
$frag3 = $parser->parse_balanced_chunk($string,
{force_within=>'td'});
say $frag3->toString; # <b>Hello</b>
There is a boolean option "mark_outliers" which marks each outlier
with an attribute (`data-perl-html-html5-parser-outlier`) to indicate
its outlier status. Clearly, this is ignored when you use
"force_within" because no outliers are returned. Some outliers may be
XML::LibXML::Text elements; text nodes don't have attributes, so these
will not be marked with an attribute.
A last note is to mention what gets returned by this method. Normally
it's an XML::LibXML::DocumentFragment object, but if you call the
method in list context, a list of the individual node elements is
returned. Alternatively you can request the data to be returned as an
XML::LibXML::NodeList object:
# Get an XML::LibXML::NodeList
my $list = $parser->parse_balanced_chunk($str, {as=>'list'});
The exact implementation of this method may change from version to
version, but the long-term goal will be to approach how common desktop
browsers parse HTML fragments when implementing the setter for DOM's
`innerHTML` attribute.
The push parser and SAX-based parser are not supported. Trying to change
an option (such as recover_silently) will make HTML::HTML5::Parser carp a
warning. (But you can inspect the options.)
Error Handling
Error handling is obviously different to XML::LibXML, as errors are (bugs
notwithstanding) non-fatal.
`error_handler`
Get/set an error handling function. Must be set to a coderef or undef.
The error handling function will be called with a single parameter, a
HTML::HTML5::Parser::Error object.
`errors`
Returns a list of errors that occurred during the last parse.
See HTML::HTML5::Parser::Error.
Additional Methods
The module provides a few methods to obtain additional, non-DOM data from
DOM nodes.
`dtd_public_id`
$pubid = $parser->dtd_public_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the Public
Identifier of the DTD used (if any).
`dtd_system_id`
$sysid = $parser->dtd_system_id( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the System
Identifier of the DTD used (if any).
`dtd_element`
$element = $parser->dtd_element( $doc );
For an XML::LibXML::Document which has been returned by
HTML::HTML5::Parser, using this method will tell you the root element
declared in the DTD used (if any). That is, if the document has this
doctype:
<!doctype html>
... it will return "html".
This may return the empty string if a DTD was present but did not
contain a root element; or undef if no DTD was present.
`compat_mode`
$mode = $parser->compat_mode( $doc );
Returns 'quirks', 'limited quirks' or undef (standards mode).
( run in 0.527 second using v1.01-cache-2.11-cpan-e1769b4cff6 )