HTML-HTML5-Parser

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

                <b>Hello</b></td></tr>
              </tbody></table>
            </body>
          </html>

        Yes, `<i>World</i>` gets hoisted up before the `<table>`. This is
        weird, I know, but it's how browsers do it in real life.

        So what should:

          $string   = q{<b>Hello</b></td></tr> <i>World</i>};
          $fragment = $parser->parse_balanced_chunk($string);

        actually return? Well, you can choose...

          $string = q{<b>Hello</b></td></tr> <i>World</i>};
  
          $frag1  = $parser->parse_balanced_chunk($string, {within=>'div'});
          say $frag1->toString; # <b>Hello</b> <i>World</i>
  
          $frag2  = $parser->parse_balanced_chunk($string, {within=>'td'});
          say $frag2->toString; # <i>World</i><b>Hello</b>

        If you don't pass a "within" option, then the chunk is parsed as if it
        were within a `<div>` element. This is often the most sensible option.
        If you pass something like `{ within => "foobar" }` where "foobar" is
        not a real HTML element name (as found in the HTML5 spec), then this
        method will croak; if you pass the name of a void element (e.g. "br"
        or "meta") then this method will croak; there are a handful of other
        unsupported elements which will croak (namely: "noscript", "noembed",
        "noframes").

        Note that the second time around, although we parsed the string "as if
        it were within a `<td>` element", the `<i>Hello</i>` bit did not
        strictly end up within the `<td>` element (not even within the
        `<table>` element!) yet it still gets returned. We'll call things such
        as this "outliers". There is a "force_within" option which tells
        parse_balanced_chunk to ignore outliers:

          $frag3  = $parser->parse_balanced_chunk($string,
                                                  {force_within=>'td'});
          say $frag3->toString; # <b>Hello</b>

        There is a boolean option "mark_outliers" which marks each outlier
        with an attribute (`data-perl-html-html5-parser-outlier`) to indicate
        its outlier status. Clearly, this is ignored when you use
        "force_within" because no outliers are returned. Some outliers may be
        XML::LibXML::Text elements; text nodes don't have attributes, so these
        will not be marked with an attribute.

        A last note is to mention what gets returned by this method. Normally
        it's an XML::LibXML::DocumentFragment object, but if you call the
        method in list context, a list of the individual node elements is
        returned. Alternatively you can request the data to be returned as an
        XML::LibXML::NodeList object:

         # Get an XML::LibXML::NodeList
         my $list = $parser->parse_balanced_chunk($str, {as=>'list'});

        The exact implementation of this method may change from version to
        version, but the long-term goal will be to approach how common desktop
        browsers parse HTML fragments when implementing the setter for DOM's
        `innerHTML` attribute.

    The push parser and SAX-based parser are not supported. Trying to change
    an option (such as recover_silently) will make HTML::HTML5::Parser carp a
    warning. (But you can inspect the options.)

  Error Handling
    Error handling is obviously different to XML::LibXML, as errors are (bugs
    notwithstanding) non-fatal.

    `error_handler`
        Get/set an error handling function. Must be set to a coderef or undef.

        The error handling function will be called with a single parameter, a
        HTML::HTML5::Parser::Error object.

    `errors`
        Returns a list of errors that occurred during the last parse.

        See HTML::HTML5::Parser::Error.

  Additional Methods
    The module provides a few methods to obtain additional, non-DOM data from
    DOM nodes.

    `dtd_public_id`
          $pubid = $parser->dtd_public_id( $doc );

        For an XML::LibXML::Document which has been returned by
        HTML::HTML5::Parser, using this method will tell you the Public
        Identifier of the DTD used (if any).

    `dtd_system_id`
          $sysid = $parser->dtd_system_id( $doc );

        For an XML::LibXML::Document which has been returned by
        HTML::HTML5::Parser, using this method will tell you the System
        Identifier of the DTD used (if any).

    `dtd_element`
          $element = $parser->dtd_element( $doc );

        For an XML::LibXML::Document which has been returned by
        HTML::HTML5::Parser, using this method will tell you the root element
        declared in the DTD used (if any). That is, if the document has this
        doctype:

          <!doctype html>

        ... it will return "html".

        This may return the empty string if a DTD was present but did not
        contain a root element; or undef if no DTD was present.

    `compat_mode`
          $mode = $parser->compat_mode( $doc );

        Returns 'quirks', 'limited quirks' or undef (standards mode).



( run in 0.527 second using v1.01-cache-2.11-cpan-e1769b4cff6 )