HTML-Gumbo

 view release on metacpan or  search on metacpan

lib/HTML/Gumbo.pm  view on Meta::CPAN

            die "Unknown event";
        }
    } );

Note that 'end' events are not generated for
L<void elements|http://www.w3.org/TR/html5/syntax.html#void-elements>,
for example C<hr>, C<br> and C<img>.

No additional arguments except mentioned C<callback>.

Fragment parsing still generates 'document start' and 'document end' events what
can be handy to initialize your parsing callback.

=head2 tree

Alpha stage.

Produces tree based on L<HTML::Element>s, like L<HTML::TreeBuilder>.

There is major difference from HTML::TreeBuilder, this method produces
top level element with tag name 'document' which may have doctype, comments
and html tags as children.

Fragments parsing still produces top level 'document' element as fragment
can be a list of tags, for example: '<p>hello</p><p>world</p'.

Yes, it's not ready to use as drop in replacement of tree builder. Patches
are wellcome as I don't use this formatter at the moment. Note that it's hard
to get rid of top level element because of situations described above.
So not bad idea is to write HTML::Gumbo::Document class that is either subclass
of L<HTML::Element> or implements a small subset of methods of HTML::Element.

=head1 CHARACTER ENCODING OF THE INPUT

The C parser works only with UTF-8, so you have several options to make
sure input is UTF-8. First of all define C<input_is> argument:

=over 4

=item string

Input is Perl string, for example obtained from L<HTTP::Response/decoded_content>.
Default value.

    $gumbo->parse( decode_utf8($octets) );

=item octets

Input are octets. Partial implementation of
L<encoding sniffing algorithm|http://www.w3.org/TR/html5/syntax.html#encoding-sniffing-algorithm>
is used. First thing wins:

=over 4

=item C<encoding> argument

Use it to hardcode a specific encoding.

    $gumbo->parse( $octets, input_is => 'octets', encoding => 'latin-1' );

=item BOM

UTF-8/UTF-16 BOMs are checked.

=item C<encoding_content_type> argument

Encdoning from rransport layer, charset in content-type header.

    $gumbo->parse( $octets, input_is => 'octets', encoding_content_type => 'latin-1' );

=item Prescan

Not implemented, follow L<issue 58|https://github.com/google/gumbo-parser/issues/58>.

HTML5 defines L<prescan algorithm|http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding>
that extracts encoding from meta tags in the head.

It would be cool to get it in the C library, but I will accept a patch that impements it in pure perl.

=item C<encoding_tentative> argument

The likely encoding for this page, e.g. based on the encoding of the
page when it was last visited.

    $gumbo->parse( $octets, input_is => 'octets', encoding_tentative => 'latin-1' );

=item nested browsing context

Not implemented. Fragment parsing with or without context is not implemented. Parser
also has no origin information, so it wouldn't be implemented.

=item autodetection

Not implemented.

Can be implemented using L<Encode::Detect::Detector>. Patches are welcome.

=item otherwise

It B<dies>.

=back

=item C<utf8>

Use utf8 as input_is when you're sure input is UTF-8, but octets.
No pre-processing at all. Should only be used on trusted input or
when it's preprocessed already.

=back

=cut

sub new {
    my $proto = shift;
    return bless {@_}, ref($proto) || $proto;
}

sub parse {
    my $self = shift;
    my $what = shift;
    my %args = @_;



( run in 1.618 second using v1.01-cache-2.11-cpan-39bf76dae61 )