HTML-Gumbo
view release on metacpan or search on metacpan
lib/HTML/Gumbo.pm view on Meta::CPAN
die "Unknown event";
}
} );
Note that 'end' events are not generated for
L<void elements|http://www.w3.org/TR/html5/syntax.html#void-elements>,
for example C<hr>, C<br> and C<img>.
No additional arguments except mentioned C<callback>.
Fragment parsing still generates 'document start' and 'document end' events what
can be handy to initialize your parsing callback.
=head2 tree
Alpha stage.
Produces tree based on L<HTML::Element>s, like L<HTML::TreeBuilder>.
There is major difference from HTML::TreeBuilder, this method produces
top level element with tag name 'document' which may have doctype, comments
and html tags as children.
Fragments parsing still produces top level 'document' element as fragment
can be a list of tags, for example: '<p>hello</p><p>world</p'.
Yes, it's not ready to use as drop in replacement of tree builder. Patches
are wellcome as I don't use this formatter at the moment. Note that it's hard
to get rid of top level element because of situations described above.
So not bad idea is to write HTML::Gumbo::Document class that is either subclass
of L<HTML::Element> or implements a small subset of methods of HTML::Element.
=head1 CHARACTER ENCODING OF THE INPUT
The C parser works only with UTF-8, so you have several options to make
sure input is UTF-8. First of all define C<input_is> argument:
=over 4
=item string
Input is Perl string, for example obtained from L<HTTP::Response/decoded_content>.
Default value.
$gumbo->parse( decode_utf8($octets) );
=item octets
Input are octets. Partial implementation of
L<encoding sniffing algorithm|http://www.w3.org/TR/html5/syntax.html#encoding-sniffing-algorithm>
is used. First thing wins:
=over 4
=item C<encoding> argument
Use it to hardcode a specific encoding.
$gumbo->parse( $octets, input_is => 'octets', encoding => 'latin-1' );
=item BOM
UTF-8/UTF-16 BOMs are checked.
=item C<encoding_content_type> argument
Encdoning from rransport layer, charset in content-type header.
$gumbo->parse( $octets, input_is => 'octets', encoding_content_type => 'latin-1' );
=item Prescan
Not implemented, follow L<issue 58|https://github.com/google/gumbo-parser/issues/58>.
HTML5 defines L<prescan algorithm|http://www.w3.org/TR/html5/syntax.html#prescan-a-byte-stream-to-determine-its-encoding>
that extracts encoding from meta tags in the head.
It would be cool to get it in the C library, but I will accept a patch that impements it in pure perl.
=item C<encoding_tentative> argument
The likely encoding for this page, e.g. based on the encoding of the
page when it was last visited.
$gumbo->parse( $octets, input_is => 'octets', encoding_tentative => 'latin-1' );
=item nested browsing context
Not implemented. Fragment parsing with or without context is not implemented. Parser
also has no origin information, so it wouldn't be implemented.
=item autodetection
Not implemented.
Can be implemented using L<Encode::Detect::Detector>. Patches are welcome.
=item otherwise
It B<dies>.
=back
=item C<utf8>
Use utf8 as input_is when you're sure input is UTF-8, but octets.
No pre-processing at all. Should only be used on trusted input or
when it's preprocessed already.
=back
=cut
sub new {
my $proto = shift;
return bless {@_}, ref($proto) || $proto;
}
sub parse {
my $self = shift;
my $what = shift;
my %args = @_;
( run in 1.618 second using v1.01-cache-2.11-cpan-39bf76dae61 )