HTML5-DOM

 view release on metacpan or  search on metacpan

lib/HTML5/DOM.pod  view on Meta::CPAN

 my $entry = $selector->entry(0);
 print Dumper($entry->specificity); # {a => 0, b => 1, c => 2}

=head3 specificityArray

 my $specificity = $entry->specificityArray;

Get specificity in array C<[a, b, c]> (ordered by weight)

 my $css = HTML5::DOM::CSS->new;
 my $selector = $css->parseSelector('body div.red, body span.blue');
 my $entry = $selector->entry(0);
 print Dumper($entry->specificityArray); # [0, 1, 2]


=head1 HTML5::DOM::Encoding

Encoding detection.

See for available encodings: L</ENCODINGS>

=head3 id2name

 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);

Get encoding name by id.

 print HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding->UTF_8); # UTF-8

=head3 name2id

 my $encoding_id = HTML5::DOM::Encoding::name2id($encoding);

Get id by name.
 
 print HTML5::DOM::Encoding->UTF_8;             # 0
 print HTML5::DOM::Encoding::id2name("UTF-8");  # 0

=head3 detectAuto

 my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectAuto($text, $max_length = 0);

Auto detect text encoding using (in this order):

=over

=item *

L<detectByPrescanStream|/detectByPrescanStream>

=item *

L<detectBomAndCut|/detectBomAndCut>

=item *

L<detect|/detect>

=back

Returns array with encoding id and new text without BOM, if success. 

If fail, then encoding id equal HTML5::DOM::Encoding->NOT_DETERMINED.

 my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectAuto("ололо");
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # UTF-8

=head3 detect

 my $encoding_id = HTML5::DOM::Encoding::detect($text, $max_length = 0);

Detect text encoding. Single method for both L<detectCyrillic|/detectCyrillic> and L<detectUnicode|/detectUnicode>.

Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.

 my $encoding_id = HTML5::DOM::Encoding::detect("ололо");
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # UTF-8

=head3 detectCyrillic

 my $encoding_id = HTML5::DOM::Encoding::detectCyrillic($text, $max_length = 0);

Detect cyrillic text encoding (using lowercase B<trigrams>), such as C<windows-1251>, C<koi8-r>, C<iso-8859-5>, C<x-mac-cyrillic>, C<ibm866>.

Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.

This method also have aliases for compatibility reasons: C<detectUkrainian>, C<detectRussian>

=head3 detectUnicode

 my $encoding_id = HTML5::DOM::Encoding::detectUnicode($text, $max_length = 0);

Detect unicode family text encoding, such as C<UTF-8>, C<UTF-16LE>, C<UTF-16BE>.

Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.

 # get UTF-16LE data for test
 my $str = "ололо";
 Encode::from_to($str, "UTF-8", "UTF-16LE");
 
 my $encoding_id = HTML5::DOM::Encoding::detectUnicode($str);
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # UTF-16LE

=head3 detectByPrescanStream

 my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream($text, $max_length = 0);

Detect encoding by parsing C<E<lt>metaE<gt>> tags in html.

Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.

See for more info: L<https://html.spec.whatwg.org/multipage/syntax.html#prescan-a-byte-stream-to-determine-its-encoding>

 my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream('
    <meta http-equiv="content-type" content="text/html; charset=windows-1251">
 ');
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # WINDOWS-1251

=head3 detectByCharset

 my $encoding_id = HTML5::DOM::Encoding::detectByCharset($text, $max_length = 0);

Extracting character encoding from string. Find "charset=" and see encoding. Return found raw data.

For example: "text/html; charset=windows-1251". Return HTML5::DOM::Encoding->WINDOWS_1251

And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.

See for more info: L<https://html.spec.whatwg.org/multipage/infrastructure.html#algorithm-for-extracting-a-character-encoding-from-a-meta-element>

 my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream('
    <meta http-equiv="content-type" content="text/html; charset=windows-1251">
 ');
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # WINDOWS-1251

=head3 detectBomAndCut

 my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectBomAndCut($text, $max_length = 0);

Returns array with encoding id and new text without BOM. 

If fail, then encoding id equal HTML5::DOM::Encoding->NOT_DETERMINED.

 my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectBomAndCut("\xEF\xBB\xBFололо");
 my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
 print $encoding; # UTF-8
 print $new_text; # ололо

=head1 NAMESPACES

=head3 Supported namespace names

 html, matml, svg, xlink, xml, xmlns

=head3 Supported namespace id constants

 HTML5::DOM->NS_UNDEF
 HTML5::DOM->NS_HTML
 HTML5::DOM->NS_MATHML
 HTML5::DOM->NS_SVG
 HTML5::DOM->NS_XLINK
 HTML5::DOM->NS_XML
 HTML5::DOM->NS_XMLNS
 HTML5::DOM->NS_ANY
 HTML5::DOM->NS_LAST_ENTRY

=head1 TAGS

 HTML5::DOM->TAG__UNDEF
 HTML5::DOM->TAG__TEXT
 HTML5::DOM->TAG__COMMENT
 HTML5::DOM->TAG__DOCTYPE
 HTML5::DOM->TAG_A
 HTML5::DOM->TAG_ABBR
 HTML5::DOM->TAG_ACRONYM
 HTML5::DOM->TAG_ADDRESS
 HTML5::DOM->TAG_ANNOTATION_XML
 HTML5::DOM->TAG_APPLET
 HTML5::DOM->TAG_AREA
 HTML5::DOM->TAG_ARTICLE
 HTML5::DOM->TAG_ASIDE
 HTML5::DOM->TAG_AUDIO
 HTML5::DOM->TAG_B
 HTML5::DOM->TAG_BASE
 HTML5::DOM->TAG_BASEFONT
 HTML5::DOM->TAG_BDI
 HTML5::DOM->TAG_BDO
 HTML5::DOM->TAG_BGSOUND
 HTML5::DOM->TAG_BIG
 HTML5::DOM->TAG_BLINK
 HTML5::DOM->TAG_BLOCKQUOTE
 HTML5::DOM->TAG_BODY
 HTML5::DOM->TAG_BR
 HTML5::DOM->TAG_BUTTON
 HTML5::DOM->TAG_CANVAS
 HTML5::DOM->TAG_CAPTION
 HTML5::DOM->TAG_CENTER
 HTML5::DOM->TAG_CITE
 HTML5::DOM->TAG_CODE
 HTML5::DOM->TAG_COL

lib/HTML5/DOM.pod  view on Meta::CPAN

=item *

L<HTML5::DOM::Tree::parseFragment|/parseFragment>

=back

=head4 threads

Threads count, if < 2 - parsing in single mode without threads (default 0)

This option affects only for L<HTML5::DOM::new|/new>.

Originaly, L<MyHTML|https://github.com/lexborisov/myhtml/blob/master/LICENSE> can use mulithread parsing.

But in real cases this mode slower than single mode (threads=0). Result speed very OS-specific and depends on input html.

Not recommended use if don't known what you do. B<Single mode faster in 99.9% cases.>

=head4 ignore_whitespace

Ignore whitespace tokens (default 0)

=head4 ignore_doctype

Do not parse DOCTYPE (default 0)

=head4 scripts

If 1 - <noscript> contents parsed to single text node (default)

If 0 - <noscript> contents parsed to child nodes

=head4 encoding

Encoding of input HTML, if C<auto> - library can tree to automaticaly determine encoding. (default "auto")

Allowed both encoding name or id. 

=head4 default_encoding

Default encoding, this affects only if C<encoding> set to C<auto> and encoding not determined. (default "UTF-8")

Allowed both encoding name or id. 

See for available encodings: L</ENCODINGS>

=head4 encoding_use_meta

Allow use C<E<lt>metaE<gt>> tags to determine input HTML encoding. (default 1)

See L<detectByPrescanStream|/detectByPrescanStream>.

=head4 encoding_prescan_limit

Limit string length to determine encoding by C<E<lt>metaE<gt>> tags. (default 1024, from spec)

See L<detectByPrescanStream|/detectByPrescanStream>.

=head4 encoding_use_bom

Allow use detecding BOM to determine input HTML encoding. (default 1)

See L<detectBomAndCut|/detectBomAndCut>.

=head4 utf8

Default: C<"auto">

If 1, then all returned strings have utf8 flag (chars).

If 0, then all returned strings haven't utf8 flag (bytes).

If C<"auto">, then utf8 flag detected by input string. Automaticaly enables C<utf8=1> if input string have utf8 flag.

C<"auto"> works only in L<parse|/parse>, L<parseChunk|/parseChunk>, L<parseAsync|/parseAsync> methods. 


=head1 CSS PARSER OPTIONS

Options for:

=over

=item *

L<HTML5::DOM::CSS::new|/new>

=item *

L<HTML5::DOM::CSS::parseSelector|/parseSelector>

=back

=head4 utf8

Default: C<"auto">

If 1, then all returned strings have utf8 flag (chars).

If 0, then all returned strings haven't utf8 flag (bytes).

If C<"auto">, then utf8 flag detected by input string. Automaticaly enables C<utf8=1> if input string have utf8 flag.


=head1 HTML5 SUPPORT

Tested with L<html5lib-tests|https://github.com/html5lib/html5lib-tests> (at 2021-06-26)

 -------------------------------------------------------------
 test                        total    ok      fail    skip
 -------------------------------------------------------------
 foreign-fragment.dat        66       54      12      0
 tests26.dat                 19       16      3       0
 menuitem-element.dat        19       16      3       0
 tests11.dat                 12       11      1       0
 tests1.dat                  112      112     0       0
 tests4.dat                  6        6       0       0
 tests6.dat                  51       51      0       0
 ruby.dat                    20       20      0       0
 adoption01.dat              17       17      0       0
 tests14.dat                 6        6       0       0



( run in 0.624 second using v1.01-cache-2.11-cpan-524268b4103 )