HTML5-DOM
view release on metacpan or search on metacpan
lib/HTML5/DOM.pod view on Meta::CPAN
my $entry = $selector->entry(0);
print Dumper($entry->specificity); # {a => 0, b => 1, c => 2}
=head3 specificityArray
my $specificity = $entry->specificityArray;
Get specificity in array C<[a, b, c]> (ordered by weight)
my $css = HTML5::DOM::CSS->new;
my $selector = $css->parseSelector('body div.red, body span.blue');
my $entry = $selector->entry(0);
print Dumper($entry->specificityArray); # [0, 1, 2]
=head1 HTML5::DOM::Encoding
Encoding detection.
See for available encodings: L</ENCODINGS>
=head3 id2name
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
Get encoding name by id.
print HTML5::DOM::Encoding::id2name(HTML5::DOM::Encoding->UTF_8); # UTF-8
=head3 name2id
my $encoding_id = HTML5::DOM::Encoding::name2id($encoding);
Get id by name.
print HTML5::DOM::Encoding->UTF_8; # 0
print HTML5::DOM::Encoding::id2name("UTF-8"); # 0
=head3 detectAuto
my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectAuto($text, $max_length = 0);
Auto detect text encoding using (in this order):
=over
=item *
L<detectByPrescanStream|/detectByPrescanStream>
=item *
L<detectBomAndCut|/detectBomAndCut>
=item *
L<detect|/detect>
=back
Returns array with encoding id and new text without BOM, if success.
If fail, then encoding id equal HTML5::DOM::Encoding->NOT_DETERMINED.
my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectAuto("ололо");
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # UTF-8
=head3 detect
my $encoding_id = HTML5::DOM::Encoding::detect($text, $max_length = 0);
Detect text encoding. Single method for both L<detectCyrillic|/detectCyrillic> and L<detectUnicode|/detectUnicode>.
Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.
my $encoding_id = HTML5::DOM::Encoding::detect("ололо");
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # UTF-8
=head3 detectCyrillic
my $encoding_id = HTML5::DOM::Encoding::detectCyrillic($text, $max_length = 0);
Detect cyrillic text encoding (using lowercase B<trigrams>), such as C<windows-1251>, C<koi8-r>, C<iso-8859-5>, C<x-mac-cyrillic>, C<ibm866>.
Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.
This method also have aliases for compatibility reasons: C<detectUkrainian>, C<detectRussian>
=head3 detectUnicode
my $encoding_id = HTML5::DOM::Encoding::detectUnicode($text, $max_length = 0);
Detect unicode family text encoding, such as C<UTF-8>, C<UTF-16LE>, C<UTF-16BE>.
Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.
# get UTF-16LE data for test
my $str = "ололо";
Encode::from_to($str, "UTF-8", "UTF-16LE");
my $encoding_id = HTML5::DOM::Encoding::detectUnicode($str);
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # UTF-16LE
=head3 detectByPrescanStream
my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream($text, $max_length = 0);
Detect encoding by parsing C<E<lt>metaE<gt>> tags in html.
Returns encoding id, if success. And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.
See for more info: L<https://html.spec.whatwg.org/multipage/syntax.html#prescan-a-byte-stream-to-determine-its-encoding>
my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream('
<meta http-equiv="content-type" content="text/html; charset=windows-1251">
');
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # WINDOWS-1251
=head3 detectByCharset
my $encoding_id = HTML5::DOM::Encoding::detectByCharset($text, $max_length = 0);
Extracting character encoding from string. Find "charset=" and see encoding. Return found raw data.
For example: "text/html; charset=windows-1251". Return HTML5::DOM::Encoding->WINDOWS_1251
And returns HTML5::DOM::Encoding->NOT_DETERMINED if fail.
See for more info: L<https://html.spec.whatwg.org/multipage/infrastructure.html#algorithm-for-extracting-a-character-encoding-from-a-meta-element>
my $encoding_id = HTML5::DOM::Encoding::detectByPrescanStream('
<meta http-equiv="content-type" content="text/html; charset=windows-1251">
');
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # WINDOWS-1251
=head3 detectBomAndCut
my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectBomAndCut($text, $max_length = 0);
Returns array with encoding id and new text without BOM.
If fail, then encoding id equal HTML5::DOM::Encoding->NOT_DETERMINED.
my ($encoding_id, $new_text) = HTML5::DOM::Encoding::detectBomAndCut("\xEF\xBB\xBFололо");
my $encoding = HTML5::DOM::Encoding::id2name($encoding_id);
print $encoding; # UTF-8
print $new_text; # ололо
=head1 NAMESPACES
=head3 Supported namespace names
html, matml, svg, xlink, xml, xmlns
=head3 Supported namespace id constants
HTML5::DOM->NS_UNDEF
HTML5::DOM->NS_HTML
HTML5::DOM->NS_MATHML
HTML5::DOM->NS_SVG
HTML5::DOM->NS_XLINK
HTML5::DOM->NS_XML
HTML5::DOM->NS_XMLNS
HTML5::DOM->NS_ANY
HTML5::DOM->NS_LAST_ENTRY
=head1 TAGS
HTML5::DOM->TAG__UNDEF
HTML5::DOM->TAG__TEXT
HTML5::DOM->TAG__COMMENT
HTML5::DOM->TAG__DOCTYPE
HTML5::DOM->TAG_A
HTML5::DOM->TAG_ABBR
HTML5::DOM->TAG_ACRONYM
HTML5::DOM->TAG_ADDRESS
HTML5::DOM->TAG_ANNOTATION_XML
HTML5::DOM->TAG_APPLET
HTML5::DOM->TAG_AREA
HTML5::DOM->TAG_ARTICLE
HTML5::DOM->TAG_ASIDE
HTML5::DOM->TAG_AUDIO
HTML5::DOM->TAG_B
HTML5::DOM->TAG_BASE
HTML5::DOM->TAG_BASEFONT
HTML5::DOM->TAG_BDI
HTML5::DOM->TAG_BDO
HTML5::DOM->TAG_BGSOUND
HTML5::DOM->TAG_BIG
HTML5::DOM->TAG_BLINK
HTML5::DOM->TAG_BLOCKQUOTE
HTML5::DOM->TAG_BODY
HTML5::DOM->TAG_BR
HTML5::DOM->TAG_BUTTON
HTML5::DOM->TAG_CANVAS
HTML5::DOM->TAG_CAPTION
HTML5::DOM->TAG_CENTER
HTML5::DOM->TAG_CITE
HTML5::DOM->TAG_CODE
HTML5::DOM->TAG_COL
lib/HTML5/DOM.pod view on Meta::CPAN
=item *
L<HTML5::DOM::Tree::parseFragment|/parseFragment>
=back
=head4 threads
Threads count, if < 2 - parsing in single mode without threads (default 0)
This option affects only for L<HTML5::DOM::new|/new>.
Originaly, L<MyHTML|https://github.com/lexborisov/myhtml/blob/master/LICENSE> can use mulithread parsing.
But in real cases this mode slower than single mode (threads=0). Result speed very OS-specific and depends on input html.
Not recommended use if don't known what you do. B<Single mode faster in 99.9% cases.>
=head4 ignore_whitespace
Ignore whitespace tokens (default 0)
=head4 ignore_doctype
Do not parse DOCTYPE (default 0)
=head4 scripts
If 1 - <noscript> contents parsed to single text node (default)
If 0 - <noscript> contents parsed to child nodes
=head4 encoding
Encoding of input HTML, if C<auto> - library can tree to automaticaly determine encoding. (default "auto")
Allowed both encoding name or id.
=head4 default_encoding
Default encoding, this affects only if C<encoding> set to C<auto> and encoding not determined. (default "UTF-8")
Allowed both encoding name or id.
See for available encodings: L</ENCODINGS>
=head4 encoding_use_meta
Allow use C<E<lt>metaE<gt>> tags to determine input HTML encoding. (default 1)
See L<detectByPrescanStream|/detectByPrescanStream>.
=head4 encoding_prescan_limit
Limit string length to determine encoding by C<E<lt>metaE<gt>> tags. (default 1024, from spec)
See L<detectByPrescanStream|/detectByPrescanStream>.
=head4 encoding_use_bom
Allow use detecding BOM to determine input HTML encoding. (default 1)
See L<detectBomAndCut|/detectBomAndCut>.
=head4 utf8
Default: C<"auto">
If 1, then all returned strings have utf8 flag (chars).
If 0, then all returned strings haven't utf8 flag (bytes).
If C<"auto">, then utf8 flag detected by input string. Automaticaly enables C<utf8=1> if input string have utf8 flag.
C<"auto"> works only in L<parse|/parse>, L<parseChunk|/parseChunk>, L<parseAsync|/parseAsync> methods.
=head1 CSS PARSER OPTIONS
Options for:
=over
=item *
L<HTML5::DOM::CSS::new|/new>
=item *
L<HTML5::DOM::CSS::parseSelector|/parseSelector>
=back
=head4 utf8
Default: C<"auto">
If 1, then all returned strings have utf8 flag (chars).
If 0, then all returned strings haven't utf8 flag (bytes).
If C<"auto">, then utf8 flag detected by input string. Automaticaly enables C<utf8=1> if input string have utf8 flag.
=head1 HTML5 SUPPORT
Tested with L<html5lib-tests|https://github.com/html5lib/html5lib-tests> (at 2021-06-26)
-------------------------------------------------------------
test total ok fail skip
-------------------------------------------------------------
foreign-fragment.dat 66 54 12 0
tests26.dat 19 16 3 0
menuitem-element.dat 19 16 3 0
tests11.dat 12 11 1 0
tests1.dat 112 112 0 0
tests4.dat 6 6 0 0
tests6.dat 51 51 0 0
ruby.dat 20 20 0 0
adoption01.dat 17 17 0 0
tests14.dat 6 6 0 0
( run in 0.624 second using v1.01-cache-2.11-cpan-524268b4103 )