HTML-Encoding
view release on metacpan or search on metacpan
lib/HTML/Encoding.pm view on Meta::CPAN
my $xhtml = exists $o{xhtml} ? $o{xhtml} : 1;
my $default = exists $o{default} ? $o{default} : 1;
my $type = $mess->header('Content-Type');
my $charset = encoding_from_content_type($type);
if ($mess->content_type =~ $is_xml)
{
return wantarray ? (protocol => $charset) : $charset
if defined $charset;
# special case for text/xml at user option
return wantarray ? (protocol_default => $txml) : $txml
if defined $txml and $mess->content_type =~ $is_t_xml;
if (wantarray)
{
my @xml = encoding_from_xml_document($mess->content, encodings => $encodings);
return @xml if @xml;
}
else
{
my $xml = scalar encoding_from_xml_document($mess->content, encodings => $encodings);
return $xml if defined $xml;
}
return wantarray ? (default => $xml_d) : $xml_d if defined $default;
}
if ($mess->content_type =~ $is_html)
{
return wantarray ? (protocol => $charset) : $charset
if defined $charset;
if (wantarray)
{
my @html = encoding_from_html_document($mess->content, encodings => $encodings, xhtml => $xhtml);
return @html if @html;
}
else
{
my $html = scalar encoding_from_html_document($mess->content, encodings => $encodings, xhtml => $xhtml);
return $html if defined $html;
}
return wantarray ? (default => $html_d) : $html_d if defined $default;
}
return
}
1;
__END__
=pod
=head1 NAME
HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents
=head1 SYNOPSIS
use HTML::Encoding 'encoding_from_http_message';
use LWP::UserAgent;
use Encode;
my $resp = LWP::UserAgent->new->get('http://www.example.org');
my $enco = encoding_from_http_message($resp);
my $utf8 = decode($enco => $resp->content);
=head1 WARNING
The interface and implementation are guranteed to change before this
module reaches version 1.00! Please send feedback to the author of
this module.
=head1 DESCRIPTION
HTML::Encoding helps to determine the encoding of HTML and XML/XHTML
documents...
=head1 DEFAULT ENCODINGS
Most routines need to know some suspected character encodings which
can be provided through the C<encodings> option. This option always
defaults to the $HTML::Encoding::DEFAULT_ENCODINGS array reference
which means the following encodings are considered by default:
* ISO-8859-1
* UTF-16LE
* UTF-16BE
* UTF-32LE
* UTF-32BE
* UTF-8
If you change the values or pass custom values to the routines note
that L<Encode> must support them in order for this module to work
correctly.
=head1 ENCODING SOURCES
C<encoding_from_xml_document>, C<encoding_from_html_document>, and
C<encoding_from_http_message> return in list context the encoding
source and the encoding name, possible encoding sources are
* protocol (Content-Type: text/html;charset=encoding)
* bom (leading U+FEFF)
* xml (<?xml version='1.0' encoding='encoding'?>)
* meta (<meta http-equiv=...)
* default (default fallback value)
* protocol_default (protocol default)
=head1 ROUTINES
Routines exported by this module at user option. By default, nothing
is exported.
=over 2
lib/HTML/Encoding.pm view on Meta::CPAN
<p>...</p>
This would likely not detect the C<utf-8> value if HTML::Parser
does not resolve the entity. This should however only be a concern
for documents specifically crafted to break the encoding detection.
=item encoding_from_xml_document($octets, [, %options])
Uses encoding_from_byte_order_mark to detect the encoding using a
byte order mark in the byte string and returns the return value of
that routine if it succeeds. Uses xml_declaration_from_octets and
encoding_from_xml_declaration and returns the encoding for which
the latter routine found most matches in scalar context, and all
encodings ordered by number of occurences in list context. It
does not return a value of neither byte order mark not inbound
declarations declare a character encoding.
Examples:
+----------------------------+----------+-----------+----------+
| Input | Encoding | Encodings | Result |
+----------------------------+----------+-----------+----------+
| "<?xml?>" | UTF-16 | default | UTF-16BE |
| "<?xml?>" | UTF-16LE | default | undef |
| "<?xml encoding='utf-8'?>" | UTF-16LE | default | utf-8 |
| "<?xml encoding='utf-8'?>" | UTF-16 | default | UTF-16BE |
| "<?xml encoding='cp37'?>" | CP37 | default | undef |
| "<?xml encoding='cp37'?>" | CP37 | CP37 | cp37 |
+----------------------------+----------+-----------+----------+
Lacking a return value from this routine and higher-level protocol
information (such as protocol encoding defaults) processors would
be required to assume that the document is UTF-8 encoded.
Note however that the return value depends on the set of suspected
encodings you pass to it. For example, by default, EBCDIC encodings
would not be considered and thus for
<?xml version='1.0' encoding='cp37'?>
this routine would return the undefined value. You can modify the
list of suspected encodings using $options{encodings}.
=item encoding_from_html_document($octets, [, %options])
Uses encoding_from_xml_document and encoding_from_meta_element to
determine the encoding of HTML documents. If $options{xhtml} is
set to a false value uses encoding_from_byte_order_mark and
encoding_from_meta_element to determine the encoding. The xhtml
option is on by default. The $options{encodings} can be used to
modify the suspected encodings and $options{parser_options} can
be used to modify the HTML::Parser options in
encoding_from_meta_element (see the relevant documentation).
Returns nothing if no declaration could be found, the winning
declaration in scalar context and a list of encoding source
and encoding name in list context, see ENCODING SOURCES.
...
Other problems arise from differences between HTML and XHTML syntax
and encoding detection rules, for example, the input could be
Content-Type: text/html
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
This is a perfectly legal HTML 4.01 document and implementations
might be expected to consider the document ISO-8859-2 encoded as
XML rules for encoding detection do not apply to HTML documents.
This module attempts to avoid making decisions which rules apply
for a specific document and would thus by default return 'utf-8'
for this input.
On the other hand, if the input omits the encoding declaration,
Content-Type: text/html
<?xml version='1.0'?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv = "Content-Type"
content = "text/html;charset=iso-8859-2">
<title></title>
<p>...</p>
It would return 'iso-8859-2'. Similar problems would arise from
other differences between HTML and XHTML, for example consider
Content-Type: text/html
<?foo >
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html ...
?>
...
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
...
If this is processed using HTML rules, the first > will end the
processing instruction and the XHTML document type declaration
would be the relevant declaration for the document, if it is
processed using XHTML rules, the ?> will end the processing
instruction and the HTML document type declaration would be the
relevant declaration.
IOW, an application would need to assume a certain character
encoding (family) to process enough of the document to determine
whether it is XHTML or HTML and the result of this detection would
depend on which processing rules are assumed in order to process it.
It is thus in essence not possible to write a "perfect" detection
algorithm, which is why this routine attempts to avoid making any
decisions on this matter.
=item encoding_from_http_message($message [, %options])
Determines the encoding of HTML / XML / XHTML documents enclosed
in HTTP message. $message is an object compatible to L<HTTP::Message>,
e.g. a L<HTTP::Response> object. %options is a hash with the following
possible entries:
=over 2
=item encodings
array references of suspected character encodings, defaults to
C<$HTML::Encoding::DEFAULT_ENCODINGS>.
=item is_html
Regular expression matched against the content_type of the message
to determine whether to use HTML rules for the entity body, defaults
to C<qr{^text/html$}i>.
=item is_xml
Regular expression matched against the content_type of the message
to determine whether to use XML rules for the entity body, defaults
to C<qr{^.+/(?:.+\+)?xml$}i>.
=item is_text_xml
Regular expression matched against the content_type of the message
to determine whether to use text/html rules for the message, defaults
to C<qr{^text/(?:.+\+)?xml$}i>. This will only be checked if is_xml
matches aswell.
=item html_default
Default encoding for documents determined (by is_html) as HTML,
defaults to C<ISO-8859-1>.
=item xml_default
Default encoding for documents determined (by is_xml) as XML,
defaults to C<UTF-8>.
=item text_xml_default
Default encoding for documents determined (by is_text_xml) as text/xml,
defaults to C<undef> in which case the default is ignored. This should
be set to C<US-ASCII> if desired as this module is by default
inconsistent with RFC 3023 which requires that for text/xml documents
without a charset parameter in the HTTP header C<US-ASCII> is assumed.
This requirement is inconsistent with RFC 2616 (HTTP/1.1) which requires
to assume C<ISO-8859-1>, has been widely ignored and is thus disabled by
default.
=item xhtml
Whether the routine should look for an encoding declaration in the
XML declaration of the document (if any), defaults to C<1>.
=item default
Whether the relevant default value should be returned when no other
( run in 1.260 second using v1.01-cache-2.11-cpan-172d661cebc )