XHTML results from the CPAN

XHTML

HTML-Encoding

view release on metacpan or search on metacpan

--- #YAML:1.0
name:               HTML-Encoding
version:            0.61
abstract:           Determine the encoding of HTML/XML/XHTML documents
author:
    - Bjoern Hoehrmann <bjoern@hoehrmann.de>
license:            perl
distribution_type:  module
configure_requires:
    ExtUtils::MakeMaker:  0
build_requires:
    ExtUtils::MakeMaker:  0
requires:
    Encode:               0

lib/HTML/Encoding.pm view on Meta::CPAN

}

1;

__END__

=pod

=head1 NAME

HTML::Encoding - Determine the encoding of HTML/XML/XHTML documents

=head1 SYNOPSIS

  use HTML::Encoding 'encoding_from_http_message';
  use LWP::UserAgent;
  use Encode;
  
  my $resp = LWP::UserAgent->new->get('http://www.example.org');
  my $enco = encoding_from_http_message($resp);
  my $utf8 = decode($enco => $resp->content);

lib/HTML/Encoding.pm view on Meta::CPAN

modify the suspected encodings and $options{parser_options} can
be used to modify the HTML::Parser options in
encoding_from_meta_element (see the relevant documentation).

Returns nothing if no declaration could be found, the winning
declaration in scalar context and a list of encoding source
and encoding name in list context, see ENCODING SOURCES.

...

Other problems arise from differences between HTML and XHTML syntax
and encoding detection rules, for example, the input could be

  Content-Type: text/html

  <?xml version='1.0' encoding='utf-8'?>
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd">
  <meta http-equiv = "Content-Type"
           content = "text/html;charset=iso-8859-2">
  <title></title>

lib/HTML/Encoding.pm view on Meta::CPAN

           content = "text/html;charset=iso-8859-2">
  <title></title>
  <p>...</p>

It would return 'iso-8859-2'. Similar problems would arise from
other differences between HTML and XHTML, for example consider

  Content-Type: text/html

  <?foo >
  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
      "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html ...
  ?>
  ...
  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
  ...
  
If this is processed using HTML rules, the first > will end the
processing instruction and the XHTML document type declaration
would be the relevant declaration for the document, if it is
processed using XHTML rules, the ?> will end the processing
instruction and the HTML document type declaration would be the
relevant declaration.

IOW, an application would need to assume a certain character
encoding (family) to process enough of the document to determine
whether it is XHTML or HTML and the result of this detection would
depend on which processing rules are assumed in order to process it.
It is thus in essence not possible to write a "perfect" detection
algorithm, which is why this routine attempts to avoid making any
decisions on this matter.

=item encoding_from_http_message($message [, %options])

Determines the encoding of HTML / XML / XHTML documents enclosed
in HTTP message. $message is an object compatible to L<HTTP::Message>,
e.g. a L<HTTP::Response> object. %options is a hash with the following
possible entries:

=over 2

=item encodings

array references of suspected character encodings, defaults to
C<$HTML::Encoding::DEFAULT_ENCODINGS>.

( run in 0.234 second using v1.01-cache-2.11-cpan-0f795438458 )