Plack-App-MCCS
view release on metacpan or search on metacpan
local/lib/perl5/IO/HTML.pm view on Meta::CPAN
$options ||= {};
open(my $in, '<:raw', $filename) or croak "Failed to open $filename: $!";
my ($encoding, $bom) = sniff_encoding($in, $filename, $options);
if (not defined $encoding) {
croak "No default encoding specified"
unless defined($encoding = $default_encoding);
$encoding = find_encoding($encoding) if $options->{encoding};
} # end if we didn't find an encoding
binmode $in, sprintf(":encoding(%s):crlf",
$options->{encoding} ? $encoding->name : $encoding);
return ($in, $encoding, $bom);
} # end html_file_and_encoding
#---------------------------------------------------------------------
sub html_outfile
{
my ($filename, $encoding, $bom) = @_;
if (not defined $encoding) {
croak "No default encoding specified"
unless defined($encoding = $default_encoding);
} # end if we didn't find an encoding
elsif (ref $encoding) {
$encoding = $encoding->name;
}
open(my $out, ">:encoding($encoding)", $filename)
or croak "Failed to open $filename: $!";
print $out "\x{FeFF}" if $bom;
return $out;
} # end html_outfile
#---------------------------------------------------------------------
sub sniff_encoding
{
my ($in, $filename, $options) = @_;
$filename = 'file' unless defined $filename;
$options ||= {};
my $pos = tell $in;
croak "Could not seek $filename: $!" if $pos < 0;
croak "Could not read $filename: $!"
unless defined read $in, my($buf), $bytes_to_check;
seek $in, $pos, 0 or croak "Could not seek $filename: $!";
# Check for BOM:
my $bom;
my $encoding = do {
if ($buf =~ /^\xFe\xFF/) {
$bom = 2;
'UTF-16BE';
} elsif ($buf =~ /^\xFF\xFe/) {
$bom = 2;
'UTF-16LE';
} elsif ($buf =~ /^\xEF\xBB\xBF/) {
$bom = 3;
'utf-8-strict';
} else {
find_charset_in($buf, $options); # check for <meta charset>
}
}; # end $encoding
if ($bom) {
seek $in, $bom, 1 or croak "Could not seek $filename: $!";
$bom = 1;
}
elsif (not defined $encoding) { # try decoding as UTF-8
my $test = decode('utf-8-strict', $buf, Encode::FB_QUIET);
if ($buf =~ /^(?: # nothing left over
| [\xC2-\xDF] # incomplete 2-byte char
| [\xE0-\xEF] [\x80-\xBF]? # incomplete 3-byte char
| [\xF0-\xF4] [\x80-\xBF]{0,2} # incomplete 4-byte char
)\z/x and $test =~ /[^\x00-\x7F]/) {
$encoding = 'utf-8-strict';
} # end if valid UTF-8 with at least one multi-byte character:
} # end if testing for UTF-8
if (defined $encoding and $options->{encoding} and not ref $encoding) {
$encoding = find_encoding($encoding);
} # end if $encoding is a string and we want an object
return wantarray ? ($encoding, $bom) : $encoding;
} # end sniff_encoding
#=====================================================================
# Based on HTML5 8.2.2.2 Determining the character encoding:
# Get attribute from current position of $_
sub _get_attribute
{
m!\G[\x09\x0A\x0C\x0D /]+!gc; # skip whitespace or /
return if /\G>/gc or not /\G(=?[^\x09\x0A\x0C\x0D =]*)/gc;
my ($name, $value) = (lc $1, '');
if (/\G[\x09\x0A\x0C\x0D ]*=[\x09\x0A\x0C\x0D ]*/gc) {
if (/\G"/gc) {
# Double-quoted attribute value
/\G([^"]*)("?)/gc;
return unless $2; # Incomplete attribute (missing closing quote)
$value = lc $1;
} elsif (/\G'/gc) {
# Single-quoted attribute value
/\G([^']*)('?)/gc;
return unless $2; # Incomplete attribute (missing closing quote)
local/lib/perl5/IO/HTML.pm view on Meta::CPAN
# Alternative interface:
open(my $in, '<:raw', 'bar.html');
my $encoding = IO::HTML::sniff_encoding($in, 'bar.html');
=head1 DESCRIPTION
IO::HTML provides an easy way to open a file containing HTML while
automatically determining its encoding. It uses the HTML5 encoding
sniffing algorithm specified in section 8.2.2.2 of the draft standard.
The algorithm as implemented here is:
=over
=item 1.
If the file begins with a byte order mark indicating UTF-16LE,
UTF-16BE, or UTF-8, then that is the encoding.
=item 2.
If the first C<$bytes_to_check> bytes of the file contain a C<< <meta> >> tag that
indicates the charset, and Encode recognizes the specified charset
name, then that is the encoding. (This portion of the algorithm is
implemented by C<find_charset_in>.)
The C<< <meta> >> tag can be in one of two formats:
<meta charset="...">
<meta http-equiv="Content-Type" content="...charset=...">
The search is case-insensitive, and the order of attributes within the
tag is irrelevant. Any additional attributes of the tag are ignored.
The first matching tag with a recognized encoding ends the search.
=item 3.
If the first C<$bytes_to_check> bytes of the file are valid UTF-8 (with at least 1
non-ASCII character), then the encoding is UTF-8.
=item 4.
If all else fails, use the default character encoding. The HTML5
standard suggests the default encoding should be locale dependent, but
currently it is always C<cp1252> unless you set
C<$IO::HTML::default_encoding> to a different value. Note:
C<sniff_encoding> does not apply this step; only C<html_file> does
that.
=back
=head1 SUBROUTINES
=head2 html_file
$filehandle = html_file($filename, \%options);
This function (exported by default) is the primary entry point. It
opens the file specified by C<$filename> for reading, uses
C<sniff_encoding> to find a suitable encoding layer, and applies it.
It also applies the C<:crlf> layer. If the file begins with a BOM,
the filehandle is positioned just after the BOM.
The optional second argument is a hashref containing options. The
possible keys are described under C<find_charset_in>.
If C<sniff_encoding> is unable to determine the encoding, it defaults
to C<$IO::HTML::default_encoding>, which is set to C<cp1252>
(a.k.a. Windows-1252) by default. According to the standard, the
default should be locale dependent, but that is not currently
implemented.
It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.
=head2 html_file_and_encoding
($filehandle, $encoding, $bom)
= html_file_and_encoding($filename, \%options);
This function (exported only by request) is just like C<html_file>,
but returns more information. In addition to the filehandle, it
returns the name of the encoding used, and a flag indicating whether a
byte order mark was found (if C<$bom> is true, the file began with a
BOM). This may be useful if you want to write the file out again
(especially in conjunction with the C<html_outfile> function).
The optional second argument is a hashref containing options. The
possible keys are described under C<find_charset_in>.
It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.
The result of calling C<html_file_and_encoding> in scalar context is undefined
(in the C sense of there is no guarantee what you'll get).
=head2 html_outfile
$filehandle = html_outfile($filename, $encoding, $bom);
This function (exported only by request) opens C<$filename> for output
using C<$encoding>, and writes a BOM to it if C<$bom> is true.
If C<$encoding> is C<undef>, it defaults to C<$IO::HTML::default_encoding>.
C<$encoding> may be either an encoding name or an Encode::Encoding object.
It dies if the file cannot be opened, or if both C<$encoding> and
C<$IO::HTML::default_encoding> are C<undef>.
=head2 sniff_encoding
($encoding, $bom) = sniff_encoding($filehandle, $filename, \%options);
This function (exported only by request) runs the HTML5 encoding
sniffing algorithm on C<$filehandle> (which must be seekable, and
should have been opened in C<:raw> mode). C<$filename> is used only
for error messages (if there's a problem using the filehandle), and
defaults to "file" if omitted. The optional third argument is a
hashref containing options. The possible keys are described under
C<find_charset_in>.
It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name. It returns
C<undef> if the encoding cannot be determined. C<$bom> is true if the
file began with a byte order mark. In scalar context, it returns only
C<$encoding>.
The filehandle's position is restored to its original position
(normally the beginning of the file) unless C<$bom> is true. In that
case, the position is immediately after the BOM.
Tip: If you want to run C<sniff_encoding> on a file you've already
loaded into a string, open an in-memory file on the string, and pass
that handle:
($encoding, $bom) = do {
open(my $fh, '<', \$string); sniff_encoding($fh)
};
(This only makes sense if C<$string> contains bytes, not characters.)
=head2 find_charset_in
$encoding = find_charset_in($string_containing_HTML, \%options);
This function (exported only by request) looks for charset information
in a C<< <meta> >> tag in a possibly-incomplete HTML document using
the "two step" algorithm specified by HTML5. It does not look for a BOM.
The C<< <meta> >> tag must begin within the first C<$IO::HTML::bytes_to_check>
bytes of the string.
It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name. It returns
C<undef> if no charset is specified or if the specified charset is not
recognized by the Encode module.
The optional second argument is a hashref containing options. The
following keys are recognized:
=over
=item C<encoding>
If true, return the L<Encode::Encoding> object instead of its name.
Defaults to false.
=item C<need_pragma>
If true (the default), follow the HTML5 spec and examine the
C<content> attribute only of C<< <meta http-equiv="Content-Type" >>.
If set to 0, relax the HTML5 spec, and look for "charset=" in the
C<content> attribute of I<every> meta tag.
=back
=head1 EXPORTS
By default, only C<html_file> is exported. Other functions may be
exported on request.
For people who prefer not to export functions, all functions beginning
with C<html_> have an alias without that prefix (e.g. you can call
C<IO::HTML::file(...)> instead of C<IO::HTML::html_file(...)>. These
aliases are not exportable.
=for Pod::Coverage
file
file_and_encoding
outfile
The following export tags are available:
=over
=item C<:all>
All exportable functions.
=item C<:rw>
C<html_file>, C<html_file_and_encoding>, C<html_outfile>.
=back
=head1 SEE ALSO
The HTML5 specification, section 8.2.2.2 Determining the character encoding:
L<http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding>
( run in 3.248 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )