IO-HTML
view release on metacpan or search on metacpan
examples/detect-encoding.pl view on Meta::CPAN
use strict;
use warnings;
use IO::HTML qw(html_file_and_encoding);
for my $filename (@ARGV) {
my ($filehandle, $encoding, $bom) = html_file_and_encoding($filename);
close $filehandle;
$encoding .= " BOM=$bom" if defined $bom;
print "$filename: $encoding\n";
}
lib/IO/HTML.pm view on Meta::CPAN
my $pos = tell $in;
croak "Could not seek $filename: $!" if $pos < 0;
croak "Could not read $filename: $!"
unless defined read $in, my($buf), $bytes_to_check;
seek $in, $pos, 0 or croak "Could not seek $filename: $!";
# Check for BOM:
my $bom;
my $encoding = do {
if ($buf =~ /^\xFe\xFF/) {
$bom = 2;
'UTF-16BE';
} elsif ($buf =~ /^\xFF\xFe/) {
$bom = 2;
'UTF-16LE';
} elsif ($buf =~ /^\xEF\xBB\xBF/) {
$bom = 3;
lib/IO/HTML.pm view on Meta::CPAN
=head1 SUBROUTINES
=head2 html_file
$filehandle = html_file($filename, \%options);
This function (exported by default) is the primary entry point. It
opens the file specified by C<$filename> for reading, uses
C<sniff_encoding> to find a suitable encoding layer, and applies it.
It also applies the C<:crlf> layer. If the file begins with a BOM,
the filehandle is positioned just after the BOM.
The optional second argument is a hashref containing options. The
possible keys are described under C<find_charset_in>.
If C<sniff_encoding> is unable to determine the encoding, it defaults
to C<$IO::HTML::default_encoding>, which is set to C<cp1252>
(a.k.a. Windows-1252) by default. According to the standard, the
default should be locale dependent, but that is not currently
implemented.
lib/IO/HTML.pm view on Meta::CPAN
=head2 html_file_and_encoding
($filehandle, $encoding, $bom)
= html_file_and_encoding($filename, \%options);
This function (exported only by request) is just like C<html_file>,
but returns more information. In addition to the filehandle, it
returns the name of the encoding used, and a flag indicating whether a
byte order mark was found (if C<$bom> is true, the file began with a
BOM). This may be useful if you want to write the file out again
(especially in conjunction with the C<html_outfile> function).
The optional second argument is a hashref containing options. The
possible keys are described under C<find_charset_in>.
It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.
The result of calling C<html_file_and_encoding> in scalar context is undefined
(in the C sense of there is no guarantee what you'll get).
=head2 html_outfile
$filehandle = html_outfile($filename, $encoding, $bom);
This function (exported only by request) opens C<$filename> for output
using C<$encoding>, and writes a BOM to it if C<$bom> is true.
If C<$encoding> is C<undef>, it defaults to C<$IO::HTML::default_encoding>.
C<$encoding> may be either an encoding name or an Encode::Encoding object.
It dies if the file cannot be opened, or if both C<$encoding> and
C<$IO::HTML::default_encoding> are C<undef>.
=head2 sniff_encoding
($encoding, $bom) = sniff_encoding($filehandle, $filename, \%options);
lib/IO/HTML.pm view on Meta::CPAN
C<find_charset_in>.
It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name. It returns
C<undef> if the encoding cannot be determined. C<$bom> is true if the
file began with a byte order mark. In scalar context, it returns only
C<$encoding>.
The filehandle's position is restored to its original position
(normally the beginning of the file) unless C<$bom> is true. In that
case, the position is immediately after the BOM.
Tip: If you want to run C<sniff_encoding> on a file you've already
loaded into a string, open an in-memory file on the string, and pass
that handle:
($encoding, $bom) = do {
open(my $fh, '<', \$string); sniff_encoding($fh)
};
(This only makes sense if C<$string> contains bytes, not characters.)
=head2 find_charset_in
$encoding = find_charset_in($string_containing_HTML, \%options);
This function (exported only by request) looks for charset information
in a C<< <meta> >> tag in a possibly-incomplete HTML document using
the "two step" algorithm specified by HTML5. It does not look for a BOM.
The C<< <meta> >> tag must begin within the first C<$IO::HTML::bytes_to_check>
bytes of the string.
It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name. It returns
C<undef> if no charset is specified or if the specified charset is not
recognized by the Encode module.
The optional second argument is a hashref containing options. The
following keys are recognized:
t/10-find.t view on Meta::CPAN
local $Test::Builder::Level = $Test::Builder::Level + 1;
is(scalar find_charset_in(@data), $charset, $name);
} # end test
#---------------------------------------------------------------------
test 'utf-8-strict' => <<'';
<meta charset="UTF-8">
test 'utf-8-strict' => <<'';
<!-- UTF-16 is recognized only with a BOM -->
<meta charset="UTF-16BE">
test 'iso-8859-15' => <<'';
<meta charset ="ISO-8859-15">
test 'iso-8859-15' => <<'';
<meta charset= "ISO-8859-15">
test 'iso-8859-15' => <<'';
<meta charset =
t/30-outfile.t view on Meta::CPAN
use IO::HTML ':rw';
use Encode 'find_encoding';
use File::Temp;
#---------------------------------------------------------------------
sub test
{
my ($encoding, $bom, $expected) = @_;
my $name = ref $encoding ? $encoding->name . " object" : $encoding;
$name .= ($bom ? ' with BOM' : ' without BOM') if defined $bom;
local $Test::Builder::Level = $Test::Builder::Level + 1;
my $tmp = File::Temp->new(UNLINK => 1);
$tmp->close;
my $fh = html_outfile("$tmp", $encoding, $bom);
print $fh "\xA0\x{2014}";
( run in 0.499 second using v1.01-cache-2.11-cpan-e9daa2b36ef )