BOM results from the CPAN

IO-HTML


use strict;
use warnings;

use IO::HTML qw(html_file_and_encoding);

for my $filename (@ARGV) {
  my ($filehandle, $encoding, $bom) = html_file_and_encoding($filename);

  close $filehandle;
  $encoding .= " BOM=$bom" if defined $bom;
  print "$filename: $encoding\n";
}

lib/IO/HTML.pm view on Meta::CPAN


  my $pos = tell $in;
  croak "Could not seek $filename: $!" if $pos < 0;

  croak "Could not read $filename: $!"
      unless defined read $in, my($buf), $bytes_to_check;

  seek $in, $pos, 0 or croak "Could not seek $filename: $!";


  # Check for BOM:
  my $bom;
  my $encoding = do {
    if ($buf =~ /^\xFe\xFF/) {
      $bom = 2;
      'UTF-16BE';
    } elsif ($buf =~ /^\xFF\xFe/) {
      $bom = 2;
      'UTF-16LE';
    } elsif ($buf =~ /^\xEF\xBB\xBF/) {
      $bom = 3;

lib/IO/HTML.pm view on Meta::CPAN


=head1 SUBROUTINES

=head2 html_file

  $filehandle = html_file($filename, \%options);

This function (exported by default) is the primary entry point.  It
opens the file specified by C<$filename> for reading, uses
C<sniff_encoding> to find a suitable encoding layer, and applies it.
It also applies the C<:crlf> layer.  If the file begins with a BOM,
the filehandle is positioned just after the BOM.

The optional second argument is a hashref containing options.  The
possible keys are described under C<find_charset_in>.

If C<sniff_encoding> is unable to determine the encoding, it defaults
to C<$IO::HTML::default_encoding>, which is set to C<cp1252>
(a.k.a. Windows-1252) by default.  According to the standard, the
default should be locale dependent, but that is not currently
implemented.

lib/IO/HTML.pm view on Meta::CPAN


=head2 html_file_and_encoding

  ($filehandle, $encoding, $bom)
    = html_file_and_encoding($filename, \%options);

This function (exported only by request) is just like C<html_file>,
but returns more information.  In addition to the filehandle, it
returns the name of the encoding used, and a flag indicating whether a
byte order mark was found (if C<$bom> is true, the file began with a
BOM).  This may be useful if you want to write the file out again
(especially in conjunction with the C<html_outfile> function).

The optional second argument is a hashref containing options.  The
possible keys are described under C<find_charset_in>.

It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.

The result of calling C<html_file_and_encoding> in scalar context is undefined
(in the C sense of there is no guarantee what you'll get).


=head2 html_outfile

  $filehandle = html_outfile($filename, $encoding, $bom);

This function (exported only by request) opens C<$filename> for output
using C<$encoding>, and writes a BOM to it if C<$bom> is true.
If C<$encoding> is C<undef>, it defaults to C<$IO::HTML::default_encoding>.
C<$encoding> may be either an encoding name or an Encode::Encoding object.

It dies if the file cannot be opened, or if both C<$encoding> and
C<$IO::HTML::default_encoding> are C<undef>.


=head2 sniff_encoding

  ($encoding, $bom) = sniff_encoding($filehandle, $filename, \%options);

lib/IO/HTML.pm view on Meta::CPAN

C<find_charset_in>.

It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name.  It returns
C<undef> if the encoding cannot be determined.  C<$bom> is true if the
file began with a byte order mark.  In scalar context, it returns only
C<$encoding>.

The filehandle's position is restored to its original position
(normally the beginning of the file) unless C<$bom> is true.  In that
case, the position is immediately after the BOM.

Tip: If you want to run C<sniff_encoding> on a file you've already
loaded into a string, open an in-memory file on the string, and pass
that handle:

  ($encoding, $bom) = do {
    open(my $fh, '<', \$string);  sniff_encoding($fh)
  };

(This only makes sense if C<$string> contains bytes, not characters.)


=head2 find_charset_in

  $encoding = find_charset_in($string_containing_HTML, \%options);

This function (exported only by request) looks for charset information
in a C<< <meta> >> tag in a possibly-incomplete HTML document using
the "two step" algorithm specified by HTML5.  It does not look for a BOM.
The C<< <meta> >> tag must begin within the first C<$IO::HTML::bytes_to_check>
bytes of the string.

It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name.  It returns
C<undef> if no charset is specified or if the specified charset is not
recognized by the Encode module.

The optional second argument is a hashref containing options.  The
following keys are recognized:

t/10-find.t view on Meta::CPAN

  local $Test::Builder::Level = $Test::Builder::Level + 1;

  is(scalar find_charset_in(@data), $charset, $name);
} # end test

#---------------------------------------------------------------------
test 'utf-8-strict' => <<'';
<meta charset="UTF-8">

test 'utf-8-strict' => <<'';
<!-- UTF-16 is recognized only with a BOM -->
<meta charset="UTF-16BE">

test 'iso-8859-15' => <<'';
<meta charset ="ISO-8859-15">

test 'iso-8859-15' => <<'';
<meta charset= "ISO-8859-15">

test 'iso-8859-15' => <<'';
<meta charset =

t/30-outfile.t view on Meta::CPAN

use IO::HTML ':rw';
use Encode 'find_encoding';
use File::Temp;

#---------------------------------------------------------------------
sub test
{
  my ($encoding, $bom, $expected) = @_;

  my $name = ref $encoding ? $encoding->name . " object" : $encoding;
  $name .= ($bom ? ' with BOM' : ' without BOM') if defined $bom;

  local $Test::Builder::Level = $Test::Builder::Level + 1;

  my $tmp = File::Temp->new(UNLINK => 1);
  $tmp->close;

  my $fh = html_outfile("$tmp", $encoding, $bom);

  print $fh "\xA0\x{2014}";

( run in 0.499 second using v1.01-cache-2.11-cpan-e9daa2b36ef )