BOM results from the CPAN

Plack-App-MCCS


  $options ||= {};

  open(my $in, '<:raw', $filename) or croak "Failed to open $filename: $!";


  my ($encoding, $bom) = sniff_encoding($in, $filename, $options);

  if (not defined $encoding) {
    croak "No default encoding specified"
        unless defined($encoding = $default_encoding);
    $encoding = find_encoding($encoding) if $options->{encoding};
  } # end if we didn't find an encoding

  binmode $in, sprintf(":encoding(%s):crlf",
                       $options->{encoding} ? $encoding->name : $encoding);

  return ($in, $encoding, $bom);
} # end html_file_and_encoding
#---------------------------------------------------------------------


sub html_outfile
{
  my ($filename, $encoding, $bom) = @_;

  if (not defined $encoding) {
    croak "No default encoding specified"
        unless defined($encoding = $default_encoding);
  } # end if we didn't find an encoding
  elsif (ref $encoding) {
    $encoding = $encoding->name;
  }

  open(my $out, ">:encoding($encoding)", $filename)
      or croak "Failed to open $filename: $!";

  print $out "\x{FeFF}" if $bom;

  return $out;
} # end html_outfile
#---------------------------------------------------------------------


sub sniff_encoding
{
  my ($in, $filename, $options) = @_;

  $filename = 'file' unless defined $filename;
  $options ||= {};

  my $pos = tell $in;
  croak "Could not seek $filename: $!" if $pos < 0;

  croak "Could not read $filename: $!"
      unless defined read $in, my($buf), $bytes_to_check;

  seek $in, $pos, 0 or croak "Could not seek $filename: $!";


  # Check for BOM:
  my $bom;
  my $encoding = do {
    if ($buf =~ /^\xFe\xFF/) {
      $bom = 2;
      'UTF-16BE';
    } elsif ($buf =~ /^\xFF\xFe/) {
      $bom = 2;
      'UTF-16LE';
    } elsif ($buf =~ /^\xEF\xBB\xBF/) {
      $bom = 3;
      'utf-8-strict';
    } else {
      find_charset_in($buf, $options); # check for <meta charset>
    }
  }; # end $encoding

  if ($bom) {
    seek $in, $bom, 1 or croak "Could not seek $filename: $!";
    $bom = 1;
  }
  elsif (not defined $encoding) { # try decoding as UTF-8
    my $test = decode('utf-8-strict', $buf, Encode::FB_QUIET);
    if ($buf =~ /^(?:                   # nothing left over
         | [\xC2-\xDF]                  # incomplete 2-byte char
         | [\xE0-\xEF] [\x80-\xBF]?     # incomplete 3-byte char
         | [\xF0-\xF4] [\x80-\xBF]{0,2} # incomplete 4-byte char
        )\z/x and $test =~ /[^\x00-\x7F]/) {
      $encoding = 'utf-8-strict';
    } # end if valid UTF-8 with at least one multi-byte character:
  } # end if testing for UTF-8

  if (defined $encoding and $options->{encoding} and not ref $encoding) {
    $encoding = find_encoding($encoding);
  } # end if $encoding is a string and we want an object

  return wantarray ? ($encoding, $bom) : $encoding;
} # end sniff_encoding

#=====================================================================
# Based on HTML5 8.2.2.2 Determining the character encoding:

# Get attribute from current position of $_
sub _get_attribute
{
  m!\G[\x09\x0A\x0C\x0D /]+!gc; # skip whitespace or /

  return if /\G>/gc or not /\G(=?[^\x09\x0A\x0C\x0D =]*)/gc;

  my ($name, $value) = (lc $1, '');

  if (/\G[\x09\x0A\x0C\x0D ]*=[\x09\x0A\x0C\x0D ]*/gc) {
    if (/\G"/gc) {
      # Double-quoted attribute value
      /\G([^"]*)("?)/gc;
      return unless $2; # Incomplete attribute (missing closing quote)
      $value = lc $1;
    } elsif (/\G'/gc) {
      # Single-quoted attribute value
      /\G([^']*)('?)/gc;
      return unless $2; # Incomplete attribute (missing closing quote)

local/lib/perl5/IO/HTML.pm view on Meta::CPAN

  # Alternative interface:
  open(my $in, '<:raw', 'bar.html');
  my $encoding = IO::HTML::sniff_encoding($in, 'bar.html');

=head1 DESCRIPTION

IO::HTML provides an easy way to open a file containing HTML while
automatically determining its encoding.  It uses the HTML5 encoding
sniffing algorithm specified in section 8.2.2.2 of the draft standard.

The algorithm as implemented here is:

=over

=item 1.

If the file begins with a byte order mark indicating UTF-16LE,
UTF-16BE, or UTF-8, then that is the encoding.

=item 2.

If the first C<$bytes_to_check> bytes of the file contain a C<< <meta> >> tag that
indicates the charset, and Encode recognizes the specified charset
name, then that is the encoding.  (This portion of the algorithm is
implemented by C<find_charset_in>.)

The C<< <meta> >> tag can be in one of two formats:

  <meta charset="...">
  <meta http-equiv="Content-Type" content="...charset=...">

The search is case-insensitive, and the order of attributes within the
tag is irrelevant.  Any additional attributes of the tag are ignored.
The first matching tag with a recognized encoding ends the search.

=item 3.

If the first C<$bytes_to_check> bytes of the file are valid UTF-8 (with at least 1
non-ASCII character), then the encoding is UTF-8.

=item 4.

If all else fails, use the default character encoding.  The HTML5
standard suggests the default encoding should be locale dependent, but
currently it is always C<cp1252> unless you set
C<$IO::HTML::default_encoding> to a different value.  Note:
C<sniff_encoding> does not apply this step; only C<html_file> does
that.

=back

=head1 SUBROUTINES

=head2 html_file

  $filehandle = html_file($filename, \%options);

This function (exported by default) is the primary entry point.  It
opens the file specified by C<$filename> for reading, uses
C<sniff_encoding> to find a suitable encoding layer, and applies it.
It also applies the C<:crlf> layer.  If the file begins with a BOM,
the filehandle is positioned just after the BOM.

The optional second argument is a hashref containing options.  The
possible keys are described under C<find_charset_in>.

If C<sniff_encoding> is unable to determine the encoding, it defaults
to C<$IO::HTML::default_encoding>, which is set to C<cp1252>
(a.k.a. Windows-1252) by default.  According to the standard, the
default should be locale dependent, but that is not currently
implemented.

It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.


=head2 html_file_and_encoding

  ($filehandle, $encoding, $bom)
    = html_file_and_encoding($filename, \%options);

This function (exported only by request) is just like C<html_file>,
but returns more information.  In addition to the filehandle, it
returns the name of the encoding used, and a flag indicating whether a
byte order mark was found (if C<$bom> is true, the file began with a
BOM).  This may be useful if you want to write the file out again
(especially in conjunction with the C<html_outfile> function).

The optional second argument is a hashref containing options.  The
possible keys are described under C<find_charset_in>.

It dies if the file cannot be opened, or if C<sniff_encoding> cannot
determine the encoding and C<$IO::HTML::default_encoding> has been set
to C<undef>.

The result of calling C<html_file_and_encoding> in scalar context is undefined
(in the C sense of there is no guarantee what you'll get).


=head2 html_outfile

  $filehandle = html_outfile($filename, $encoding, $bom);

This function (exported only by request) opens C<$filename> for output
using C<$encoding>, and writes a BOM to it if C<$bom> is true.
If C<$encoding> is C<undef>, it defaults to C<$IO::HTML::default_encoding>.
C<$encoding> may be either an encoding name or an Encode::Encoding object.

It dies if the file cannot be opened, or if both C<$encoding> and
C<$IO::HTML::default_encoding> are C<undef>.


=head2 sniff_encoding

  ($encoding, $bom) = sniff_encoding($filehandle, $filename, \%options);

This function (exported only by request) runs the HTML5 encoding
sniffing algorithm on C<$filehandle> (which must be seekable, and
should have been opened in C<:raw> mode).  C<$filename> is used only
for error messages (if there's a problem using the filehandle), and
defaults to "file" if omitted.  The optional third argument is a
hashref containing options.  The possible keys are described under
C<find_charset_in>.

It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name.  It returns
C<undef> if the encoding cannot be determined.  C<$bom> is true if the
file began with a byte order mark.  In scalar context, it returns only
C<$encoding>.

The filehandle's position is restored to its original position
(normally the beginning of the file) unless C<$bom> is true.  In that
case, the position is immediately after the BOM.

Tip: If you want to run C<sniff_encoding> on a file you've already
loaded into a string, open an in-memory file on the string, and pass
that handle:

  ($encoding, $bom) = do {
    open(my $fh, '<', \$string);  sniff_encoding($fh)
  };

(This only makes sense if C<$string> contains bytes, not characters.)


=head2 find_charset_in

  $encoding = find_charset_in($string_containing_HTML, \%options);

This function (exported only by request) looks for charset information
in a C<< <meta> >> tag in a possibly-incomplete HTML document using
the "two step" algorithm specified by HTML5.  It does not look for a BOM.
The C<< <meta> >> tag must begin within the first C<$IO::HTML::bytes_to_check>
bytes of the string.

It returns Perl's canonical name for the encoding, which is not
necessarily the same as the MIME or IANA charset name.  It returns
C<undef> if no charset is specified or if the specified charset is not
recognized by the Encode module.

The optional second argument is a hashref containing options.  The
following keys are recognized:

=over

=item C<encoding>

If true, return the L<Encode::Encoding> object instead of its name.
Defaults to false.

=item C<need_pragma>

If true (the default), follow the HTML5 spec and examine the
C<content> attribute only of C<< <meta http-equiv="Content-Type" >>.
If set to 0, relax the HTML5 spec, and look for "charset=" in the
C<content> attribute of I<every> meta tag.

=back

=head1 EXPORTS

By default, only C<html_file> is exported.  Other functions may be
exported on request.

For people who prefer not to export functions, all functions beginning
with C<html_> have an alias without that prefix (e.g. you can call
C<IO::HTML::file(...)> instead of C<IO::HTML::html_file(...)>.  These
aliases are not exportable.

=for Pod::Coverage
file
file_and_encoding
outfile

The following export tags are available:

=over

=item C<:all>

All exportable functions.

=item C<:rw>

C<html_file>, C<html_file_and_encoding>, C<html_outfile>.

=back

=head1 SEE ALSO

The HTML5 specification, section 8.2.2.2 Determining the character encoding:
L<http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding>

( run in 0.851 second using v1.01-cache-2.11-cpan-c966e8aa7e8 )