BOM results from the CPAN

Encode

view release on metacpan or search on metacpan


2.78 2015/09/24 02:19:21
! Makefile.PL
  Mend pull/42 again.  This time correctly.
! lib/Encode/Supported.pod
  Applied: RT#107146: [PATCH] fix a spelling mistake
  https://rt.cpan.org/Public/Bug/Display.html?id=107146

2.77 2015/09/15 13:53:27
! Unicode/Unicode.xs Unicode/Unicode.pm
  Address RT#107043: If no BOM is found, the routine dies.
  When you decode from UTF-(16|32) without -BE or LE without BOM,
  Encode now assumes BE accordingly to RFC2781 and the Unicode
  Standard version 8.0
  https://rt.cpan.org/Public/Bug/Display.html?id=107043
! Makefile.PL encoding.t
  Mend pull/42
! Encode.xs Makefile.PL encoding.pm encoding.t
  Pulled: precompile 1252 table as that is now the Pod::Simple default
  https://github.com/dankogai/p5-encode/pull/42

2.76 2015/07/31 02:18:28

Changes view on Meta::CPAN

  Addressed RT #40027:
   decode of MIME-Header removes too much whitespace
  http://rt.cpan.org/Ticket/Display.html?id=40027
  http://rt.cpan.org/Ticket/Display.html?id=42902
! t/piconv.t
  Addressed by CSJEWELL: t/piconv.t loops infinitely on Win32
  http://rt.cpan.org/Ticket/Display.html?id=47760

2.34 2009/07/08 13:34:15
! bin/piconv
  duplicate-BOM problem now fixed.
  Message-Id: <10ECB9B7-006E-4570-9EB6-51C49F04ADCF@dan.co.jp>
! bin/piconv
+ t/piconv.t
  patches and tests by SREZIC
  Message-Id: <4A5366DA.8050801@iconmobile.com>
! Makefile.PL
  man* removed on behalf of blead
  Message-Id: <20090326135219.GU18164@plum.flirble.org>

2.33 2009/03/25 07:55:57

Changes view on Meta::CPAN

  "If someone thinks utf8::upgrade($1) should be croaked like 
  chom?p($1),please try the following patch for Encode.pm."
  -- sadahiro-san
  <20040522212704.C068.BQW10602@nifty.com>

2.0 2004/05/16 20:55:15
* version updated to 2.00
   -- sorry, no big feature change.  I just hate version 1.100 :)
! lib/Encode/Guess.pm
  Unicode/Unicode.pm
  addressed  UTF-(8|32LE) + BOM misguessing
  https://rt.cpan.org/Ticket/Display.html?id=6279
! Encode.pm
  s/is_utif8/is_utf8/ in POD
! Encode/lib/Encode/CN/HZ.pm 
  Fixes "make test" failure after the patch to pp_hot.c
  by Sadahiro-san
  Message-Id: <20040222182357.6B39.BQW10602@nifty.com>
! bin/piconv
  From:   autrijus@autrijus.org
  Subject: [PATCH] "piconv -C 512" badly broken

Changes view on Meta::CPAN

  Message-Id: <3ED79E01.8050401@mac.com>
! bin/piconv
  Found and fixed the back that -p,--perlqq does not work.
  Induced by the change from Getopt::Std to Getopt::Long.
! encoding.pm
  Addressed [cpan #2629] Wrong assumption in numeric comparison
  Message-Id: <rt-2629-7326.19.5700583232515@cpan.org>
! Encode.pm Encode.xs Unicode/Unicode.pm Unicode/Unicode.xs
 lib/Encode/Encoding.pm t/perlio.t
 ! API Change: ->new_sequence() => ->renew()
 + Encode::Unicode makes use of it so it can handle BOM on PerlIO
 + Encode::XS and Encode::utf8 now supports ->renew()
 + Encode::Encoding now documents this with examples
 - Non-XS (en|de)code stripped out of Encode::Unicode
 Message-Id: <146957DB-8C39-11D7-9C91-000393AE4244@dan.co.jp>

1.95 2003/05/21 08:41:11
! ucm/8859-*.ucm
  Since bogus entries were found in iso-8859-6, all entries are
  re-generated once again out of
  http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT

Changes view on Meta::CPAN

  Encode::Encoder, once just a placeholder of an idea, is now much more 
  practical.  See t/Encode.t to find how practical it can be.
+ lib/Encode/Config.pm
! Encode.pm
  my false laziness at Encode.pm is fixed.  Now %ExtModules are set
  in Encode::Config and they are all literally, not programatically
  set.  My false laziness was resulting many encodings missing from
  %ExtModules.
! lib/Encode/Unicode.pm
! t/Unicode.t
  BOM for 32LE was bogus as noted by Anton.  t/Unicode.t is fixed
  so that it does not rely Encode::Unicode for BOM values
  Message-Id: <FFEC33E9-4AFB-11D6-B415-00039301D480@dan.co.jp>

1.30 2002/04/08 02:34:51
+ lib/Encode/Encoder.pm
  Object Oriented Encoder.  I reckon something like this is in need.
! Encode.pm
! t/Unicode.pm
! lib/Encode/Supported.pod
  * autoloading bug that prevented upper-case canonicals such as UTF-16
    is fixed.  Now even UTF/UCS are autoloaded!
  * encodings() is now more intuitive.
  * t/Unicode.t fixed to explicitly use Unicode.pm -- BOM values are
    stored therein.
  * Obligatory fixes to the POD.
! lib/Encode/Supported.pod
  Patch from Anton applied.
  Message-Id: <66641479.20020408033300@motor.ru>
! Encode.pm
! lib/Encode/Unicode.pm
  Cosmetic changes: "bless $obj, $class" => "bless $obj => class"

1.28 2002/04/07 18:58:42

Unicode/Unicode.pm view on Meta::CPAN


use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );

#
# Object Generator 8 transcoders all at once!
#

use Encode ();

our %BOM_Unknown = map { $_ => 1 } qw(UTF-16 UTF-32);

for my $name (
    qw(UTF-16 UTF-16BE UTF-16LE
    UTF-32 UTF-32BE UTF-32LE
    UCS-2BE  UCS-2LE)
  )
{
    my ( $size, $endian, $ucs2, $mask );
    $name =~ /^(\w+)-(\d+)(\w*)$/o;
    if ( $ucs2 = ( $1 eq 'UCS' ) ) {

Unicode/Unicode.pm view on Meta::CPAN

        endian => $endian,
        ucs2   => $ucs2,
    } => __PACKAGE__;
    Encode::define_encoding($obj, $name);
}

use parent qw(Encode::Encoding);

sub renew {
    my $self = shift;
    $BOM_Unknown{ $self->name } or return $self;
    my $clone = bless {%$self} => ref($self);
    $clone->{renewed}++;    # so the caller knows it is renewed.
    return $clone;
}

1;
__END__

=head1 NAME

Unicode/Unicode.pm view on Meta::CPAN

UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and
UTF-32LE (UCS-4LE), and UTF-7.

Since UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of
Unicode's Character Encoding Scheme.  It is separately implemented in
Encode::Unicode::UTF7.  For details see L<Encode::Unicode::UTF7>.

=item Quick Reference

                Decodes from ord(N)           Encodes chr(N) to...
       octet/char BOM S.P d800-dfff  ord > 0xffff     \x{1abcd} ==
  ---------------+-----------------+------------------------------
  UCS-2BE       2   N   N  is bogus                  Not Available
  UCS-2LE       2   N   N     bogus                  Not Available
  UTF-16      2/4   Y   Y  is   S.P           S.P            BE/LE
  UTF-16BE    2/4   N   Y       S.P           S.P    0xd82a,0xdfcd
  UTF-16LE    2/4   N   Y       S.P           S.P    0x2ad8,0xcddf
  UTF-32        4   Y   -  is bogus         As is            BE/LE
  UTF-32BE      4   N   -     bogus         As is       0x0001abcd
  UTF-32LE      4   N   -     bogus         As is       0xcdab0100
  UTF-8       1-4   -   -     bogus   >= 4 octets   \xf0\x9a\af\8d
  ---------------+-----------------+------------------------------

=back

=head1 Size, Endianness, and BOM

You can categorize these CES by 3 criteria:  size of each character,
endianness, and Byte Order Mark.

=head2 by size

UCS-2 is a fixed-length encoding with each character taking 16 bits.
It B<does not> support I<surrogate pairs>.  When a surrogate pair
is encountered during decode(), its place is filled with \x{FFFD}
if I<CHECK> is 0, or the routine croaks if I<CHECK> is 1.  When a

Unicode/Unicode.pm view on Meta::CPAN

=head2 by endianness

The first (and now failed) goal of Unicode was to map all character
repertoires into a fixed-length integer so that programmers are happy.
Since each character is either a I<short> or I<long> in C, you have to
pay attention to the endianness of each platform when you pass data
to one another.

Anything marked as BE is Big Endian (or network byte order) and LE is
Little Endian (aka VAX byte order).  For anything not marked either
BE or LE, a character called Byte Order Mark (BOM) indicating the
endianness is prepended to the string.

CAVEAT: Though BOM in utf8 (\xEF\xBB\xBF) is valid, it is meaningless
and as of this writing Encode suite just leave it as is (\x{FeFF}).

=over 4

=item BOM as integer when fetched in network byte order

              16         32 bits/char
  -------------------------
  BE      0xFeFF 0x0000FeFF
  LE      0xFFFe 0xFFFe0000
  -------------------------

=back

This modules handles the BOM as follows.

=over 4

=item *

When BE or LE is explicitly stated as the name of encoding, BOM is
simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).

=item *

When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
what the BOM says.

=item *

Default Byte Order

When no BOM is found, Encode 2.76 and blow croaked.  Since Encode
2.77, it falls back to BE accordingly to RFC2781 and the Unicode
Standard version 8.0

=item *

When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended.  So when you want to encode a whole text
file, make sure you encode() the whole text at once, not line by line
or each line, not file, will have a BOM prepended.

=item *

C<UCS-2> is an exception.  Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.

=back

=head1 Surrogate Pairs

Unicode/Unicode.pm view on Meta::CPAN

  integer support!

=head1 Error Checking

Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.

  % perl -MEncode -e'$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \
         -e'Encode::from_to($_, "utf16","shift_jis", 0); print'
  UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
  % perl -MEncode -e'$a = "BOM missing"' \
         -e' Encode::from_to($a, "utf16", "shift_jis", 0); print'
  UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.

Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another.  So Encode
is more strict on UTFs.

Consider that "division by zero" of Encode :)

=head1 SEE ALSO

L<Encode>, L<Encode::Unicode::UTF7>, L<https://www.unicode.org/glossary/>,

Unicode/Unicode.xs view on Meta::CPAN


#define IN_UNICODE_XS

#define PERL_NO_GET_CONTEXT
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "../Encode/encode.h"

#define FBCHAR			0xFFFd
#define BOM_BE			0xFeFF
#define BOM16LE			0xFFFe
#define BOM32LE			0xFFFe0000
#define issurrogate(x)		(0xD800 <= (x)  && (x) <= 0xDFFF )
#define isHiSurrogate(x)	(0xD800 <= (x)  && (x) <  0xDC00 )
#define isLoSurrogate(x)	(0xDC00 <= (x)  && (x) <= 0xDFFF )
#define invalid_ucs2(x)         ( issurrogate(x) || 0xFFFF < (x) )

#ifndef SVfARG
#define SVfARG(p) ((void*)(p))
#endif

#define PERLIO_BUFSIZ 1024 /* XXX value comes from PerlIOEncode_get_base */

Unicode/Unicode.xs view on Meta::CPAN

    temp_result = (ulen == PERLIO_BUFSIZ);

    ST(0) = sv_2mortal(result);
    SvUTF8_on(result);

    if (!endian && s+size <= e) {
	SV *sv;
	UV bom;
	endian = (size == 4) ? 'N' : 'n';
	bom = enc_unpack(aTHX_ &s,e,size,endian);
	if (bom != BOM_BE) {
	    if (bom == BOM16LE) {
		endian = 'v';
	    }
	    else if (bom == BOM32LE) {
		endian = 'V';
	    }
	    else {
               /* No BOM found, use big-endian fallback as specified in
                * RFC2781 and the Unicode Standard version 8.0:
                *
                *  The UTF-16 encoding scheme may or may not begin with
                *  a BOM. However, when there is no BOM, and in the
                *  absence of a higher-level protocol, the byte order
                *  of the UTF-16 encoding scheme is big-endian.
                *
                *  If the first two octets of the text is not 0xFE
                *  followed by 0xFF, and is not 0xFF followed by 0xFE,
                *  then the text SHOULD be interpreted as big-endian.
                */
                s -= size;
	    }
	}

Unicode/Unicode.xs view on Meta::CPAN

    ST(0) = sv_2mortal(result);

    /* Preallocate the result buffer to the maximum possible size.
       ie. assume each UTF8 byte is 1 character.
       Then shrink the result's buffer if necesary at the end. */
    SvGROW(result, ((ulen+1) * usize));

    if (!endian) {
	SV *sv;
	endian = (size == 4) ? 'N' : 'n';
	enc_pack(aTHX_ result,size,endian,BOM_BE);
#if 1
	/* Update endian for next sequence */
	sv = attr("renewed");
	if (SvTRUE(sv)) {
	    (void)hv_store((HV *)SvRV(obj),"endian",6,newSVpv((char *)&endian,1),0);
	}
#endif
    }
    while (s < e && s+UTF8SKIP(s) <= e) {
        STRLEN len;

bin/encguess view on Meta::CPAN


   encguess -us euc-jp,shiftjis,7bit-jis test*.txt

=back

=head1 DESCRIPTION

The encoding identification is done by checking one encoding type at a
time until all but the right type are eliminated. The set of encoding
types to try is defined by the -s parameter and defaults to ascii,
utf8 and UTF-16/32 with BOM. This can be overridden by passing one or
more encoding types via the -s parameter. If you need to pass in
multiple suspect encoding types, use a quoted string with the a space
separating each value.

=head1 SEE ALSO

L<Encode::Guess>, L<Encode::Detect>

=head1 LICENSE AND COPYRIGHT

lib/Encode/Guess.pm view on Meta::CPAN

    Suspects => {%DEF_SUSPECTS},
} => __PACKAGE__;
Encode::define_encoding($obj, $Canon);

use parent qw(Encode::Encoding);
sub needs_lines { 1 }
sub perlio_ok   { 0 }

our @EXPORT         = qw(guess_encoding);
our $NoUTFAutoGuess = 0;
our $UTF8_BOM       = pack( "C3", 0xef, 0xbb, 0xbf );

sub import {    # Exporter not used so we do it on our own
    my $callpkg = caller;
    for my $item (@EXPORT) {
        no strict 'refs';
        *{"$callpkg\::$item"} = \&{"$item"};
    }
    set_suspects(@_);
}

lib/Encode/Guess.pm view on Meta::CPAN


    # sanity check
    return "Empty string, empty guess" unless defined $octet and length $octet;

    # cheat 0: utf8 flag;
    if ( Encode::is_utf8($octet) ) {
        return find_encoding('utf8') unless $NoUTFAutoGuess;
        Encode::_utf8_off($octet);
    }

    # cheat 1: BOM
    use Encode::Unicode;
    unless ($NoUTFAutoGuess) {
        my $BOM = pack( 'C3', unpack( "C3", $octet ) );
        return find_encoding('utf8')
          if ( defined $BOM and $BOM eq $UTF8_BOM );
        $BOM = unpack( 'N', $octet );
        return find_encoding('UTF-32')
          if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe0000 ) );
        $BOM = unpack( 'n', $octet );
        return find_encoding('UTF-16')
          if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe ) );
        if ( $octet =~ /\x00/o )
        {    # if \x00 found, we assume UTF-(16|32)(BE|LE)
            my $utf;
            my ( $be, $le ) = ( 0, 0 );
            if ( $octet =~ /\x00\x00/o ) {    # UTF-32(BE|LE) assumed
                $utf = "UTF-32";
                for my $char ( unpack( 'N*', $octet ) ) {
                    $char & 0x0000ffff and $be++;
                    $char & 0xffff0000 and $le++;
                }

lib/Encode/Guess.pm view on Meta::CPAN

  # or
  $utf8 = decode($enc->name, $data)

=head1 ABSTRACT

Encode::Guess enables you to guess in what encoding a given data is
encoded, or at least tries to.  

=head1 DESCRIPTION

By default, it checks only ascii, utf8 and UTF-16/32 with BOM.

  use Encode::Guess; # ascii/utf8/BOMed UTF

To use it more practically, you have to give the names of encodings to
check (I<suspects> as follows).  The name of suspects can either be
canonical names or aliases.

CAVEAT: Unlike UTF-(16|32), BOM in utf8 is NOT AUTOMATICALLY STRIPPED.

 # tries all major Japanese Encodings as well
  use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;

If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
value, no heuristics will be applied to UTF8/16/32, and the result
will be limited to the suspects and C<ascii>.

=over 4

lib/Encode/Guess.pm view on Meta::CPAN

=item guess_encoding($data, [, I<list of suspects>])

You can also try C<guess_encoding> function which is exported by
default.  It takes $data to check and it also takes the list of
suspects by option.  The optional suspect list is I<not reflected> to
the internal suspects list.

  my $decoder = guess_encoding($data, qw/euc-jp euc-kr euc-cn/);
  die $decoder unless ref($decoder);
  my $utf8 = $decoder->decode($data);
  # check only ascii, utf8 and UTF-(16|32) with BOM
  my $decoder = guess_encoding($data);

=back

=head1 CAVEATS

=over 4

=item *

lib/Encode/Supported.pod view on Meta::CPAN


is heavily misused.
See L<Microsoft-related naming mess> for details.

C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
with Encode. See L<Encode::KR> for details.

  UTF-16 UTF-16BE UTF-16LE

are IANA-registered C<charset>s. See [RFC 2781] for details.
Jungshik Shin reports that UTF-16 with a BOM is well accepted
by MS IE 5/6 and NS 4/6. Beware however that

=over 2

=item *

C<UTF-16> support in any software you're going to be
using/interoperating with has probably been less tested
then C<UTF-8> support

t/perlio.t view on Meta::CPAN

        dump2file("$pfile.$seq", $dtext);
        }
    }
     if ( ! $DEBUG ) {
            1 while unlink ($sfile);
            1 while unlink ($pfile);
        }
    }
}

# BOM Test

SKIP:{
    my $pev = PerlIO::encoding->VERSION;
    skip "PerlIO::encoding->VERSION = $pev <= 0.07 ", 6
    unless ($pev >= 0.07 or $DEBUG);

    my $file = File::Spec->catfile($dir,"jisx0208.utf");
    open my $fh, "<:utf8", $file or die "$file : $!";
    my $str = join('' => <$fh>);
    close $fh;

( run in 0.694 second using v1.01-cache-2.11-cpan-e9daa2b36ef )