view release on metacpan or search on metacpan
2.78 2015/09/24 02:19:21
! Makefile.PL
Mend pull/42 again. This time correctly.
! lib/Encode/Supported.pod
Applied: RT#107146: [PATCH] fix a spelling mistake
https://rt.cpan.org/Public/Bug/Display.html?id=107146
2.77 2015/09/15 13:53:27
! Unicode/Unicode.xs Unicode/Unicode.pm
Address RT#107043: If no BOM is found, the routine dies.
When you decode from UTF-(16|32) without -BE or LE without BOM,
Encode now assumes BE accordingly to RFC2781 and the Unicode
Standard version 8.0
https://rt.cpan.org/Public/Bug/Display.html?id=107043
! Makefile.PL encoding.t
Mend pull/42
! Encode.xs Makefile.PL encoding.pm encoding.t
Pulled: precompile 1252 table as that is now the Pod::Simple default
https://github.com/dankogai/p5-encode/pull/42
2.76 2015/07/31 02:18:28
Addressed RT #40027:
decode of MIME-Header removes too much whitespace
http://rt.cpan.org/Ticket/Display.html?id=40027
http://rt.cpan.org/Ticket/Display.html?id=42902
! t/piconv.t
Addressed by CSJEWELL: t/piconv.t loops infinitely on Win32
http://rt.cpan.org/Ticket/Display.html?id=47760
2.34 2009/07/08 13:34:15
! bin/piconv
duplicate-BOM problem now fixed.
Message-Id: <10ECB9B7-006E-4570-9EB6-51C49F04ADCF@dan.co.jp>
! bin/piconv
+ t/piconv.t
patches and tests by SREZIC
Message-Id: <4A5366DA.8050801@iconmobile.com>
! Makefile.PL
man* removed on behalf of blead
Message-Id: <20090326135219.GU18164@plum.flirble.org>
2.33 2009/03/25 07:55:57
"If someone thinks utf8::upgrade($1) should be croaked like
chom?p($1),please try the following patch for Encode.pm."
-- sadahiro-san
<20040522212704.C068.BQW10602@nifty.com>
2.0 2004/05/16 20:55:15
* version updated to 2.00
-- sorry, no big feature change. I just hate version 1.100 :)
! lib/Encode/Guess.pm
Unicode/Unicode.pm
addressed UTF-(8|32LE) + BOM misguessing
https://rt.cpan.org/Ticket/Display.html?id=6279
! Encode.pm
s/is_utif8/is_utf8/ in POD
! Encode/lib/Encode/CN/HZ.pm
Fixes "make test" failure after the patch to pp_hot.c
by Sadahiro-san
Message-Id: <20040222182357.6B39.BQW10602@nifty.com>
! bin/piconv
From: autrijus@autrijus.org
Subject: [PATCH] "piconv -C 512" badly broken
Message-Id: <3ED79E01.8050401@mac.com>
! bin/piconv
Found and fixed the back that -p,--perlqq does not work.
Induced by the change from Getopt::Std to Getopt::Long.
! encoding.pm
Addressed [cpan #2629] Wrong assumption in numeric comparison
Message-Id: <rt-2629-7326.19.5700583232515@cpan.org>
! Encode.pm Encode.xs Unicode/Unicode.pm Unicode/Unicode.xs
lib/Encode/Encoding.pm t/perlio.t
! API Change: ->new_sequence() => ->renew()
+ Encode::Unicode makes use of it so it can handle BOM on PerlIO
+ Encode::XS and Encode::utf8 now supports ->renew()
+ Encode::Encoding now documents this with examples
- Non-XS (en|de)code stripped out of Encode::Unicode
Message-Id: <146957DB-8C39-11D7-9C91-000393AE4244@dan.co.jp>
1.95 2003/05/21 08:41:11
! ucm/8859-*.ucm
Since bogus entries were found in iso-8859-6, all entries are
re-generated once again out of
http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-*.TXT
Encode::Encoder, once just a placeholder of an idea, is now much more
practical. See t/Encode.t to find how practical it can be.
+ lib/Encode/Config.pm
! Encode.pm
my false laziness at Encode.pm is fixed. Now %ExtModules are set
in Encode::Config and they are all literally, not programatically
set. My false laziness was resulting many encodings missing from
%ExtModules.
! lib/Encode/Unicode.pm
! t/Unicode.t
BOM for 32LE was bogus as noted by Anton. t/Unicode.t is fixed
so that it does not rely Encode::Unicode for BOM values
Message-Id: <FFEC33E9-4AFB-11D6-B415-00039301D480@dan.co.jp>
1.30 2002/04/08 02:34:51
+ lib/Encode/Encoder.pm
Object Oriented Encoder. I reckon something like this is in need.
! Encode.pm
! t/Unicode.pm
! lib/Encode/Supported.pod
* autoloading bug that prevented upper-case canonicals such as UTF-16
is fixed. Now even UTF/UCS are autoloaded!
* encodings() is now more intuitive.
* t/Unicode.t fixed to explicitly use Unicode.pm -- BOM values are
stored therein.
* Obligatory fixes to the POD.
! lib/Encode/Supported.pod
Patch from Anton applied.
Message-Id: <66641479.20020408033300@motor.ru>
! Encode.pm
! lib/Encode/Unicode.pm
Cosmetic changes: "bless $obj, $class" => "bless $obj => class"
1.28 2002/04/07 18:58:42
Unicode/Unicode.pm view on Meta::CPAN
use XSLoader;
XSLoader::load( __PACKAGE__, $VERSION );
#
# Object Generator 8 transcoders all at once!
#
use Encode ();
our %BOM_Unknown = map { $_ => 1 } qw(UTF-16 UTF-32);
for my $name (
qw(UTF-16 UTF-16BE UTF-16LE
UTF-32 UTF-32BE UTF-32LE
UCS-2BE UCS-2LE)
)
{
my ( $size, $endian, $ucs2, $mask );
$name =~ /^(\w+)-(\d+)(\w*)$/o;
if ( $ucs2 = ( $1 eq 'UCS' ) ) {
Unicode/Unicode.pm view on Meta::CPAN
endian => $endian,
ucs2 => $ucs2,
} => __PACKAGE__;
Encode::define_encoding($obj, $name);
}
use parent qw(Encode::Encoding);
sub renew {
my $self = shift;
$BOM_Unknown{ $self->name } or return $self;
my $clone = bless {%$self} => ref($self);
$clone->{renewed}++; # so the caller knows it is renewed.
return $clone;
}
1;
__END__
=head1 NAME
Unicode/Unicode.pm view on Meta::CPAN
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32 (UCS-4), UTF-32BE (UCS-4BE) and
UTF-32LE (UCS-4LE), and UTF-7.
Since UTF-7 is a 7-bit (re)encoded version of UTF-16BE, It is not part of
Unicode's Character Encoding Scheme. It is separately implemented in
Encode::Unicode::UTF7. For details see L<Encode::Unicode::UTF7>.
=item Quick Reference
Decodes from ord(N) Encodes chr(N) to...
octet/char BOM S.P d800-dfff ord > 0xffff \x{1abcd} ==
---------------+-----------------+------------------------------
UCS-2BE 2 N N is bogus Not Available
UCS-2LE 2 N N bogus Not Available
UTF-16 2/4 Y Y is S.P S.P BE/LE
UTF-16BE 2/4 N Y S.P S.P 0xd82a,0xdfcd
UTF-16LE 2/4 N Y S.P S.P 0x2ad8,0xcddf
UTF-32 4 Y - is bogus As is BE/LE
UTF-32BE 4 N - bogus As is 0x0001abcd
UTF-32LE 4 N - bogus As is 0xcdab0100
UTF-8 1-4 - - bogus >= 4 octets \xf0\x9a\af\8d
---------------+-----------------+------------------------------
=back
=head1 Size, Endianness, and BOM
You can categorize these CES by 3 criteria: size of each character,
endianness, and Byte Order Mark.
=head2 by size
UCS-2 is a fixed-length encoding with each character taking 16 bits.
It B<does not> support I<surrogate pairs>. When a surrogate pair
is encountered during decode(), its place is filled with \x{FFFD}
if I<CHECK> is 0, or the routine croaks if I<CHECK> is 1. When a
Unicode/Unicode.pm view on Meta::CPAN
=head2 by endianness
The first (and now failed) goal of Unicode was to map all character
repertoires into a fixed-length integer so that programmers are happy.
Since each character is either a I<short> or I<long> in C, you have to
pay attention to the endianness of each platform when you pass data
to one another.
Anything marked as BE is Big Endian (or network byte order) and LE is
Little Endian (aka VAX byte order). For anything not marked either
BE or LE, a character called Byte Order Mark (BOM) indicating the
endianness is prepended to the string.
CAVEAT: Though BOM in utf8 (\xEF\xBB\xBF) is valid, it is meaningless
and as of this writing Encode suite just leave it as is (\x{FeFF}).
=over 4
=item BOM as integer when fetched in network byte order
16 32 bits/char
-------------------------
BE 0xFeFF 0x0000FeFF
LE 0xFFFe 0xFFFe0000
-------------------------
=back
This modules handles the BOM as follows.
=over 4
=item *
When BE or LE is explicitly stated as the name of encoding, BOM is
simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
=item *
When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
what the BOM says.
=item *
Default Byte Order
When no BOM is found, Encode 2.76 and blow croaked. Since Encode
2.77, it falls back to BE accordingly to RFC2781 and the Unicode
Standard version 8.0
=item *
When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended. So when you want to encode a whole text
file, make sure you encode() the whole text at once, not line by line
or each line, not file, will have a BOM prepended.
=item *
C<UCS-2> is an exception. Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.
=back
=head1 Surrogate Pairs
Unicode/Unicode.pm view on Meta::CPAN
integer support!
=head1 Error Checking
Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.
% perl -MEncode -e'$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \
-e'Encode::from_to($_, "utf16","shift_jis", 0); print'
UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
% perl -MEncode -e'$a = "BOM missing"' \
-e' Encode::from_to($a, "utf16", "shift_jis", 0); print'
UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.
Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another. So Encode
is more strict on UTFs.
Consider that "division by zero" of Encode :)
=head1 SEE ALSO
L<Encode>, L<Encode::Unicode::UTF7>, L<https://www.unicode.org/glossary/>,
Unicode/Unicode.xs view on Meta::CPAN
#define IN_UNICODE_XS
#define PERL_NO_GET_CONTEXT
#include "EXTERN.h"
#include "perl.h"
#include "XSUB.h"
#include "../Encode/encode.h"
#define FBCHAR 0xFFFd
#define BOM_BE 0xFeFF
#define BOM16LE 0xFFFe
#define BOM32LE 0xFFFe0000
#define issurrogate(x) (0xD800 <= (x) && (x) <= 0xDFFF )
#define isHiSurrogate(x) (0xD800 <= (x) && (x) < 0xDC00 )
#define isLoSurrogate(x) (0xDC00 <= (x) && (x) <= 0xDFFF )
#define invalid_ucs2(x) ( issurrogate(x) || 0xFFFF < (x) )
#ifndef SVfARG
#define SVfARG(p) ((void*)(p))
#endif
#define PERLIO_BUFSIZ 1024 /* XXX value comes from PerlIOEncode_get_base */
Unicode/Unicode.xs view on Meta::CPAN
temp_result = (ulen == PERLIO_BUFSIZ);
ST(0) = sv_2mortal(result);
SvUTF8_on(result);
if (!endian && s+size <= e) {
SV *sv;
UV bom;
endian = (size == 4) ? 'N' : 'n';
bom = enc_unpack(aTHX_ &s,e,size,endian);
if (bom != BOM_BE) {
if (bom == BOM16LE) {
endian = 'v';
}
else if (bom == BOM32LE) {
endian = 'V';
}
else {
/* No BOM found, use big-endian fallback as specified in
* RFC2781 and the Unicode Standard version 8.0:
*
* The UTF-16 encoding scheme may or may not begin with
* a BOM. However, when there is no BOM, and in the
* absence of a higher-level protocol, the byte order
* of the UTF-16 encoding scheme is big-endian.
*
* If the first two octets of the text is not 0xFE
* followed by 0xFF, and is not 0xFF followed by 0xFE,
* then the text SHOULD be interpreted as big-endian.
*/
s -= size;
}
}
Unicode/Unicode.xs view on Meta::CPAN
ST(0) = sv_2mortal(result);
/* Preallocate the result buffer to the maximum possible size.
ie. assume each UTF8 byte is 1 character.
Then shrink the result's buffer if necesary at the end. */
SvGROW(result, ((ulen+1) * usize));
if (!endian) {
SV *sv;
endian = (size == 4) ? 'N' : 'n';
enc_pack(aTHX_ result,size,endian,BOM_BE);
#if 1
/* Update endian for next sequence */
sv = attr("renewed");
if (SvTRUE(sv)) {
(void)hv_store((HV *)SvRV(obj),"endian",6,newSVpv((char *)&endian,1),0);
}
#endif
}
while (s < e && s+UTF8SKIP(s) <= e) {
STRLEN len;
bin/encguess view on Meta::CPAN
encguess -us euc-jp,shiftjis,7bit-jis test*.txt
=back
=head1 DESCRIPTION
The encoding identification is done by checking one encoding type at a
time until all but the right type are eliminated. The set of encoding
types to try is defined by the -s parameter and defaults to ascii,
utf8 and UTF-16/32 with BOM. This can be overridden by passing one or
more encoding types via the -s parameter. If you need to pass in
multiple suspect encoding types, use a quoted string with the a space
separating each value.
=head1 SEE ALSO
L<Encode::Guess>, L<Encode::Detect>
=head1 LICENSE AND COPYRIGHT
lib/Encode/Guess.pm view on Meta::CPAN
Suspects => {%DEF_SUSPECTS},
} => __PACKAGE__;
Encode::define_encoding($obj, $Canon);
use parent qw(Encode::Encoding);
sub needs_lines { 1 }
sub perlio_ok { 0 }
our @EXPORT = qw(guess_encoding);
our $NoUTFAutoGuess = 0;
our $UTF8_BOM = pack( "C3", 0xef, 0xbb, 0xbf );
sub import { # Exporter not used so we do it on our own
my $callpkg = caller;
for my $item (@EXPORT) {
no strict 'refs';
*{"$callpkg\::$item"} = \&{"$item"};
}
set_suspects(@_);
}
lib/Encode/Guess.pm view on Meta::CPAN
# sanity check
return "Empty string, empty guess" unless defined $octet and length $octet;
# cheat 0: utf8 flag;
if ( Encode::is_utf8($octet) ) {
return find_encoding('utf8') unless $NoUTFAutoGuess;
Encode::_utf8_off($octet);
}
# cheat 1: BOM
use Encode::Unicode;
unless ($NoUTFAutoGuess) {
my $BOM = pack( 'C3', unpack( "C3", $octet ) );
return find_encoding('utf8')
if ( defined $BOM and $BOM eq $UTF8_BOM );
$BOM = unpack( 'N', $octet );
return find_encoding('UTF-32')
if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe0000 ) );
$BOM = unpack( 'n', $octet );
return find_encoding('UTF-16')
if ( defined $BOM and ( $BOM == 0xFeFF or $BOM == 0xFFFe ) );
if ( $octet =~ /\x00/o )
{ # if \x00 found, we assume UTF-(16|32)(BE|LE)
my $utf;
my ( $be, $le ) = ( 0, 0 );
if ( $octet =~ /\x00\x00/o ) { # UTF-32(BE|LE) assumed
$utf = "UTF-32";
for my $char ( unpack( 'N*', $octet ) ) {
$char & 0x0000ffff and $be++;
$char & 0xffff0000 and $le++;
}
lib/Encode/Guess.pm view on Meta::CPAN
# or
$utf8 = decode($enc->name, $data)
=head1 ABSTRACT
Encode::Guess enables you to guess in what encoding a given data is
encoded, or at least tries to.
=head1 DESCRIPTION
By default, it checks only ascii, utf8 and UTF-16/32 with BOM.
use Encode::Guess; # ascii/utf8/BOMed UTF
To use it more practically, you have to give the names of encodings to
check (I<suspects> as follows). The name of suspects can either be
canonical names or aliases.
CAVEAT: Unlike UTF-(16|32), BOM in utf8 is NOT AUTOMATICALLY STRIPPED.
# tries all major Japanese Encodings as well
use Encode::Guess qw/euc-jp shiftjis 7bit-jis/;
If the C<$Encode::Guess::NoUTFAutoGuess> variable is set to a true
value, no heuristics will be applied to UTF8/16/32, and the result
will be limited to the suspects and C<ascii>.
=over 4
lib/Encode/Guess.pm view on Meta::CPAN
=item guess_encoding($data, [, I<list of suspects>])
You can also try C<guess_encoding> function which is exported by
default. It takes $data to check and it also takes the list of
suspects by option. The optional suspect list is I<not reflected> to
the internal suspects list.
my $decoder = guess_encoding($data, qw/euc-jp euc-kr euc-cn/);
die $decoder unless ref($decoder);
my $utf8 = $decoder->decode($data);
# check only ascii, utf8 and UTF-(16|32) with BOM
my $decoder = guess_encoding($data);
=back
=head1 CAVEATS
=over 4
=item *
lib/Encode/Supported.pod view on Meta::CPAN
is heavily misused.
See L<Microsoft-related naming mess> for details.
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
with Encode. See L<Encode::KR> for details.
UTF-16 UTF-16BE UTF-16LE
are IANA-registered C<charset>s. See [RFC 2781] for details.
Jungshik Shin reports that UTF-16 with a BOM is well accepted
by MS IE 5/6 and NS 4/6. Beware however that
=over 2
=item *
C<UTF-16> support in any software you're going to be
using/interoperating with has probably been less tested
then C<UTF-8> support
dump2file("$pfile.$seq", $dtext);
}
}
if ( ! $DEBUG ) {
1 while unlink ($sfile);
1 while unlink ($pfile);
}
}
}
# BOM Test
SKIP:{
my $pev = PerlIO::encoding->VERSION;
skip "PerlIO::encoding->VERSION = $pev <= 0.07 ", 6
unless ($pev >= 0.07 or $DEBUG);
my $file = File::Spec->catfile($dir,"jisx0208.utf");
open my $fh, "<:utf8", $file or die "$file : $!";
my $str = join('' => <$fh>);
close $fh;