Encode
view release on metacpan or search on metacpan
Unicode/Unicode.pm view on Meta::CPAN
=item *
When BE or LE is explicitly stated as the name of encoding, BOM is
simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).
=item *
When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
what the BOM says.
=item *
Default Byte Order
When no BOM is found, Encode 2.76 and blow croaked. Since Encode
2.77, it falls back to BE accordingly to RFC2781 and the Unicode
Standard version 8.0
=item *
When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended. So when you want to encode a whole text
file, make sure you encode() the whole text at once, not line by line
or each line, not file, will have a BOM prepended.
=item *
C<UCS-2> is an exception. Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.
=back
=head1 Surrogate Pairs
To say the least, surrogate pairs were the biggest mistake of the
Unicode Consortium. But according to the late Douglas Adams in I<The
Hitchhiker's Guide to the Galaxy> Trilogy, C<In the beginning the
Universe was created. This has made a lot of people very angry and
been widely regarded as a bad move>. Their mistake was not of this
magnitude so let's forgive them.
(I don't dare make any comparison with Unicode Consortium and the
Vogons here ;) Or, comparing Encode to Babel Fish is completely
appropriate -- if you can only stick this into your ear :)
Surrogate pairs were born when the Unicode Consortium finally
admitted that 16 bits were not big enough to hold all the world's
character repertoires. But they already made UCS-2 16-bit. What
do we do?
Back then, the range 0xD800-0xDFFF was not allocated. Let's split
that range in half and use the first half to represent the C<upper
half of a character> and the second half to represent the C<lower
half of a character>. That way, you can represent 1024 * 1024 =
1048576 more characters. Now we can store character ranges up to
\x{10ffff} even with 16-bit encodings. This pair of half-character is
now called a I<surrogate pair> and UTF-16 is the name of the encoding
that embraces them.
Here is a formula to ensurrogate a Unicode character \x{10000} and
above;
$hi = ($uni - 0x10000) / 0x400 + 0xD800;
$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
And to desurrogate;
$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
Note this move has made \x{D800}-\x{DFFF} into a forbidden zone but
perl does not prohibit the use of characters within this range. To perl,
every one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>.
(*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit
integer support!
=head1 Error Checking
Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.
% perl -MEncode -e'$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \
-e'Encode::from_to($_, "utf16","shift_jis", 0); print'
UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
% perl -MEncode -e'$a = "BOM missing"' \
-e' Encode::from_to($a, "utf16", "shift_jis", 0); print'
UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.
Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another. So Encode
is more strict on UTFs.
Consider that "division by zero" of Encode :)
=head1 SEE ALSO
L<Encode>, L<Encode::Unicode::UTF7>, L<https://www.unicode.org/glossary/>,
L<https://www.unicode.org/faq/utf_bom.html>,
RFC 2781 L<http://www.ietf.org/rfc/rfc2781.txt>,
The whole Unicode standard L<https://www.unicode.org/standard/standard.html>
Ch. 6 pp. 275 of C<Programming Perl (3rd Edition)>
by Tom Christiansen, brian d foy & Larry Wall;
O'Reilly & Associates; ISBN 978-0-596-00492-7
=cut
( run in 1.638 second using v1.01-cache-2.11-cpan-e1769b4cff6 )