Encode

 view release on metacpan or  search on metacpan

Unicode/Unicode.pm  view on Meta::CPAN

=item *

When BE or LE is explicitly stated as the name of encoding, BOM is
simply treated as a normal character (ZERO WIDTH NO-BREAK SPACE).

=item *

When BE or LE is omitted during decode(), it checks if BOM is at the
beginning of the string; if one is found, the endianness is set to
what the BOM says.

=item *

Default Byte Order

When no BOM is found, Encode 2.76 and blow croaked.  Since Encode
2.77, it falls back to BE accordingly to RFC2781 and the Unicode
Standard version 8.0

=item *

When BE or LE is omitted during encode(), it returns a BE-encoded
string with BOM prepended.  So when you want to encode a whole text
file, make sure you encode() the whole text at once, not line by line
or each line, not file, will have a BOM prepended.

=item *

C<UCS-2> is an exception.  Unlike others, this is an alias of UCS-2BE.
UCS-2 is already registered by IANA and others that way.

=back

=head1 Surrogate Pairs

To say the least, surrogate pairs were the biggest mistake of the
Unicode Consortium.  But according to the late Douglas Adams in I<The
Hitchhiker's Guide to the Galaxy> Trilogy, C<In the beginning the
Universe was created. This has made a lot of people very angry and
been widely regarded as a bad move>.  Their mistake was not of this
magnitude so let's forgive them.

(I don't dare make any comparison with Unicode Consortium and the
Vogons here ;)  Or, comparing Encode to Babel Fish is completely
appropriate -- if you can only stick this into your ear :)

Surrogate pairs were born when the Unicode Consortium finally
admitted that 16 bits were not big enough to hold all the world's
character repertoires.  But they already made UCS-2 16-bit.  What
do we do?

Back then, the range 0xD800-0xDFFF was not allocated.  Let's split
that range in half and use the first half to represent the C<upper
half of a character> and the second half to represent the C<lower
half of a character>.  That way, you can represent 1024 * 1024 =
1048576 more characters.  Now we can store character ranges up to
\x{10ffff} even with 16-bit encodings.  This pair of half-character is
now called a I<surrogate pair> and UTF-16 is the name of the encoding
that embraces them.

Here is a formula to ensurrogate a Unicode character \x{10000} and
above;

  $hi = ($uni - 0x10000) / 0x400 + 0xD800;
  $lo = ($uni - 0x10000) % 0x400 + 0xDC00;

And to desurrogate;

 $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

Note this move has made \x{D800}-\x{DFFF} into a forbidden zone but
perl does not prohibit the use of characters within this range.  To perl,
every one of \x{0000_0000} up to \x{ffff_ffff} (*) is I<a character>.

  (*) or \x{ffff_ffff_ffff_ffff} if your perl is compiled with 64-bit
  integer support!

=head1 Error Checking

Unlike most encodings which accept various ways to handle errors,
Unicode encodings simply croaks.

  % perl -MEncode -e'$_ = "\xfe\xff\xd8\xd9\xda\xdb\0\n"' \
         -e'Encode::from_to($_, "utf16","shift_jis", 0); print'
  UTF-16:Malformed LO surrogate d8d9 at /path/to/Encode.pm line 184.
  % perl -MEncode -e'$a = "BOM missing"' \
         -e' Encode::from_to($a, "utf16", "shift_jis", 0); print'
  UTF-16:Unrecognised BOM 424f at /path/to/Encode.pm line 184.

Unlike other encodings where mappings are not one-to-one against
Unicode, UTFs are supposed to map 100% against one another.  So Encode
is more strict on UTFs.

Consider that "division by zero" of Encode :)

=head1 SEE ALSO

L<Encode>, L<Encode::Unicode::UTF7>, L<https://www.unicode.org/glossary/>,
L<https://www.unicode.org/faq/utf_bom.html>,

RFC 2781 L<http://www.ietf.org/rfc/rfc2781.txt>,

The whole Unicode standard L<https://www.unicode.org/standard/standard.html>

Ch. 6 pp. 275 of C<Programming Perl (3rd Edition)>
by Tom Christiansen, brian d foy & Larry Wall;
O'Reilly & Associates; ISBN 978-0-596-00492-7

=cut



( run in 0.683 second using v1.01-cache-2.11-cpan-df04353d9ac )