BOM results from the CPAN

POD2-RU

should emit a warning and may abort parsing the document
altogether.

A document having more than one "=encoding" line should be
considered an error.  Pod processors may silently tolerate this if
the not-first "=encoding" lines are just duplicates of the
first one (e.g., if there's a "=encoding utf8" line, and later on
another "=encoding utf8" line).  But Pod processors should complain if
there are contradictory "=encoding" lines in the same document
(e.g., if there is a "=encoding utf8" early in the document and
"=encoding big5" later).  Pod processors that recognize BOMs
may also complain if they see an "=encoding" line
that contradicts the BOM (e.g., if a document with a UTF-16LE
BOM has an "=encoding shiftjis" line).

=back

If a Pod processor sees any command other than the ones listed
above (like "=head", or "=haed1", or "=stuff", or "=cuttlefish",
or "=w123"), that processor must by default treat this as an
error.  It must not process the paragraph beginning with that
command, must by default warn of this as an error, and may
abort the parse.  A Pod parser may allow a way for particular
applications to add to the above list of known commands, and to

lib/POD2/RU/perlpodspec.pod view on Meta::CPAN

Future versions of this specification may specify
how Pod can accept other encodings.  Presumably treatment of other
encodings in Pod parsing would be as in XML parsing: whatever the
encoding declared by a particular Pod file, content is to be
stored in memory as Unicode characters.

=item *

The well known Unicode Byte Order Marks are as follows:  if the
file begins with the two literal byte values 0xFE 0xFF, this is
the BOM for big-endian UTF-16.  If the file begins with the two
literal byte value 0xFF 0xFE, this is the BOM for little-endian
UTF-16.  If the file begins with the three literal byte values
0xEF 0xBB 0xBF, this is the BOM for UTF-8.

=for comment
 use bytes; print map sprintf(" 0x%02X", ord $_), split '', "\x{feff}";
 0xEF 0xBB 0xBF

=for comment
 If toke.c is modified to support UTF-32, add mention of those here.

=item *

A naive but sufficient heuristic for testing the first highbit
byte-sequence in a BOM-less file (whether in code or in Pod!), to see
whether that sequence is valid as UTF-8 (RFC 2279) is to check whether
that the first byte in the sequence is in the range 0xC0 - 0xFD
I<and> whether the next byte is in the range
0x80 - 0xBF.  If so, the parser may conclude that this file is in
UTF-8, and all highbit sequences in the file should be assumed to
be UTF-8.  Otherwise the parser should treat the file as being
in Latin-1.  In the unlikely circumstance that the first highbit
sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one
can cater to our heuristic (as well as any more intelligent heuristic)
by prefacing that line with a comment line containing a highbit

lib/POD2/RU/perlunicode.pod view on Meta::CPAN


=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts

As a compatibility measure, the C<use utf8> pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines.  B<These are the only times when an explicit C<use utf8>
is needed.>  See L<utf8>.

=item BOM-marked scripts and UTF-16 scripts autodetected

If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(BOMless UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)

=item C<use encoding> needed to upgrade non-Latin-1 byte strings

By default, there is a fundamental asymmetry in Perl's Unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding.  This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.

lib/POD2/RU/perlunicode.pod view on Meta::CPAN


=over 4

=item *

Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.

If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or C<use utf8>, the latter requires a BOM.)

Unicode characters can also be added to a string by using the C<\N{U+...}>
notation.  The Unicode code for the desired character, in hexadecimal,
should be placed in the braces, after the C<U>. For instance, a smiley face is
C<\N{U+263A}>.

Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
above.  For characters below 0x100 you may get byte semantics instead of
character semantics;  see L</The "Unicode Bug">.  On EBCDIC machines there is
the additional problem that the value for such characters gives the EBCDIC

lib/POD2/RU/perlunicode.pod view on Meta::CPAN

"open interchange".  See L</Non-character code points>.

=item *

UTF-EBCDIC

Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.

=item *

UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)

The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.

Like UTF-8, UTF-16 is a variable-width encoding, but where
UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
All code points occupy either 2 or 4 bytes in UTF-16: code points
C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
points C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
using I<surrogates>, the first 16-bit unit being the I<high

lib/POD2/RU/perlunicode.pod view on Meta::CPAN


    $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
itself can be used for in-memory computations, but if storage or
transfer is required either UTF-16BE (big-endian) or UTF-16LE
(little-endian) encodings must be chosen.

This introduces another problem: what if you just know that your data
is UTF-16, but you don't know which endianness?  Byte Order Marks, or
BOMs, are a solution to this.  A special character has been reserved
in Unicode to function as a byte order marker: the character with the
code point C<U+FEFF> is the BOM.

The trick is that if you read a BOM, you will know the byte order,
since if it was written on a big-endian platform, you will read the
bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)

The way this trick works is that the character with the code point
C<U+FFFE> is not supposed to be in input streams, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
format".

Surrogates have no meaning in Unicode outside their use in pairs to
represent other code points.  However, Perl allows them to be
represented individually internally, for example by saying
C<chr(0xD801)>, so that all code points, not just those valid for open
interchange, are
representable.  Unicode does define semantics for them, such as their
General Category is "Cs".  But because their use is somewhat dangerous,

lib/POD2/RU/perlunicode.pod view on Meta::CPAN

to do things like take the lower case of one, or match
case-insensitively, or to output them.  (But don't try this on Perls
before 5.14.)

=item *

UTF-32, UTF-32BE, UTF-32LE

The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
needed.  UTF-32 is a fixed-width encoding.  The BOM signatures are
C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.

=item *

UCS-2, UCS-4

Legacy, fixed-width encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
because it does not use surrogates.  UCS-4 is a 32-bit encoding,
functionally identical to UTF-32 (the difference being that

lib/POD2/RU/perluniintro.pod view on Meta::CPAN

output these abstract numbers, the numbers must be I<encoded> or
I<serialised> somehow.  Unicode defines several I<character encoding
forms>, of which I<UTF-8> is perhaps the most popular.  UTF-8 is a
variable length encoding that encodes Unicode characters as 1 to 6
bytes.  Other encodings
include UTF-16 and UTF-32 and their big- and little-endian variants
(UTF-8 is byte-order independent).  The ISO/IEC 10646 defines the UCS-2
and UCS-4 encoding forms.

For more information about encodings--for instance, to learn what
I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.

=head2 Perl's Unicode Support

Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
natively.  Perl v5.8.0, however, is the first recommended release for
serious Unicode work.  The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Perl v5.14.0 is the first release where Unicode support is
(almost) seamlessly integrable without some gotchas (the exception being

xt/04_podspell.t view on Meta::CPAN

readonly
recursed
recursing
reentrancy
reimplementation
righthand
shouldn
th
unclosed
xFF
BOM
BOMless
BOMs
Cn
DeMorgan
HIRAGANA
Hiragana
Kana
Linebreaking
NEL
Posix
Standardese
Unforcing

xt/09_podspell_unicodetut.t view on Meta::CPAN

readonly
recursed
recursing
reentrancy
reimplementation
righthand
shouldn
th
unclosed
xFF
BOM
BOMless
BOMs
Cn
DeMorgan
HIRAGANA
Hiragana
Kana
Linebreaking
NEL
Posix
Standardese
Unforcing

( run in 0.360 second using v1.01-cache-2.11-cpan-131fc08a04b )