POD2-RU

 view release on metacpan or  search on metacpan

lib/POD2/RU/perlunicode.pod  view on Meta::CPAN


=encoding utf8

=head1 NAME/НАИМЕНОВАНИЕ

perlunicode - Поддержка Юникода в Perl

=head1 ОПИСАНИЕ

=head2 Важные предостережения

Поддержка Юникода (Unicode) - это расширенное требование. В то время как Perl не
реализует стандарт Юникода или сопровождающих его технических отчетов
от корки до корки, Perl действительно поддерживает много его функций.

Людям, которые хотят научиться использовать Юникод в Perl, вероятно, следует читать
L<Perl Unicode tutorial, perlunitut|perlunitut> и
L<perluniintro>, перед чтением этого справочного документа.

Кроме того, использование Юникода может привести к неочевидным проблемам безопасности. 
Читайте L<Соображения безопасности Юникода|http://www.unicode.org/reports/tr36>.

=over 4

=item Самое безопасное, если вы используете "use feature 'unicode_strings'"

Для того чтобы сохранить обратную совместимость, Perl не включает 
полную внутреннюю поддержку Юникода, если не указана прагма
C<use feature 'unicode_strings'>. (Она автоматически 
выбрается, если вы используете C<use 5.012> или выше.) 
Неспособность сделать это может 
вызвать неожиданные сюрпризы. Смотреть L</"Баг Юникода"> ниже.

Эта прагма не влияет на операции ввода-вывода. Ни меняет внутреннее
представление строк, только их интерпретацию. Есть еще
несколько мест, где Юникод не полностью поддерживается, например, в
именах файлов.

=item Входные и выходные Слои (Layers)

Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
the ":encoding(utf8)" layer.  Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)"  layer.  See L<open>.

To indicate that Perl source itself is in UTF-8, use C<use utf8;>.

=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts

As a compatibility measure, the C<use utf8> pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines.  B<These are the only times when an explicit C<use utf8>
is needed.>  See L<utf8>.

=item BOM-marked scripts and UTF-16 scripts autodetected

If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(BOMless UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)

=item C<use encoding> needed to upgrade non-Latin-1 byte strings

By default, there is a fundamental asymmetry in Perl's Unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding.  This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.

See L</"Byte and Character Semantics"> for more details.

=back

=head2 Byte and Character Semantics

Perl uses logically-wide characters to represent strings internally.

Starting in Perl 5.14, Perl-level operations work with
characters rather than bytes within the scope of a
C<L<use feature 'unicode_strings'|feature>> (or equivalently
C<use 5.012> or higher).  (This is not true if bytes have been
explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
for interactions with the platform's operating system.)

For earlier Perls, and when C<unicode_strings> is not in effect, Perl
provides a fairly safe environment that can handle both types of
semantics in programs.  For operations where Perl can unambiguously
decide that the input data are characters, Perl switches to character
semantics.  For operations where this determination cannot be made
without additional information from the user, Perl decides in favor of
compatibility and chooses to use byte semantics.

When C<use locale> (but not C<use locale ':not_characters'>) is in
effect, Perl uses the semantics associated with the current locale.
(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
while C<use locale ':not_characters'> effectively also selects
C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
Otherwise, Perl uses the platform's native
byte semantics for characters whose code points are less than 256, and
Unicode semantics for those greater than 255.  That means that non-ASCII
characters are undefined except for their
ordinal numbers.  This means that none have case (upper and lower), nor are any
a member of character classes, like C<[:alpha:]> or C<\w>.  (But all do belong
to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)

This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
none of the program's inputs were marked as being a source of Unicode
character data.  Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.

The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op.  See L<utf8>.

If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
character semantics.  This can cause surprises: See L</BUGS>, below.
You can choose to be warned when this happens.  See L<encoding::warnings>.

Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
characters may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
See L<perluniintro> for more.

=head2 Effects of Character Semantics

Character semantics have the following effects:

=over 4

=item *

Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.

If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or C<use utf8>, the latter requires a BOM.)

Unicode characters can also be added to a string by using the C<\N{U+...}>
notation.  The Unicode code for the desired character, in hexadecimal,
should be placed in the braces, after the C<U>. For instance, a smiley face is
C<\N{U+263A}>.

Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
above.  For characters below 0x100 you may get byte semantics instead of
character semantics;  see L</The "Unicode Bug">.  On EBCDIC machines there is
the additional problem that the value for such characters gives the EBCDIC
character rather than the Unicode one, thus it is more portable to use
C<\N{U+...}> instead.

Additionally, you can use the C<\N{...}> notation and put the official
Unicode character name within the braces, such as
C<\N{WHITE SMILING FACE}>.  This automatically loads the L<charnames>
module with the C<:full> and C<:short> options.  If you prefer different
options for this module, you can instead, before the C<\N{...}>,
explicitly load it with your desired options; for example,

   use charnames ':loose';

=item *

If an appropriate L<encoding> is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs.  Perl does not currently attempt to canonicalize variable
names.

=item *

Regular expressions match characters instead of bytes.  "." matches
a character instead of a byte.

=item *

Bracketed character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database.  C<\w> can be used to match a Japanese
ideograph, for instance.

=item *

Named Unicode properties, scripts, and block ranges may be used (like bracketed
character classes) by using the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
See L</"Unicode Character Properties"> for more details.

You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
See L</"User-Defined Character Properties"> for more details.

=item *

The special pattern C<\X> matches a logical character, an "extended grapheme
cluster" in Standardese.  In Unicode what appears to the user to be a single
character, for example an accented C<G>, may in fact be composed of a sequence
of characters, in this case a C<G> followed by an accent character.  C<\X>
will match the entire sequence.

lib/POD2/RU/perlunicode.pod  view on Meta::CPAN

=item *

UTF-8

UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
encoding. For ASCII (and we really do mean 7-bit ASCII, not another
8-bit encoding), UTF-8 is transparent.

The following table is from Unicode 3.2.

 Code Points            1st Byte  2nd Byte  3rd Byte 4th Byte

   U+0000..U+007F       00..7F
   U+0080..U+07FF     * C2..DF    80..BF
   U+0800..U+0FFF       E0      * A0..BF    80..BF
   U+1000..U+CFFF       E1..EC    80..BF    80..BF
   U+D000..U+D7FF       ED        80..9F    80..BF
   U+D800..U+DFFF       +++++ utf16 surrogates, not legal utf8 +++++
   U+E000..U+FFFF       EE..EF    80..BF    80..BF
  U+10000..U+3FFFF      F0      * 90..BF    80..BF    80..BF
  U+40000..U+FFFFF      F1..F3    80..BF    80..BF    80..BF
 U+100000..U+10FFFF     F4        80..8F    80..BF    80..BF

Note the gaps marked by "*" before several of the byte entries above.  These are
caused by legal UTF-8 avoiding non-shortest encodings: it is technically
possible to UTF-8-encode a single code point in different ways, but that is
explicitly forbidden, and the shortest possible encoding should always be used
(and that is what Perl does).

Another way to look at it is via bits:

                Code Points  1st Byte  2nd Byte  3rd Byte  4th Byte

                   0aaaaaaa  0aaaaaaa
           00000bbbbbaaaaaa  110bbbbb  10aaaaaa
           ccccbbbbbbaaaaaa  1110cccc  10bbbbbb  10aaaaaa
 00000dddccccccbbbbbbaaaaaa  11110ddd  10cccccc  10bbbbbb  10aaaaaa

As you can see, the continuation bytes all begin with "10", and the
leading bits of the start byte tell how many bytes there are in the
encoded character.

The original UTF-8 specification allowed up to 6 bytes, to allow
encoding of numbers up to 0x7FFF_FFFF.  Perl continues to allow those,
and has extended that up to 13 bytes to encode code points up to what
can fit in a 64-bit word.  However, Perl will warn if you output any of
these as being non-portable; and under strict UTF-8 input protocols,
they are forbidden.

The Unicode non-character code points are also disallowed in UTF-8 in
"open interchange".  See L</Non-character code points>.

=item *

UTF-EBCDIC

Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.

=item *

UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)

The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.

Like UTF-8, UTF-16 is a variable-width encoding, but where
UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
All code points occupy either 2 or 4 bytes in UTF-16: code points
C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
points C<U+10000..U+10FFFF> in two 16-bit units.  The latter case is
using I<surrogates>, the first 16-bit unit being the I<high
surrogate>, and the second being the I<low surrogate>.

Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
range of Unicode code points in pairs of 16-bit units.  The I<high
surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
are the range C<U+DC00..U+DFFF>.  The surrogate encoding is

    $hi = ($uni - 0x10000) / 0x400 + 0xD800;
    $lo = ($uni - 0x10000) % 0x400 + 0xDC00;

and the decoding is

    $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);

Because of the 16-bitness, UTF-16 is byte-order dependent.  UTF-16
itself can be used for in-memory computations, but if storage or
transfer is required either UTF-16BE (big-endian) or UTF-16LE
(little-endian) encodings must be chosen.

This introduces another problem: what if you just know that your data
is UTF-16, but you don't know which endianness?  Byte Order Marks, or
BOMs, are a solution to this.  A special character has been reserved
in Unicode to function as a byte order marker: the character with the
code point C<U+FEFF> is the BOM.

The trick is that if you read a BOM, you will know the byte order,
since if it was written on a big-endian platform, you will read the
bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
you will read the bytes C<0xFF 0xFE>.  (And if the originating platform
was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)

The way this trick works is that the character with the code point
C<U+FFFE> is not supposed to be in input streams, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
format".

Surrogates have no meaning in Unicode outside their use in pairs to
represent other code points.  However, Perl allows them to be
represented individually internally, for example by saying
C<chr(0xD801)>, so that all code points, not just those valid for open
interchange, are
representable.  Unicode does define semantics for them, such as their
General Category is "Cs".  But because their use is somewhat dangerous,
Perl will warn (using the warning category "surrogate", which is a
sub-category of "utf8") if an attempt is made
to do things like take the lower case of one, or match
case-insensitively, or to output them.  (But don't try this on Perls
before 5.14.)

=item *

UTF-32, UTF-32BE, UTF-32LE

The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
needed.  UTF-32 is a fixed-width encoding.  The BOM signatures are
C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.

=item *

UCS-2, UCS-4

Legacy, fixed-width encodings defined by the ISO 10646 standard.  UCS-2 is a 16-bit
encoding.  Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
because it does not use surrogates.  UCS-4 is a 32-bit encoding,
functionally identical to UTF-32 (the difference being that
UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).

=item *

UTF-7

A seven-bit safe (non-eight-bit) encoding, which is useful if the
transport or storage is not eight-bit safe.  Defined by RFC 2152.

=back

=head2 Non-character code points

66 code points are set aside in Unicode as "non-character code points".
These all have the Unassigned (Cn) General Category, and they never will
be assigned.  These are never supposed to be in legal Unicode input
streams, so that code can use them as sentinels that can be mixed in
with character data, and they always will be distinguishable from that data.
To keep them out of Perl input streams, strict UTF-8 should be
specified, such as by using the layer C<:encoding('UTF-8')>.  The
non-character code points are the 32 between U+FDD0 and U+FDEF, and the
34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
Some people are under the mistaken impression that these are "illegal",
but that is not true.  An application or cooperating set of applications
can legally use them at will internally; but these code points are
"illegal for open interchange".  Therefore, Perl will not accept these
from input streams unless lax rules are being used, and will warn
(using the warning category "nonchar", which is a sub-category of "utf8") if
an attempt is made to output them.

=head2 Beyond Unicode code points

The maximum Unicode code point is U+10FFFF.  But Perl accepts code
points up to the maximum permissible unsigned number available on the
platform.  However, Perl will not accept these from input streams unless
lax rules are being used, and will warn (using the warning category
"non_unicode", which is a sub-category of "utf8") if an attempt is made to
operate on or output them.  For example, C<uc(0x11_0000)> will generate
this warning, returning the input parameter as its result, as the upper
case of every non-Unicode code point is the code point itself.

=head2 Security Implications of Unicode

Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
Also, note the following:

=over 4

=item *



( run in 0.623 second using v1.01-cache-2.11-cpan-39bf76dae61 )