POD2-RU
view release on metacpan or search on metacpan
lib/POD2/RU/perlunicode.pod view on Meta::CPAN
=encoding utf8
=head1 NAME/ÐÐÐÐÐÐÐÐÐÐÐÐ
perlunicode - ÐоддеÑжка Юникода в Perl
=head1 ÐÐÐСÐÐÐÐ
=head2 ÐажнÑе пÑедоÑÑеÑежениÑ
ÐоддеÑжка Юникода (Unicode) - ÑÑо ÑаÑÑиÑенное ÑÑебование. Ð Ñо вÑÐµÐ¼Ñ ÐºÐ°Ðº Perl не
ÑеализÑÐµÑ ÑÑандаÑÑ Ð®Ð½Ð¸ÐºÐ¾Ð´Ð° или ÑопÑовождаÑÑиÑ
его ÑеÑ
ниÑеÑкиÑ
оÑÑеÑов
Ð¾Ñ ÐºÐ¾Ñки до коÑки, Perl дейÑÑвиÑелÑно поддеÑÐ¶Ð¸Ð²Ð°ÐµÑ Ð¼Ð½Ð¾Ð³Ð¾ его ÑÑнкÑий.
ÐÑдÑм, коÑоÑÑе Ñ
оÑÑÑ Ð½Ð°ÑÑиÑÑÑÑ Ð¸ÑполÑзоваÑÑ Ð®Ð½Ð¸ÐºÐ¾Ð´ в Perl, веÑоÑÑно, ÑледÑÐµÑ ÑиÑаÑÑ
L<Perl Unicode tutorial, perlunitut|perlunitut> и
L<perluniintro>, пеÑед ÑÑением ÑÑого ÑпÑавоÑного докÑменÑа.
ÐÑоме Ñого, иÑполÑзование Юникода Ð¼Ð¾Ð¶ÐµÑ Ð¿ÑивеÑÑи к неоÑевиднÑм пÑоблемам безопаÑноÑÑи.
ЧиÑайÑе L<СообÑÐ°Ð¶ÐµÐ½Ð¸Ñ Ð±ÐµÐ·Ð¾Ð¿Ð°ÑноÑÑи Юникода|http://www.unicode.org/reports/tr36>.
=over 4
=item Самое безопаÑное, еÑли Ð²Ñ Ð¸ÑполÑзÑеÑе "use feature 'unicode_strings'"
ÐÐ»Ñ Ñого ÑÑÐ¾Ð±Ñ ÑоÑ
ÑаниÑÑ Ð¾Ð±ÑаÑнÑÑ ÑовмеÑÑимоÑÑÑ, Perl не вклÑÑаеÑ
полнÑÑ Ð²Ð½ÑÑÑеннÑÑ Ð¿Ð¾Ð´Ð´ÐµÑÐ¶ÐºÑ Ð®Ð½Ð¸ÐºÐ¾Ð´Ð°, еÑли не Ñказана пÑагма
C<use feature 'unicode_strings'>. (Ðна авÑомаÑиÑеÑки
вÑбÑаеÑÑÑ, еÑли Ð²Ñ Ð¸ÑполÑзÑеÑе C<use 5.012> или вÑÑе.)
ÐеÑпоÑобноÑÑÑ ÑделаÑÑ ÑÑо можеÑ
вÑзваÑÑ Ð½ÐµÐ¾Ð¶Ð¸Ð´Ð°Ð½Ð½Ñе ÑÑÑпÑизÑ. СмоÑÑеÑÑ L</"Ðаг Юникода"> ниже.
ÐÑа пÑагма не влиÑÐµÑ Ð½Ð° опеÑаÑии ввода-вÑвода. Ðи менÑÐµÑ Ð²Ð½ÑÑÑеннее
пÑедÑÑавление ÑÑÑок, ÑолÑко иÑ
инÑеÑпÑеÑаÑиÑ. ÐÑÑÑ ÐµÑе
неÑколÑко меÑÑ, где Юникод не полноÑÑÑÑ Ð¿Ð¾Ð´Ð´ÐµÑживаеÑÑÑ, напÑимеÑ, в
именаÑ
Ñайлов.
=item ÐÑ
однÑе и вÑÑ
однÑе Слои (Layers)
Perl knows when a filehandle uses Perl's internal Unicode encodings
(UTF-8, or UTF-EBCDIC if in EBCDIC) if the filehandle is opened with
the ":encoding(utf8)" layer. Other encodings can be converted to Perl's
encoding on input or from Perl's encoding on output by use of the
":encoding(...)" layer. See L<open>.
To indicate that Perl source itself is in UTF-8, use C<use utf8;>.
=item C<use utf8> still needed to enable UTF-8/UTF-EBCDIC in scripts
As a compatibility measure, the C<use utf8> pragma must be explicitly
included to enable recognition of UTF-8 in the Perl scripts themselves
(in string or regular expression literals, or in identifier names) on
ASCII-based machines or to recognize UTF-EBCDIC on EBCDIC-based
machines. B<These are the only times when an explicit C<use utf8>
is needed.> See L<utf8>.
=item BOM-marked scripts and UTF-16 scripts autodetected
If a Perl script begins marked with the Unicode BOM (UTF-16LE, UTF16-BE,
or UTF-8), or if the script looks like non-BOM-marked UTF-16 of either
endianness, Perl will correctly read in the script as Unicode.
(BOMless UTF-8 cannot be effectively recognized or differentiated from
ISO 8859-1 or other eight-bit encodings.)
=item C<use encoding> needed to upgrade non-Latin-1 byte strings
By default, there is a fundamental asymmetry in Perl's Unicode model:
implicit upgrading from byte strings to Unicode strings assumes that
they were encoded in I<ISO 8859-1 (Latin-1)>, but Unicode strings are
downgraded with UTF-8 encoding. This happens because the first 256
codepoints in Unicode happens to agree with Latin-1.
See L</"Byte and Character Semantics"> for more details.
=back
=head2 Byte and Character Semantics
Perl uses logically-wide characters to represent strings internally.
Starting in Perl 5.14, Perl-level operations work with
characters rather than bytes within the scope of a
C<L<use feature 'unicode_strings'|feature>> (or equivalently
C<use 5.012> or higher). (This is not true if bytes have been
explicitly requested by C<L<use bytes|bytes>>, nor necessarily true
for interactions with the platform's operating system.)
For earlier Perls, and when C<unicode_strings> is not in effect, Perl
provides a fairly safe environment that can handle both types of
semantics in programs. For operations where Perl can unambiguously
decide that the input data are characters, Perl switches to character
semantics. For operations where this determination cannot be made
without additional information from the user, Perl decides in favor of
compatibility and chooses to use byte semantics.
When C<use locale> (but not C<use locale ':not_characters'>) is in
effect, Perl uses the semantics associated with the current locale.
(C<use locale> overrides C<use feature 'unicode_strings'> in the same scope;
while C<use locale ':not_characters'> effectively also selects
C<use feature 'unicode_strings'> in its scope; see L<perllocale>.)
Otherwise, Perl uses the platform's native
byte semantics for characters whose code points are less than 256, and
Unicode semantics for those greater than 255. That means that non-ASCII
characters are undefined except for their
ordinal numbers. This means that none have case (upper and lower), nor are any
a member of character classes, like C<[:alpha:]> or C<\w>. (But all do belong
to the C<\W> class or the Perl regular expression extension C<[:^alpha:]>.)
This behavior preserves compatibility with earlier versions of Perl,
which allowed byte semantics in Perl operations only if
none of the program's inputs were marked as being a source of Unicode
character data. Such data may come from filehandles, from calls to
external programs, from information provided by the system (such as %ENV),
or from literals and constants in the source text.
The C<utf8> pragma is primarily a compatibility device that enables
recognition of UTF-(8|EBCDIC) in literals encountered by the parser.
Note that this pragma is only required while Perl defaults to byte
semantics; when character semantics become the default, this pragma
may become a no-op. See L<utf8>.
If strings operating under byte semantics and strings with Unicode
character data are concatenated, the new string will have
character semantics. This can cause surprises: See L</BUGS>, below.
You can choose to be warned when this happens. See L<encoding::warnings>.
Under character semantics, many operations that formerly operated on
bytes now operate on characters. A character in Perl is
logically just a number ranging from 0 to 2**31 or so. Larger
characters may encode into longer sequences of bytes internally, but
this internal detail is mostly hidden for Perl code.
See L<perluniintro> for more.
=head2 Effects of Character Semantics
Character semantics have the following effects:
=over 4
=item *
Strings--including hash keys--and regular expression patterns may
contain characters that have an ordinal value larger than 255.
If you use a Unicode editor to edit your program, Unicode characters may
occur directly within the literal strings in UTF-8 encoding, or UTF-16.
(The former requires a BOM or C<use utf8>, the latter requires a BOM.)
Unicode characters can also be added to a string by using the C<\N{U+...}>
notation. The Unicode code for the desired character, in hexadecimal,
should be placed in the braces, after the C<U>. For instance, a smiley face is
C<\N{U+263A}>.
Alternatively, you can use the C<\x{...}> notation for characters 0x100 and
above. For characters below 0x100 you may get byte semantics instead of
character semantics; see L</The "Unicode Bug">. On EBCDIC machines there is
the additional problem that the value for such characters gives the EBCDIC
character rather than the Unicode one, thus it is more portable to use
C<\N{U+...}> instead.
Additionally, you can use the C<\N{...}> notation and put the official
Unicode character name within the braces, such as
C<\N{WHITE SMILING FACE}>. This automatically loads the L<charnames>
module with the C<:full> and C<:short> options. If you prefer different
options for this module, you can instead, before the C<\N{...}>,
explicitly load it with your desired options; for example,
use charnames ':loose';
=item *
If an appropriate L<encoding> is specified, identifiers within the
Perl script may contain Unicode alphanumeric characters, including
ideographs. Perl does not currently attempt to canonicalize variable
names.
=item *
Regular expressions match characters instead of bytes. "." matches
a character instead of a byte.
=item *
Bracketed character classes in regular expressions match characters instead of
bytes and match against the character properties specified in the
Unicode properties database. C<\w> can be used to match a Japanese
ideograph, for instance.
=item *
Named Unicode properties, scripts, and block ranges may be used (like bracketed
character classes) by using the C<\p{}> "matches property" construct and
the C<\P{}> negation, "doesn't match property".
See L</"Unicode Character Properties"> for more details.
You can define your own character properties and use them
in the regular expression with the C<\p{}> or C<\P{}> construct.
See L</"User-Defined Character Properties"> for more details.
=item *
The special pattern C<\X> matches a logical character, an "extended grapheme
cluster" in Standardese. In Unicode what appears to the user to be a single
character, for example an accented C<G>, may in fact be composed of a sequence
of characters, in this case a C<G> followed by an accent character. C<\X>
will match the entire sequence.
lib/POD2/RU/perlunicode.pod view on Meta::CPAN
=item *
UTF-8
UTF-8 is a variable-length (1 to 4 bytes), byte-order independent
encoding. For ASCII (and we really do mean 7-bit ASCII, not another
8-bit encoding), UTF-8 is transparent.
The following table is from Unicode 3.2.
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
U+0000..U+007F 00..7F
U+0080..U+07FF * C2..DF 80..BF
U+0800..U+0FFF E0 * A0..BF 80..BF
U+1000..U+CFFF E1..EC 80..BF 80..BF
U+D000..U+D7FF ED 80..9F 80..BF
U+D800..U+DFFF +++++ utf16 surrogates, not legal utf8 +++++
U+E000..U+FFFF EE..EF 80..BF 80..BF
U+10000..U+3FFFF F0 * 90..BF 80..BF 80..BF
U+40000..U+FFFFF F1..F3 80..BF 80..BF 80..BF
U+100000..U+10FFFF F4 80..8F 80..BF 80..BF
Note the gaps marked by "*" before several of the byte entries above. These are
caused by legal UTF-8 avoiding non-shortest encodings: it is technically
possible to UTF-8-encode a single code point in different ways, but that is
explicitly forbidden, and the shortest possible encoding should always be used
(and that is what Perl does).
Another way to look at it is via bits:
Code Points 1st Byte 2nd Byte 3rd Byte 4th Byte
0aaaaaaa 0aaaaaaa
00000bbbbbaaaaaa 110bbbbb 10aaaaaa
ccccbbbbbbaaaaaa 1110cccc 10bbbbbb 10aaaaaa
00000dddccccccbbbbbbaaaaaa 11110ddd 10cccccc 10bbbbbb 10aaaaaa
As you can see, the continuation bytes all begin with "10", and the
leading bits of the start byte tell how many bytes there are in the
encoded character.
The original UTF-8 specification allowed up to 6 bytes, to allow
encoding of numbers up to 0x7FFF_FFFF. Perl continues to allow those,
and has extended that up to 13 bytes to encode code points up to what
can fit in a 64-bit word. However, Perl will warn if you output any of
these as being non-portable; and under strict UTF-8 input protocols,
they are forbidden.
The Unicode non-character code points are also disallowed in UTF-8 in
"open interchange". See L</Non-character code points>.
=item *
UTF-EBCDIC
Like UTF-8 but EBCDIC-safe, in the way that UTF-8 is ASCII-safe.
=item *
UTF-16, UTF-16BE, UTF-16LE, Surrogates, and BOMs (Byte Order Marks)
The followings items are mostly for reference and general Unicode
knowledge, Perl doesn't use these constructs internally.
Like UTF-8, UTF-16 is a variable-width encoding, but where
UTF-8 uses 8-bit code units, UTF-16 uses 16-bit code units.
All code points occupy either 2 or 4 bytes in UTF-16: code points
C<U+0000..U+FFFF> are stored in a single 16-bit unit, and code
points C<U+10000..U+10FFFF> in two 16-bit units. The latter case is
using I<surrogates>, the first 16-bit unit being the I<high
surrogate>, and the second being the I<low surrogate>.
Surrogates are code points set aside to encode the C<U+10000..U+10FFFF>
range of Unicode code points in pairs of 16-bit units. The I<high
surrogates> are the range C<U+D800..U+DBFF> and the I<low surrogates>
are the range C<U+DC00..U+DFFF>. The surrogate encoding is
$hi = ($uni - 0x10000) / 0x400 + 0xD800;
$lo = ($uni - 0x10000) % 0x400 + 0xDC00;
and the decoding is
$uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
Because of the 16-bitness, UTF-16 is byte-order dependent. UTF-16
itself can be used for in-memory computations, but if storage or
transfer is required either UTF-16BE (big-endian) or UTF-16LE
(little-endian) encodings must be chosen.
This introduces another problem: what if you just know that your data
is UTF-16, but you don't know which endianness? Byte Order Marks, or
BOMs, are a solution to this. A special character has been reserved
in Unicode to function as a byte order marker: the character with the
code point C<U+FEFF> is the BOM.
The trick is that if you read a BOM, you will know the byte order,
since if it was written on a big-endian platform, you will read the
bytes C<0xFE 0xFF>, but if it was written on a little-endian platform,
you will read the bytes C<0xFF 0xFE>. (And if the originating platform
was writing in UTF-8, you will read the bytes C<0xEF 0xBB 0xBF>.)
The way this trick works is that the character with the code point
C<U+FFFE> is not supposed to be in input streams, so the
sequence of bytes C<0xFF 0xFE> is unambiguously "BOM, represented in
little-endian format" and cannot be C<U+FFFE>, represented in big-endian
format".
Surrogates have no meaning in Unicode outside their use in pairs to
represent other code points. However, Perl allows them to be
represented individually internally, for example by saying
C<chr(0xD801)>, so that all code points, not just those valid for open
interchange, are
representable. Unicode does define semantics for them, such as their
General Category is "Cs". But because their use is somewhat dangerous,
Perl will warn (using the warning category "surrogate", which is a
sub-category of "utf8") if an attempt is made
to do things like take the lower case of one, or match
case-insensitively, or to output them. (But don't try this on Perls
before 5.14.)
=item *
UTF-32, UTF-32BE, UTF-32LE
The UTF-32 family is pretty much like the UTF-16 family, expect that
the units are 32-bit, and therefore the surrogate scheme is not
needed. UTF-32 is a fixed-width encoding. The BOM signatures are
C<0x00 0x00 0xFE 0xFF> for BE and C<0xFF 0xFE 0x00 0x00> for LE.
=item *
UCS-2, UCS-4
Legacy, fixed-width encodings defined by the ISO 10646 standard. UCS-2 is a 16-bit
encoding. Unlike UTF-16, UCS-2 is not extensible beyond C<U+FFFF>,
because it does not use surrogates. UCS-4 is a 32-bit encoding,
functionally identical to UTF-32 (the difference being that
UCS-4 forbids neither surrogates nor code points larger than 0x10_FFFF).
=item *
UTF-7
A seven-bit safe (non-eight-bit) encoding, which is useful if the
transport or storage is not eight-bit safe. Defined by RFC 2152.
=back
=head2 Non-character code points
66 code points are set aside in Unicode as "non-character code points".
These all have the Unassigned (Cn) General Category, and they never will
be assigned. These are never supposed to be in legal Unicode input
streams, so that code can use them as sentinels that can be mixed in
with character data, and they always will be distinguishable from that data.
To keep them out of Perl input streams, strict UTF-8 should be
specified, such as by using the layer C<:encoding('UTF-8')>. The
non-character code points are the 32 between U+FDD0 and U+FDEF, and the
34 code points U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, ... U+10FFFE, U+10FFFF.
Some people are under the mistaken impression that these are "illegal",
but that is not true. An application or cooperating set of applications
can legally use them at will internally; but these code points are
"illegal for open interchange". Therefore, Perl will not accept these
from input streams unless lax rules are being used, and will warn
(using the warning category "nonchar", which is a sub-category of "utf8") if
an attempt is made to output them.
=head2 Beyond Unicode code points
The maximum Unicode code point is U+10FFFF. But Perl accepts code
points up to the maximum permissible unsigned number available on the
platform. However, Perl will not accept these from input streams unless
lax rules are being used, and will warn (using the warning category
"non_unicode", which is a sub-category of "utf8") if an attempt is made to
operate on or output them. For example, C<uc(0x11_0000)> will generate
this warning, returning the input parameter as its result, as the upper
case of every non-Unicode code point is the code point itself.
=head2 Security Implications of Unicode
Read L<Unicode Security Considerations|http://www.unicode.org/reports/tr36>.
Also, note the following:
=over 4
=item *
( run in 0.623 second using v1.01-cache-2.11-cpan-39bf76dae61 )