POD2-RU

 view release on metacpan or  search on metacpan

lib/POD2/RU/perluniintro.pod  view on Meta::CPAN

C<LATIN CAPITAL LETTER A> followed by C<COMBINING ACUTE ACCENT>
represents the same character in I<Normalization Form Decomposed> (NFD).

Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character".  The same character could
be represented differently in several legacy encodings.  The
converse is not also true: some code points do not have an assigned
character.  Firstly, there are unallocated code points within
otherwise used blocks.  Secondly, there are special Unicode control
characters that do not represent true characters.

When Unicode was first conceived, it was thought that all the world's
characters could be represented using a 16-bit word; that is a maximum of
C<0x10000> (or 65536) characters from C<0x0000> to C<0xFFFF> would be
needed.  This soon proved to be false, and since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
and Unicode 3.1 (March 2001) defined the first characters above C<0xFFFF>.
The first C<0x10000> characters are called the I<Plane 0>, or the
I<Basic Multilingual Plane> (BMP).  With Unicode 3.1, 17 (yes,
seventeen) planes in all were defined--but they are nowhere near full of
defined characters, yet.

When a new language is being encoded, Unicode generally will choose a
C<block> of consecutive unallocated code points for its characters.  So
far, the number of code points in these blocks has always been evenly
divisible by 16.  Extras in a block, not currently needed, are left
unallocated, for future growth.  But there have been occasions when
a later relase needed more code points than the available extras, and a
new block had to allocated somewhere else, not contiguous to the initial
one, to handle the overflow.  Thus, it became apparent early on that
"block" wasn't an adequate organizing principal, and so the C<Script>
property was created.  (Later an improved script property was added as
well, the C<Script_Extensions> property.)  Those code points that are in
overflow blocks can still
have the same script as the original ones.  The script concept fits more
closely with natural language: there is C<Latin> script, C<Greek>
script, and so on; and there are several artificial scripts, like
C<Common> for characters that are used in multiple scripts, such as
mathematical symbols.  Scripts usually span varied parts of several
blocks.  For more information about scripts, see L<perlunicode/Scripts>.
The division into blocks exists, but it is almost completely
accidental--an artifact of how the characters have been and still are
allocated.  (Note that this paragraph has oversimplified things for the
sake of this being an introduction.  Unicode doesn't really encode
languages, but the writing systems for them--their scripts; and one
script can be used by many languages.  Unicode also encodes things that
aren't really about languages, such as symbols like C<BAGGAGE CLAIM>.)

The Unicode code points are just abstract numbers.  To input and
output these abstract numbers, the numbers must be I<encoded> or
I<serialised> somehow.  Unicode defines several I<character encoding
forms>, of which I<UTF-8> is perhaps the most popular.  UTF-8 is a
variable length encoding that encodes Unicode characters as 1 to 6
bytes.  Other encodings
include UTF-16 and UTF-32 and their big- and little-endian variants
(UTF-8 is byte-order independent).  The ISO/IEC 10646 defines the UCS-2
and UCS-4 encoding forms.

For more information about encodings--for instance, to learn what
I<surrogates> and I<byte order marks> (BOMs) are--see L<perlunicode>.

=head2 Perl's Unicode Support

Starting from Perl v5.6.0, Perl has had the capacity to handle Unicode
natively.  Perl v5.8.0, however, is the first recommended release for
serious Unicode work.  The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
Perl v5.14.0 is the first release where Unicode support is
(almost) seamlessly integrable without some gotchas (the exception being
some differences in L<quotemeta|perlfunc/quotemeta>, which is fixed
starting in Perl 5.16.0).   To enable this
seamless support, you should C<use feature 'unicode_strings'> (which is
automatically selected if you C<use 5.012> or higher).  See L<feature>.
(5.14 also fixes a number of bugs and departures from the Unicode
standard.)

Before Perl v5.8.0, the use of C<use utf8> was used to declare
that operations in the current block or file would be Unicode-aware.
This model was found to be wrong, or at least clumsy: the "Unicodeness"
is now carried with the data, instead of being attached to the
operations.
Starting with Perl v5.8.0, only one case remains where an explicit C<use
utf8> is needed: if your Perl script itself is encoded in UTF-8, you can
use UTF-8 in your identifier names, and in string and regular expression
literals, by saying C<use utf8>.  This is not the default because
scripts with legacy 8-bit data in them would break.  See L<utf8>.

=head2 Perl's Unicode Model

Perl supports both pre-5.6 strings of eight-bit native bytes, and
strings of Unicode characters.  The general principle is that Perl tries
to keep its data as eight-bit bytes for as long as possible, but as soon
as Unicodeness cannot be avoided, the data is transparently upgraded
to Unicode.  Prior to Perl v5.14.0, the upgrade was not completely
transparent (see L<perlunicode/The "Unicode Bug">), and for backwards
compatibility, full transparency is not gained unless C<use feature
'unicode_strings'> (see L<feature>) or C<use 5.012> (or higher) is
selected.

Internally, Perl currently uses either whatever the native eight-bit
character set of the platform (for example Latin-1) is, defaulting to
UTF-8, to encode Unicode strings. Specifically, if all code points in
the string are C<0xFF> or less, Perl uses the native eight-bit
character set.  Otherwise, it uses UTF-8.

A user of Perl does not normally need to know nor care how Perl
happens to encode its internal strings, but it becomes relevant when
outputting Unicode strings to a stream without a PerlIO layer (one with
the "default" encoding).  In such a case, the raw bytes used internally
(the native character set or UTF-8, as appropriate for each string)
will be used, and a "Wide character" warning will be issued if those
strings contain a character beyond 0x00FF.

For example,

      perl -e 'print "\x{DF}\n", "\x{0100}\x{DF}\n"'

produces a fairly useless mixture of native bytes and UTF-8, as well
as a warning:



( run in 1.059 second using v1.01-cache-2.11-cpan-39bf76dae61 )