POD2-RU

 view release on metacpan or  search on metacpan

lib/POD2/RU/perlunicode.pod  view on Meta::CPAN

surrogates, which are not Unicode code points valid for interchange.

=item *

Regular expression pattern matching may surprise you if you're not
accustomed to Unicode.  Starting in Perl 5.14, several pattern
modifiers are available to control this, called the character set
modifiers.  Details are given in L<perlre/Character set modifiers>.

=back

As discussed elsewhere, Perl has one foot (two hooves?) planted in
each of two worlds: the old world of bytes and the new world of
characters, upgrading from bytes to characters when necessary.
If your legacy code does not explicitly use Unicode, no automatic
switch-over to characters should happen.  Characters shouldn't get
downgraded to bytes, either.  It is possible to accidentally mix bytes
and characters, however (see L<perluniintro>), in which case C<\w> in
regular expressions might start behaving differently (unless the C</a>
modifier is in effect).  Review your code.  Use warnings and the C<strict> pragma.

=head2 Unicode in Perl on EBCDIC

The way Unicode is handled on EBCDIC platforms is still
experimental.  On such platforms, references to UTF-8 encoding in this
document and elsewhere should be read as meaning the UTF-EBCDIC
specified in Unicode Technical Report 16, unless ASCII vs. EBCDIC issues
are specifically discussed. There is no C<utfebcdic> pragma or
":utfebcdic" layer; rather, "utf8" and ":utf8" are reused to mean
the platform's "natural" 8-bit encoding of Unicode. See L<perlebcdic>
for more discussion of the issues.

=head2 Locales

See L<perllocale/Unicode and UTF-8>

=head2 When Unicode Does Not Happen

While Perl does have extensive ways to input and output in Unicode,
and a few other "entry points" like the @ARGV array (which can sometimes be
interpreted as UTF-8), there are still many places where Unicode
(in some encoding or another) could be given as arguments or received as
results, or both, but it is not.

The following are such interfaces.  Also, see L</The "Unicode Bug">.
For all of these interfaces Perl
currently (as of v5.16.0) simply assumes byte strings both as arguments
and results, or UTF-8 strings if the (problematic) C<encoding> pragma has been used.

One reason that Perl does not attempt to resolve the role of Unicode in
these situations is that the answers are highly dependent on the operating
system and the file system(s).  For example, whether filenames can be
in Unicode and in exactly what kind of encoding, is not exactly a
portable concept.  Similarly for C<qx> and C<system>: how well will the
"command-line interface" (and which of them?) handle Unicode?

=over 4

=item *

chdir, chmod, chown, chroot, exec, link, lstat, mkdir,
rename, rmdir, stat, symlink, truncate, unlink, utime, -X

=item *

%ENV

=item *

glob (aka the <*>)

=item *

open, opendir, sysopen

=item *

qx (aka the backtick operator), system

=item *

readdir, readlink

=back

=head2 The "Unicode Bug"

The term, "Unicode bug" has been applied to an inconsistency
on ASCII platforms with the
Unicode code points in the Latin-1 Supplement block, that
is, between 128 and 255.  Without a locale specified, unlike all other
characters or code points, these characters have very different semantics in
byte semantics versus character semantics, unless
C<use feature 'unicode_strings'> is specified, directly or indirectly.
(It is indirectly specified by a C<use v5.12> or higher.)

In character semantics these upper-Latin1 characters are interpreted as
Unicode code points, which means
they have the same semantics as Latin-1 (ISO-8859-1).

In byte semantics (without C<unicode_strings>), they are considered to
be unassigned characters, meaning that the only semantics they have is
their ordinal numbers, and that they are
not members of various character classes.  None are considered to match C<\w>
for example, but all match C<\W>.

Perl 5.12.0 added C<unicode_strings> to force character semantics on
these code points in some circumstances, which fixed portions of the
bug; Perl 5.14.0 fixed almost all of it; and Perl 5.16.0 fixed the
remainder (so far as we know, anyway).  The lesson here is to enable
C<unicode_strings> to avoid the headaches described below.

The old, problematic behavior affects these areas:

=over 4

=item *

Changing the case of a scalar, that is, using C<uc()>, C<ucfirst()>, C<lc()>,
and C<lcfirst()>, or C<\L>, C<\U>, C<\u> and C<\l> in double-quotish
contexts, such as regular expression substitutions.



( run in 0.553 second using v1.01-cache-2.11-cpan-5511b514fd6 )