Char-UTF2
view release on metacpan or search on metacpan
lib/UTF2.pm view on Meta::CPAN
Unicode properties (aka character properties) of regexp are not available.
Also (?[]) in regexp of Perl 5.18 is not available. There is no plans to currently
support these.
=item * Delimiter of String and Regexp
qq//, q//, qw//, qx//, qr//, m//, s///, tr///, and y/// can't use a wide character
as the delimiter.
=item * \b{...} Boundaries in Regular Expressions
Following \b{...} available starting in v5.22 are not supported.
\b{gcb} or \b{g} Unicode "Grapheme Cluster Boundary"
\b{sb} Unicode "Sentence Boundary"
\b{wb} Unicode "Word Boundary"
\B{gcb} or \B{g} Unicode "Grapheme Cluster Boundary" doesn't match
\B{sb} Unicode "Sentence Boundary" doesn't match
\B{wb} Unicode "Word Boundary" doesn't match
=back
=head1 AUTHOR
INABA Hitoshi E<lt>ina@cpan.orgE<gt>
This project was originated by INABA Hitoshi.
=head1 LICENSE AND COPYRIGHT
This software is free software; you can redistribute it and/or
modify it under the same terms as Perl itself. See L<perlartistic>.
This software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
=head1 My Goal
P.401 See chapter 15: Unicode
of ISBN 0-596-00027-8 Programming Perl Third Edition.
Before the introduction of Unicode support in perl, The eq operator
just compared the byte-strings represented by two scalars. Beginning
with perl 5.8, eq compares two byte-strings with simultaneous
consideration of the UTF8 flag.
/* You are not expected to understand this */
Information processing model beginning with perl 5.8
+----------------------+---------------------+
| Text strings | |
+----------+-----------| Binary strings |
| UTF-8 | Latin-1 | |
+----------+-----------+---------------------+
| UTF8 | Not UTF8 |
| Flagged | Flagged |
+--------------------------------------------+
http://perl-users.jp/articles/advent-calendar/2010/casual/4
Confusion of Perl string model is made from double meanings of
"Binary string."
Meanings of "Binary string"
1. Non-Text string
2. Digital octet string
Let's draw again using those term.
+----------------------+---------------------+
| Text strings | |
+----------+-----------| Non-Text strings |
| UTF-8 | Latin-1 | |
+----------+-----------+---------------------+
| UTF8 | Not UTF8 |
| Flagged | Flagged |
+--------------------------------------------+
| Digital octet string |
+--------------------------------------------+
There are people who don't agree to change in the character string
processing model of Perl 5.8. It is impossible to get to agree it to
majority of Perl user who hardly ever use Perl.
How to solve it by returning to a original method, let's drag out
page 402 of the old dusty Programming Perl, 3rd ed. again.
Information processing model beginning with perl3 or this software
of UNIX/C-ism.
+--------------------------------------------+
| Text string as Digital octet string |
| Digital octet string as Text string |
+--------------------------------------------+
| Not UTF8 Flagged, No Mojibake |
+--------------------------------------------+
In UNIX Everything is a File
- In UNIX everything is a stream of bytes
- In UNIX the filesystem is used as a universal name space
Native Encoding Scripting
- native encoding of file contents
- native encoding of file name on filesystem
- native encoding of command line
- native encoding of environment variable
- native encoding of API
- native encoding of network packet
- native encoding of database
Ideally, I'd like to achieve these five Goals:
=over 2
=item * Goal #1:
Old byte-oriented programs should not spontaneously break on the old
byte-oriented data they used to work on.
This goal has been achieved by that this software is additional code
for perl like utf8 pragma. Perl should work same as past Perl if added
( run in 2.065 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )