Unicode-CharWidth

 view release on metacpan or  search on metacpan

lib/Unicode/CharWidth.pm  view on Meta::CPAN


__PACKAGE__
__END__

=head1 SYNOPSIS

    use Unicode::CharWidth;

    if ( $string =~ /\p{InDoublewidth)/ ) {
        # string contains double width (two-column) characters
    }

    if ( $string !~ /\p{InNowidth} ) {
        # all string characters have a defined column width
    }

    # use capital P for negation

    if ( $string =~ /\P{InSinglewidth)/ ) {
        # string contains characters that aren't single width
    }

=head1 DESCRIPTION

=head2 Export

C<Unicode::CharWidth> exports four functions: C<InZerowidth>,
C<InSinglewidth>, C<InDoubleWidth> and C<InNowidth>.

These functions enable the use of like-named (inofficial) unicode
properties in regular expressions. Thus C</\p{InSinglewidth}/> matches
all characters that occupy a single screen column.

The functions are not supposed to be called directly (they return
strings that describe character properties, some of them lengthy),
but are automatically called by Perl's Unicode matching system.
They must be present in your current package for the L</unicode properties>
to work as described below.

C<Unicode::CharWidth> normally ignores arguments in the C<use>-statement.
There is one exception:

    use Unicode::CharWidth -gen

You don't ever I<need> to run this on an installed copy of this module.
See L</The -gen Option> for more.

=head2 Unicode Properties

The enabled Unicode properties are InZerowidth, InSinglewidth,
InDoubleWidth, and InNowidth.

They are not derived from Unicode documents directly, but rely on
the implementation of the C library function C<wcwidth(3)>.

=over 4

=item InZerowidth

C</\p{InZerowidth}/> matches the characters that don't occupy 
column space of their own. Most of these are modifying or overlay
characters that add accents or overstrokes to the preceding character.
C<"\0"> also has zero width. It is the only zero width character in
the ASCII range.

=item InSinglewidth

C</\p{InSinglewidth}/> matches what most westerners would consider
"normal" characters that occupy a single screen column. All printing
(non-control) ASCII characters are in this class, as well as most
characters in other alphabetic scripts.

=item InDoublewidth

C</\p{InDoublewidth}/> matches characters (in east-asian scripts) that
occupy two adjacent screen columns. There are no ASCII characters in this
class.

=item InNowidth

C</\p{InNowidth}/> These are characters that don't have a (fixed) column
width assigned at all. All ASCII control characters except C<"\0"> are in
this class, C<"\t">, C<"\n">, and C<"\r"> are examples.  Outside ASCII,
vast ranges of unassigned and reserved unicode characters fall in
this class.

=back

Every unicode character has (matches) exactly one of these four
character properties. Thus the column width (if any) of a 
character can in principle be recovered by trying it against
the four regexes and registering which one matched. But
use the function C<Text::CharWidth::mbwidth> for that (under a
unicode locale), it is much faster and it's what the character
properties are based on in the first place.

=head2 The -gen Option

As mentioned, C<use Unicode::CharWidth -gen> is handled as a special
case. Its purpose is to generate a file that holds the definitions
of the character properties exported by this module. The file (called
F<UCW_startup>) is distributed with the module, so there is no need
to generate it again. If it gets lost or corrupted (rarely), you can
force a re-install like with any other damaged module.

The -gen mechanism is not separated from the distribution,
though techically it could, mostly for simplicity, but also,
... we're supposed to be open software, aren't we? Generating
files in private and shipping them to an unsuspecting public
isn't the done thing.

If you want to to run with -gen for any reason, you must be able to do
a few things:

=over 4

=item Overwrite the shipped UCW_startup file

The shipped file is installed directly next to the file 
F<.../Unicode/CharWidth.pm>, as F<.../Unicode/UCW_startup>.
(Consult C<$INC{'Unicode/CharWidth.pm'}> if in doubt.) You must



( run in 2.012 seconds using v1.01-cache-2.11-cpan-98e64b0badf )