Lingua-ZH-HanConvert
view release on metacpan or search on metacpan
HanConvert.pm view on Meta::CPAN
1;
__END__
=head1 NAME
Lingua::ZH::HanConvert - convert between Traditional and Simplified Chinese characters
=head1 SYNOPSIS
#!perl -lw
use Lingua::ZH::HanConvert qw(simple trad);
use utf8;
my $t = "å"; # Traditional symbol for "country", unicode 22283
# or: my $t = v22283;
print simple($t); # Simplified "country", å½ (unicode 22269)
$s = "é±¼"; # Simplified symbol for "fish", unicode 40060
# or: $s = v40060;
print trad($s); # Traditional "fish", é (unicode 39970)
=head1 REQUIRES
Perl 5.6
=head1 DESCRIPTION
In the 1950's, the Chinese government simplified over 2000 Chinese
characters. Taiwan and Hong Kong still use the traditional characters.
The simplified characters are hard to read if you only know the traditional
ones, and vice-versa. This module attempts to convert Chinese text between
the two forms, using character-by-character transliteration.
Note that this module only handles text in the Unicode UTF-8 character set.
If you need to convert between the Big5 and GB character sets, then please
look at L<Text::IConv>, or use the C<HanConvert> Perl script which comes
with this module.
C<simple> takes a string, converts any traditional Chinese characters (such
as E<22283>, unicode U+570B, meaning "country") to the corresponding
simplified characters (like E<22269>, unicode U+56FD, also meaning
"country"), and returns the result. Characters which are not traditional
Chinese do not change.
C<trad> does the reverse; it converts any simplified Chinese characters to
the corresponding traditional characters. Characters which are not
simplified Chinese do not change.
If a simplified character has two or more corresponding traditional
characters, then it will be replaced by all of them, enclosed in square
brackets. To use different characters instead of the square brackets, give
them as the second and third arguments to C<trad>. The same applies where
a traditional character has two or more corresponding simplified forms,
but this happens much more rarely.
=head1 BUGS, LIMITATIONS
B<There may be mistakes in the transliterations>. A number of data sources
were used to build the transliteration tables, including dictionaries and
the Unicode consortium's Unihan database, but some mappings may be
incorrect or missing.
Some characters which are simplified forms are also traditional forms. For
example, E<38754>, unicode U+9762, is the simplified form of E<40629>,
unicode U+9EB5, meaning "noodles"; but it is also the character for "face"
in both traditional and simplified writing. Most character mapping lists
say that simplified E<38754> (U+9762) can correspond to traditional
E<40629> (U+9EB5), but do not mention that simplified E<38754> (U+9762) can
map to traditional E<38754> (U+9762); common sense makes this is obvious to
a human who comes across this character in a text, but not to a computer
program. To provide this module with that extra information, it has been
assumed that any simplified form which appears in the Big5 character set is
also a traditional form. In some cases, this assumption may be incorrect.
The transliteration mappings could be improved. Ideally, I'd like to see
the module performing intelligent transliteration of ambiguous characters
based on context, if suitable data sources were available. See
C<http://www.basistech.com/articles/C2C.html> for a discussion of
transliteration issues.
Some differences in styles of Chinese writing are not related to simplified
characters. For instance, the mainland Chinese word for "computer" differs
from the word used in Taiwan. Colloquial Cantonese writing is different
from Mandarin writing, and everyday Cantonese text such as
"E<20322>E<20418>E<21780>E<20418>E<25105>E<13774>" ("is it mine?") contains
characters and phrases which may be unfamiliar to a Mandarin-speaking
reader. These issues are beyond the scope of this module; analogously, a
module which converted American English spelling into British English
spelling would not change the word "gasoline" into the word "petrol".
The characters in this documentation may not display correctly unless the
program you are reading it with is unicode-aware.
=head1 SEE ALSO
If you just want to convert some text, you might want to use trad2simp and
simp2trad, the Perl scripts which come with this module.
=head1 ACKNOWLEDGEMENTS
Much of the data used by this module is taken from the Unicode consortium's
Unihan database, available from C<ftp://ftp.unicode.org>. Thanks to them
for compiling the data and making it freely available.
=head1 AUTHOR
David Chan <david@sheetmusic.org.uk>
=head1 COPYRIGHT
Copyright (C) 2001, David Chan. All rights reserved. This program is free
software; you can redistribute it and/or modify it under the same terms as
Perl itself.
( run in 0.924 second using v1.01-cache-2.11-cpan-71847e10f99 )