ShiftJIS-CP932-MapUTF

 view release on metacpan or  search on metacpan

MapUTF.pod  view on Meta::CPAN


Currently, only coderefs are allowed as C<UNICODE_CALLBACK>.
A string returned from the coderef is inserted
in place of the unmapped character.

A coderef as C<UNICODE_CALLBACK> is called with one or more arguments.
If the unmapped character is a partial character (an illegal byte),
the first argument is C<undef>
and the second argument is an unsigned integer representing the byte.
If not partial, the first argument is an unsigned interger
representing a Unicode code point.

For example, characters unmapped to CP-932 are
converted to numerical character references for HTML 4.01.

    sub toHexNCR {
        my ($char, $byte) = @_;
        return sprintf("&#x%x;", $char) if defined $char;
        die sprintf "illegal byte 0x%02x was found", $byte;
    }

    $cp932 = utf8_to_cp932   (\&toHexNCR, $utf8_string);
    $cp932 = unicode_to_cp932(\&toHexNCR, $unicode_string);
    $cp932 = utf16le_to_cp932(\&toHexNCR, $utf16le_string);

The return value of C<UNICODE_CALLBACK> must be legal in CP-932.

C<UNICODE_OPTION> may be specified after C<STRING>. They can be combined
like C<'fg'> and C<'gsf'> (the order does not matter).

    'g'    add mappings of Gaiji (user defined characters)
           [0xF040 to 0xF9FC (rows 95 to 114) in CP-932]
           from Unicode's PUA [0xE000 to 0xE757] (1880 characters).

    's'    add mappings of undefined Single-byte characters:
           U+0080 => 0x80,  U+F8F0 => 0xA0,
           U+F8F1 => 0xFD,  U+F8F2 => 0xFE,  U+F8F3 => 0xFF.

    'f'    add some Fallback mappings from Unicode to CP-932.
           The characters additionally mapped are
           some characters in latin-1 region [U+00A0..U+00FF], and
           HIRAGANA LETTER VU [U+3094, to KATAKANA LETTER VU, 0x8394].

=over 4

=item C<utf8_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-8 to CP-932.

=item C<unicode_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts Unicode to CP-932.

This B<Unicode> is coded in the Perl's internal format (see F<perlunicode>).
If not flagged with C<SVf_UTF8>, upgraded as an ISO 8859-1 (latin1) string.

B<This function is provided only for Perl 5.6.1 or later, and via XS.>

=item C<utf16_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-16 (with or w/o C<BOM>) to CP-932.

=item C<utf16le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-16LE to CP-932.

=item C<utf16be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-16BE to CP-932.

=item C<utf32_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-32 (with or w/o C<BOM>) to CP-932.

=item C<utf32le_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-32LE to CP-932.

=item C<utf32be_to_cp932([UNICODE_CALLBACK,] STRING [, UNICODE_OPTION])>

Converts UTF-32BE to CP-932.

=back

=head2 Export

B<By default:>

    cp932_to_utf8     utf8_to_cp932
    cp932_to_utf16le  utf16le_to_cp932
    cp932_to_utf16be  utf16be_to_cp932

    cp932_to_unicode  unicode_to_cp932 (only for XS)

B<On request:>

    cp932_to_utf32le  utf32le_to_cp932
    cp932_to_utf32be  utf32be_to_cp932
                      utf16_to_cp932 [*]
                      utf32_to_cp932 [*]

[*] Their counterparts C<cp932_to_utf16()> and C<cp932_to_utf32()>
are not implemented yet. They need more investigation
on return values from C<SJIS_CALLBACK>...
(concatenation needs recognition of and coping with C<BOM>)

=head1 CAVEAT

Pure Perl edition of this module doesn't understand
any logically wide characters (see F<perlunicode>).
Use C<utf8::decode>/C<utf8::encode> (see F<utf8>) on Perl 5.7 or later
if necessary.

=head1 AUTHOR

SADAHIRO Tomoyuki <SADAHIRO@cpan.org>

Copyright(C) 2001-2007, SADAHIRO Tomoyuki. Japan. All rights reserved.

This module is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

=head1 SEE ALSO

=over 4

=item Microsoft PRB, Article ID: Q170559

Conversion Problem Between Shift-JIS and Unicode

=item cp932 to Unicode table

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit932.txt

http://www.microsoft.com/globaldev/reference/dbcs/932.htm

=back

=cut



( run in 1.508 second using v1.01-cache-2.11-cpan-2398b32b56e )