Encode

 view release on metacpan or  search on metacpan

lib/Encode/Supported.pod  view on Meta::CPAN

                                                MacRomanian
                                                MacRumanian
  Latin3[1]     iso-8859-3      
  Latin4[2]     iso-8859-4              
  Cyrillics     iso-8859-5      cp855   cp1251  MacCyrillic
    (See also next section)     cp866           MacUkrainian
  Arabic        iso-8859-6      cp864   cp1256  MacArabic
                                cp1006          MacFarsi
  Greek         iso-8859-7      cp737   cp1253  MacGreek
                                cp869 (DOSGreek2)
  Hebrew        iso-8859-8      cp862   cp1255  MacHebrew
  Turkish       iso-8859-9      cp857   cp1254  MacTurkish
  Nordics       iso-8859-10     cp865
                                cp861           MacIcelandic
                                                MacSami
  Thai          iso-8859-11[3]  cp874           MacThai
  (iso-8859-12 is nonexistent. Reserved for Indics?)
  Baltics       iso-8859-13     cp775           cp1257
  Celtics       iso-8859-14
  Latin9 [4]    iso-8859-15
  Latin10       iso-8859-16
  Vietnamese    viscii                  cp1258  MacVietnamese
  ----------------------------------------------------------------

  [1] Esperanto, Maltese, and Turkish. Turkish is now on 8859-9.
  [2] Baltics.  Now on 8859-10, except for Latvian.
  [3] TIS 620 +  Non-Breaking Space (0xA0 / U+00A0)
  [4] Nicknamed Latin0; the Euro sign as well as French and Finnish
      letters that are missing from 8859-1 were added.

All cp* are also available as ibm-*, ms-*, and windows-* .  See also
L<http://czyborra.com/charsets/codepages.html>.

Macintosh encodings don't seem to be registered in such entities as
IANA.  "Canonical" names in Encode are based upon Apple's Tech Note
1150.  See L<http://developer.apple.com/technotes/tn/tn1150.html> 
for details.

=item KOI8 - De Facto Standard for the Cyrillic world

Though ISO-8859 does have ISO-8859-5, the KOI8 series is far more
popular in the Net.   L<Encode> comes with the following KOI charsets.
For gory details, see L<http://czyborra.com/charsets/cyrillic.html>

  ----------------------------------------------------------------
  koi8-f                                        
  koi8-r cp878                                           [RFC1489]
  koi8-u                                                 [RFC2319]
  ----------------------------------------------------------------

=back

=head2 gsm0338 - Hentai Latin 1

GSM0338 is for GSM handsets. Though it shares alphanumerals with
ASCII, control character ranges and other parts are mapped very
differently, mainly to store Greek characters.  There are also escape
sequences (starting with 0x1B) to cover e.g. the Euro sign.  

This was once handled by L<Encode::Bytes> but because of all those
unusual specifications, Encode 2.20 has relocated the support to
L<Encode::GSM0338>. See L<Encode::GSM0338> for details.

=over 2

=item gsm0338 support before 2.19

Some special cases like a trailing 0x00 byte or a lone 0x1B byte are not
well-defined and decode() will return an empty string for them.
One possible workaround is

   $gsm =~ s/\x00\z/\x00\x00/;
   $uni = decode("gsm0338", $gsm);
   $uni .= "\xA0" if $gsm =~ /\x1B\z/;

Note that the Encode implementation of GSM0338 does not implement the
reuse of Latin capital letters as Greek capital letters (for example,
the 0x5A is U+005A (LATIN CAPITAL LETTER Z), not U+0396 (GREEK CAPITAL
LETTER ZETA).

The GSM0338 is also covered in Encode::Byte even though it is not
an "extended ASCII" encoding.

=back

=head2 CJK: Chinese, Japanese, Korean (Multibyte)

Note that Vietnamese is listed above.  Also read "Encoding vs Charset"
below.  Also note that these are implemented in distinct modules by
countries, due to the size concerns (simplified Chinese is mapped
to 'CN', continental China, while traditional Chinese is mapped to
'TW', Taiwan).  Please refer to their respective documentation pages.

=over 2

=item Encode::CN -- Continental China

  Standard      DOS/Win Macintosh                Comment/Reference
  ----------------------------------------------------------------
  euc-cn [1]            MacChineseSimp
  (gbk)         cp936 [2]
  gb12345-raw                      { GB12345 without CES }
  gb2312-raw                       { GB2312  without CES }
  hz
  iso-ir-165
  ----------------------------------------------------------------

  [1] GB2312 is aliased to this.  See L<Microsoft-related naming mess>
  [2] gbk is aliased to this.  See L<Microsoft-related naming mess>

=item Encode::JP -- Japan

  Standard      DOS/Win Macintosh                Comment/Reference
  ----------------------------------------------------------------
  euc-jp
  shiftjis      cp932   macJapanese
  7bit-jis
  iso-2022-jp                                            [RFC1468]
  iso-2022-jp-1                                          [RFC2237]
  jis0201-raw  { JIS X 0201 (roman + halfwidth kana) without CES }
  jis0208-raw  { JIS X 0208 (Kanji + fullwidth kana) without CES }



( run in 0.646 second using v1.01-cache-2.11-cpan-5511b514fd6 )