Encode
view release on metacpan or search on metacpan
lib/Encode/Supported.pod view on Meta::CPAN
falls into this category. See L<perlunicode/"Unicode Encodings"> to
find out how UTF-8 maps Unicode to a byte sequence.
You may also have found out by now why 7bit ISO-2022 cannot comprise
a CCS. If you look at a byte sequence \x21\x21, you can't tell if
it is two !'s or IDEOGRAPHIC SPACE. EUC maps the latter to \xA1\xA1
so you have no trouble differentiating between "!!". and S<" ">.
=head1 Encoding Classification (by Anton Tagunov and Dan Kogai)
This section tries to classify the supported encodings by their
applicability for information exchange over the Internet and to
choose the most suitable aliases to name them in the context of
such communication.
=over 2
=item *
To (en|de)code encodings marked by C<(**)>, you need
C<Encode::HanExtra>, available from CPAN.
=back
Encoding names
US-ASCII UTF-8 ISO-8859-* KOI8-R
Shift_JIS EUC-JP ISO-2022-JP ISO-2022-JP-1
EUC-KR Big5 GB2312
are registered with IANA as preferred MIME names and may
be used over the Internet.
C<Shift_JIS> has been officialized by JIS X 0208:1997.
L<Microsoft-related naming mess> gives details.
C<GB2312> is the IANA name for C<EUC-CN>.
See L<Microsoft-related naming mess> for details.
C<GB_2312-80> I<raw> encoding is available as C<gb2312-raw>
with Encode. See L<Encode::CN> for details.
EUC-CN
KOI8-U [RFC2319]
have not been registered with IANA (as of March 2002) but
seem to be supported by major web browsers.
The IANA name for C<EUC-CN> is C<GB2312>.
KS_C_5601-1987
is heavily misused.
See L<Microsoft-related naming mess> for details.
C<KS_C_5601-1987> I<raw> encoding is available as C<kcs5601-raw>
with Encode. See L<Encode::KR> for details.
UTF-16 UTF-16BE UTF-16LE
are IANA-registered C<charset>s. See [RFC 2781] for details.
Jungshik Shin reports that UTF-16 with a BOM is well accepted
by MS IE 5/6 and NS 4/6. Beware however that
=over 2
=item *
C<UTF-16> support in any software you're going to be
using/interoperating with has probably been less tested
then C<UTF-8> support
=item *
C<UTF-8> coded data seamlessly passes traditional
command piping (C<cat>, C<more>, etc.) while C<UTF-16> coded
data is likely to cause confusion (with its zero bytes,
for example)
=item *
it is beyond the power of words to describe the way HTML browsers
encode non-C<ASCII> form data. To get a general impression, visit
L<http://www.alanflavell.org.uk/charset/form-i18n.html>.
While encoding of form data has stabilized for C<UTF-8> encoded pages
(at least IE 5/6, NS 6, and Opera 6 behave consistently), be sure to
expect fun (and cross-browser discrepancies) with C<UTF-16> encoded
pages!
=back
The rule of thumb is to use C<UTF-8> unless you know what
you're doing and unless you really benefit from using C<UTF-16>.
ISO-IR-165 [RFC1345]
VISCII
GB 12345
GB 18030 (**) (see links below)
EUC-TW (**)
are totally valid encodings but not registered at IANA.
The names under which they are listed here are probably the
most widely-known names for these encodings and are recommended
names.
BIG5PLUS (**)
is a proprietary name.
=head2 Microsoft-related naming mess
Microsoft products misuse the following names:
=over 2
=item KS_C_5601-1987
Microsoft extension to C<EUC-KR>.
Proper names: C<CP949>, C<UHC>, C<x-windows-949> (as used by Mozilla).
See L<http://lists.w3.org/Archives/Public/ietf-charsets/2001AprJun/0033.html>
( run in 0.816 second using v1.01-cache-2.11-cpan-39bf76dae61 )