Encode-Arabic
view release on metacpan or search on metacpan
lib/Encode/Arabic/ArabTeX.pm view on Meta::CPAN
map {
"\x{0644}" . $double . $vowel . "\x{0627}" . $_,
"\x{0644}" . "\x{0671}" . $double . $vowel,
} "\x{064E}", "\x{064F}", "\x{0650}"
} "\x{064E}", "\x{064F}", "\x{0650}"
} "", "\x{0651}"
) : () ),
),
# optional ligatures to enforce here
];
no strict 'refs';
${ $cls . '::decoder' }->[$mode + $delevel] = Encode::Mapper->compile(@{$demoder->[$mode]});
${ $cls . '::decoder' }->[$mode + $delevel]->describe('') if $option{'describe'};
return ${ $cls . '::decoder' }->[$mode + $delevel];
}
1;
__END__
=head1 NAME
Encode::Arabic::ArabTeX - Interpreter of the ArabTeX notation of Arabic
=head1 SYNOPSIS
use Encode::Arabic::ArabTeX; # imports just like 'use Encode' would, plus extended options
while ($line = <>) { # maps the ArabTeX notation for Arabic into the Arabic script
print encode 'utf8', decode 'arabtex', $line; # 'ArabTeX' alias 'Lagally' alias 'TeX'
}
# ArabTeX lower ASCII transliteration <--> Arabic script in Perl's internal format
$string = decode 'ArabTeX', $octets;
$octets = encode 'ArabTeX', $string;
Encode::Arabic::ArabTeX->encoder('dump' => '!./encoder.code'); # dump the encoder engine to file
Encode::Arabic::ArabTeX->decoder('load'); # load the decoder engine from module's extra sources
=head1 DESCRIPTION
ArabTeX is an excellent extension to TeX/LaTeX designed for typesetting the right-to-left scripts of
the Orient. It comes up with very intuitive and comprehensible lower ASCII transliterations, the
expressive power of which is even better than that of the scripts.
L<Encode::Arabic::ArabTeX|Encode::Arabic::ArabTeX> implements the rules needed for proper interpretation
of the ArabTeX notation of Arabic. The conversion ifself is done by L<Encode::Mapper|Encode::Mapper>, and
the user interface is built on the L<Encode::Encoding|Encode::Encoding> module.
=head2 ENCODING BUSINESS
Since the ArabTeX notation is not a simple mapping to the graphemes of the Arabic script, encoding the script
into the notation is ambiguous. Two different strings in the notation may correspond to identical strings in
the script. Heuristics must be engaged to decide which of the representations is more appropriate.
Together with this bottle-neck, encoding may not be perfectly invertible by the decode operation, due to
over-generation or approximations in the encoding algorithm.
There are situations where conversion from the Arabic script to the ArabTeX notation is still convenient and
useful. Imagine you need to edit the data, enhance it with vowels or other diacritical marks, produce phonetic
transcripts and trim the typography of the script ... Do it in the ArabTeX notation, having an unrivalled
control over your acts!
Nonetheless, encoding is not the very purpose for this module's existence ;)
=head2 DECODING BUSINESS
The module decodes the ArabTeX notation as defined in the User Manual Version 4.00 of March 11, 2004,
L<ftp://ftp.informatik.uni-stuttgart.de/pub/arabtex/doc/arabdoc.pdf>. The implementation uses three levels
of L<Encode::Mapper|Encode::Mapper> engines to solve the problem:
=over
=item I<Hamza> writing
I<Hamza> carriers are determined from the context in accordance with the Arabic orthographical conventions.
The first level of mapping expands every C<< <'> >> into the verbatim encoding of the relevant carrier.
This level of processing can become optional, if people ever need to encode the I<hamza> carriers explicitly.
Interpretation of geminated I<hamza> C<< <''> >> is B<correct> here, as opposed to ArabTeX itself. In order to
deduce the proper spelling rules, we resorted to L<http://www.arabic-morphology.com/> and experimented with words
like C<< <ra''asa> >>, C<< <ru''isa> >>, C<< <tara''usuN> >>, etc.
On this level, word-internal occurrences of C<< <T> >> get translated into C<< <t> >>, which is an extension
to the notation that simplifies some requirements in modeling of the Arabic morphology.
=item Grapheme generation
The core level includes most of the rules needed, and converts the ArabTeX notation to Arabic graphemes in
Unicode. The engine recognizes all the consonants of Modern Standard Arabic, plus the following letters:
[ "|", "" ], # invisible consonant
[ "B", "\x{0640}" ], # consonantal ta.twil
[ "T", "\x{0629}" ], # ta' marbu.ta
[ "H", "\x{0629}" ], # ta' marbu.ta silent
[ "p", "\x{067E}" ], # pa'
[ "v", "\x{06A4}" ], # va'
[ "g", "\x{06AF}" ], # gaf
( run in 0.673 second using v1.01-cache-2.11-cpan-71847e10f99 )