DTA-CAB
view release on metacpan or search on metacpan
CAB/WebServiceHowto.pod view on Meta::CPAN
=head3 Analysis Components
This section describes the atomic analysis components
provided by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration.
=over 4
=item L<static|DTA::CAB::Analyzer::Cache::Static>
Static type-wise analysis cache for
the L<attributes|/Analysis Attributes>
C<eqphox>, C<errid>, C<exlex>, C<f>, C<lts>, C<mlatin>, C<morph>, C<msafe>, C<rw>, C<xlit>, C<lang>, and C<pnd>
based on the most recent release of the L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> corpus,
typically less than one week old.
=item L<exlex|DTA::CAB::Analyzer::ExLex>
Type-wise exception lexicon extracted from the
L<DTA EvalCorpus|http://odo.dwds.de/~jurish/software/dtaec/>
and the DTA::CAB error database (demo L<here|http://kaskade.dwds.de/caberr/>),
typically updated weekly.
=item L<tokpp|DTA::CAB::Analyzer::TokPP>
Type-wise heuristic token preprocessor used to identify punctuation, numbers, quotes, etc.
=item L<xlit|DTA::CAB::Analyzer::Unicruft>
Deterministic type-wise character transliterator based on L<libunicruft|http://odo.dwds.de/~moocow/software/unicruft/>,
mostly useful for handling extinct characters and diacritics.
=item L<lts|DTA::CAB::Analyzer::LTS>
Deterministic type-wise phonetization ("letter-to-sound" mapping)
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer
as described in L<Jurish (2012), Ch. 1|http://./#jurish2012>.
=item L<morph|DTA::CAB::Analyzer::Morph>
Type-wise morphological analysis of the (transliterated) surface form
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer.
The default DTA analysis chanin uses a modified version of the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology FST.
=item L<mlatin|DTA::CAB::Analyzer::Morph::Latin>
Type-wise Latin pseudo-morphology for (transliterated) surface forms
based on the finite word-list distributed with
the
"L<William Whitaker's Words|https://sourceforge.net/projects/wwwords/>"
Latin dictionary.
=item L<msafe|DTA::CAB::Analyzer::MorphSafe>
Heuristics for detecting "suspicious" analyses supplied
by the L</morph> component (L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>),
as described in L<Jurish (2012), App. A.4|http://./#jurish2012>.
=item L<langid|DTA::CAB::Analyzer::LangId::Simple>
Simple sentence-wise language guesser based on stopword lists
extracted from the python L<NLTK project|http://www.nltk.org/>.
Also supports the pseudo-language C<XY>, which is typically assigned
for mathematical notation, abbreviations, or other extra-lexical material.
=item L<rw|DTA::CAB::Analyzer::Rewrite>
Type-wise I<k>-best weighted finite-state rewrite cascade conflator ("nearest neighbors")
via L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade.
as described in L<Jurish (2012), Ch. 2|http://./#jurish2012>
=item L<eqphox|DTA::CAB::Analyzer::EqPhoX>
Type-wise pohonetic equivalence conflator using a
L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade;
requires prior L</lts> analysis.
Unlike the presentation in L<Jurish (2012), Ch. 3|http://./#jurish2012>,
the current implementation uses a I<k>-best search strategy
over an infinite target language derived from the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology for improved recall.
=item L<dmoot|DTA::CAB::Analyzer::Moot::Boltzmann>
Sentence-wise conflation candidate disambiguator as described in
L<Jurish (2012), Ch. 4|http://./#jurish2012>. Attempts to determine
the "best" modern form from the canidate conflations provided by
the L</exlex>, L</xlit>, L</eqphox>, and L</rw> components,
after consideration of the properties provided by the
L</morph>, L</msafe>, L</mlatin>, and L</langid> components
(e.g. sentences already identified as consisting primarily of
foreign-language material will B<not> be "forced" onto contemporary
German).
=item L<dmootsub|DTA::CAB::Analyzer::MootSub>
Sentence-wise post-processing for the L</dmoot> HMM.
Mostly useful for performing L<morphological analysis|/morph> on
non-trivial canonicalizations supplied by L</dmoot>.
=item L<moot|DTA::CAB::Analyzer::Moot>
Sentence-wise part-of-speech (PoS) tagging using
the L<moot|http://kaskade.dwds.de/~moocow/mirror/projects/moot/> tagger
on the observations (word forms) provided by L</dmoot> or the raw
input token text and the morphological ambiguity classes supplied
by L</dmootsub> or L</morph>.
=item L<mootsub|DTA::CAB::Analyzer::MootSub>
Sentence-wise post-processing for the L</moot> tagger.
Mostly useful for determining the "best" lemma for the canonical
word form (L</dmoot> or token text) and PoS-tag selected by
by L</moot> from the set of canonical morphological analyses
(L</dmootsub> or L</morph>).
=back
=cut
##------------------------------------------------------
## Analysis Attributes
=pod
=head3 Analysis Attributes
This section describes the most common analysis attributes
used by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration.
Each attribute is described by a template such as:
data: $OBJ->{ATTR} = CODE
text: +[LABEL] TEXT
hidden: HIDDEN
where:
=over 4
=item *
C<$OBJ-E<gt>{ATTR} = CODE> is Perl notation for the underlying data-structure of the attribute.
C<$OBJ> is one of C<$w>, C<$s>, or C<$doc> to indicate
a
L<token-|DTA::CAB::Token>,
L<sentence-|DTA::CAB::Sentence>,
or L<document-|DTA::CAB::Document>-level attribute, respectively.
If unspecified, C<ATTR> is identical to the attribute name itself,
and C<CODE> is a simple string containing the atomic attribute value.
=item *
( run in 0.999 second using v1.01-cache-2.11-cpan-97f6503c9c8 )