DTA-CAB

 view release on metacpan or  search on metacpan

CAB/WebServiceHowto.pod  view on Meta::CPAN

The final particle "nach" is mis-tagged as a preposition (APPR vs. PTKVZ) by the unigram-based model, but this has
no effect on the lemma assigned.
Although use of the "norm1" analyzer does not alter any canonical modern forms in this example,
such cases are possible.

=cut

##========================================================================
## Analysis Chains: Expansion
=pod

=head3 Term Expansion

It is sometimes useful to have a list of all known orthographic variants of a given input form, e.g.
for runtime queries of a database which indexes only surface forms.  For such tasks, the analysis
chain "expand" can be used.  To
see all the variants of the surface form "Elephant" in the
L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> corpus for example, one could query
L<http://deutschestextarchiv.de/cab?a=expand&q=Elephant&tokenize=0>,
and expect a response something like:

 Elephant
 	+[moot/word] Elefant
 	+[moot/tag] NN
 	+[moot/lemma] Elefant
 	+[eqpho] Elephant <0>
 	+[eqpho] Elefant <14>
 	+[eqpho] elephant <17>
 	+[eqpho] elevant <17>
 	+[eqpho] Elephand <18>
 	+[eqpho] Elevant <18>
 	+[eqpho] elefant <18>
 	+[eqpho] Elephandt <19>
 	+[eqpho] Elephanth <19>
 	+[eqrw] Elefant <0>
 	+[eqrw] Elephant <0>
 	+[eqrw] Elephandt <8.44527626037598>
 	+[eqrw] elefant <8.44683265686035>
 	+[eqrw] Elephanth <8.70806312561035>
 	+[eqrw] elephant <9.01417255401611>
 	+[eqrw] Elephand <18.6624526977539>
 	+[eqrw] Eliphant <18.7045001983643>
 	+[eqrw] Elephants <21.1982593536377>
 	+[eqrw] elevant <21.3945064544678>
 	+[eqrw] Elphant <23.2134704589844>
 	+[eqrw] Elesant <27.7278366088867>
 	+[eqrw] Elephanta <30.2710800170898>
 	+[eqlemma] Elefannten <0>
 	+[eqlemma] Elefant <0>
 	+[eqlemma] Elefanten <0>
 	+[eqlemma] Elefantin <0>
 	+[eqlemma] Elefantine <0>
 	+[eqlemma] Elephandten <0>
 	+[eqlemma] Elephant <0>
 	+[eqlemma] Elephanten <0>
 	+[eqlemma] Elesant <0>
 	+[eqlemma] elefant <0>
 	+[eqlemma] elephanten <0>

Here, the "eqpho" attribute contains all surface forms recognized as phonetic variants of the query term,
"eqrw" contains those surface forms recognized as variants by the heuristic rewrite cascade,
and "eqlemma" contains the surface forms most likely to be mapped to the same modern lemma as the query term.
This online expansion strategy
is used by the L<DTA Query Lizard|http://kaskade.dwds.de/dstar/dta/lizard?q=Elephant>,
and was also used by an earlier version of the L<DTA corpus index|http://kaskade.dwds.de/dstar/dta/>
as described in
L<Jurish et al. (2014)|http://./#jtw2014>,
but has since been replaced there by an online lemmatization query using the "lemma" expander,
in conjunction with a direct query of the underlying corpus $Lemma index.

The request includes the C<tokenize=0> option,
which informs the CAB server that the query does not need to be tokenized, effectively forcing use of the
L<C<qd> parameter|DTA::CAB::HttpProtocol/Query Parameters> to the low-level service.  This is generally
a good idea when using single-token queries or pre-tokenized documents, since it speeds up processing.

=cut

##========================================================================
## Analysis Chains: Date-optimized
=pod

=head3 Date-optimized Analysis

As of DTA::CAB v1.78, the L<DTA Dispatcher|DTA::CAB::Chain::DTA>
includes specialized L<rewrite models|DTA::CAB::Chain::DTA/rw>
(C<rw.1600-1700>, C<rw.1700-1800>, C<rw.1800-1900>),
and provides a number of L<high-level convenience chains|DTA::CAB::Chain::DTA/setupChains>
(C<norm.1600-1700>, C<norm1.1600-1700>, etc.) using these models
instead of the default "generic" rewrite cascade (C<rw>)
to provide canonicalization hypotheses for unknown words.
The weights for the specialized rewrite models
were trained on a modest number of manually assigned canonicalization pairs
from the period in question
extracted from the L<CabErrorDb|http://kaskade.dwds.de/caberr/>
error database (4,000-8,000 pair types per model), and may provide a slight
improvement in canonicalization accuracy with respect to the generic model,
provided that you specify the appropriate analysis chain ("Analyzer") in your request.
Compare for example the outputs of the various chains for the input forms
I<avf>, I<Auffichten>, and I<Büberchens>:

=over 4

=item *

L<generic|http://www.deutschestextarchiv.de/cab/?a=default1&q=avf%20Auffichten%20B%C3%BCberchens>

=item *

L<1600-1700|http://www.deutschestextarchiv.de/cab/?a=default1.1600-1700&q=avf%20Auffichten%20B%C3%BCberchens>

=item *

L<1700-1800|http://www.deutschestextarchiv.de/cab/?a=default1.1700-1800&q=avf%20Auffichten%20B%C3%BCberchens>

=item *

L<1800-1900|http://www.deutschestextarchiv.de/cab/?a=default1.1800-1900&q=avf%20Auffichten%20B%C3%BCberchens>

=back

=cut


##========================================================================
## Analysis Chains: Format Conversion
=pod

=head3 Format Conversion

The CAB server can be used to convert between various
supported L<IE<sol>O Formats|/IE<sol>O Formats>.  In this mode,
no analysis is performed on the input data
(with the exception of tokenization for raw untokenized input),
but the input document is parsed and re-formatted according to
the selected output format.  The analysis chain "null" can be
selected for such tasks.  To tokenize a simple text
string for instance, you can select the "null" analyzer and the
"text" format, and expect output such as
L<this|http://www.deutschestextarchiv.de/cab?a=null&q=EJn+zamer+Elephant+gillt+ohngef%C3%A4hr+zweyhundert+Thaler.+Ceterum+censeo+Carthaginem+esse+delendam.>.

This mode of operation is mostly useful in conjunction with
L<file upload queries|/A File Query> to convert analyzed files.
If you only need to tokenize raw text files, consider using
the more efficient L<WASTE tokenizer web-service|http://www.dwds.de/waste/>
directly,
or the CLARIN-D L<WebLicht|http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/> tool-chainer,
which offers a number of different tokenizer components.

=cut

CAB/WebServiceHowto.pod  view on Meta::CPAN

Static type-wise analysis cache for
the L<attributes|/Analysis Attributes>
C<eqphox>, C<errid>, C<exlex>, C<f>, C<lts>, C<mlatin>, C<morph>, C<msafe>, C<rw>, C<xlit>, C<lang>, and C<pnd>
based on the most recent release of the L<I<Deutsches Textarchiv>|http://www.deutschestextarchiv.de> corpus,
typically less than one week old.

=item L<exlex|DTA::CAB::Analyzer::ExLex>

Type-wise exception lexicon extracted from the
L<DTA EvalCorpus|http://odo.dwds.de/~jurish/software/dtaec/>
and the DTA::CAB error database (demo L<here|http://kaskade.dwds.de/caberr/>),
typically updated weekly.

=item L<tokpp|DTA::CAB::Analyzer::TokPP>

Type-wise heuristic token preprocessor used to identify punctuation, numbers, quotes, etc.

=item L<xlit|DTA::CAB::Analyzer::Unicruft>

Deterministic type-wise character transliterator based on L<libunicruft|http://odo.dwds.de/~moocow/software/unicruft/>,
mostly useful for handling extinct characters and diacritics.

=item L<lts|DTA::CAB::Analyzer::LTS>

Deterministic type-wise phonetization ("letter-to-sound" mapping)
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer
as described in L<Jurish (2012), Ch. 1|http://./#jurish2012>.


=item L<morph|DTA::CAB::Analyzer::Morph>

Type-wise morphological analysis of the (transliterated) surface form
via L<Gfsm|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/> transducer.
The default DTA analysis chanin uses a modified version of the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology FST.

=item L<mlatin|DTA::CAB::Analyzer::Morph::Latin>

Type-wise Latin pseudo-morphology for (transliterated) surface forms
based on the finite word-list distributed with
the
"L<William Whitaker's Words|https://sourceforge.net/projects/wwwords/>"
Latin dictionary.

=item L<msafe|DTA::CAB::Analyzer::MorphSafe>

Heuristics for detecting "suspicious" analyses supplied
by the L</morph> component (L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>),
as described in L<Jurish (2012), App. A.4|http://./#jurish2012>.

=item L<langid|DTA::CAB::Analyzer::LangId::Simple>

Simple sentence-wise language guesser based on stopword lists
extracted from the python L<NLTK project|http://www.nltk.org/>.
Also supports the pseudo-language C<XY>, which is typically assigned
for mathematical notation, abbreviations, or other extra-lexical material.

=item L<rw|DTA::CAB::Analyzer::Rewrite>

Type-wise I<k>-best weighted finite-state rewrite cascade conflator ("nearest neighbors")
via L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade.
as described in L<Jurish (2012), Ch. 2|http://./#jurish2012>

=item L<eqphox|DTA::CAB::Analyzer::EqPhoX>

Type-wise pohonetic equivalence conflator using a
L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade;
requires prior L</lts> analysis.
Unlike the presentation in L<Jurish (2012), Ch. 3|http://./#jurish2012>,
the current implementation uses a I<k>-best search strategy
over an infinite target language derived from the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology for improved recall.

=item L<dmoot|DTA::CAB::Analyzer::Moot::Boltzmann>

Sentence-wise conflation candidate disambiguator as described in
L<Jurish (2012), Ch. 4|http://./#jurish2012>.  Attempts to determine
the "best" modern form from the canidate conflations provided by
the L</exlex>, L</xlit>, L</eqphox>, and L</rw> components,
after consideration of the properties provided by the
L</morph>, L</msafe>, L</mlatin>, and L</langid> components
(e.g. sentences already identified as consisting primarily of
foreign-language material will B<not> be "forced" onto contemporary
German).

=item L<dmootsub|DTA::CAB::Analyzer::MootSub>

Sentence-wise post-processing for the L</dmoot> HMM.
Mostly useful for performing L<morphological analysis|/morph> on
non-trivial canonicalizations supplied by L</dmoot>.

=item L<moot|DTA::CAB::Analyzer::Moot>

Sentence-wise part-of-speech (PoS) tagging using
the L<moot|http://kaskade.dwds.de/~moocow/mirror/projects/moot/> tagger
on the observations (word forms) provided by L</dmoot> or the raw
input token text and the morphological ambiguity classes supplied
by L</dmootsub> or L</morph>.

=item L<mootsub|DTA::CAB::Analyzer::MootSub>

Sentence-wise post-processing for the L</moot> tagger.
Mostly useful for determining the "best" lemma for the canonical
word form (L</dmoot> or token text) and PoS-tag selected by
by L</moot> from the set of canonical morphological analyses
(L</dmootsub> or L</morph>).

=back

=cut

##------------------------------------------------------
## Analysis Attributes
=pod

=head3 Analysis Attributes

This section describes the most common analysis attributes
used by the default L<DTA::CAB::Chain::DTA|DTA::CAB::Chain::DTA> configuration.
Each attribute is described by a template such as:

 data: $OBJ->{ATTR} = CODE
 text: +[LABEL] TEXT
 hidden: HIDDEN

where:



( run in 1.545 second using v1.01-cache-2.11-cpan-0bb4e1dffa6 )