DTA-CAB

 view release on metacpan or  search on metacpan

CAB/Analyzer/Automaton/Gfsm/XL.pm  view on Meta::CPAN

			     );
  $aut->setLookupOptions($aut);
  return $aut;
}


## $aut = $aut->clear()
sub clear {
  my $aut = shift;

  $aut->{fst}->_cascade_set(undef) if ($aut->{fst});

  ##-- inherited
  $aut->SUPER::clear();
}

## $aut = $aut->resetProfilingData()
## - inherited

##--------------------------------------------------------------
## Methods: Lookup Options

CAB/Analyzer/Automaton/Gfsm/XL.pm  view on Meta::CPAN

## $class = $aut->fstClass()
##  + default FST class for loadFst() method
sub fstClass { return 'Gfsm::XL::Cascade'; }

## $class = $aut->labClass()
##  + default labels class for loadLabels() method
sub labClass { return 'Gfsm::Alphabet'; }

## $bool = $aut->fstOk()
##  + should return false iff fst is undefined or "empty"
sub fstOk { return defined($_[0]{fst}) && defined($_[0]{fst}->cascade) && $_[0]{fst}->cascade->depth>0; }

## $bool = $aut->labOk()
##  + should return false iff label-set is undefined or "empty"
#(inherited)

## $bool = $aut->dictOk()
##  + should return false iff dict is undefined or "empty"
##(inherited)


CAB/Analyzer/Automaton/Gfsm/XL.pm  view on Meta::CPAN



##--------------------------------------------------------------
## Methods: I/O: Input: Transducer

## $aut = $aut->loadCascade($cscfile)
## $aut = $aut->loadFst    ($cscfile)
*loadFst = \&loadCascade;
sub loadCascade {
  my ($aut,$cscfile) = @_;
  $aut->info("loading cascade file '$cscfile'");
  my $csc = Gfsm::XL::Cascade->new();
  if (!$csc->load($cscfile)) {
    $aut->logconfess("loadCascade(): load failed for '$cscfile': $!");
    return undef;
  }
  $aut->{fst} = Gfsm::XL::Cascade::Lookup->new($csc);
  $aut->setLookupOptions($aut);
  #$aut->{result} = Gfsm::Automaton->new($csc->semiring_type);  ##-- reset result automaton
  delete($aut->{_analyze});
  #print STDERR sprintf("loadCascade(): csc=0x%0.8x, cl=0x%0.8x\n", $$csc, ${$aut->{fst}}); ##-- DEBUG
  return $aut;
}

## $result = $aut->resultFst()
##  + returns empty result FST
sub resultFst {
  return Gfsm::Automaton->new($_[0]{fst}->cascade->semiring_type);
}


##--------------------------------------------------------------
## Methods: I/O: Input: Labels

## $aut = $aut->loadLabels($labfile)
## + inherited

## $aut = $aut->parseLabels()

CAB/Analyzer/Common.pm  view on Meta::CPAN

=item L<DTA::CAB::Analyzer::Automaton|DTA::CAB::Analyzer::Automaton>

Generic API for finite-state automaton analyzers.

=item L<DTA::CAB::Analyzer::Automaton::Gfsm|DTA::CAB::Analyzer::Automaton::Gfsm>

Finite-state analyzer base class using Gfsm for low-level automaton operations (lookup).

=item L<DTA::CAB::Analyzer::Automaton::Gfsm::XL|DTA::CAB::Analyzer::Automaton::Gfsm::XL>

Finite-state analyzer base class using Gfsm::XL for low-level automaton operations (k-best cascade lookup).



=item L<DTA::CAB::Analyzer::Dict|DTA::CAB::Analyzer::Dict>

Full-form dictionary-based analyzer (aka "cache") using a flat hash.

=item L<DTA::CAB::Analyzer::Dict::BDB|DTA::CAB::Analyzer::Dict::BDB>

Full-form dictionary-based analyzer (aka "cache") using Berkeley DB.

CAB/Analyzer/Common.pm  view on Meta::CPAN



=item L<DTA::CAB::Analyzer::Null|DTA::CAB::Analyzer::Null>

Null analyzer, for testing purposes.



=item L<DTA::CAB::Analyzer::Rewrite|DTA::CAB::Analyzer::Rewrite>

Error-correction (rewrite) analyzer using a Gfsm::XL cascade.

=item L<DTA::CAB::Analyzer::RewriteSub|DTA::CAB::Analyzer::RewriteSub>

Sub-analyzer for rewrite output.



=item L<DTA::CAB::Analyzer::TokPP|DTA::CAB::Analyzer::TokPP>

Type-level heuristic token preprocessor (for punctuation etc)

CAB/Analyzer/EqPho/Cascade.pm  view on Meta::CPAN

## -*- Mode: CPerl -*-
##
## File: DTA::CAB::Analyzer::EqPho::Cascade.pm
## Author: Bryan Jurish <moocow@cpan.org>
## Description: phonetic equivalence via Gfsm::XL cascade

##==============================================================================
## Package: Analyzer::EqPho::Cascade
##==============================================================================
package DTA::CAB::Analyzer::EqPho::Cascade;
use DTA::CAB::Analyzer::Automaton::Gfsm::XL;
use Carp;
use strict;
our @ISA = qw(DTA::CAB::Analyzer::Automaton::Gfsm::XL);

CAB/Analyzer/EqPho/Cascade.pm  view on Meta::CPAN

__END__
##========================================================================
## POD DOCUMENTATION, auto-generated by podextract.perl, edited

##========================================================================
## NAME
=pod

=head1 NAME

DTA::CAB::Analyzer::EqPho::Cascade - phonetic equivalence expander via Gfsm::XL cascade

=cut

##========================================================================
## SYNOPSIS
=pod

=head1 SYNOPSIS

 ##========================================================================

CAB/Analyzer/EqPho/Cascade.pm  view on Meta::CPAN

=cut

##========================================================================
## DESCRIPTION
=pod

=head1 DESCRIPTION

DTA::CAB::Analyzer::EqPho::Cascade is a phonetic equivalence expander
conforming to the L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> API which uses
a L<Gfsm::XL|Gfsm::XL> cascade to perform the actual expansion.
It inherits from
L<DTA::CAB::Analyzer::Automaton::Gfsm::XL|DTA::CAB::Analyzer::Automaton::Gfsm::XL>
and sets the following default parameters:

 analyzeDst => 'eqpho',
 wantAnalysisLo => 0,
 tolower => 1,
 ##
 ##-- analysis parameters
 max_weight => 1e38,

CAB/Analyzer/EqPhoX.pm  view on Meta::CPAN

## -*- Mode: CPerl -*-
##
## File: DTA::CAB::Analyzer::EqPhoX
## Author: Bryan Jurish <moocow@cpan.org>
## Description: phonetic-equivalence class expansion: intensional via gfsmxl cascade

##==============================================================================
## Package: Analyzer::Morph
##==============================================================================
package DTA::CAB::Analyzer::EqPhoX;
use DTA::CAB::Analyzer::EqPho::Cascade;
use strict;
our @ISA = qw(DTA::CAB::Analyzer::EqPho::Cascade);

## $obj = CLASS_OR_OBJ->new(%args)

CAB/Analyzer/EqPhoX.pm  view on Meta::CPAN


##========================================================================
## POD DOCUMENTATION, auto-generated by podextract.perl, edited

##========================================================================
## NAME
=pod

=head1 NAME

DTA::CAB::Analyzer::EqPhoX - phonetic equivalence class expansion: intensional, via gfsmxl cascade

=cut

##========================================================================
## SYNOPSIS
=pod

=head1 SYNOPSIS

 ##========================================================================

CAB/Analyzer/Rewrite.pm  view on Meta::CPAN

## -*- Mode: CPerl -*-
##
## File: DTA::CAB::Analyzer::Rewrite.pm
## Author: Bryan Jurish <moocow@cpan.org>
## Description: rewrite analysis via Gfsm::XL cascade

##==============================================================================
## Package: Analyzer::Rewrite
##==============================================================================
package DTA::CAB::Analyzer::Rewrite;
use DTA::CAB::Analyzer ':child';
use DTA::CAB::Analyzer::Automaton::Gfsm::XL;
use Carp;
use strict;
our @ISA = qw(DTA::CAB::Analyzer::Automaton::Gfsm::XL);

CAB/Analyzer/Rewrite.pm  view on Meta::CPAN

__END__
##========================================================================
## POD DOCUMENTATION, auto-generated by podextract.perl, edited

##========================================================================
## NAME
pod

=head1 NAME

DTA::CAB::Analyzer::Rewrite - rewrite analysis via Gfsm::XL cascade

=cut

##========================================================================
## SYNOPSIS
=pod

=head1 SYNOPSIS

 use DTA::CAB::Analyzer::Rewrite;

CAB/Chain.pm  view on Meta::CPAN

## -*- Mode: CPerl -*-
##
## File: DTA::CAB::Chain.pm
## Author: Bryan Jurish <moocow@cpan.org>
## Description: generic analyzer API: analyzer "chains" / "cascades" / "pipelines" / ...

package DTA::CAB::Chain;
use DTA::CAB::Analyzer;
use DTA::CAB::Datum ':all';
use Carp;
use strict;

##==============================================================================
## Globals
##==============================================================================

CAB/Chain.pm  view on Meta::CPAN

=cut

##========================================================================
## DESCRIPTION
=pod

=head1 DESCRIPTION

DTA::CAB::Chain
is an abstract L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> subclass
for implementing serial document processing "pipelines" or "cascades"
in terms of a flat list of L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> objects.

=cut

##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB::Chain: Constructors etc.
=pod

=head2 Constructors etc.

CAB/Chain/DTA.pm  view on Meta::CPAN

  $chains->{"norm1.hlgl.geo"} = [map {($_,$_ eq $ach->{mlatin} ? $ach->{mhessengeo} : qw())} @{$chains->{norm1}}];
  ##-- END TEMPORARY custom chain(s)

  ##-- date-dependent chains
  foreach my $rng (@RW_RANGES) {
    if ($ach->{"rw.$rng"} && ($ach->{"rw.$rng"}{enabled}//1)) {
      foreach my $key (qw(norm norm1 lemma lemma1 default default1 expand)) {
	$chains->{"$key.$rng"} = [map {$_ eq $ach->{rw} ? $ach->{"rw.$rng"} : $_} @{$chains->{$key}}];
      }
    } else {
      $ach->warn("optimized rewrite cascade rw.$rng not available: disabling derived chains for range $rng");
      delete $ach->{"rw.$rng"};
      delete $chains->{"sub.rw.$rng"};
    }
  }

  ##-- sanitize chains
  foreach (values %{$ach->{chains}}) {
    @$_ = grep {ref($_)} @$_;
  }

CAB/Chain/DTA.pm  view on Meta::CPAN

=cut

##========================================================================
## DESCRIPTION
=pod

=head1 DESCRIPTION

DTA::CAB::Chain::DTA
is the L<DTA::CAB::Analyzer|DTA::CAB::Analyzer> subclass implementing
the robust orthographic canonicalization cascade used in the
I<Deutsches Textarchiv> project.  This class inherits from
L<DTA::CAB::Chain::Multi|DTA::CAB::Chain::Multi>.
See the L</setupChains> method for a list of supported sub-chains
and the corresponding analyers.

=cut

##----------------------------------------------------------------
## DESCRIPTION: DTA::CAB::Chain::DTA: Methods
=pod

CAB/Chain/DTA.pm  view on Meta::CPAN

Latin pseudo-morphology,
a L<DTA::CAB::Analyzer::Morph::Latin|DTA::CAB::Analyzer::Morph::Latin> object.

=item msafe

Morphological security heuristics,
a L<DTA::CAB::Analyzer::MorphSafe|DTA::CAB::Analyzer::MorphSafe> object.

=item rw

Weighted finite-state rewrite cascade,
a L<DTA::CAB::Analyzer::Rewrite|DTA::CAB::Analyzer::Rewrite> object.

Date-optimized variants C<rw.1600-1700>, C<rw.1700-1800>, and C<rw.1800-1900> may also be included.

=item rwsub

Post-processing for rewrite cascade,
a L<DTA::CAB::Analyzer::RewriteSub|DTA::CAB::Analyzer::RewriteSub> object.

=item eqphox

Intensional (TAGH-based) phonetic equivalence expander,
a L<DTA::CAB::Analyzer::EqPhoX|DTA::CAB::Analyzer::EqPhoX> object.

=item eqpho

Extensional (corpus-based) phonetic equivalence expander,

CAB/Chain/DTA.pm  view on Meta::CPAN

 'norm1'          =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot1 dmootsub moot1 mootsub)}],
 'ner'            =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot  dmootsub moot  mootsub ner)}],
 'caberr'         =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot  dmootsub moot  mootsub mapclass)}],
 'caberr1'        =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw                  eqphox dmoot1 dmootsub moot1 mootsub mapclass)}],
 'all'            =>[@$ach{qw(static exlex tokpp xlit lts morph mlatin msafe langid rw rwsub eqpho eqrw eqphox dmoot  dmootsub moot  mootsub eqlemma)}],
 'clean'          =>[@$ach{qw(clean)}],
 ##
 'null'           =>[$ach->{null}],

High-level date-optimized chains C<norm.RNG>, C<norm1.RNG>, C<lemma.RNG>, C<lemma1.RNG>, C<default.RNG>, and C<expand.RNG>
are also defined using the date-optimized rewrite cascade C<rw.RNG> in place of the default "generic" cascade C<rw>
for each range I<RNG> in C<1600-1700>, C<1700-1800>, and C<1800-1900>.

=item ensureLoaded

 $bool = $ach->ensureLoaded();

Ensures analysis data is loaded from default files.
Inherited DTA::CAB::Chain::Multi override calls ensureChain() before inherited method.
Hack copies chain sub-analyzers (rwsub, dmootsub) AFTER loading their own sub-analyzers,
setting 'enabled' only then if appropriate.

CAB/WebServiceHowto.pod  view on Meta::CPAN

 	+[eqlemma] Elefantin <0>
 	+[eqlemma] Elefantine <0>
 	+[eqlemma] Elephandten <0>
 	+[eqlemma] Elephant <0>
 	+[eqlemma] Elephanten <0>
 	+[eqlemma] Elesant <0>
 	+[eqlemma] elefant <0>
 	+[eqlemma] elephanten <0>

Here, the "eqpho" attribute contains all surface forms recognized as phonetic variants of the query term,
"eqrw" contains those surface forms recognized as variants by the heuristic rewrite cascade,
and "eqlemma" contains the surface forms most likely to be mapped to the same modern lemma as the query term.
This online expansion strategy
is used by the L<DTA Query Lizard|http://kaskade.dwds.de/dstar/dta/lizard?q=Elephant>,
and was also used by an earlier version of the L<DTA corpus index|http://kaskade.dwds.de/dstar/dta/>
as described in
L<Jurish et al. (2014)|http://./#jtw2014>,
but has since been replaced there by an online lemmatization query using the "lemma" expander,
in conjunction with a direct query of the underlying corpus $Lemma index.

The request includes the C<tokenize=0> option,

CAB/WebServiceHowto.pod  view on Meta::CPAN

## Analysis Chains: Date-optimized
=pod

=head3 Date-optimized Analysis

As of DTA::CAB v1.78, the L<DTA Dispatcher|DTA::CAB::Chain::DTA>
includes specialized L<rewrite models|DTA::CAB::Chain::DTA/rw>
(C<rw.1600-1700>, C<rw.1700-1800>, C<rw.1800-1900>),
and provides a number of L<high-level convenience chains|DTA::CAB::Chain::DTA/setupChains>
(C<norm.1600-1700>, C<norm1.1600-1700>, etc.) using these models
instead of the default "generic" rewrite cascade (C<rw>)
to provide canonicalization hypotheses for unknown words.
The weights for the specialized rewrite models
were trained on a modest number of manually assigned canonicalization pairs
from the period in question
extracted from the L<CabErrorDb|http://kaskade.dwds.de/caberr/>
error database (4,000-8,000 pair types per model), and may provide a slight
improvement in canonicalization accuracy with respect to the generic model,
provided that you specify the appropriate analysis chain ("Analyzer") in your request.
Compare for example the outputs of the various chains for the input forms
I<avf>, I<Auffichten>, and I<Büberchens>:

CAB/WebServiceHowto.pod  view on Meta::CPAN


=item L<langid|DTA::CAB::Analyzer::LangId::Simple>

Simple sentence-wise language guesser based on stopword lists
extracted from the python L<NLTK project|http://www.nltk.org/>.
Also supports the pseudo-language C<XY>, which is typically assigned
for mathematical notation, abbreviations, or other extra-lexical material.

=item L<rw|DTA::CAB::Analyzer::Rewrite>

Type-wise I<k>-best weighted finite-state rewrite cascade conflator ("nearest neighbors")
via L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade.
as described in L<Jurish (2012), Ch. 2|http://./#jurish2012>

=item L<eqphox|DTA::CAB::Analyzer::EqPhoX>

Type-wise pohonetic equivalence conflator using a
L<GfsmXL|http://kaskade.dwds.de/~moocow/mirror/projects/gfsm/#gfsmxl> transducer cascade;
requires prior L</lts> analysis.
Unlike the presentation in L<Jurish (2012), Ch. 3|http://./#jurish2012>,
the current implementation uses a I<k>-best search strategy
over an infinite target language derived from the
L<TAGH|https://www.dwds.de/static/publications/text/Geyken_Hanneforth_fsmnlp.pdf>
morphology for improved recall.

=item L<dmoot|DTA::CAB::Analyzer::Moot::Boltzmann>

Sentence-wise conflation candidate disambiguator as described in

CAB/XmlRpcProtocol.pod  view on Meta::CPAN


=item token.eqphox

Array of strings representing the k-best phonetically equivalent word types known to
the underlying (intensional) lexicon.  Differs from L</token.eqpho> in the set from
which the phonetic equivalents are drawn.

=item token.rw

Array of analysis structs a la L</token.morph>, where weights
and analyses are determined by a (canonicalizing) rewrite cascade.
Each analysis struct may additionally have "lts" and/or "morph"
fields of its own, representing the respective analyses of the
rewrite I<target>.

=item token.eqrw

Array of analysis structs a la L</token.morph> representing the indexed word types
which are "rewrite-equivalent"
to the current token; i.e. which were rewritten to the same string as the
current token.

CAB/index.pod  view on Meta::CPAN

=item L<DTA::CAB::Analyzer::EqPho::BDB|DTA::CAB::Analyzer::EqPho::BDB>

DB dictionary-based phonetic equivalence expander

=item L<DTA::CAB::Analyzer::EqPho::CDB|DTA::CAB::Analyzer::EqPho::CDB>

DB dictionary-based phonetic equivalence expander

=item L<DTA::CAB::Analyzer::EqPho::Cascade|DTA::CAB::Analyzer::EqPho::Cascade>

phonetic equivalence expander via Gfsm::XL cascade

=item L<DTA::CAB::Analyzer::EqPho::Dict|DTA::CAB::Analyzer::EqPho::Dict>

dictionary-based phonetic form expander

=item L<DTA::CAB::Analyzer::EqPho::FST|DTA::CAB::Analyzer::EqPho::FST>

FST-based phonetic form expander

=item L<DTA::CAB::Analyzer::EqPho::JsonCDB|DTA::CAB::Analyzer::EqPho::JsonCDB>

Json-valued CDB dictionary-based phonetic equivalence expander

=item L<DTA::CAB::Analyzer::EqPhoX|DTA::CAB::Analyzer::EqPhoX>

phonetic equivalence class expansion: intensional, via gfsmxl cascade

=item L<DTA::CAB::Analyzer::EqRW|DTA::CAB::Analyzer::EqRW>

rewrite-equivalence class expander: default

=item L<DTA::CAB::Analyzer::EqRW::BDB|DTA::CAB::Analyzer::EqRW::BDB>

DB dictionary-based rewrite-equivalence expander

=item L<DTA::CAB::Analyzer::EqRW::CDB|DTA::CAB::Analyzer::EqRW::CDB>

CAB/index.pod  view on Meta::CPAN

=item L<DTA::CAB::Analyzer::Phonem|DTA::CAB::Analyzer::Phonem>

phonetic digest analysis using Text::Phonetic::Phonem

=item L<DTA::CAB::Analyzer::Phonix|DTA::CAB::Analyzer::Phonix>

phonetic digest analysis using Text::Phonetic::Phonix

=item L<DTA::CAB::Analyzer::Rewrite|DTA::CAB::Analyzer::Rewrite>

rewrite analysis via Gfsm::XL cascade

=item L<DTA::CAB::Analyzer::RewriteSub|DTA::CAB::Analyzer::RewriteSub>

sub-analysis (LTS, Morph) of rewrite targets

=item L<DTA::CAB::Analyzer::Soundex|DTA::CAB::Analyzer::Soundex>

phonetic digest analysis using Text::Phonetic::Soundex

=item L<DTA::CAB::Analyzer::SynCoPe|DTA::CAB::Analyzer::SynCoPe>

Changes  view on Meta::CPAN

	* added EqRW.pm, EqRW/Dict.pm
	* moved Dict::EqRW -> EqRW::Dict
	* fixed latin-1/utf-8 bug in CAB::Analyzer::Automaton

v0.12 2009-08-06 11:29  moocow
	* equiv-expander work
	  - TODO: get eqrw working via FST

v0.11 2009-08-03 14:26  moocow
	* removed eqpho-dict
	  - TODO: get eqrw working with 1-sided FST (explicit cascade direct from token-stored rw output)
	* added EqPho/FST.pm
	  - updated Analyzer::Automaton for non-deterministic analysis
	  - e.g. split Text->Pho and Pho->EqText into 2 FST analyzers
	* updated dta-eqrw.dict (after additional punishments for 'hülfe' in target lg)
	* more rewrite-equivalence class testing
	  + got integrated in DTA::CAB class, server config, etc.
	  + got dictionary building
	  + found some more data-type bugs (tagh, rewrite, msafe, ...):
	    - hülfe -> helf~en ... [subjII] : see misc/notes/*
	  + found more tokenizer problems/bugs: see misc/notes/tokenizer.txt

Changes  view on Meta::CPAN

	* added system/resources/Makefile rules to generate rewrite-equivalence dictionary for use with Dict::EqClass
	* initial tests seem to work well

v0.09 2009-07-24 14:34  moocow
	* dictionary/cache updates

v0.08 2009-07-23 14:34  moocow
	* removed stale old-format cache files
	* added cache-generation to resources Makefile
	* moved EqClass, LatinDict to Dict:: namespace
	* added EqPho analyzer via Gfsm::XL cascade
	  - loads quicker, runs slower, still maybe some buglets
	* updated rewrite dict with better upper/lower case heuristics

v0.07 2009-07-03 13:42  moocow
	* added linear-function max_weight computation for Gfsm::XL (rewrite) cascades

v0.0602 2009-07-03 13:39  moocow
	* updated system/cab.plm to use new rewrite FST, dict
	* updated dta-rw.dict
	* added -log-config option to dta-cab-analyze.perl
	* added cab-server-nodict.plm: useful for testing e.g. rewrite cascade w/o exception lexicon
	* MorphSafe back-changes: ITJ is unsafe
	* minor MorphSafe changes, new rw dict

v0.0601 2009-06-26 14:28  moocow
	* added dta-rw.dict, updated MorphSafe
	* added dta-rw.dict: extracted from grimm/wm-eval data
	* updated resource makefile
	* added symlink taxi-resources
	* Morph/Latin uses tolower=>1



( run in 0.876 second using v1.01-cache-2.11-cpan-49f99fa48dc )