Lingua-NATools

 view release on metacpan or  search on metacpan

lib/Lingua/NATools.pm  view on Meta::CPAN


=head2 C<index_ngrams>

This method calculates ngrams (bigrams, trigrams and tetragrams) for
both languages and ALL chunks.

  $pcorpus->index_ngrams;


=head2 C<split_corpus_simple>

This method is called by the C<codify> method to split the corpora
into chunks. Note that this method should be called for any number of
chunks, including the singular one.

The method receives an hash reference with configuration values, and
the two text files with the text to be tokenized. The hash should
include, at least, the number of chunks, and the chunk currently being
processed.

  $pcorpus->split_corpus_simple({tokenize => 0,
                                 verbose => 1,
                                 chunk => 1, nrchunks => 16},
                                    "/var/corpora/EuroParl.PT",
                                    "/var/corpora/EuroParl.EN");


=head2 C<run_initmat>

This method invoques the C program C<nat-initmat> for a specific
chunk. You must supply the chunk number, and it should exist. It
returns the time used to run the command.

  $pcorpus->run_initmat(3);


=head2 C<run_mat2dic>

This method invoques the C program C<nat-mat2dic> for a specific
chunk. You must supply the chunk number, and it should exist. It
returns the time used to run the command.

  $pcorpus->run_mat2dic(4);


=head2 C<run_post>

This method invoques the C program C<nat-postbin> for a specific
chunk. You must supply the chunk number, and it should exist. It
returns the time used to run the command.

  $pcorpus->run_post(5);

=head2 C<run_generic_EM>

This method invoques one of the three algorithms for Entropy
Maximization of the alignment matrix: C<nat-sampleA>, C<nat-sampleB>
and C<nat-ipfp>.

You should call the method with the name of the algorithm ("sampleA",
"sampleB" or "ipfp"), the number of iterations to be done, and the
chunk to be processed.

Returns the time used to run the command.

  $pcorpus->run_generic_EM("ipfp", 5, 3);

=head2 C<align_all>

This method will re-align all chunks in the corpora repository. It
will not re-encode them, just re-align.

  $pcorpus->align_all;


=head2 C<align_chunk>

This method will re-align a specific chunk in the corpora repository. It
will not re-encode it, just re-align.

You need to give a first argument with the chunk number to be aligned,
and a optional second argument stating if you want verose output.

  $pcorpus->align_chunk(3,0);


=head2 C<run_dict_add>

This method appends a chunk to both languages dictionaries (not
NATdicts). You must supply a chunk number (and it should exist).  The
method should not be called directly. Or, if really needed, call it
for all chunks, one at a time, starting with the first.

  for (1..10) {
    $pcorpus->run_dict_add($_)
  }



=head2 C<make_dict>

This method creates the corpora dictionaries (not NATDicts). The
method is called directly in the object with an optional argument to
force verbose output if needed. This method will call C<run_dict_add>
for each chunk.

  $pcorpus->make_dict;


=head2 C<pre_chunk>

This function does the encoding for each created chunk. It is called
internally by the C<codify> method. You should call it with the home
directory for the parallel corpora repository and the chunk
identifier.

   pre_chunk({ ignore_case => 1}, "/var/corpora/EuroParl", 4);


=head2 C<dump_ptd>



( run in 0.650 second using v1.01-cache-2.11-cpan-71847e10f99 )