view release on metacpan or search on metacpan
- add the possibility of a case sensitive matching, according
to (or not) the length of the term
0.7 - tagging of the term with the lemmatised form of its words
0.6 - Correction in Build.PL
- * was forget in the quote meta character function
- Correction in the documentation
- Few meta characters unspecialised in the terms
- the returned term form is the matching form according to the regex
- Optimization of the term selection
0.5 - Add '/' character to the regex frontier while final term tagging
- Addition of the method print_corpus_index to print on STDERR
the corpus index
- Correction in the step for selecting potentially appearing
terms in the corpus (some terms were missing)
- Additional information as semantic tag associated to terms
(set from the third column are managed
- Add the LICENSE file
0.4 - Improvement in displaying the tagging process
- Documentation of the modules and scripts are gathered at the
end of each file.
- Addition of a Build.PL file
- Addition of examples
bin/TermTagger-brat.pl view on Meta::CPAN
# URL : https://perso.limsi.fr/hamon/
#
########################################################################
=head1 NAME
TermTagger-brat.pl -- A Perl script for tagging text with terms (Brat format output)
=head1 SYNOPSIS
TermTagger.pl [options] corpus termlist selected_term_list lemmatised_corpus
=head1 OPTIONS
=over 4
=item B<--help> brief help message
=back
=head1 DESCRIPTION
This script tags a corpus with terms and provide a output compatible with Brat (<http://brat.nlplab.org/>). Corpus (C<corpus>) is a file
with one sentence per line. Term list (C<termlist>) is a file
containing one term per line. For each term, additionnal information
(as canonical form) can be given after a column. Each line of the
output file (C<selected_term_list>) contains the sentence number, the
term, additional information, all separated by a tabulation character.
==hea1 EXAMPLES
Tag the textual corpus in C<corpus-test.txt> with terms in the file
C<termlist-test.lst> and record the results in the file
C<corpus-test.ann>) according to the Brat input format:
TermTagger-brat.pl corpus-test.txt termlist-test.lst corpus-test.ann
bin/TermTagger.pl view on Meta::CPAN
# URL : https://perso.limsi.fr/hamon/
#
########################################################################
=head1 NAME
TermTagger.pl -- A Perl script for tagging corpus with terms
=head1 SYNOPSIS
TermTagger.pl [options] corpus termlist selected_term_list lemmatised_corpus
=head1 OPTIONS
=over 4
=item B<--help> brief help message
=back
=head1 DESCRIPTION
This script tags a corpus with terms. Corpus (C<corpus>) is a file
with one sentence per line. Term list (C<termlist>) is a file
containing one term per line. For each term, additionnal information
(as canonical form) can be given after a column. Each line of the
output file (C<selected_term_list>) contains the sentence number, the
term, additional information, all separated by a tabulation character.
=head1 SEE ALSO
Alvis web site: http://www.alvis.info
=head1 AUTHORS
Thierry Hamon <thierry.hamon@limsi.fr>
examples/load_and_term-matching view on Meta::CPAN
#!/usr/bin/perl -w
use strict;
use warnings;
require Alvis::TermTagger;
my $corpus = "corpus";
my $termlist = "term+lem+semtaglist";
my $selected_term_list = "selected-terms";
Alvis::TermTagger::termtagging($corpus, $termlist, $selected_term_list);
warn "List of the selected terms in $selected_term_list\n";
examples/load_and_term-matching_hash view on Meta::CPAN
use strict;
use warnings;
require Alvis::TermTagger;
my $corpus = "corpus";
my $lemmatised_corpus = "lemmatised-corpus";
my $termlist = "term+lem+semtaglist+wordlemma";
my %selected_term_list;
my $term;
my @term_list;
my %term_listIdx;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
my $CS = undef;
Alvis::TermTagger::load_TermList($termlist,\@term_list,\%term_listIdx);
Alvis::TermTagger::get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
Alvis::TermTagger::load_Corpus($corpus,\%corpus, \%lc_corpus);
Alvis::TermTagger::load_Corpus($lemmatised_corpus,\%lemmatised_corpus, \%lc_lemmatised_corpus);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $CS);
Alvis::TermTagger::corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $CS);
Alvis::TermTagger::term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $CS);
Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%selected_term_list, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, \%selected_term_list, $CS);
foreach $term (keys %selected_term_list) {
# print "$term\t" . join("\t", @{$selected_term_list{$term}}) . "\n";
print join("\t", @{$selected_term_list{$term}}) . "\n";
}
examples/load_and_term-matching_tab view on Meta::CPAN
use strict;
use warnings;
require Alvis::TermTagger;
my $corpus = "corpus";
my $lemmatised_corpus = "lemmatised-corpus";
my $termlist = "term+lem+semtaglist+wordlemma";
my @selected_term_list;
my $term;
my %term_listIdx;
my @term_list;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
my $CS = 3;
Alvis::TermTagger::load_TermList($termlist, \@term_list, \%term_listIdx);
Alvis::TermTagger::get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
Alvis::TermTagger::load_Corpus($corpus, \%corpus, \%lc_corpus);
Alvis::TermTagger::load_Corpus($lemmatised_corpus, \%lemmatised_corpus, \%lc_lemmatised_corpus);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $CS);
Alvis::TermTagger::corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $CS);
Alvis::TermTagger::term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $CS);
# Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idtrm_select, $CS);
Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@selected_term_list, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, \@selected_term_list, $CS);
foreach $term (@selected_term_list) {
print "$term\n";
}
examples/term-matching view on Meta::CPAN
require Alvis::TermTagger;
my %corpus = (1, {'line' => "During sporulation of Bacillus subtilis, spore coat proteins encoded by cot genes are expressed in the mother cell and deposited on the forespore.", 'offset' => 0});
my %lc_corpus;
my @termlist_input = ("sporulation", "sporulation", "Bacillus subtilis", "Bacillus subtilis" , "proteins", "protein", "genes", "gene" , "mother cell", "mother cell" );
my @termlist;
my @selected_term_list;
my $term;
my @regex_term_list;
my %corpus_index;
my %idtrm_select;
my $key;
foreach $key (keys %corpus) {
$lc_corpus{$key}->{'line'} = lc($corpus{$key}->{'line'});
$lc_corpus{$key}->{'offset'} = $corpus{$key}->{'offset'};
}
my $i = 0;
for($i = 0;$i< scalar (@termlist_input); $i+=2) {
my @tmp = ($termlist_input[$i], $termlist_input[$i+1]);
push @termlist, \@tmp;
}
Alvis::TermTagger::get_Regex_TermList(\@termlist, \@regex_term_list);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus_index);
Alvis::TermTagger::term_Selection(\%corpus_index, \@termlist, \%idtrm_select, 1);
Alvis::TermTagger::term_tagging_offset_tab(\@termlist, \@regex_term_list, \%idtrm_select, \%corpus, \@selected_term_list);
foreach $term (@selected_term_list) {
print "$term\n";
}
lib/Alvis/TermTagger.pm view on Meta::CPAN
# URL : https://perso.limsi.fr/hamon/
#
########################################################################
use strict;
use warnings;
use utf8;
# TODO : write functions for term tagginga, term selection with and
# without offset in the corpus
sub termtagging {
my ($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive) = @_;
my @term_list;
my %term_listIdx;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
if (!defined $caseSensitive) {
$caseSensitive = -1;
}
&load_TermList($term_list_filename,\@term_list, \%term_listIdx);
&get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
&load_Corpus($corpus_filename, \%corpus, \%lc_corpus);
if (defined $lemmatised_corpus_filename) {
&load_Corpus($lemmatised_corpus_filename, \%lemmatised_corpus, \%lc_lemmatised_corpus);
}
&corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $caseSensitive);
}
&term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $caseSensitive);
}
&term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&term_tagging_offset(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, $output_filename, $caseSensitive);
}
return(0);
}
sub termtagging_brat {
my ($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive) = @_;
my @term_list;
my %term_listIdx;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
if (!defined $caseSensitive) {
$caseSensitive = -1;
}
&load_TermList($term_list_filename,\@term_list, \%term_listIdx);
&get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
&load_Corpus($corpus_filename, \%corpus, \%lc_corpus);
if (defined $lemmatised_corpus_filename) {
&load_Corpus($lemmatised_corpus_filename, \%lemmatised_corpus, \%lc_lemmatised_corpus);
}
&corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $caseSensitive);
}
&term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $caseSensitive);
}
&term_tagging_offset_brat(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
if (defined $lemmatised_corpus_filename) {
&term_tagging_offset_brat(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, $output_filename, $caseSensitive);
}
return(0);
}
sub load_TermList {
my ($termlist_name, $ref_termlist, $ref_termlistIdx) = @_;
my $line;
my $line1;
lib/Alvis/TermTagger.pm view on Meta::CPAN
if (!exists $ref_corpus_index->{$word}) {
my @tabtmp;
$ref_corpus_index->{$word} = \@tabtmp;
}
push @{$ref_corpus_index->{$word}}, $sent_id;
}
}
}
# print STDERR join(" : ", keys(%$ref_corpus_index)) . "\n";
print STDERR "\n\tSize of the first selected term list: " . scalar(keys %$ref_corpus_index) . "\n\n";
}
sub print_corpus_index {
my ($ref_corpus_index) = @_;
my $word;
foreach $word (sort keys %$ref_corpus_index) {
print STDERR "$word\t";
print STDERR join(", ", @{$ref_corpus_index->{$word}});
print STDERR "\n";
}
}
sub _term_Selection2 {
my ($ref_corpus_index, $ref_termlist, $ref_tabh_idtrm_select) = @_;
my $counter;
my $term;
my @tab_termlex;
my $i;
my $word;
my $sent_id;
my $word_found = 0;
warn "Selecting the terms potentialy appearing in the corpus\n";
my %tabh_numtrm_select;
for($counter = 0;$counter < scalar @$ref_termlist;$counter++) {
$term = lc $ref_termlist->[$counter]->[0];
# XXX - ABREVIATION - XXX
@tab_termlex = split /[ \-]+/, $term;
$word_found = 0;
$i=0;
do {
$word = $tab_termlex[$i];
if (($word ne "") && ((length($word) > 2) || (scalar(@tab_termlex)==1)) &&
((exists $ref_corpus_index->{$word}))) { # || (exists $ref_corpus_index->{$word . "s"})
$word_found = 1;
if (!exists $ref_tabh_idtrm_select->{$counter}) {
my %tabhtmp2;
$ref_tabh_idtrm_select->{$counter} = \%tabhtmp2;
}
foreach $sent_id (@{$ref_corpus_index->{$word}}) {
${$ref_tabh_idtrm_select->{$counter}}{$sent_id} = 1;
}
}
$i++;
} while((!$word_found) && ($i < scalar @tab_termlex));
}
warn "\nEnd of selecting the terms potentialy appearing in the corpus\n";
}
sub term_Selection {
my ($ref_corpus_index, $ref_termlist, $ref_tabh_idtrm_select, $caseSensitive, $termField) = @_;
my $counter;
my $term;
my @tab_termlex;
my $termCap;
my @tab_termlexCap;
my $i;
my $word;
my $sent_id;
my $word_found = 0;
my @recordedWords;
if (!defined $termField) {
$termField = 0;
}
warn "Selecting the terms potentialy appearing in the corpus ($termField)\n";
my %tabh_numtrm_select;
# warn "caseSensitive: $caseSensitive\n";
for($counter = 0;$counter < scalar @$ref_termlist;$counter++) {
if (defined $ref_termlist->[$counter]->[$termField]) {
# warn "==> " . $ref_termlist->[$counter]->[0] . " / " . $ref_termlist->[$counter]->[$termField] . "\n";
if ((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField]) <= $caseSensitive))) {
$term = $ref_termlist->[$counter]->[$termField];
$termCap = $ref_termlist->[$counter]->[$termField];
# warn "passe\n";
} else {
lib/Alvis/TermTagger.pm view on Meta::CPAN
# } else {
# warn "--------------------------> $term\n";
}
$i++;
$word = $tab_termlex[$i];
# warn "i: $i\n";
}
if ($i == scalar(@tab_termlex)) {
foreach $word (@recordedWords) {
# print STDERR "$word : ";
if (!exists $ref_tabh_idtrm_select->{$counter}) {
my %tabhtmp2;
$ref_tabh_idtrm_select->{$counter} = \%tabhtmp2;
}
foreach $sent_id (@{$ref_corpus_index->{$word}}) {
${$ref_tabh_idtrm_select->{$counter}}{$sent_id} = 1;
}
}
}
# }
}
# print STDERR "\n";
# print STDERR join(" : ", keys(%$ref_tabh_idtrm_select)) . "\n";
warn "Size of the selected list: " . scalar (keys %$ref_tabh_idtrm_select) . "\n";
# foreach $counter (keys %$ref_tabh_idtrm_select) {
# warn $ref_termlist->[$counter]->[0] . "\n";
# }
warn "\nEnd of selecting the terms potentialy appearing in the corpus\n";
}
sub term_tagging_offset {
my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $offset_tagged_corpus_name, $caseSensitive, $termField) = @_;
my $counter;
my $term_regex;
my $sent_id;
my $line;
my $termField2;
if (!defined $termField) {
$termField = 0;
}
# XXX - ABREVIATION - XXX => regex
warn "Term tagging\n";
open TAGGEDCORPUS, ">>$offset_tagged_corpus_name" or die "$0: $offset_tagged_corpus_name: No such file\n";
binmode(TAGGEDCORPUS, ":utf8");
foreach $counter (keys %$ref_tabh_idtrm_select) {
$term_regex = $ref_regex_termlist->[$counter];
$termField2 = 0;
if (defined $ref_termlist->[$counter]->[$termField]) {
$termField2 = $termField;
}
foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
$line = $ref_tabh_corpus->{$sent_id}->{'line'};
print STDERR ".";
if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+]($term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/)) ||
(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) &&
($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+]($term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/i))) {
printMatchingTerm(\*TAGGEDCORPUS, $ref_termlist->[$counter], $sent_id);
}
if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
lib/Alvis/TermTagger.pm view on Meta::CPAN
my ($descriptor, $ref_matching_term, $sent_id) = @_;
print $descriptor "$sent_id\t";
print $descriptor join("\t", @$ref_matching_term);
print $descriptor "\n";
}
sub term_tagging_offset_tab {
my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $ref_tab_results, $caseSensitive, $termField) = @_;
my $counter;
my $term_regex;
my $sent_id;
my $line;
my $i;
my $size_termselect = scalar(keys %$ref_tabh_idtrm_select);
my $termField2;
$i = 0;
if (!defined $termField) {
$termField = 0;
}
# XXX - ABREVIATION - XXX => regex
# warn "====> $caseSensitive\n";
foreach $counter (keys %$ref_tabh_idtrm_select) {
# printf STDERR "Term tagging... %0.1f%%\r", ($i/$size_termselect)*100 ;
$term_regex = $ref_regex_termlist->[$counter];
# warn "counter: $counter ($term_regex)\n";
$termField2 = 0;
if (defined $ref_termlist->[$counter]->[$termField]) {
$termField2 = $termField;
}
foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
$line = $ref_tabh_corpus->{$sent_id}->{'line'};
# warn "$line\n$term_regex\n";
if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+](?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/s)) ||
(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) &&
($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+](?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/is))) {
printMatchingTerm_tab($ref_termlist->[$counter], $+{term}, $sent_id, $ref_tab_results);
}
lib/Alvis/TermTagger.pm view on Meta::CPAN
}
$i++;
}
print STDERR "\n";
#########################################################################################################
warn "\nEnd of term tagging\n";
}
sub term_tagging_offset_brat {
my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $offset_tagged_corpus_name, $caseSensitive, $termField) = @_;
my $counter;
my $term_regex;
my $sent_id;
my $line;
my $i;
my $size_termselect = scalar(keys %$ref_tabh_idtrm_select);
my $termField2;
my $termId = 1;
my $offset;
my $currOffset;
$i = 0;
warn "Term tagging ($offset_tagged_corpus_name)\n";
open TAGGEDCORPUS, ">$offset_tagged_corpus_name" or die "$0: $offset_tagged_corpus_name: No such file\n";
lib/Alvis/TermTagger.pm view on Meta::CPAN
binmode(TAGGEDCORPUS, ":utf8");
if (!defined $termField) {
$termField = 0;
}
# XXX - ABREVIATION - XXX => regex
# warn "====> $caseSensitive\n";
foreach $counter (keys %$ref_tabh_idtrm_select) {
# printf STDERR "Term tagging... %0.1f%%\r", ($i/$size_termselect)*100 ;
$term_regex = $ref_regex_termlist->[$counter];
# warn "counter: $counter ($term_regex)\n";
$termField2 = 0;
if (defined $ref_termlist->[$counter]->[$termField]) {
$termField2 = $termField;
}
foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
$line = $ref_tabh_corpus->{$sent_id}->{'line'};
$offset = $ref_tabh_corpus->{$sent_id}->{'offset'};
# warn "$line\n$term_regex\n";
# warn "$line\n$offset\n";
if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
($line =~ /(?<before>[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+])(?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/s)) ||
(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) &&
($line =~ /(?<before>[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+])(?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/is))) {
lib/Alvis/TermTagger.pm view on Meta::CPAN
hashtable given by reference).
=head2 print_corpus_index()
print_corpus_index(\%corpus_index);
This method prints on STDERR the corpus index C<\%corpus_index>.
=head2 term_Selection()
term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);
This method selects the terms from the term list (C<\@term_list>)
potentially appearing in the corpus (that is the indexed corpus,
C<\%corpus_index>). Results are recorded in the hash table
C<\%idtrm_select>.
The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.
=head2 term_tagging_offset()
term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the file C<$output_filename>.
The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.
=head2 term_tagging_offset_brat()
term_tagging_offset_brat(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the file C<$output_filename> in the Brat input format (<http://brat.nlplab.org/>).
The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.
=head2 term_tagging_offset_tab()
term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results, $caseSensitive);
or
term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results, $caseSensitive);
This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the array C<@tab_results>
(values are sentence id, selected terms and additional information
separated by tabulation) or in the hashtable C<%tabh_results> (keys
form is "sentenceid_selectedterm", values are an array reference
containing sentence id, selected terms and additional ifnormation).
The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.
=head2 printMatchingTerm