Alvis-TermTagger

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

        - add the possibility of a case sensitive matching, according
          to (or not) the length of the term

0.7     - tagging of the term with the lemmatised form of its words

0.6     - Correction in Build.PL
	- * was forget in the quote meta character function
        - Correction in the documentation
        - Few meta characters unspecialised in the terms 
	- the returned term form is the matching form according to the regex
	- Optimization of the term selection

0.5     - Add '/' character to the regex frontier while final term tagging
        - Addition of the method print_corpus_index to print on STDERR
          the corpus index
        - Correction in the step for selecting potentially appearing
          terms in the corpus (some terms were missing)
        - Additional information as semantic tag associated to terms
          (set from the third column are managed
	- Add the LICENSE file

0.4     - Improvement in displaying the tagging process
        - Documentation of the modules and scripts are gathered at the
          end of each file.
        - Addition of a Build.PL file        
        - Addition of examples

bin/TermTagger-brat.pl  view on Meta::CPAN

# URL : https://perso.limsi.fr/hamon/
#
########################################################################

=head1 NAME

TermTagger-brat.pl -- A Perl script for tagging text with terms (Brat format output)

=head1 SYNOPSIS

TermTagger.pl [options] corpus termlist selected_term_list lemmatised_corpus

=head1 OPTIONS

=over 4

=item    B<--help>            brief help message

=back

=head1 DESCRIPTION

This script tags a corpus with terms and provide a output compatible with Brat (<http://brat.nlplab.org/>). Corpus (C<corpus>) is a file
with one sentence per line. Term list (C<termlist>) is a file
containing one term per line. For each term, additionnal information
(as canonical form) can be given after a column. Each line of the
output file (C<selected_term_list>) contains the sentence number, the
term, additional information, all separated by a tabulation character.

==hea1 EXAMPLES

Tag the textual corpus in C<corpus-test.txt> with terms in the file
C<termlist-test.lst> and record the results in the file
C<corpus-test.ann>) according to the Brat input format:

TermTagger-brat.pl corpus-test.txt termlist-test.lst corpus-test.ann

bin/TermTagger.pl  view on Meta::CPAN

# URL : https://perso.limsi.fr/hamon/
#
########################################################################

=head1 NAME

TermTagger.pl -- A Perl script for tagging corpus with terms

=head1 SYNOPSIS

TermTagger.pl [options] corpus termlist selected_term_list lemmatised_corpus

=head1 OPTIONS

=over 4

=item    B<--help>            brief help message

=back

=head1 DESCRIPTION

This script tags a corpus with terms. Corpus (C<corpus>) is a file
with one sentence per line. Term list (C<termlist>) is a file
containing one term per line. For each term, additionnal information
(as canonical form) can be given after a column. Each line of the
output file (C<selected_term_list>) contains the sentence number, the
term, additional information, all separated by a tabulation character.

=head1 SEE ALSO

Alvis web site: http://www.alvis.info

=head1 AUTHORS

Thierry Hamon <thierry.hamon@limsi.fr>

examples/load_and_term-matching  view on Meta::CPAN

#!/usr/bin/perl -w


use strict;
use warnings;

require Alvis::TermTagger;

my $corpus = "corpus";
my $termlist = "term+lem+semtaglist";
my $selected_term_list = "selected-terms";

Alvis::TermTagger::termtagging($corpus, $termlist, $selected_term_list);


warn "List of the selected terms in $selected_term_list\n";

examples/load_and_term-matching_hash  view on Meta::CPAN



use strict;
use warnings;

require Alvis::TermTagger;

my $corpus = "corpus";
my $lemmatised_corpus = "lemmatised-corpus";
my $termlist = "term+lem+semtaglist+wordlemma";
my %selected_term_list;
my $term;

my @term_list;
my %term_listIdx;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
my $CS = undef;


Alvis::TermTagger::load_TermList($termlist,\@term_list,\%term_listIdx);
Alvis::TermTagger::get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
Alvis::TermTagger::load_Corpus($corpus,\%corpus, \%lc_corpus);
Alvis::TermTagger::load_Corpus($lemmatised_corpus,\%lemmatised_corpus, \%lc_lemmatised_corpus);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $CS);
Alvis::TermTagger::corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $CS);
Alvis::TermTagger::term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $CS);
Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%selected_term_list, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, \%selected_term_list, $CS);
 
foreach $term (keys %selected_term_list) {
#   print "$term\t" .  join("\t", @{$selected_term_list{$term}}) . "\n";
  print join("\t", @{$selected_term_list{$term}}) . "\n";
}


examples/load_and_term-matching_tab  view on Meta::CPAN



use strict;
use warnings;

require Alvis::TermTagger;

my $corpus = "corpus";
my $lemmatised_corpus = "lemmatised-corpus";
my $termlist = "term+lem+semtaglist+wordlemma";
my @selected_term_list;
my $term;

my %term_listIdx;
my @term_list;
my @regex_term_list;
my @regex_lemmawordterm_list;
my %corpus;
my %lc_corpus;
my %lemmatised_corpus;
my %lc_lemmatised_corpus;
my %corpus_index;
my %lemmatised_corpus_index;
my %idtrm_select;
my %idlemtrm_select;
my $CS = 3;


Alvis::TermTagger::load_TermList($termlist, \@term_list, \%term_listIdx);
Alvis::TermTagger::get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);
Alvis::TermTagger::load_Corpus($corpus, \%corpus, \%lc_corpus);
Alvis::TermTagger::load_Corpus($lemmatised_corpus, \%lemmatised_corpus, \%lc_lemmatised_corpus);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $CS);
Alvis::TermTagger::corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $CS);
Alvis::TermTagger::term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $CS);
# Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idtrm_select, $CS);
Alvis::TermTagger::term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@selected_term_list, $CS);
Alvis::TermTagger::term_tagging_offset_tab(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, \@selected_term_list, $CS);

foreach $term (@selected_term_list) {
  print "$term\n";
}


examples/term-matching  view on Meta::CPAN


require Alvis::TermTagger;

my %corpus = (1, {'line' => "During sporulation of Bacillus subtilis, spore coat proteins encoded by cot genes are expressed in the mother cell and deposited on the forespore.", 'offset' => 0});

my %lc_corpus;

my @termlist_input = ("sporulation", "sporulation", "Bacillus subtilis", "Bacillus subtilis" , "proteins", "protein", "genes", "gene" , "mother cell", "mother cell" );

my @termlist;
my @selected_term_list;
my $term;

my @regex_term_list;
my %corpus_index;
my %idtrm_select;

my $key;

foreach $key (keys %corpus) {
    $lc_corpus{$key}->{'line'} = lc($corpus{$key}->{'line'});
    $lc_corpus{$key}->{'offset'} = $corpus{$key}->{'offset'};
}

my $i = 0;

for($i = 0;$i< scalar (@termlist_input); $i+=2) {
    my @tmp = ($termlist_input[$i], $termlist_input[$i+1]);
    push @termlist, \@tmp;
}


Alvis::TermTagger::get_Regex_TermList(\@termlist, \@regex_term_list);
Alvis::TermTagger::corpus_Indexing(\%lc_corpus, \%corpus_index);
Alvis::TermTagger::term_Selection(\%corpus_index, \@termlist, \%idtrm_select, 1);
Alvis::TermTagger::term_tagging_offset_tab(\@termlist, \@regex_term_list, \%idtrm_select, \%corpus, \@selected_term_list);

foreach $term (@selected_term_list) {
  print "$term\n";
}




lib/Alvis/TermTagger.pm  view on Meta::CPAN

# URL : https://perso.limsi.fr/hamon/
#
########################################################################


use strict;
use warnings;

use utf8;

# TODO : write functions for term tagginga, term selection with and
# without offset in the corpus

sub termtagging {

    my ($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive) = @_;

    my @term_list;
    my %term_listIdx;
    my @regex_term_list;
    my @regex_lemmawordterm_list;
    my %corpus;
    my %lc_corpus;
    my %lemmatised_corpus;
    my %lc_lemmatised_corpus;
    my %corpus_index;
    my %lemmatised_corpus_index;
    my %idtrm_select;
    my %idlemtrm_select;

    if (!defined $caseSensitive) {
	$caseSensitive = -1;
    }

    &load_TermList($term_list_filename,\@term_list, \%term_listIdx);
    &get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);

    &load_Corpus($corpus_filename, \%corpus, \%lc_corpus);
    if (defined $lemmatised_corpus_filename) {
	&load_Corpus($lemmatised_corpus_filename, \%lemmatised_corpus, \%lc_lemmatised_corpus);
    }
    &corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $caseSensitive);
    }
    &term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $caseSensitive);
    }
    &term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&term_tagging_offset(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, $output_filename, $caseSensitive);
    }
    return(0);
}

sub termtagging_brat {

    my ($corpus_filename, $term_list_filename, $output_filename, $lemmatised_corpus_filename, $caseSensitive) = @_;

    my @term_list;
    my %term_listIdx;
    my @regex_term_list;
    my @regex_lemmawordterm_list;
    my %corpus;
    my %lc_corpus;
    my %lemmatised_corpus;
    my %lc_lemmatised_corpus;
    my %corpus_index;
    my %lemmatised_corpus_index;
    my %idtrm_select;
    my %idlemtrm_select;

    if (!defined $caseSensitive) {
	$caseSensitive = -1;
    }

    &load_TermList($term_list_filename,\@term_list, \%term_listIdx);
    &get_Regex_TermList(\@term_list, \@regex_term_list, \@regex_lemmawordterm_list);

    &load_Corpus($corpus_filename, \%corpus, \%lc_corpus);
    if (defined $lemmatised_corpus_filename) {
	&load_Corpus($lemmatised_corpus_filename, \%lemmatised_corpus, \%lc_lemmatised_corpus);
    }
    &corpus_Indexing(\%lc_corpus, \%corpus, \%corpus_index, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&corpus_Indexing(\%lc_lemmatised_corpus, \%lemmatised_corpus, \%lemmatised_corpus_index, $caseSensitive);
    }
    &term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&term_Selection(\%lemmatised_corpus_index, \@term_list, \%idlemtrm_select, $caseSensitive);
    }
    &term_tagging_offset_brat(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);
    if (defined $lemmatised_corpus_filename) {
	&term_tagging_offset_brat(\@term_list, \@regex_lemmawordterm_list, \%idlemtrm_select, \%lemmatised_corpus, $output_filename, $caseSensitive);
    }
    return(0);
}


sub load_TermList {
    my ($termlist_name, $ref_termlist, $ref_termlistIdx) = @_;

    my $line;
    my $line1;

lib/Alvis/TermTagger.pm  view on Meta::CPAN

		if (!exists $ref_corpus_index->{$word}) {
		    my @tabtmp;
		    $ref_corpus_index->{$word} = \@tabtmp;
		}
		push @{$ref_corpus_index->{$word}}, $sent_id;
	    }
	}
    }
    # print STDERR join(" : ", keys(%$ref_corpus_index)) . "\n";

    print STDERR "\n\tSize of the first selected term list: " . scalar(keys %$ref_corpus_index) . "\n\n";
}

sub print_corpus_index {
    my ($ref_corpus_index) = @_;

    my $word;

    foreach $word (sort keys %$ref_corpus_index) {
	print STDERR "$word\t";
	print STDERR join(", ", @{$ref_corpus_index->{$word}});
	print STDERR "\n";
    }
}

sub _term_Selection2 {
    my ($ref_corpus_index, $ref_termlist, $ref_tabh_idtrm_select) = @_;
    my $counter;
    my $term;
    my @tab_termlex;
    my $i;
    my $word;
    my $sent_id;
    my $word_found = 0;

    warn "Selecting the terms potentialy appearing in the corpus\n";

    my %tabh_numtrm_select;
  
    for($counter  = 0;$counter < scalar @$ref_termlist;$counter++) {
	$term = lc $ref_termlist->[$counter]->[0];
        # XXX - ABREVIATION - XXX
	@tab_termlex = split /[ \-]+/, $term;
	$word_found = 0;
	$i=0; 
	do {
	    $word = $tab_termlex[$i];
	    if (($word ne "") && ((length($word) > 2) || (scalar(@tab_termlex)==1)) && 
		((exists $ref_corpus_index->{$word}))) { #  || (exists $ref_corpus_index->{$word . "s"})
		$word_found = 1;
		if (!exists $ref_tabh_idtrm_select->{$counter}) {
		    my %tabhtmp2;
		    $ref_tabh_idtrm_select->{$counter} = \%tabhtmp2;
		}
		foreach $sent_id (@{$ref_corpus_index->{$word}}) {
		    ${$ref_tabh_idtrm_select->{$counter}}{$sent_id} = 1;
		}
	    }
	    $i++;
	} while((!$word_found) && ($i < scalar @tab_termlex));
    }

    warn "\nEnd of selecting the terms potentialy appearing in the corpus\n";
}

sub term_Selection {
    my ($ref_corpus_index, $ref_termlist, $ref_tabh_idtrm_select, $caseSensitive, $termField) = @_;
    my $counter;
    my $term;
    my @tab_termlex;
    my $termCap;
    my @tab_termlexCap;
    my $i;
    my $word;
    my $sent_id;
    my $word_found = 0;

    my @recordedWords;

    if (!defined $termField) {
	$termField = 0;
    }

    warn "Selecting the terms potentialy appearing in the corpus ($termField)\n";

    my %tabh_numtrm_select;
    
    # warn "caseSensitive: $caseSensitive\n";
    for($counter  = 0;$counter < scalar @$ref_termlist;$counter++) {
	if (defined $ref_termlist->[$counter]->[$termField]) {
	    # warn "==> " . $ref_termlist->[$counter]->[0] . " / " . $ref_termlist->[$counter]->[$termField] . "\n";
	    if ((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField]) <= $caseSensitive))) {
		$term = $ref_termlist->[$counter]->[$termField];
		$termCap = $ref_termlist->[$counter]->[$termField];
		# warn "passe\n";
	    } else {

lib/Alvis/TermTagger.pm  view on Meta::CPAN

		    # } else {
		    # 	warn "--------------------------> $term\n";
		}
		$i++;
		$word = $tab_termlex[$i];
		# warn "i: $i\n";
	    }
	    if ($i == scalar(@tab_termlex)) {
		foreach $word (@recordedWords) {
		    # print STDERR "$word : ";
		    if (!exists $ref_tabh_idtrm_select->{$counter}) {
			my %tabhtmp2;
			$ref_tabh_idtrm_select->{$counter} = \%tabhtmp2;
		    }
		    foreach $sent_id (@{$ref_corpus_index->{$word}}) {
			${$ref_tabh_idtrm_select->{$counter}}{$sent_id} = 1;
		    }
		}
	    }
#	}
    }
    # print STDERR "\n";

    # print STDERR join(" : ", keys(%$ref_tabh_idtrm_select)) . "\n";

    warn "Size of the selected list: " . scalar (keys %$ref_tabh_idtrm_select) . "\n";
    # foreach $counter (keys %$ref_tabh_idtrm_select) {
    # 	warn $ref_termlist->[$counter]->[0] . "\n";
    # }

    warn "\nEnd of selecting the terms potentialy appearing in the corpus\n";
}

sub term_tagging_offset {
    my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $offset_tagged_corpus_name, $caseSensitive, $termField) = @_;
    my $counter;
    my $term_regex;
    my $sent_id;
    my $line;
    my $termField2;

    if (!defined $termField) {
	$termField = 0;
    }
    # XXX - ABREVIATION - XXX => regex

    warn "Term tagging\n";

    open TAGGEDCORPUS, ">>$offset_tagged_corpus_name" or die "$0: $offset_tagged_corpus_name: No such file\n";

    binmode(TAGGEDCORPUS, ":utf8");

    foreach $counter (keys %$ref_tabh_idtrm_select) {
	$term_regex = $ref_regex_termlist->[$counter];
	$termField2 = 0;
	if (defined $ref_termlist->[$counter]->[$termField]) {
	    $termField2 = $termField;
	}
	foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
	    $line = $ref_tabh_corpus->{$sent_id}->{'line'};
	    print STDERR ".";
	    
	    if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
		 ($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+]($term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/)) || 
		(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) && 
		 ($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+]($term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/i))) {
		printMatchingTerm(\*TAGGEDCORPUS, $ref_termlist->[$counter], $sent_id);
	    }
	    if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&

lib/Alvis/TermTagger.pm  view on Meta::CPAN

    my ($descriptor, $ref_matching_term, $sent_id) = @_;

    print $descriptor "$sent_id\t";
    print $descriptor join("\t", @$ref_matching_term);
    print $descriptor "\n";

}


sub term_tagging_offset_tab {
    my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $ref_tab_results, $caseSensitive, $termField) = @_;
    my $counter;
    my $term_regex;
    my $sent_id;
    my $line;
    my $i;
    my $size_termselect = scalar(keys %$ref_tabh_idtrm_select);
    my $termField2;

    $i = 0;

    if (!defined $termField) {
	$termField = 0;
    }

    # XXX - ABREVIATION - XXX => regex
    # warn "====> $caseSensitive\n";
    
    foreach $counter (keys %$ref_tabh_idtrm_select) {
#  	printf STDERR "Term tagging... %0.1f%%\r", ($i/$size_termselect)*100 ;
	$term_regex = $ref_regex_termlist->[$counter];
	# warn "counter: $counter ($term_regex)\n";

	$termField2 = 0;
	if (defined $ref_termlist->[$counter]->[$termField]) {
	    $termField2 = $termField;
	}

	foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
	    $line = $ref_tabh_corpus->{$sent_id}->{'line'};

	    # warn "$line\n$term_regex\n";

	    if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
		 ($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+](?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/s)) ||
		(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) && 
		 ($line =~ /[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+](?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/is))) {
 		printMatchingTerm_tab($ref_termlist->[$counter], $+{term},  $sent_id, $ref_tab_results);
	    }

lib/Alvis/TermTagger.pm  view on Meta::CPAN

	}
	$i++;
    }
    print STDERR "\n";

#########################################################################################################
    warn "\nEnd of term tagging\n";
}

sub term_tagging_offset_brat {
    my ($ref_termlist, $ref_regex_termlist, $ref_tabh_idtrm_select, $ref_tabh_corpus, $offset_tagged_corpus_name, $caseSensitive, $termField) = @_;
    my $counter;
    my $term_regex;
    my $sent_id;
    my $line;
    my $i;
    my $size_termselect = scalar(keys %$ref_tabh_idtrm_select);
    my $termField2;
    my $termId = 1;
    my $offset;
    my $currOffset;

    $i = 0;

    warn "Term tagging ($offset_tagged_corpus_name)\n";

    open TAGGEDCORPUS, ">$offset_tagged_corpus_name" or die "$0: $offset_tagged_corpus_name: No such file\n";

lib/Alvis/TermTagger.pm  view on Meta::CPAN

    binmode(TAGGEDCORPUS, ":utf8");


    if (!defined $termField) {
	$termField = 0;
    }

    # XXX - ABREVIATION - XXX => regex
    # warn "====> $caseSensitive\n";
    
    foreach $counter (keys %$ref_tabh_idtrm_select) {
#  	printf STDERR "Term tagging... %0.1f%%\r", ($i/$size_termselect)*100 ;
	$term_regex = $ref_regex_termlist->[$counter];
	# warn "counter: $counter ($term_regex)\n";

	$termField2 = 0;
	if (defined $ref_termlist->[$counter]->[$termField]) {
	    $termField2 = $termField;
	}

	foreach $sent_id (keys %{$ref_tabh_idtrm_select->{$counter}}){
	    $line = $ref_tabh_corpus->{$sent_id}->{'line'};
	    $offset = $ref_tabh_corpus->{$sent_id}->{'offset'};

	    # warn "$line\n$term_regex\n";
	    # warn "$line\n$offset\n";

	    if ((((defined $caseSensitive) && (($caseSensitive == 0) || (length($ref_termlist->[$counter]->[$termField2]) <= $caseSensitive))) &&
		 ($line =~ /(?<before>[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+])(?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/s)) ||
		(((!defined $caseSensitive) || ($caseSensitive < 0) || (length($ref_termlist->[$counter]->[$termField2]) > $caseSensitive)) && 
		 ($line =~ /(?<before>[,.?!:;\/ \n\-\/\*'\#\{\}\(\)\[\]\+])(?<term>$term_regex)[,.?!:;\/ \n\-\/\*'\#\(\)\[\]\{\}\+]/is))) {

lib/Alvis/TermTagger.pm  view on Meta::CPAN

hashtable given by reference).

=head2 print_corpus_index()

    print_corpus_index(\%corpus_index);

This method prints on STDERR the corpus index C<\%corpus_index>.

=head2 term_Selection()

    term_Selection(\%corpus_index, \@term_list, \%idtrm_select, $caseSensitive);

This method selects the terms from the term list (C<\@term_list>)
potentially appearing in the corpus (that is the indexed corpus,
C<\%corpus_index>). Results are recorded in the hash table
C<\%idtrm_select>.

The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.


=head2 term_tagging_offset()

    term_tagging_offset(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);

This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the file C<$output_filename>.

The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.

=head2 term_tagging_offset_brat()

    term_tagging_offset_brat(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, $output_filename, $caseSensitive);

This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the file C<$output_filename> in the Brat input format (<http://brat.nlplab.org/>).

The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.

=head2 term_tagging_offset_tab()

    term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \@tab_results, $caseSensitive);

or 

    term_tagging_offset_tab(\@term_list, \@regex_term_list, \%idtrm_select, \%corpus, \%tabh_results, $caseSensitive);

This method tags the corpus C<\%corpus> with the terms (issued from
the term list C<\@term_list>, C<\@regex_term_list> is the term list
with regular expression), and selected in a previous step
(C<\%idtrm_select>). Resulting selected terms are recorded with their
offset, and additional information in the array C<@tab_results>
(values are sentence id, selected terms and additional information
separated by tabulation) or in the hashtable C<%tabh_results> (keys
form is "sentenceid_selectedterm", values are an array reference
containing sentence id, selected terms and additional ifnormation).

The parameter C<$caseSensitive> indicates if the term matching is case
sensitive (value greater or equal to 0) or insensitive ((value
strictly lesser than 0). If the value of C<$caseSensitive> is equal to
0, the case sensitive match is carried out for any terms. If the value of
C<$caseSensitive> is strictly greater than 0, the case sensitive match
is carried out only for the terms with a number of characters lesser
or equal to C<$caseSensitive>.

=head2 printMatchingTerm



( run in 1.551 second using v1.01-cache-2.11-cpan-39bf76dae61 )