Algorithm-VSM
view release on metacpan or search on metacpan
lib/Algorithm/VSM.pm view on Meta::CPAN
my $min = $self->{_min_word_length};
if ($self->{_break_camelcased_and_underscored}) {
my @brokenup = grep $_, split /\W|_|\s+/, "@$query";
@clean_words = map {$_ =~ /$_regex/g} @brokenup;
@clean_words = grep $_, map {$_ =~ /([[:lower:]0-9]{$min,})/i;$1?"\L$1":''} @clean_words;
} else {
my @brokenup = split /\"|\'|\.|\(|\)|\[|\]|\\|\/|\s+/, "@$query";
@clean_words = grep $_, map { /([a-z0-9_]{$min,})/i;$1 } @brokenup;
}
$query = \@clean_words;
print "\nYour processed query words are: @$query\n" if $self->{_debug};
die "Your vocabulary histogram is empty"
unless scalar(keys %{$self->{_vocab_hist}});
die "You must first construct an LSA model"
unless scalar(keys %{$self->{_doc_vecs_trunc_lsa}});
foreach ( keys %{$self->{_vocab_hist}} ) {
$self->{_query_vector}->{$_} = 0;
}
foreach (@$query) {
$self->{_query_vector}->{"\L$_"}++
if exists $self->{_vocab_hist}->{"\L$_"};
lib/Algorithm/VSM.pm view on Meta::CPAN
print "\n\nFor query $query, precision values: @Precision_values\n"
if $self->{_debug};
print "\nFor query $query, recall values: @Recall_values\n"
if $self->{_debug};
$self->{_precision_for_queries}->{$query} = \@Precision_values;
my $avg_precision;
$avg_precision += $_ for @Precision_values;
$self->{_avg_precision_for_queries}->{$query} += $avg_precision / (1.0 * @Precision_values);
$self->{_recall_for_queries}->{$query} = \@Recall_values;
}
print "\n\n========= query by query processing for Precision vs. Recall calculations finished ========\n\n"
if $self->{_debug};
my @avg_precisions;
foreach (keys %{$self->{_avg_precision_for_queries}}) {
push @avg_precisions, $self->{_avg_precision_for_queries}->{$_};
}
$self->{_map} += $_ for @avg_precisions;
$self->{_map} /= scalar keys %{$self->{_queries_for_relevancy}};
}
sub display_average_precision_for_queries_and_map {
lib/Algorithm/VSM.pm view on Meta::CPAN
the same size, the size of the vocabulary. An element of a vector is the frequency
of occurrence of the word corresponding to that position in the vector.
LSA modeling is a small variation on VSM modeling. Now you take VSM modeling one
step further by subjecting the term-frequency matrix for the corpus to singular value
decomposition (SVD). By retaining only a subset of the singular values (usually the
N largest for some value of N), you can construct reduced-dimensionality vectors for
the documents and the queries. In VSM, as mentioned above, the size of the document
and the query vectors is equal to the size of the vocabulary. For large corpora,
this size may involve tens of thousands of words --- this can slow down the VSM
modeling and retrieval process. So you are very likely to get faster performance
with retrieval based on LSA modeling, especially if you store the model once
constructed in a database file on the disk and carry out retrievals using the
disk-based model.
=head1 CAN THIS MODULE BE USED FOR GENERAL TEXT RETRIEVAL?
This module has only been tested for software retrieval. For more general text
retrieval, you would need to replace the simple stemmer used in the module by one
based on, say, Porter's Stemming Algorithm. You would also need to vastly expand the
lib/Algorithm/VSM.pm view on Meta::CPAN
'examples' directory.
=item I<relevancy_file:>
This option names the disk file for storing the relevancy judgments.
=item I<relevancy_threshold:>
The constructor parameter B<relevancy_threshold> is used for automatic determination
of document relevancies to queries on the basis of the number of occurrences of query
words in a document. You can exercise control over the process of determining
relevancy of a document to a query by giving a suitable value to the constructor
parameter B<relevancy_threshold>. A document is considered relevant to a query only
when the document contains at least B<relevancy_threshold> number of query words.
=item I<save_model_on_disk:>
The constructor parameter B<save_model_on_disk> will cause the basic
information about the VSM and the LSA models to be stored on the disk.
Subsequently, any retrievals can be carried out from the disk-based model.
( run in 0.246 second using v1.01-cache-2.11-cpan-8d75d55dd25 )