Algorithm-VSM
view release on metacpan or search on metacpan
lib/Algorithm/VSM.pm view on Meta::CPAN
file_types => ['.txt', '.java'],
lsa_svd_threshold => 0.01,
max_number_retrievals => 10,
min_word_length => 4,
stop_words_file => $stop_words_file,
use_idf_filter => 1,
want_stemming => 1,
);
$lsa->get_corpus_vocabulary_and_word_counts();
$lsa->display_corpus_vocab();
$lsa->display_corpus_vocab_size();
$lsa->write_corpus_vocab_to_file("vocabulary_dump.txt");
$lsa->generate_document_vectors();
$lsa->construct_lsa_model();
my $retrievals = $lsa->retrieve_for_query_with_lsa( \@query );
$lsa->display_retrievals( $retrievals );
The initialization code before the constructor call and the calls for displaying
the vocabulary and the vectors after the call remain the same as for the VSM case
shown previously in this Synopsis. In the call above, the constructor parameter
'lsa_svd_threshold' determines how many of the singular values will be retained
after we have carried out an SVD decomposition of the term-frequency matrix for
the documents in the corpus. Singular values smaller than this threshold
fraction of the largest value are rejected.
# FOR MEASURING PRECISION VERSUS RECALL FOR VSM:
my $corpus_dir = "corpus";
my $stop_words_file = "stop_words.txt";
my $query_file = "test_queries.txt";
my $relevancy_file = "relevancy.txt"; # All relevancy judgments
# will be stored in this file
my $vsm = Algorithm::VSM->new(
break_camelcased_and_underscored => 1,
case_sensitive => 0,
corpus_directory => $corpus_dir,
file_types => ['.txt', '.java'],
min_word_length => 4,
query_file => $query_file,
relevancy_file => $relevancy_file,
relevancy_threshold => 5,
stop_words_file => $stop_words_file,
want_stemming => 1,
);
$vsm->get_corpus_vocabulary_and_word_counts();
$vsm->generate_document_vectors();
$vsm->estimate_doc_relevancies();
$vsm->display_doc_relevancies(); # used only for testing
$vsm->precision_and_recall_calculator('vsm');
$vsm->display_precision_vs_recall_for_queries();
$vsm->display_average_precision_for_queries_and_map();
Measuring precision and recall requires a set of queries. These are supplied
through the constructor parameter 'query_file'. The format of the this file
must be according to the sample file 'test_queries.txt' in the 'examples'
directory. The module estimates the relevancies of the documents to the
queries and dumps the relevancies in a file named by the 'relevancy_file'
constructor parameter. The constructor parameter 'relevancy_threshold' is used
to decide which of the documents are considered to be relevant to a query. A
document must contain at least the 'relevancy_threshold' occurrences of query
words in order to be considered relevant to a query.
# FOR MEASURING PRECISION VERSUS RECALL FOR LSA:
my $lsa = Algorithm::VSM->new(
break_camelcased_and_underscored => 1,
case_sensitive => 0,
corpus_directory => $corpus_dir,
file_types => ['.txt', '.java'],
lsa_svd_threshold => 0.01,
min_word_length => 4,
query_file => $query_file,
relevancy_file => $relevancy_file,
relevancy_threshold => 5,
stop_words_file => $stop_words_file,
want_stemming => 1,
);
$lsa->get_corpus_vocabulary_and_word_counts();
$lsa->generate_document_vectors();
$lsa->construct_lsa_model();
$lsa->estimate_doc_relevancies();
$lsa->display_doc_relevancies();
$lsa->precision_and_recall_calculator('lsa');
$lsa->display_precision_vs_recall_for_queries();
$lsa->display_average_precision_for_queries_and_map();
We have already explained the purpose of the constructor parameter 'query_file'
and about the constraints on the format of queries in the file named through
this parameter. As mentioned earlier, the module estimates the relevancies of
the documents to the queries and dumps the relevancies in a file named by the
'relevancy_file' constructor parameter. The constructor parameter
'relevancy_threshold' is used in deciding which of the documents are considered
to be relevant to a query. A document must contain at least the
'relevancy_threshold' occurrences of query words in order to be considered
relevant to a query. We have previously explained the role of the constructor
parameter 'lsa_svd_threshold'.
# FOR MEASURING PRECISION VERSUS RECALL FOR VSM USING FILE-BASED RELEVANCE JUDGMENTS:
my $corpus_dir = "corpus";
my $stop_words_file = "stop_words.txt";
my $query_file = "test_queries.txt";
my $relevancy_file = "relevancy.txt";
my $vsm = Algorithm::VSM->new(
break_camelcased_and_underscored => 1,
case_sensitive => 0,
corpus_directory => $corpus_dir,
file_types => ['.txt', '.java'],
min_word_length => 4,
query_file => $query_file,
relevancy_file => $relevancy_file,
stop_words_file => $stop_words_file,
want_stemming => 1,
);
$vsm->get_corpus_vocabulary_and_word_counts();
$vsm->generate_document_vectors();
$vsm->upload_document_relevancies_from_file();
$vsm->display_doc_relevancies();
$vsm->precision_and_recall_calculator('vsm');
$vsm->display_precision_vs_recall_for_queries();
$vsm->display_average_precision_for_queries_and_map();
Now the filename supplied through the constructor parameter 'relevancy_file' must
contain relevance judgments for the queries that are named in the file supplied
through the parameter 'query_file'. The format of these two files must be
according to what is shown in the sample files 'test_queries.txt' and
'relevancy.txt' in the 'examples' directory.
# FOR MEASURING PRECISION VERSUS RECALL FOR LSA USING FILE-BASED RELEVANCE JUDGMENTS:
my $corpus_dir = "corpus";
my $stop_words_file = "stop_words.txt";
my $query_file = "test_queries.txt";
my $relevancy_file = "relevancy.txt";
my $lsa = Algorithm::VSM->new(
break_camelcased_and_underscored => 1,
case_sensitive => 0,
corpus_directory => $corpus_dir,
file_types => ['.txt', '.java'],
lsa_svd_threshold => 0.01,
min_word_length => 4,
query_file => $query_file,
relevancy_file => $relevancy_file,
stop_words_file => $stop_words_file,
want_stemming => 1,
);
$lsa->get_corpus_vocabulary_and_word_counts();
$lsa->generate_document_vectors();
lib/Algorithm/VSM.pm view on Meta::CPAN
calculating retrieval performance with C<Precision> and C<Recall> numbers. The format
of the query file must be as shown in the sample file C<test_queries.txt> in the
'examples' directory.
=item I<relevancy_file:>
This option names the disk file for storing the relevancy judgments.
=item I<relevancy_threshold:>
The constructor parameter B<relevancy_threshold> is used for automatic determination
of document relevancies to queries on the basis of the number of occurrences of query
words in a document. You can exercise control over the process of determining
relevancy of a document to a query by giving a suitable value to the constructor
parameter B<relevancy_threshold>. A document is considered relevant to a query only
when the document contains at least B<relevancy_threshold> number of query words.
=item I<save_model_on_disk:>
The constructor parameter B<save_model_on_disk> will cause the basic
information about the VSM and the LSA models to be stored on the disk.
Subsequently, any retrievals can be carried out from the disk-based model.
=item I<stop_words_file:>
The parameter B<stop_words_file> is for naming the file that contains the stop words
that you do not wish to include in the corpus vocabulary. The format of this file
must be as shown in the sample file C<stop_words.txt> in the 'examples' directory.
=item I<use_idf_filter:>
The constructor parameter B<use_idf_filter> is set by default. If you want
to turn off the normalization of the document vectors, including turning
off the weighting of the term frequencies of the words by their idf values,
you must set this parameter explicitly to 0.
=item I<want_stemming:>
The boolean parameter B<want_stemming> determines whether or not the words extracted
from the documents would be subject to stemming. As mentioned elsewhere, stemming
means that related words like 'programming' and 'programs' would both be reduced to
the root word 'program'.
=back
=begin html
<br>
=end html
=item B<construct_lsa_model():>
You call this subroutine for constructing an LSA model for your corpus
after you have extracted the corpus vocabulary and constructed document
vectors:
$vsm->construct_lsa_model();
The SVD decomposition that is carried out in LSA model construction uses the
constructor parameter C<lsa_svd_threshold> to decide how many of the singular values
to retain for the LSA model. A singular is retained only if it is larger than the
C<lsa_svd_threshold> fraction of the largest singular value.
=item B<display_average_precision_for_queries_and_map():>
The Average Precision for a query is the average of the Precision-at-rank values
associated with each of the corpus documents relevant to the query. The mean of the
Average Precision values for all the queries is the Mean Average Precision (MAP).
The C<Average Precision> values for the queries and the overall C<MAP> can be printed
out by calling
$vsm->display_average_precision_for_queries_and_map();
=item B<display_corpus_vocab():>
If you would like to see corpus vocabulary as constructed by the previous call, make
the call
$vsm->display_corpus_vocab();
Note that this is a useful thing to do only on small test corpora. If you need
to examine the vocabulary for a large corpus, call the two methods listed below.
=item B<display_corpus_vocab_size():>
If you would like for the module to print out in your terminal window the size of the
vocabulary, make the call
$vsm->display_corpus_vocab_size();
=item B<display_doc_relevancies():>
If you would like to see the document relevancies generated by the previous method,
you can call
$vsm->display_doc_relevancies()
=item B<display_doc_vectors():>
If you would like to see the document vectors constructed by the previous call, make
the call:
$vsm->display_doc_vectors();
Note that this is a useful thing to do only on small test corpora. If you must call
this method on a large corpus, you might wish to direct the output to a file.
=item B<display_inverse_document_frequencies():>
You can display the idf value associated with each word in the corpus by
$vsm->display_inverse_document_frequencies();
The idf of a word in the corpus is calculated typically as the logarithm of the ratio
( run in 1.473 second using v1.01-cache-2.11-cpan-13bb782fe5a )