Algorithm-VSM

 view release on metacpan or  search on metacpan

lib/Algorithm/VSM.pm  view on Meta::CPAN

        my $corpus_dir = "corpus";  
        my $stop_words_file = "stop_words.txt";
        my $query_file      = "test_queries.txt";
        my $relevancy_file   = "relevancy.txt";  
        my $lsa = Algorithm::VSM->new( 
                            break_camelcased_and_underscored  => 1,  
                            case_sensitive      => 0,                
                            corpus_directory    => $corpus_dir,
                            file_types          => ['.txt', '.java'],
                            lsa_svd_threshold   => 0.01,
                            min_word_length     => 4,
                            query_file          => $query_file,
                            relevancy_file      => $relevancy_file,
                            stop_words_file     => $stop_words_file,
                            want_stemming       => 1,                
        );
        $lsa->get_corpus_vocabulary_and_word_counts();
        $lsa->generate_document_vectors();
        $lsa->upload_document_relevancies_from_file();  
        $lsa->display_doc_relevancies();
        $lsa->precision_and_recall_calculator('vsm');
        $lsa->display_precision_vs_recall_for_queries();
        $lsa->display_average_precision_for_queries_and_map();

    As mentioned for the previous code block, the filename supplied through the
    constructor parameter 'relevancy_file' must contain relevance judgments for the
    queries that are named in the file supplied through the parameter 'query_file'.
    The format of this file must be according to what is shown in the sample file
    'relevancy.txt' in the 'examples' directory.  We have already explained the roles
    played by the constructor parameters such as 'lsa_svd_threshold'.



  # FOR MEASURING THE SIMILARITY MATRIX FOR A SET OF DOCUMENTS:

        my $corpus_dir = "corpus";
        my $stop_words_file = "stop_words.txt";
        my $vsm = Algorithm::VSM->new(
                   break_camelcased_and_underscored  => 1,  
                   case_sensitive           => 0,           
                   corpus_directory         => $corpus_dir,
                   file_types               => ['.txt', '.java'],
                   min_word_length          => 4,
                   stop_words_file          => $stop_words_file,
                   want_stemming            => 1,           
        );
        $vsm->get_corpus_vocabulary_and_word_counts();
        $vsm->generate_document_vectors();
        # code for calculating pairwise similarities as shown in the
        # script calculate_similarity_matrix_for_all_docs.pl in the
        # examples directory.  This script makes calls to
        #
        #   $vsm->pairwise_similarity_for_docs($docs[$i], $docs[$j]);        
        #
        # for every pair of documents.

=head1 CHANGES

Version 1.70: All of the changes made in this version affect only that part of the
module that is used for calculating precision-vs.-recall curve for the estimation of
MAP (Mean Average Precision).  The new formulas that go into estimating MAP are
presented in the author's tutorial on significance testing.  Additionally, when
estimating the average retrieval precision for a query, this version explicitly
disregards all documents that have zero similarity with the query.

Version 1.62 removes the Perl version restriction on the module. This version also
fixes two bugs, one in the file scanner code and the other in the
precision-and-recall calculator.  The file scanner bug was related to the new
constructor parameter C<case_sensitive> that was introduced in Version 1.61.  And the
precision-and-recall calculator bug was triggered if a query consisted solely of
non-vocabulary words.

Version 1.61 improves the implementation of the directory scanner to make it more
platform independent.  Additionally, you are now required to specify in the
constructor call the file types to be considered for computing the database model.
If, say, you have a large software library and you want only Java and text files to
be scanned for creating the VSM (or the LSA) model, you must supply that information
to the module by setting the constructor parameter C<file_types> to the anonymous
list C<['.java', '.txt']>.  An additional constructor parameter introduced in this
version is C<case_sensitive>.  If you set it to 1, that will force the database model
and query matching to become case sensitive.

Version 1.60 reflects the fact that people are now more likely to use this module by
keeping the model constructed for a corpus in the fast memory (as opposed to storing
the models in disk-based hash tables) for its repeated invocation for different
queries.  As a result, the default value for the constructor option
C<save_model_on_disk> was changed from 1 to 0.  For those who still wish to store on
a disk the model that is constructed, the script
C<retrieve_with_VSM_and_also_create_disk_based_model.pl> shows how you can do that.
Other changes in 1.60 include a slight reorganization of the scripts in the
C<examples> directory.  Most scripts now do not by default store their models in
disk-based hash tables.  This reorganization is reflected in the description of the
C<examples> directory in this documentation.  The basic logic of constructing VSM and
LSA models and how these are used for retrievals remains unchanged.

Version 1.50 incorporates a couple of new features: (1) You now have the option to
split camel-cased and underscored words for constructing your vocabulary set; and (2)
Storing the VSM and LSA models in database files on the disk is now optional.  The
second feature, in particular, should prove useful to those who are using this module
for large collections of documents.

Version 1.42 includes two new methods, C<display_corpus_vocab_size()> and
C<write_corpus_vocab_to_file()>, for those folks who deal with very large datasets.
You can get a better sense of the overall vocabulary being used by the module for
file retrieval by examining the contents of a dump file whose name is supplied as an
argument to C<write_corpus_vocab_to_file()>.

Version 1.41 downshifts the required version of the PDL module. Also cleaned up are
the dependencies between this module and the submodules of PDL.

Version 1.4 makes it easier for a user to calculate a similarity matrix over all the
documents in the corpus. The elements of such a matrix express pairwise similarities
between the documents.  The pairwise similarities are based on the dot product of two
document vectors divided by the product of the vector magnitudes.  The 'examples'
directory contains two scripts to illustrate how such matrices can be calculated by
the user.  The similarity matrix is output as a CSV file.

Version 1.3 incorporates IDF (Inverse Document Frequency) weighting of the words in a
document file. What that means is that the words that appear in most of the documents
get reduced weighting since such words are non-discriminatory with respect to the
retrieval of the documents. A typical formula that is used to calculate the IDF
weight for a word is the logarithm of the ratio of the total number of documents to
the number of documents in which the word appears.  So if a word were to appear in
all the documents, its IDF multiplier would be zero in the vector representation of a
document.  If so desired, you can turn off the IDF weighting of the words by
explicitly setting the constructor parameter C<use_idf_filter> to zero.

Version 1.2 includes a code correction and some general code and documentation
cleanup.

With Version 1.1, you can access the retrieval precision results so that you can
compare two different retrieval algorithms (VSM or LSA with different choices for
some of the constructor parameters) with significance testing. (Version 1.0 merely
sent those results to standard output, typically your terminal window.)  In Version
1.1, the new script B<significance_testing.pl> in the 'examples' directory
illustrates significance testing with Randomization and with Student's Paired t-Test.

=head1 DESCRIPTION

B<Algorithm::VSM> is a I<perl5> module for constructing a Vector Space Model (VSM) or
a Latent Semantic Analysis Model (LSA) of a collection of documents, usually referred
to as a corpus, and then retrieving the documents in response to search words in a
query.

VSM and LSA models have been around for a long time in the Information Retrieval (IR)
community.  More recently such models have been shown to be effective in retrieving
files/documents from software libraries. For an account of this research that was
presented by Shivani Rao and the author of this module at the 2011 Mining Software
Repositories conference, see L<http://portal.acm.org/citation.cfm?id=1985451>.

VSM modeling consists of: (1) Extracting the vocabulary used in a corpus.  (2)
Stemming the words so extracted and eliminating the designated stop words from the
vocabulary.  Stemming means that closely related words like 'programming' and
'programs' are reduced to the common root word 'program' and the stop words are the
non-discriminating words that can be expected to exist in virtually all the
documents. (3) Constructing document vectors for the individual files in the corpus
--- the document vectors taken together constitute what is usually referred to as a
'term-frequency' matrix for the corpus. (4) Normalizing the document vectors to
factor out the effect of document size and, if desired, multiplying the term
frequencies by the IDF (Inverse Document Frequency) values for the words to reduce
the weight of the words that appear in a large number of documents. (5) Constructing
a query vector for the search query after the query is subject to the same stemming
and stop-word elimination rules that were applied to the corpus. And, lastly, (6)
Using a similarity metric to return the set of documents that are most similar to the
query vector.  The commonly used similarity metric is one based on the cosine
distance between two vectors.  Also note that all the vectors mentioned here are of
the same size, the size of the vocabulary.  An element of a vector is the frequency
of occurrence of the word corresponding to that position in the vector.

LSA modeling is a small variation on VSM modeling.  Now you take VSM modeling one
step further by subjecting the term-frequency matrix for the corpus to singular value
decomposition (SVD).  By retaining only a subset of the singular values (usually the
N largest for some value of N), you can construct reduced-dimensionality vectors for
the documents and the queries.  In VSM, as mentioned above, the size of the document
and the query vectors is equal to the size of the vocabulary.  For large corpora,
this size may involve tens of thousands of words --- this can slow down the VSM
modeling and retrieval process.  So you are very likely to get faster performance
with retrieval based on LSA modeling, especially if you store the model once
constructed in a database file on the disk and carry out retrievals using the
disk-based model.



( run in 0.842 second using v1.01-cache-2.11-cpan-39bf76dae61 )