formula results from the CPAN

Algorithm-VSM

view release on metacpan or search on metacpan

        # examples directory.  This script makes calls to
        #
        #   $vsm->pairwise_similarity_for_docs($docs[$i], $docs[$j]);        
        #
        # for every pair of documents.

=head1 CHANGES

Version 1.70: All of the changes made in this version affect only that part of the
module that is used for calculating precision-vs.-recall curve for the estimation of
MAP (Mean Average Precision).  The new formulas that go into estimating MAP are
presented in the author's tutorial on significance testing.  Additionally, when
estimating the average retrieval precision for a query, this version explicitly
disregards all documents that have zero similarity with the query.

Version 1.62 removes the Perl version restriction on the module. This version also
fixes two bugs, one in the file scanner code and the other in the
precision-and-recall calculator.  The file scanner bug was related to the new
constructor parameter C<case_sensitive> that was introduced in Version 1.61.  And the
precision-and-recall calculator bug was triggered if a query consisted solely of
non-vocabulary words.

lib/Algorithm/VSM.pm view on Meta::CPAN

Version 1.4 makes it easier for a user to calculate a similarity matrix over all the
documents in the corpus. The elements of such a matrix express pairwise similarities
between the documents.  The pairwise similarities are based on the dot product of two
document vectors divided by the product of the vector magnitudes.  The 'examples'
directory contains two scripts to illustrate how such matrices can be calculated by
the user.  The similarity matrix is output as a CSV file.

Version 1.3 incorporates IDF (Inverse Document Frequency) weighting of the words in a
document file. What that means is that the words that appear in most of the documents
get reduced weighting since such words are non-discriminatory with respect to the
retrieval of the documents. A typical formula that is used to calculate the IDF
weight for a word is the logarithm of the ratio of the total number of documents to
the number of documents in which the word appears.  So if a word were to appear in
all the documents, its IDF multiplier would be zero in the vector representation of a
document.  If so desired, you can turn off the IDF weighting of the words by
explicitly setting the constructor parameter C<use_idf_filter> to zero.

Version 1.2 includes a code correction and some general code and documentation
cleanup.

With Version 1.1, you can access the retrieval precision results so that you can

( run in 0.505 second using v1.01-cache-2.11-cpan-3cd7ad12f66 )