best results from the CPAN

Algorithm-VSM

view release on metacpan or search on metacpan


## IMPORTANT: This estimation of document relevancies to queries is NOT for
##            serious work.  A document is considered to be relevant to a
##            query if it contains several of the query words.  As to the
##            minimum number of query words that must exist in a document
##            in order for the latter to be considered relevant is
##            determined by the relevancy_threshold parameter in the VSM
##            constructor.  (See the relevancy and precision-recall related
##            scripts in the 'examples' directory.)  The reason for why the
##            function shown below is not for serious work is because
##            ultimately it is the humans who are the best judges of the
##            relevancies of documents to queries.  The humans bring to
##            bear semantic considerations on the relevancy determination
##            problem that are beyond the scope of this module.

sub estimate_doc_relevancies {
    my $self = shift;
    die "You did not set the 'query_file' parameter in the constructor"
        unless $self->{_query_file};
    open( IN, $self->{_query_file} )
               or die "unable to open the query file $self->{_query_file}: $!";

lib/Algorithm/VSM.pm view on Meta::CPAN

size of 7553 unique words. (Here is a link to this work:
L<http://portal.acm.org/citation.cfm?id=1985451>.  Also note that the iBUGS dataset
was originally put together by V. Dallmeier and T. Zimmermann for the evaluation of
automated bug detection and localization tools.)  If C<V> is the size of the
vocabulary and C<M> the number of the documents in the corpus, the size of each
vector will be C<V> and size of the term-frequency matrix for the entire corpus will
be C<V>xC<M>.  So if you were to duplicate the bug localization experiments in
L<http://portal.acm.org/citation.cfm?id=1985451> you would be dealing with vectors of
size 7553 and a term-frequency matrix of size 7553x6546.  Extrapolating these numbers
to really large libraries/corpora, we are obviously talking about very large matrices
for SVD decomposition.  For large libraries/corpora, it would be best to store away
the model in a disk file and to base all subsequent retrievals on the disk-stored
models.  The 'examples' directory contains scripts that carry out retrievals on the
basis of disk-based models.  Further speedup in retrieval can be achieved by using
LSA to create reduced-dimensionality representations for the documents and by basing
retrievals on the stored versions of such reduced-dimensionality representations.


=head1 ESTIMATING RETRIEVAL PERFORMANCE WITH PRECISION VS. RECALL CALCULATIONS

The performance of a retrieval algorithm is typically measured by two properties:

lib/Algorithm/VSM.pm view on Meta::CPAN

purely random retrieval from a corpus, the value of MAP will be inversely
proportional to the size of the corpus.  (See the discussion in
L<https://engineering.purdue.edu/kak/Tutorials/SignificanceTesting.pdf> for further
explanation on these retrieval precision evaluators.)  

This module includes methods that allow you to carry out these retrieval accuracy
measurements using the relevancy judgments supplied through a disk file.  If
human-supplied relevancy judgments are not available, the module will be happy to
estimate relevancies for you just by determining the number of query words that exist
in a document.  Note, however, that relevancy judgments estimated in this manner
cannot be trusted. That is because ultimately it is the humans who are the best
judges of the relevancies of documents to queries.  The humans bring to bear semantic
considerations on the relevancy determination problem that are beyond the scope of
this module.


=head1 METHODS

The module provides the following methods for constructing VSM and LSA models of a
corpus, for using the models thus constructed for retrieval, and for carrying out
precision versus recall calculations for the determination of retrieval accuracy on

lib/Algorithm/VSM.pm view on Meta::CPAN

judgments are not available, you can invoke the following method to estimate them:

    $vsm->estimate_doc_relevancies();

For the above method call, a document is considered to be relevant to a query if it
contains several of the query words.  As to the minimum number of query words that
must exist in a document in order for the latter to be considered relevant, that is
determined by the C<relevancy_threshold> parameter in the VSM constructor.

But note that this estimation of document relevancies to queries is NOT for serious
work.  The reason for that is because ultimately it is the humans who are the best
judges of the relevancies of documents to queries.  The humans bring to bear semantic
considerations on the relevancy determination problem that are beyond the scope of
this module.

The generated relevancies are deposited in a file named by the constructor parameter
C<relevancy_file>.


=item B<get_all_document_names():>

lib/Algorithm/VSM.pm view on Meta::CPAN

script:

    calculate_precision_and_recall_for_LSA.pl

Note that this script will carry out its own estimation of relevancy judgments ---
which in most cases would not be a safe thing to do.

=item B<For Precision and Recall Calculations for VSM with
Human-Supplied Relevancies:>

Precision and recall calculations for retrieval accuracy determination are best
carried out with human-supplied judgments of relevancies of the documents to queries.
If such judgments are available, run the script:

    calculate_precision_and_recall_from_file_based_relevancies_for_VSM.pl

This script will print out the average precisions for the different test queries and
calculate the MAP metric of retrieval accuracy.

=item B<For Precision and Recall Calculations for LSA with
Human-Supplied Relevancies:>

( run in 0.442 second using v1.01-cache-2.11-cpan-4e96b696675 )