Algorithm-VSM

 view release on metacpan or  search on metacpan

lib/Algorithm/VSM.pm  view on Meta::CPAN

Precision and recall calculations for retrieval accuracy determination are best
carried out with human-supplied judgments of relevancies of the documents to queries.
If such judgments are available, run the script:

    calculate_precision_and_recall_from_file_based_relevancies_for_VSM.pl

This script will print out the average precisions for the different test queries and
calculate the MAP metric of retrieval accuracy.

=item B<For Precision and Recall Calculations for LSA with
Human-Supplied Relevancies:>

If human-supplied relevancy judgments are available and you wish to experiment with
precision and recall calculations for LSA-based retrieval, run the script:

    calculate_precision_and_recall_from_file_based_relevancies_for_LSA.pl

This script will print out the average precisions for the different test queries and
calculate the MAP metric of retrieval accuracy.

=item B<To carry out significance tests on the retrieval precision results with
Randomization or with Student's Paired t-Test:>

    significance_testing.pl  randomization

or

    significance_testing.pl  t-test

Significance testing consists of forming a null hypothesis that the two retrieval
algorithms you are considering are the same from a black-box perspective and then
calculating what is known as a C<p-value>.  If the C<p-value> is less than, say,
0.05, you reject the null hypothesis.

=item B<To calculate a similarity matrix for all the documents in your corpus:>

    calculate_similarity_matrix_for_all_docs.pl

or

    calculate_similarity_matrix_for_all_normalized_docs.pl

The former uses regular document vectors for calculating the similarity between every
pair of documents in the corpus. And the latter uses normalized document vectors for
the same purpose.  The document order used for row and column indexing of the matrix
corresponds to the alphabetic ordering of the document names in the corpus directory.

=back


=head1 EXPORT

None by design.

=head1 SO THAT YOU DO NOT LOSE RELEVANCY JUDGMENTS

You have to be careful when carrying out Precision verses Recall calculations if you
do not wish to lose the previously created relevancy judgments. Invoking the method
C<estimate_doc_relevancies()> in your own script will cause the file C<relevancy.txt>
to be overwritten.  If you have created a relevancy database and stored it in a file
called, say, C<relevancy.txt>, you should make a backup copy of this file before
executing a script that calls C<estimate_doc_relevancies()>.

=head1 BUGS

Please notify the author if you encounter any bugs.  When sending email, please place
the string 'VSM' in the subject line to get past my spam filter.

=head1 INSTALLATION

Download the archive from CPAN in any directory of your choice.  Unpack the archive
with a command that on a Linux machine would look like:

    tar zxvf Algorithm-VSM-1.70.tar.gz

This will create an installation directory for you whose name will be
C<Algorithm-VSM-1.70>.  Enter this directory and execute the following commands for a
standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the
module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment
variable so that this module can find the required other modules. How you do that
would depend on what platform you are working on.  In order to install this module in
a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment
variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/


=head1 THANKS

Many thanks are owed to Shivani Rao and Bunyamin Sisman for sharing with me their
deep insights in IR.  Version 1.4 was prompted by Zahn Bozanic's interest in
similarity matrix characterization of a corpus. Thanks, Zahn!  

Several of the recent changes to the module are a result of the feedback I have
received from Naveen Kulkarni of Infosys Labs. Thanks, Naveen!

Version 1.62 was a result of Slaven Rezic's recommendation that I remove the Perl
version restriction on the module since he was able to run it with Perl version
5.8.9.  Another important reason for v. 1.62 was the discovery of the two bugs
mentioned in Changes, one of them brought to my attention by Naveen Kulkarni.

=head1 AUTHOR



( run in 1.266 second using v1.01-cache-2.11-cpan-13bb782fe5a )