expand results from the CPAN

expand
Algorithm-VSM
view release on metacpan or search on metacpan

Version 1.2 includes a code correction and some general code and documentation
cleanup.

With Version 1.1, you can access the retrieval precision results so that you can
compare two different retrieval algorithms (VSM or LSA with different choices for
some of the constructor parameters) with significance testing. (Version 1.0 merely
sent those results to standard output, typically your terminal window.)  In Version
1.1, the new script B<significance_testing.pl> in the 'examples' directory
illustrates significance testing with Randomization and with Student's Paired t-Test.

=head1 DESCRIPTION

B<Algorithm::VSM> is a I<perl5> module for constructing a Vector Space Model (VSM) or
a Latent Semantic Analysis Model (LSA) of a collection of documents, usually referred
to as a corpus, and then retrieving the documents in response to search words in a
query.

VSM and LSA models have been around for a long time in the Information Retrieval (IR)
community.  More recently such models have been shown to be effective in retrieving
files/documents from software libraries. For an account of this research that was
presented by Shivani Rao and the author of this module at the 2011 Mining Software
Repositories conference, see L<http://portal.acm.org/citation.cfm?id=1985451>.

VSM modeling consists of: (1) Extracting the vocabulary used in a corpus.  (2)
Stemming the words so extracted and eliminating the designated stop words from the
vocabulary.  Stemming means that closely related words like 'programming' and
'programs' are reduced to the common root word 'program' and the stop words are the
non-discriminating words that can be expected to exist in virtually all the
documents. (3) Constructing document vectors for the individual files in the corpus
--- the document vectors taken together constitute what is usually referred to as a
'term-frequency' matrix for the corpus. (4) Normalizing the document vectors to
factor out the effect of document size and, if desired, multiplying the term
frequencies by the IDF (Inverse Document Frequency) values for the words to reduce
the weight of the words that appear in a large number of documents. (5) Constructing
a query vector for the search query after the query is subject to the same stemming
and stop-word elimination rules that were applied to the corpus. And, lastly, (6)
Using a similarity metric to return the set of documents that are most similar to the
query vector.  The commonly used similarity metric is one based on the cosine
distance between two vectors.  Also note that all the vectors mentioned here are of
the same size, the size of the vocabulary.  An element of a vector is the frequency
of occurrence of the word corresponding to that position in the vector.

LSA modeling is a small variation on VSM modeling.  Now you take VSM modeling one
step further by subjecting the term-frequency matrix for the corpus to singular value
decomposition (SVD).  By retaining only a subset of the singular values (usually the
N largest for some value of N), you can construct reduced-dimensionality vectors for
the documents and the queries.  In VSM, as mentioned above, the size of the document
and the query vectors is equal to the size of the vocabulary.  For large corpora,
this size may involve tens of thousands of words --- this can slow down the VSM
modeling and retrieval process.  So you are very likely to get faster performance
with retrieval based on LSA modeling, especially if you store the model once
constructed in a database file on the disk and carry out retrievals using the
disk-based model.


=head1 CAN THIS MODULE BE USED FOR GENERAL TEXT RETRIEVAL?

This module has only been tested for software retrieval.  For more general text
retrieval, you would need to replace the simple stemmer used in the module by one
based on, say, Porter's Stemming Algorithm.  You would also need to vastly expand the
list of stop words appropriate to the text corpora of interest to you. As previously
mentioned, the stop words are the commonly occurring words that do not carry much
discriminatory power from the standpoint of distinguishing between the documents.
See the file 'stop_words.txt' in the 'examples' directory for how such a file must be
formatted.


=head1 HOW DOES ONE DEAL WITH VERY LARGE LIBRARIES/CORPORA?

It is not uncommon for large software libraries to consist of tens of thousands of
documents that include source-code files, documentation files, README files,
configuration files, etc.  The bug-localization work presented recently by Shivani
Rao and this author at the 2011 Mining Software Repository conference (MSR11) was
based on a relatively small iBUGS dataset involving 6546 documents and a vocabulary
size of 7553 unique words. (Here is a link to this work:
L<http://portal.acm.org/citation.cfm?id=1985451>.  Also note that the iBUGS dataset
was originally put together by V. Dallmeier and T. Zimmermann for the evaluation of
automated bug detection and localization tools.)  If C<V> is the size of the
vocabulary and C<M> the number of the documents in the corpus, the size of each
vector will be C<V> and size of the term-frequency matrix for the entire corpus will
be C<V>xC<M>.  So if you were to duplicate the bug localization experiments in
L<http://portal.acm.org/citation.cfm?id=1985451> you would be dealing with vectors of
size 7553 and a term-frequency matrix of size 7553x6546.  Extrapolating these numbers
to really large libraries/corpora, we are obviously talking about very large matrices
for SVD decomposition.  For large libraries/corpora, it would be best to store away
the model in a disk file and to base all subsequent retrievals on the disk-stored
models.  The 'examples' directory contains scripts that carry out retrievals on the
basis of disk-based models.  Further speedup in retrieval can be achieved by using
LSA to create reduced-dimensionality representations for the documents and by basing
retrievals on the stored versions of such reduced-dimensionality representations.


=head1 ESTIMATING RETRIEVAL PERFORMANCE WITH PRECISION VS. RECALL CALCULATIONS

The performance of a retrieval algorithm is typically measured by two properties:
C<Precision at rank> and C<Recall at rank>.  As mentioned in my tutorial
L<https://engineering.purdue.edu/kak/Tutorials/SignificanceTesting.pdf>, at a given
rank C<r>, C<Precision> is the ratio of the number of retrieved documents that are
relevant to the total number of retrieved documents up to that rank.  And, along the
same lines, C<Recall> at a given rank C<r> is the ratio of the number of retrieved
documents that are relevant to the total number of relevant documents.  The Average
Precision associated with a query is the average of all the Precision-at-rank values
for all the documents relevant to that query.  When query specific Average Precision
is averaged over all the queries, you get Mean Average Precision (MAP) as a
single-number characterizer of the retrieval power of an algorithm for a given
corpus.  For an oracle, the value of MAP should be 1.0.  On the other hand, for
purely random retrieval from a corpus, the value of MAP will be inversely
proportional to the size of the corpus.  (See the discussion in
L<https://engineering.purdue.edu/kak/Tutorials/SignificanceTesting.pdf> for further
explanation on these retrieval precision evaluators.)  

This module includes methods that allow you to carry out these retrieval accuracy
measurements using the relevancy judgments supplied through a disk file.  If
human-supplied relevancy judgments are not available, the module will be happy to
estimate relevancies for you just by determining the number of query words that exist
in a document.  Note, however, that relevancy judgments estimated in this manner
cannot be trusted. That is because ultimately it is the humans who are the best
judges of the relevancies of documents to queries.  The humans bring to bear semantic
considerations on the relevancy determination problem that are beyond the scope of
this module.
( run in 1.211 second using v1.01-cache-2.11-cpan-a9496e3eb41 )