best results from the CPAN

Algorithm-VSM

view release on metacpan or search on metacpan


sub _doc_vec_comparator {
    my $self = shift;
    my %query_vector = %{$self->{_query_vector}};
    my $vec1_hash_ref = $self->{_idf_filter_option} ?
                                $self->{_normalized_doc_vecs}->{$a} :
                                $self->{_corpus_doc_vectors}->{$a};
    my $vec2_hash_ref = $self->{_idf_filter_option} ?
                                $self->{_normalized_doc_vecs}->{$b} :
                                $self->{_corpus_doc_vectors}->{$b};
    my @vec1 = ();
    my @vec2 = ();
    my @qvec = ();
    foreach my $word (sort keys %{$self->{_vocab_hist}}) {
        push @vec1, $vec1_hash_ref->{$word};
        push @vec2, $vec2_hash_ref->{$word};
        push @qvec, $query_vector{$word};
    }
    my $vec1_mag = vec_magnitude(\@vec1);
    my $vec2_mag = vec_magnitude(\@vec2);
    my $qvec_mag = vec_magnitude(\@qvec);
    my $product1 = vec_scalar_product(\@vec1, \@qvec);
    $product1 /= $vec1_mag * $qvec_mag;
    my $product2 = vec_scalar_product(\@vec2, \@qvec);
    $product2 /= $vec2_mag * $qvec_mag;
    return 1 if $product1 < $product2;
    return 0  if $product1 == $product2;
    return -1  if $product1 > $product2;
}

sub _similarity_to_query {
    my $self = shift;
    my $doc_name = shift;
    my $vec_hash_ref = $self->{_idf_filter_option} ?
                          $self->{_normalized_doc_vecs}->{$doc_name} :
                          $self->{_corpus_doc_vectors}->{$doc_name};
    my @vec = ();
    my @qvec = ();
    foreach my $word (sort keys %$vec_hash_ref) {
        push @vec, $vec_hash_ref->{$word};
        push @qvec, $self->{_query_vector}->{$word};
    }
    my $vec_mag = vec_magnitude(\@vec);
    my $qvec_mag = vec_magnitude(\@qvec);
    my $product = vec_scalar_product(\@vec, \@qvec);
    $product /= $vec_mag * $qvec_mag;
    return $product;
}

######################  Relevance Judgments for Testing Purposes   #######################

## IMPORTANT: This estimation of document relevancies to queries is NOT for
##            serious work.  A document is considered to be relevant to a
##            query if it contains several of the query words.  As to the
##            minimum number of query words that must exist in a document
##            in order for the latter to be considered relevant is
##            determined by the relevancy_threshold parameter in the VSM
##            constructor.  (See the relevancy and precision-recall related
##            scripts in the 'examples' directory.)  The reason for why the
##            function shown below is not for serious work is because
##            ultimately it is the humans who are the best judges of the
##            relevancies of documents to queries.  The humans bring to
##            bear semantic considerations on the relevancy determination
##            problem that are beyond the scope of this module.

sub estimate_doc_relevancies {
    my $self = shift;
    die "You did not set the 'query_file' parameter in the constructor"
        unless $self->{_query_file};
    open( IN, $self->{_query_file} )
               or die "unable to open the query file $self->{_query_file}: $!";
    croak "\n\nYou need to specify a name for the relevancy file in \n" .
        " in which the relevancy judgments will be dumped." 
                                 unless  $self->{_relevancy_file};
    while (<IN>) {
        next if /^#/;
        next if /^[ ]*\r?\n?$/;
        $_ =~ s/\r?\n?$//;
        die "Format of query file is not correct" unless /^[ ]*q[0-9]+:/;
        /^[ ]*(q[0-9]+):[ ]*(.*)/;
        my $query_label = $1;
        my $query = $2;
        next unless $query;
        $self->{_queries_for_relevancy}->{$query_label} =  $query;
    }
    if ($self->{_debug}) {
        foreach (sort keys %{$self->{_queries_for_relevancy}}) {
            print "$_   =>   $self->{_queries_for_relevancy}->{$_}\n"; 
        }
    }
    $self->{_scan_dir_for_rels} = 1;
    $self->_scan_directory($self->{_corpus_directory});
    $self->{_scan_dir_for_rels} = 0;
    chdir $self->{_working_directory};
    open(OUT, ">$self->{_relevancy_file}") 
       or die "unable to open the relevancy file $self->{_relevancy_file}: $!";
    my @relevancy_list_for_query;
    foreach (sort 
               {get_integer_suffix($a) <=> get_integer_suffix($b)} 
               keys %{$self->{_relevancy_estimates}}) {    
        @relevancy_list_for_query = 
                        keys %{$self->{_relevancy_estimates}->{$_}};
        print OUT "$_   =>   @relevancy_list_for_query\n\n"; 
        print "Number of relevant docs for query $_: " . 
                         scalar(@relevancy_list_for_query) . "\n";
    }
}

#   If there are available human-supplied relevancy judgments in a disk
#   file, use this script to upload that information.  One of the scripts
#   in the 'examples' directory carries out the precision-recall analysis 
#   by using this approach.  IMPORTANT:  The human-supplied relevancy
#   judgments must be in a format that is shown in the sample file
#   relevancy.txt in the 'examples' directory.
sub upload_document_relevancies_from_file {
    my $self = shift;
    chdir $self->{_working_directory};
    open( IN, $self->{_relevancy_file} )
       or die "unable to open the relevancy file $self->{_relevancy_file}: $!";
    while (<IN>) {
        next if /^#/;

lib/Algorithm/VSM.pm view on Meta::CPAN

Stemming the words so extracted and eliminating the designated stop words from the
vocabulary.  Stemming means that closely related words like 'programming' and
'programs' are reduced to the common root word 'program' and the stop words are the
non-discriminating words that can be expected to exist in virtually all the
documents. (3) Constructing document vectors for the individual files in the corpus
--- the document vectors taken together constitute what is usually referred to as a
'term-frequency' matrix for the corpus. (4) Normalizing the document vectors to
factor out the effect of document size and, if desired, multiplying the term
frequencies by the IDF (Inverse Document Frequency) values for the words to reduce
the weight of the words that appear in a large number of documents. (5) Constructing
a query vector for the search query after the query is subject to the same stemming
and stop-word elimination rules that were applied to the corpus. And, lastly, (6)
Using a similarity metric to return the set of documents that are most similar to the
query vector.  The commonly used similarity metric is one based on the cosine
distance between two vectors.  Also note that all the vectors mentioned here are of
the same size, the size of the vocabulary.  An element of a vector is the frequency
of occurrence of the word corresponding to that position in the vector.

LSA modeling is a small variation on VSM modeling.  Now you take VSM modeling one
step further by subjecting the term-frequency matrix for the corpus to singular value
decomposition (SVD).  By retaining only a subset of the singular values (usually the
N largest for some value of N), you can construct reduced-dimensionality vectors for
the documents and the queries.  In VSM, as mentioned above, the size of the document
and the query vectors is equal to the size of the vocabulary.  For large corpora,
this size may involve tens of thousands of words --- this can slow down the VSM
modeling and retrieval process.  So you are very likely to get faster performance
with retrieval based on LSA modeling, especially if you store the model once
constructed in a database file on the disk and carry out retrievals using the
disk-based model.


=head1 CAN THIS MODULE BE USED FOR GENERAL TEXT RETRIEVAL?

This module has only been tested for software retrieval.  For more general text
retrieval, you would need to replace the simple stemmer used in the module by one
based on, say, Porter's Stemming Algorithm.  You would also need to vastly expand the
list of stop words appropriate to the text corpora of interest to you. As previously
mentioned, the stop words are the commonly occurring words that do not carry much
discriminatory power from the standpoint of distinguishing between the documents.
See the file 'stop_words.txt' in the 'examples' directory for how such a file must be
formatted.


=head1 HOW DOES ONE DEAL WITH VERY LARGE LIBRARIES/CORPORA?

It is not uncommon for large software libraries to consist of tens of thousands of
documents that include source-code files, documentation files, README files,
configuration files, etc.  The bug-localization work presented recently by Shivani
Rao and this author at the 2011 Mining Software Repository conference (MSR11) was
based on a relatively small iBUGS dataset involving 6546 documents and a vocabulary
size of 7553 unique words. (Here is a link to this work:
L<http://portal.acm.org/citation.cfm?id=1985451>.  Also note that the iBUGS dataset
was originally put together by V. Dallmeier and T. Zimmermann for the evaluation of
automated bug detection and localization tools.)  If C<V> is the size of the
vocabulary and C<M> the number of the documents in the corpus, the size of each
vector will be C<V> and size of the term-frequency matrix for the entire corpus will
be C<V>xC<M>.  So if you were to duplicate the bug localization experiments in
L<http://portal.acm.org/citation.cfm?id=1985451> you would be dealing with vectors of
size 7553 and a term-frequency matrix of size 7553x6546.  Extrapolating these numbers
to really large libraries/corpora, we are obviously talking about very large matrices
for SVD decomposition.  For large libraries/corpora, it would be best to store away
the model in a disk file and to base all subsequent retrievals on the disk-stored
models.  The 'examples' directory contains scripts that carry out retrievals on the
basis of disk-based models.  Further speedup in retrieval can be achieved by using
LSA to create reduced-dimensionality representations for the documents and by basing
retrievals on the stored versions of such reduced-dimensionality representations.


=head1 ESTIMATING RETRIEVAL PERFORMANCE WITH PRECISION VS. RECALL CALCULATIONS

The performance of a retrieval algorithm is typically measured by two properties:
C<Precision at rank> and C<Recall at rank>.  As mentioned in my tutorial
L<https://engineering.purdue.edu/kak/Tutorials/SignificanceTesting.pdf>, at a given
rank C<r>, C<Precision> is the ratio of the number of retrieved documents that are
relevant to the total number of retrieved documents up to that rank.  And, along the
same lines, C<Recall> at a given rank C<r> is the ratio of the number of retrieved
documents that are relevant to the total number of relevant documents.  The Average
Precision associated with a query is the average of all the Precision-at-rank values
for all the documents relevant to that query.  When query specific Average Precision
is averaged over all the queries, you get Mean Average Precision (MAP) as a
single-number characterizer of the retrieval power of an algorithm for a given
corpus.  For an oracle, the value of MAP should be 1.0.  On the other hand, for
purely random retrieval from a corpus, the value of MAP will be inversely
proportional to the size of the corpus.  (See the discussion in
L<https://engineering.purdue.edu/kak/Tutorials/SignificanceTesting.pdf> for further
explanation on these retrieval precision evaluators.)  

This module includes methods that allow you to carry out these retrieval accuracy
measurements using the relevancy judgments supplied through a disk file.  If
human-supplied relevancy judgments are not available, the module will be happy to
estimate relevancies for you just by determining the number of query words that exist
in a document.  Note, however, that relevancy judgments estimated in this manner
cannot be trusted. That is because ultimately it is the humans who are the best
judges of the relevancies of documents to queries.  The humans bring to bear semantic
considerations on the relevancy determination problem that are beyond the scope of
this module.


=head1 METHODS

The module provides the following methods for constructing VSM and LSA models of a
corpus, for using the models thus constructed for retrieval, and for carrying out
precision versus recall calculations for the determination of retrieval accuracy on
the corpora of interest to you.

=over

=item B<new():>

A call to C<new()> constructs a new instance of the C<Algorithm::VSM> class:

    my $vsm = Algorithm::VSM->new( 
                     break_camelcased_and_underscored  => 1, 
                     case_sensitive         => 0,
                     corpus_directory       => "",
                     corpus_vocab_db        => "corpus_vocab_db",
                     doc_vectors_db         => "doc_vectors_db",
                     file_types             => $my_file_types,
                     lsa_svd_threshold      => 0.01, 
                     max_number_retrievals  => 10,
                     min_word_length        => 4,
                     normalized_doc_vecs_db => "normalized_doc_vecs_db",
                     query_file             => "",  
                     relevancy_file         => $relevancy_file,
                     relevancy_threshold    => 5, 
                     save_model_on_disk     => 0,  
                     stop_words_file        => "", 
                     use_idf_filter         => 1,
                     want_stemming          => 1,
              );       

The values shown on the right side of the big arrows are the B<default values for the
parameters>.  The value supplied through the variable C<$my_file_types> would be
something like C<['.java', '.txt']> if, say, you wanted only Java and text files to
be included in creating the database model.  The following nested list will now
describe each of the constructor parameters shown above:

=over 16

=item I<break_camelcased_and_underscored:>

The parameter B<break_camelcased_and_underscored> when set causes the
underscored and camel-cased words to be split.  By default the parameter is
set.  So if you don't want such words to be split, you must set it
explicitly to 0.

=item I<case_sensitive:>

When set to 1, this parameter forces the module to maintain the case of the terms in
the corpus files when creating the vocabulary and the document vectors.  Setting
C<case_sensitive> to 1 also causes the query matching to become case sensitive.
(This constructor parameter was introduced in Version 1.61.)

lib/Algorithm/VSM.pm view on Meta::CPAN

You can display the idf value associated with each word in the corpus by

    $vsm->display_inverse_document_frequencies();

The idf of a word in the corpus is calculated typically as the logarithm of the ratio
of the total number of documents in the corpus to the number of documents in which
the word appears (with protection built in to prevent division by zero).  Ideally, if
a word appears in all the documents, its idf would be small, close to zero. Words
with small idf values are non-discriminatory and should get reduced weighting in
document retrieval.


=item B<display_normalized_doc_vectors():>

If you would like to see the normalized document vectors, make the call:

    $vsm->display_normalized_doc_vectors();

See the comment made previously as to what is meant by the normalization of a
document vector.


=item B<display_precision_vs_recall_for_queries():>

A call to C<precision_and_recall_calculator()> will normally be followed by the
following call

    $vsm->display_precision_vs_recall_for_queries();

for displaying the C<Precision@rank> and C<Recall@rank> values.


=item B<display_retrievals( $retrievals ):>

You can display the retrieved document names by calling this method using the syntax:

    $vsm->display_retrievals( $retrievals );

where C<$retrievals> is a reference to the hash returned by a call to one of the
C<retrieve> methods.  The display method shown here respects the retrieval size
constraints expressed by the constructor parameter C<max_number_retrievals>.


=item B<estimate_doc_relevancies():>

Before you can carry out precision and recall calculations to test the accuracy of
VSM and LSA based retrievals from a corpus, you need to have available the relevancy
judgments for the queries.  (A relevancy judgment for a query is simply the list of
documents relevant to that query.)  Relevancy judgments are commonly supplied by the
humans who are familiar with the corpus.  But if such human-supplied relevance
judgments are not available, you can invoke the following method to estimate them:

    $vsm->estimate_doc_relevancies();

For the above method call, a document is considered to be relevant to a query if it
contains several of the query words.  As to the minimum number of query words that
must exist in a document in order for the latter to be considered relevant, that is
determined by the C<relevancy_threshold> parameter in the VSM constructor.

But note that this estimation of document relevancies to queries is NOT for serious
work.  The reason for that is because ultimately it is the humans who are the best
judges of the relevancies of documents to queries.  The humans bring to bear semantic
considerations on the relevancy determination problem that are beyond the scope of
this module.

The generated relevancies are deposited in a file named by the constructor parameter
C<relevancy_file>.


=item B<get_all_document_names():>

If you want to get hold of all the filenames in the corpus in your own script, you
can call

    my @docs = @{$vsm->get_all_document_names()};

The array on the left will contain an alphabetized list of the files.


=item B<generate_document_vectors():>

This is a necessary step after the vocabulary used by a corpus is constructed. (Of
course, if you will be doing document retrieval through a disk-stored VSM or LSA
model, then you do not need to call this method.  You construct document vectors
through the following call:

    $vsm->generate_document_vectors();


=item B<get_corpus_vocabulary_and_word_counts():>

After you have constructed a new instance of the C<Algorithm::VSM> class, you must
now scan the corpus documents for constructing the corpus vocabulary. This you do by:

    $vsm->get_corpus_vocabulary_and_word_counts();

The only time you do NOT need to call this method is when you are using a previously
constructed disk-stored VSM model for retrieval.


=item B<get_query_sorted_average_precision_for_queries():>

If you want to run significance tests on the retrieval accuracies you obtain on a
given corpus and with different algorithms (VSM or LSA with different choices for the
constructor parameters), your own script would need access to the average precision
data for a set of queries. You can get hold of this data by calling

    $vsm->get_query_sorted_average_precision_for_queries();

The script C<significance_testing.pl> in the 'examples' directory shows how you can
use this method for significance testing.


=item B<pairwise_similarity_for_docs():>

=item B<pairwise_similarity_for_normalized_docs():>

If you would like to compare in your own script any two documents in the corpus, you
can call

    my $similarity = $vsm->pairwise_similarity_for_docs("filename_1", "filename_2");

lib/Algorithm/VSM.pm view on Meta::CPAN

You can create a script similar to this for doing the same with LSA models.

=item B<For Storing the Model Information in Disk-Based Hash Tables:>

For storing the model information in disk-based DBM files that can subsequently be
used for both VSM and LSA retrieval, run the script:

    retrieve_with_VSM_and_also_create_disk_based_model.pl

=item B<For Basic LSA-Based Retrieval:>

For basic LSA-based model construction and retrieval, run the script:

    retrieve_with_LSA.pl

Starting with version 1.60, this script does not store away the model information in
disk-based hash tables.  If you want your model to be stored on the disk, you must
run the script C<retrieve_with_VSM_and_also_create_disk_based_model.pl> for that.

=item B<For VSM-Based Retrieval with a Disk-Stored Model:>

If you have previously run a script like
C<retrieve_with_VSM_and_also_create_disk_based_model.pl>, you can run the script

    retrieve_with_disk_based_VSM.pl

for repeated VSM-based retrievals from a disk-based model.

=item B<For LSA-Based Retrieval with a Disk-Stored Model:>

If you have previously run a script like
C<retrieve_with_VSM_and_also_create_disk_based_model.pl>, you can run the script

    retrieve_with_disk_based_LSA.pl

for repeated LSA-based retrievals from a disk-based model.

=item B<For Precision and Recall Calculations with VSM:>

To experiment with precision and recall calculations for VSM retrieval, run the
script:

    calculate_precision_and_recall_for_VSM.pl

Note that this script will carry out its own estimation of relevancy judgments ---
which in most cases would not be a safe thing to do.

=item B<For Precision and Recall Calculations with LSA:>

To experiment with precision and recall calculations for LSA retrieval, run the
script:

    calculate_precision_and_recall_for_LSA.pl

Note that this script will carry out its own estimation of relevancy judgments ---
which in most cases would not be a safe thing to do.

=item B<For Precision and Recall Calculations for VSM with
Human-Supplied Relevancies:>

Precision and recall calculations for retrieval accuracy determination are best
carried out with human-supplied judgments of relevancies of the documents to queries.
If such judgments are available, run the script:

    calculate_precision_and_recall_from_file_based_relevancies_for_VSM.pl

This script will print out the average precisions for the different test queries and
calculate the MAP metric of retrieval accuracy.

=item B<For Precision and Recall Calculations for LSA with
Human-Supplied Relevancies:>

If human-supplied relevancy judgments are available and you wish to experiment with
precision and recall calculations for LSA-based retrieval, run the script:

    calculate_precision_and_recall_from_file_based_relevancies_for_LSA.pl

This script will print out the average precisions for the different test queries and
calculate the MAP metric of retrieval accuracy.

=item B<To carry out significance tests on the retrieval precision results with
Randomization or with Student's Paired t-Test:>

    significance_testing.pl  randomization

or

    significance_testing.pl  t-test

Significance testing consists of forming a null hypothesis that the two retrieval
algorithms you are considering are the same from a black-box perspective and then
calculating what is known as a C<p-value>.  If the C<p-value> is less than, say,
0.05, you reject the null hypothesis.

=item B<To calculate a similarity matrix for all the documents in your corpus:>

    calculate_similarity_matrix_for_all_docs.pl

or

    calculate_similarity_matrix_for_all_normalized_docs.pl

The former uses regular document vectors for calculating the similarity between every
pair of documents in the corpus. And the latter uses normalized document vectors for
the same purpose.  The document order used for row and column indexing of the matrix
corresponds to the alphabetic ordering of the document names in the corpus directory.

=back


=head1 EXPORT

None by design.

=head1 SO THAT YOU DO NOT LOSE RELEVANCY JUDGMENTS

You have to be careful when carrying out Precision verses Recall calculations if you
do not wish to lose the previously created relevancy judgments. Invoking the method
C<estimate_doc_relevancies()> in your own script will cause the file C<relevancy.txt>
to be overwritten.  If you have created a relevancy database and stored it in a file
called, say, C<relevancy.txt>, you should make a backup copy of this file before

( run in 0.334 second using v1.01-cache-2.11-cpan-13bb782fe5a )