Algorithm-VSM

 view release on metacpan or  search on metacpan

lib/Algorithm/VSM.pm  view on Meta::CPAN

                            file_types             => ['.txt', '.java'],
                            lsa_svd_threshold      => 0.01, 
                            max_number_retrievals  => 10,
                            min_word_length        => 4,
                            stop_words_file        => $stop_words_file,
                            use_idf_filter         => 1,
                            want_stemming          => 1,
        );
        $lsa->get_corpus_vocabulary_and_word_counts();
        $lsa->display_corpus_vocab();
        $lsa->display_corpus_vocab_size();
        $lsa->write_corpus_vocab_to_file("vocabulary_dump.txt");
        $lsa->generate_document_vectors();
        $lsa->construct_lsa_model();
        my $retrievals = $lsa->retrieve_for_query_with_lsa( \@query );
        $lsa->display_retrievals( $retrievals );

    The initialization code before the constructor call and the calls for displaying
    the vocabulary and the vectors after the call remain the same as for the VSM case
    shown previously in this Synopsis.  In the call above, the constructor parameter
    'lsa_svd_threshold' determines how many of the singular values will be retained
    after we have carried out an SVD decomposition of the term-frequency matrix for
    the documents in the corpus.  Singular values smaller than this threshold
    fraction of the largest value are rejected.



  # FOR MEASURING PRECISION VERSUS RECALL FOR VSM:

        my $corpus_dir = "corpus";   
        my $stop_words_file = "stop_words.txt";  
        my $query_file      = "test_queries.txt";  
        my $relevancy_file   = "relevancy.txt";   # All relevancy judgments
                                                  # will be stored in this file
        my $vsm = Algorithm::VSM->new( 
                            break_camelcased_and_underscored  => 1, 
                            case_sensitive         => 0,
                            corpus_directory       => $corpus_dir,
                            file_types             => ['.txt', '.java'],
                            min_word_length        => 4,
                            query_file             => $query_file,
                            relevancy_file         => $relevancy_file,
                            relevancy_threshold    => 5, 
                            stop_words_file        => $stop_words_file, 
                            want_stemming          => 1,
        );
        $vsm->get_corpus_vocabulary_and_word_counts();
        $vsm->generate_document_vectors();
        $vsm->estimate_doc_relevancies();
        $vsm->display_doc_relevancies();               # used only for testing
        $vsm->precision_and_recall_calculator('vsm');
        $vsm->display_precision_vs_recall_for_queries();
        $vsm->display_average_precision_for_queries_and_map();

      Measuring precision and recall requires a set of queries.  These are supplied
      through the constructor parameter 'query_file'.  The format of the this file
      must be according to the sample file 'test_queries.txt' in the 'examples'
      directory.  The module estimates the relevancies of the documents to the
      queries and dumps the relevancies in a file named by the 'relevancy_file'
      constructor parameter.  The constructor parameter 'relevancy_threshold' is used
      to decide which of the documents are considered to be relevant to a query.  A
      document must contain at least the 'relevancy_threshold' occurrences of query
      words in order to be considered relevant to a query.



  # FOR MEASURING PRECISION VERSUS RECALL FOR LSA:

        my $lsa = Algorithm::VSM->new( 
                            break_camelcased_and_underscored  => 1, 
                            case_sensitive         => 0,
                            corpus_directory       => $corpus_dir,
                            file_types             => ['.txt', '.java'],
                            lsa_svd_threshold      => 0.01,
                            min_word_length        => 4,
                            query_file             => $query_file,
                            relevancy_file         => $relevancy_file,
                            relevancy_threshold    => 5, 
                            stop_words_file        => $stop_words_file, 
                            want_stemming          => 1,
        );
        $lsa->get_corpus_vocabulary_and_word_counts();
        $lsa->generate_document_vectors();
        $lsa->construct_lsa_model();
        $lsa->estimate_doc_relevancies();
        $lsa->display_doc_relevancies();
        $lsa->precision_and_recall_calculator('lsa');
        $lsa->display_precision_vs_recall_for_queries();
        $lsa->display_average_precision_for_queries_and_map();

      We have already explained the purpose of the constructor parameter 'query_file'
      and about the constraints on the format of queries in the file named through
      this parameter.  As mentioned earlier, the module estimates the relevancies of
      the documents to the queries and dumps the relevancies in a file named by the
      'relevancy_file' constructor parameter.  The constructor parameter
      'relevancy_threshold' is used in deciding which of the documents are considered
      to be relevant to a query.  A document must contain at least the
      'relevancy_threshold' occurrences of query words in order to be considered
      relevant to a query.  We have previously explained the role of the constructor
      parameter 'lsa_svd_threshold'.



  # FOR MEASURING PRECISION VERSUS RECALL FOR VSM USING FILE-BASED RELEVANCE JUDGMENTS:

        my $corpus_dir = "corpus";  
        my $stop_words_file = "stop_words.txt";
        my $query_file      = "test_queries.txt";
        my $relevancy_file   = "relevancy.txt";  
        my $vsm = Algorithm::VSM->new( 
                            break_camelcased_and_underscored  => 1, 
                            case_sensitive         => 0,
                            corpus_directory       => $corpus_dir,
                            file_types             => ['.txt', '.java'],
                            min_word_length        => 4,
                            query_file             => $query_file,
                            relevancy_file         => $relevancy_file,
                            stop_words_file        => $stop_words_file, 
                            want_stemming          => 1,
        );
        $vsm->get_corpus_vocabulary_and_word_counts();
        $vsm->generate_document_vectors();
        $vsm->upload_document_relevancies_from_file();  
        $vsm->display_doc_relevancies();
        $vsm->precision_and_recall_calculator('vsm');
        $vsm->display_precision_vs_recall_for_queries();
        $vsm->display_average_precision_for_queries_and_map();

    Now the filename supplied through the constructor parameter 'relevancy_file' must
    contain relevance judgments for the queries that are named in the file supplied
    through the parameter 'query_file'.  The format of these two files must be
    according to what is shown in the sample files 'test_queries.txt' and
    'relevancy.txt' in the 'examples' directory.



  # FOR MEASURING PRECISION VERSUS RECALL FOR LSA USING FILE-BASED RELEVANCE JUDGMENTS:

        my $corpus_dir = "corpus";  
        my $stop_words_file = "stop_words.txt";
        my $query_file      = "test_queries.txt";
        my $relevancy_file   = "relevancy.txt";  
        my $lsa = Algorithm::VSM->new( 
                            break_camelcased_and_underscored  => 1,  
                            case_sensitive      => 0,                
                            corpus_directory    => $corpus_dir,
                            file_types          => ['.txt', '.java'],
                            lsa_svd_threshold   => 0.01,
                            min_word_length     => 4,
                            query_file          => $query_file,
                            relevancy_file      => $relevancy_file,
                            stop_words_file     => $stop_words_file,
                            want_stemming       => 1,                
        );
        $lsa->get_corpus_vocabulary_and_word_counts();
        $lsa->generate_document_vectors();

lib/Algorithm/VSM.pm  view on Meta::CPAN

calculating retrieval performance with C<Precision> and C<Recall> numbers. The format
of the query file must be as shown in the sample file C<test_queries.txt> in the
'examples' directory.

=item I<relevancy_file:> 

This option names the disk file for storing the relevancy judgments.

=item I<relevancy_threshold:> 

The constructor parameter B<relevancy_threshold> is used for automatic determination
of document relevancies to queries on the basis of the number of occurrences of query
words in a document.  You can exercise control over the process of determining
relevancy of a document to a query by giving a suitable value to the constructor
parameter B<relevancy_threshold>.  A document is considered relevant to a query only
when the document contains at least B<relevancy_threshold> number of query words.

=item I<save_model_on_disk:>

The constructor parameter B<save_model_on_disk> will cause the basic
information about the VSM and the LSA models to be stored on the disk.
Subsequently, any retrievals can be carried out from the disk-based model.

=item I<stop_words_file:>

The parameter B<stop_words_file> is for naming the file that contains the stop words
that you do not wish to include in the corpus vocabulary.  The format of this file
must be as shown in the sample file C<stop_words.txt> in the 'examples' directory.

=item I<use_idf_filter:>

The constructor parameter B<use_idf_filter> is set by default.  If you want
to turn off the normalization of the document vectors, including turning
off the weighting of the term frequencies of the words by their idf values,
you must set this parameter explicitly to 0.

=item I<want_stemming:>

The boolean parameter B<want_stemming> determines whether or not the words extracted
from the documents would be subject to stemming.  As mentioned elsewhere, stemming
means that related words like 'programming' and 'programs' would both be reduced to
the root word 'program'.

=back

=begin html

<br>

=end html

=item B<construct_lsa_model():>

You call this subroutine for constructing an LSA model for your corpus
after you have extracted the corpus vocabulary and constructed document
vectors:

    $vsm->construct_lsa_model();

The SVD decomposition that is carried out in LSA model construction uses the
constructor parameter C<lsa_svd_threshold> to decide how many of the singular values
to retain for the LSA model.  A singular is retained only if it is larger than the
C<lsa_svd_threshold> fraction of the largest singular value.


=item B<display_average_precision_for_queries_and_map():>

The Average Precision for a query is the average of the Precision-at-rank values
associated with each of the corpus documents relevant to the query.  The mean of the
Average Precision values for all the queries is the Mean Average Precision (MAP).
The C<Average Precision> values for the queries and the overall C<MAP> can be printed
out by calling

    $vsm->display_average_precision_for_queries_and_map();


=item B<display_corpus_vocab():>

If you would like to see corpus vocabulary as constructed by the previous call, make
the call

    $vsm->display_corpus_vocab();

Note that this is a useful thing to do only on small test corpora. If you need
to examine the vocabulary for a large corpus, call the two methods listed below.


=item B<display_corpus_vocab_size():>

If you would like for the module to print out in your terminal window the size of the
vocabulary, make the call

    $vsm->display_corpus_vocab_size();


=item B<display_doc_relevancies():>

If you would like to see the document relevancies generated by the previous method,
you can call

    $vsm->display_doc_relevancies()


=item B<display_doc_vectors():>

If you would like to see the document vectors constructed by the previous call, make
the call:

    $vsm->display_doc_vectors();

Note that this is a useful thing to do only on small test corpora. If you must call
this method on a large corpus, you might wish to direct the output to a file.  


=item B<display_inverse_document_frequencies():>

You can display the idf value associated with each word in the corpus by

    $vsm->display_inverse_document_frequencies();

The idf of a word in the corpus is calculated typically as the logarithm of the ratio



( run in 1.473 second using v1.01-cache-2.11-cpan-13bb782fe5a )