origin results from the CPAN

origin
Algorithm-VSM
view release on metacpan or search on metacpan

=head1 HOW DOES ONE DEAL WITH VERY LARGE LIBRARIES/CORPORA?

It is not uncommon for large software libraries to consist of tens of thousands of
documents that include source-code files, documentation files, README files,
configuration files, etc.  The bug-localization work presented recently by Shivani
Rao and this author at the 2011 Mining Software Repository conference (MSR11) was
based on a relatively small iBUGS dataset involving 6546 documents and a vocabulary
size of 7553 unique words. (Here is a link to this work:
L<http://portal.acm.org/citation.cfm?id=1985451>.  Also note that the iBUGS dataset
was originally put together by V. Dallmeier and T. Zimmermann for the evaluation of
automated bug detection and localization tools.)  If C<V> is the size of the
vocabulary and C<M> the number of the documents in the corpus, the size of each
vector will be C<V> and size of the term-frequency matrix for the entire corpus will
be C<V>xC<M>.  So if you were to duplicate the bug localization experiments in
L<http://portal.acm.org/citation.cfm?id=1985451> you would be dealing with vectors of
size 7553 and a term-frequency matrix of size 7553x6546.  Extrapolating these numbers
to really large libraries/corpora, we are obviously talking about very large matrices
for SVD decomposition.  For large libraries/corpora, it would be best to store away
the model in a disk file and to base all subsequent retrievals on the disk-stored
models.  The 'examples' directory contains scripts that carry out retrievals on the
( run in 1.144 second using v1.01-cache-2.11-cpan-1c8d708658b )