Algorithm-TicketClusterer
view release on metacpan or search on metacpan
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
=item I<debug2:>
When this parameter is set, you will see how WordNet is being utilized to generate
word synonyms. This debugging output is also useful to see the extent of misspellings
in the tickets. If WordNet is unable to find the synonyms for a word, chances are
that the word is not spelled correctly (or that it is a jargon word or a jargon
acronym).
=item I<debug3:>
This debug flag applies to the calculations carried out during the retrieval of
similar tickets. When this flag is set, the module will display the candidate set of
tickets to be considered for matching with the query ticket. This candidate set is
chosen by using the inverted index to collect all the tickets that share words with
the query word provided the IDF value for each such word exceeds the threshold set by
the constructor parameter C<min_idf_threshold>.
=item I<excel_filename:>
This is obviously the name of the Excel file that contains the tickets you want to
process.
=item I<how_many_retrievals:>
The integer value supplied for this parameter determines how many tickets that are
most similar to a query ticket will be returned.
=item I<idf_db:>
You store the inverse document frequencies for the vocabulary words in a database
file whose name is supplied through this constructor parameter. As mentioned
earlier, the IDF for a word is, in principle, the logarithm of the ratio of the total
number of tickets to the DF (Document Frequency) for the word. The DF of a word is
the number of tickets in which the word appears.
=item I<inverted_index_db:>
If you plan to create separate scripts for the three stages of processing described
earlier, you must store the inverted index in a database file so that it can be used
by the script whose job is to carry out similarity based ticket retrieval. The
inverted index is stored in a database file whose name is supplied through this
constructor parameter.
=item I<max_num_syn_words:>
As mentioned in B<DESCRIPTION>, some words can have a very large number of synonyms
--- much larger than the number of words that may exist in a typical ticket. If you
were to add all such synonyms to a ticket, you run the danger of altering the sense
of the ticket, besides unnecessarily increasing the size of the vocabulary. This
parameter limits the number of synonyms chosen to the value used for the parameter.
When the number of synonyms returned by WordNet is greater than the value set for
this parameter, the synonyms retained are chosen randomly from the list returned by
WordNet.
=item I<min_idf_threshold:>
First recall that IDF stands for Inverse Document Frequency. It is calculated during
the second of the three-stage processing of the tickets as described in the section
B<THE THREE STAGES OF PROCESSING TICKETS>. The IDF value of a word gives us a
measure of the discriminatory power of the word. Let's say you have a word that
occurs in only one out of 1000 tickets. Such a word is obviously highly
discriminatory and its IDF would be the logarithm (to base 10) of the ratio of 1000
to 1, which is 3. On the other hand, for a word that occurs in every one of 1000
tickets, its IDF value would be the logarithm of the ratio of 1000 to 1000, which is
0. So, for the case when you have 1000 tickets, the upper bound on IDF is 3 and the
lower bound 0. This constructor parameter controls which of the query words you will
use for constructing the initial pool of tickets that will be used for matching. The
larger the value of this threshold, the smaller the pool obviously.
=item I<min_word_length:>
This parameter sets the minimum number of characters in a word in order for it to be
included for ticket processing.
=item I<misspelled_words_file:>
As to what extent you can improve ticket retrieval precision with the addition of
synonyms depends on the degree to which you can make corrections on the fly for the
spelling errors that occur frequently in tickets. That fact makes the file you
supply through this constructor parameter very important. For the current version of
the module, this file must contain exactly two columns, with the first entry in each
row the misspelled word and the second entry the correctly spelled word. See this
file in the C<examples> directory for how to format it.
=item I<processed_tickets_db:>
As mentioned earlier in B<DESCRIPTION>, the tickets must be subject to various
preprocessing steps before they can be used for document modeling for the purpose of
retrieval. Preprocessing consists of stop words removal, spelling corrections,
antonym detection, synonym addition, etc. The tickets resulting from preprocessing
are stored in a database file whose name you supply through this constructor
parameter.
=item I<raw_tickets_db:>
The raw tickets extracted from the Excel spreadsheet are stored in a database file
whose name you supply through this constructor parameter. The idea here is that we
do not want to process an Excel spreadsheet for each new attempt at matching a query
ticket with the previously recorded tickets in the same spreadsheet. It is much
faster to load the database back into the runtime environment than to process a large
spreadsheet.
=item I<stemmed_tickets_db:>
As mentioned in the section B<THE THREE STAGES OF PROCESSING>, one of the first
things you do in the second stage of processing is to stem the words in the tickets.
Stemming is important because it reduces the size of the vocabulary. To illustrate,
stemming would reduce both the words `programming' and `programmed' to the common
root 'program'. This module uses a very simple stemmer whose rules can be found in
the utility subroutine C<_simple_stemmer()>. It would be trivial to expand on these
rules, or, for that matter, to use the Perl module C<Lingua::Stem::En> for a full
application of the Porter Stemming Algorithm. The stemmed tickets are saved in a
database file whose name is supplied through this constructor parameter.
=item I<stop_words_file:>
This constructor parameter is for naming the file that contains the stop words, these
being words you do not wish to be included in the vocabulary. The format of this
file must be as shown in the sample file C<stop_words.txt> in the C<examples>
directory.
( run in 0.818 second using v1.01-cache-2.11-cpan-483215c6ad5 )