best results from the CPAN

Algorithm-TicketClusterer

view release on metacpan or search on metacpan

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

=item B<new()>

A call to C<new()> constructs a new instance of the C<Algorithm::TicketClusterer>
class:

    my $clusterer = Algorithm::TicketClusterer->new( 

                     excel_filename            => $excel_filename,
                     clustering_fieldname      => $fieldname_for_clustering,
                     unique_id_fieldname       => $unique_id_fieldname,
                     which_worksheet           => $which_worksheet,
                     raw_tickets_db            => $raw_tickets_db,
                     processed_tickets_db      => $processed_tickets_db,
                     stemmed_tickets_db        => $stemmed_tickets_db,
                     inverse_index_db          => $inverse_index_db,
                     tickets_vocab_db          => $tickets_vocab_db,
                     idf_db                    => $idf_db,
                     tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                     tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                     synset_cache_db           => $synset_cache_db,
                     stop_words_file           => $stop_words_file,
                     misspelled_words_file     => $misspelled_words_file,
                     add_synsets_to_tickets    => 1,
                     want_synset_caching       => 1,
                     min_idf_threshold         => 2.0,
                     max_num_syn_words         => 3,
                     min_word_length           => 4,
                     want_stemming             => 1,
                     how_many_retrievals       => 5,
                     debug1                    => 1,  # for processing, filtering Excel
                     debug2                    => 1,  # for doc modeling
                     debug3                    => 1,  # for retrieving similar tickets

                   );

Obviously, before you can invoke the constructor, you must provide values for the
variables shown to the right of the big arrows.  As to what these values should be is
made clear by the following alphabetized list that describes each of the constructor
parameters shown above:

=over 24

=item I<add_synsets_to_tickets:>

You can turn off the addition of synonyms to the tickets by setting this boolean
parameter to 0.

=item I<clustering_fieldname:>

This is for supplying to the constructor the heading of the column in your Excel
spreadsheet that contains the textual data for the tickets.  For example, if the
column heading for the textual content of the tickets is `Description', you must
supply this string as the value for the parameter C<clustering_fieldname>.

=item I<debug1:>

When this parameter is set, the module prints out information regarding what columns
of the spreadsheet it is extracting information from, the headers for those columns,
the index of the column that contains the textual content of the tickets, and of the
column that contains the unique integer identifier for each ticket.  If you are
dealing with spreadsheets with a large number of tickets, it is best to pipe the
output of the module into a file to see the debugging information.

=item I<debug2:>

When this parameter is set, you will see how WordNet is being utilized to generate
word synonyms. This debugging output is also useful to see the extent of misspellings
in the tickets.  If WordNet is unable to find the synonyms for a word, chances are
that the word is not spelled correctly (or that it is a jargon word or a jargon
acronym).

=item I<debug3:>

This debug flag applies to the calculations carried out during the retrieval of
similar tickets.  When this flag is set, the module will display the candidate set of
tickets to be considered for matching with the query ticket.  This candidate set is
chosen by using the inverted index to collect all the tickets that share words with
the query word provided the IDF value for each such word exceeds the threshold set by
the constructor parameter C<min_idf_threshold>.

=item I<excel_filename:>

This is obviously the name of the Excel file that contains the tickets you want to
process.

=item I<how_many_retrievals:>

The integer value supplied for this parameter determines how many tickets that are
most similar to a query ticket will be returned.

=item I<idf_db:>

You store the inverse document frequencies for the vocabulary words in a database
file whose name is supplied through this constructor parameter.  As mentioned
earlier, the IDF for a word is, in principle, the logarithm of the ratio of the total
number of tickets to the DF (Document Frequency) for the word.  The DF of a word is
the number of tickets in which the word appears.

=item I<inverted_index_db:>

If you plan to create separate scripts for the three stages of processing described
earlier, you must store the inverted index in a database file so that it can be used
by the script whose job is to carry out similarity based ticket retrieval. The
inverted index is stored in a database file whose name is supplied through this
constructor parameter.

=item I<max_num_syn_words:>

As mentioned in B<DESCRIPTION>, some words can have a very large number of synonyms
--- much larger than the number of words that may exist in a typical ticket.  If you
were to add all such synonyms to a ticket, you run the danger of altering the sense
of the ticket, besides unnecessarily increasing the size of the vocabulary. This
parameter limits the number of synonyms chosen to the value used for the parameter.
When the number of synonyms returned by WordNet is greater than the value set for
this parameter, the synonyms retained are chosen randomly from the list returned by
WordNet.

=item I<min_idf_threshold:>

First recall that IDF stands for Inverse Document Frequency.  It is calculated during
the second of the three-stage processing of the tickets as described in the section

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN


The argument to the method is the unique integer ID of a ticket for which
you want to see all the fields as stored in the Excel spreadsheet.

=item B<show_raw_ticket_clustering_data_for_given_id()>

While the previous method shows all the fields for a ticket, this method
shows only the textual content --- the content you want to use for
establishing similarity between a query ticket and the other tickets.

=item B<show_processed_ticket_clustering_data_for_given_id()>

    $clusterer->show_processed_ticket_clustering_data_for_given_id( $ticket_num );

This is the method to call if you wish to examine the textual content of a ticket
after it goes through the preprocessing steps.  In particular, you will see the
corrections made, the synonyms added, etc.  You would need to set the argument
C<$ticket_num> to the unique integer ID of the ticket you are interested in.

=item  B<store_processed_tickets_on_disk()>

    $clusterer->store_processed_tickets_on_disk();

This stores in a database file the preprocessed textual content of the
tickets.

=item B<store_raw_tickets_on_disk()>

    $clusterer->store_raw_tickets_on_disk();

This method is called by the C<get_tickets_from_excel()> method to store on the disk
the tickets extracted from the Excel spreadsheet.  Obviously, you can also call it in
your own script for doing the same.

=item  B<store_stemmed_tickets_and_inverted_index_on_disk()>

    $clusterer->store_stemmed_tickets_and_inverted_index_on_disk()

This method stores in a database file the stemmed tickets and the inverted index that
are produced at the end of the second stage of processing.

=item B<show_stemmed_ticket_clustering_data_for_given_id()>

    $clusterer->show_stemmed_ticket_clustering_data_for_given_id( $ticket_num );

If you want to see what sort of a job the stemmer is doing for a ticket, this is the
method to call.  You would need to set the argument C<$ticket_num> to the unique
integer ID of the ticket you are interested in.

=item  B<store_ticket_vectors()>

    $clusterer->store_ticket_vectors()

As the name implies, this call stores the vectors, both regular and normalized, in a
database file on the disk.

=back

=head1 HOW THE MATCHING TICKETS ARE RETRIEVED

It is the method C<retrieve_similar_tickets_with_vsm()> that returns the best ticket
matches for a given query ticket.  What this method returns is a hash reference; the
keys in this hash are the integer IDs of the matching tickets and the values the
cosine similarity distance between the query ticket and the matching tickets.  The
number of matching tickets returned by C<retrieve_similar_tickets_with_vsm()> is set
by the constructor parameter C<how_many_retrievals>.  Note that
C<retrieve_similar_tickets_with_vsm()> takes a single argument, which is the integer
ID of the query ticket.


=head1 THE C<examples> DIRECTORY

The C<examples> directory contains the following two scripts that would be your
quickest way to become familiar with this module:

=over

=item B<For ticket preprocessing and doc modeling:>

Run the script

    ticket_preprocessor_and_doc_modeler.pl

This will carry out preprocessing and doc modeling of the tickets that are stored in
the Excel file C<ExampleExcelFile.xls> that you will find in the same directory.

=item B<For retrieving similar tickets:>

Next, run the script

    retrieve_similar_tickets.pl

to retrieve five tickets that are closest to the query ticket whose integer ID is
supplied to the C<retrieve_similar_tickets_with_vsm()> method in the script.

=back

Note that the tickets in the C<ExampleExcelFil.xls> file are contrived.  The sole
purpose of executing the above two scripts is just to get you started with the use of
this module.


=head1 HOW YOU CAN TURN THIS MODULE INTO A PRODUCTION-QUALITY TOOL

By a production-quality tool, I mean a software package that you can I<actually> use
in a production environment for automated or semi-automated ticket routing in your
organization.  I am assuming you already have the tools in place that insert in
real-time the new tickets in an Excel spreadsheet.

Turning this module into a production tool will require that you find the best values
to use for the following three parameters that are needed by the constructor: (1)
C<min_idf_threshold> for the minimum C<idf> value for the words in a query ticket in
order for them to be considered for matching with the other tickets; (2)
C<min_word_length> for discarding words that are too short; and (3)
C<max_num_syn_words> for how many synonyms to retain for a word if the number of
synonyms returned by WordNet is too large.  In addition, you must also come up with a
misspelled-words file that is appropriate to your application domain and a stop-words
file.

In order to find the best values to use for the parameters that are mentioned above,
I suggest creating a graphical front-end for this module that would allow for
altering the values of the three parameters listed above in response to the
prevailing mis-routing rates for the tickets.  The front-end will display to an
operator the latest ticket that needs to be routed and a small set of the
best-matching previously routed tickets as returned by this module.  Used either in a
fully-automated mode or a semi-automated mode, this front-end would contain a
feedback recorder that would keep track of mis-routed tickets --- the mis-routed
tickets would presumably bounce back to the central operator monitoring the
front-end. The front-end display could be equipped with slider controls for altering
the values used for the three parameters. Obviously, as a parameter is changed, some
of the database files stored on the disk would need to be recomputed.  The same would
be the case if you make changes to the misspelled-words file or to the stop-words
file.

=head1 REQUIRED

This module requires the following five modules:

    Spreadsheet::ParseExcel
    Spreadsheet::XLSX
    WordNet::QueryData
    Storable
    SDBM_File

the first for extracting information from the old-style Excel sheets that are
commonly used for storing tickets, the second for extracting the same information
from the new-style Excel sheets, the third for interfacing with WordNet for
extracting the synonyms and antonyms, the fourth for creating the various disk-based
database files needed by the module, and the last for disk-based hashes used to lend
persistence to the extraction of the alphabet used by the tickets and the inverse
document frequencies of the words.

=head1 EXPORT

None by design.

=head1 CAVEATS

An automated or semi-automated ticket router based on the concepts incorporated in
this module may not be appropriate for all applications, especially in domains where
highly jargonified expressions are used to describe faults and problems associated
with an application.

=head1 BUGS

Please notify the author if you encounter any bugs.  When sending email, please place
the string 'TicketClusterer' in the subject line to get past my spam filter.

=head1 INSTALLATION

Download the archive from CPAN in any directory of your choice.  Unpack the archive
with a command that on a Linux machine would look like:

    tar zxvf Algorithm-TicketClusterer-1.01.tar.gz

This will create an installation directory for you whose name will be
C<Algorithm-TicketClusterer-1.01>.  Enter this directory and execute the following
commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the

( run in 0.710 second using v1.01-cache-2.11-cpan-39bf76dae61 )