Algorithm-TicketClusterer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

vocabulary.

This module requires the following three modules:

    Spreadsheet::ParseExcel                                                             
    Spreadsheet::XLSX                                                                   
    WordNet::QueryData                                                                  

the first for extracting information from the old-style
Excel sheets that are commonly used for storing tickets, the
second for extracting the same information from the
new-style Excel sheets, and the third for interfacing with
WordNet for extracting the synonyms and antonyms for the
words in the tickets.

For installation, do the usual

    perl Makefile.PL
    make
    make test
    make install

examples/README  view on Meta::CPAN


1.    ticket_preprocessor_and_doc_modeler.pl

2.    retrieve_similar_tickets.pl


Run the first script to see ticket preprocessing and doc
modeling being carried out on the (fake) tickets stored in
the Excel file ExampleExcelFile.xls in this directory.

Next, run the second script to retrieve five tickets that
are closest to the query ticket whose integer is supplied to
the retrieve_similar_tickets_with_vsm()> method in the
script.

If both scripts run fine, go through the statements in the
scripts to see how you need to sequence the different
preprocessing, doc modeling, and retrieval steps for your
tickets.

IMPORTANT1: The spreadsheet 

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

    $clusterer->expand_all_tickets_with_synonyms();
    
    ## Construct the VSM doc model for the tickets:
    $clusterer->get_ticket_vocabulary_and_construct_inverted_index();
    $clusterer->construct_doc_vectors_for_all_tickets();

    #  Of the various constructor parameters shown above, the following two
    #  are critical to how information is extracted from an Excel
    #  spreadsheet: `clustering_fieldname' and `unique_id_fieldname'.  The
    #  first is the heading of the column that contains the textual content
    #  of the tickets.  The second is the heading of the column that
    #  contains a unique integer identifier for each ticket.

    #  The nine database related constructor parameters (these end in the
    #  suffix `_db') are there in order to avoid repeated parsing of the
    #  spreadsheet and preprocessing of the tickets every time you need to
    #  make a retrieval for a new ticket.  The goal here is that after the
    #  ticket information has been ingested from a spreadsheet, you would
    #  want to carry out similar-ticket retrieval in real time.  (Whether
    #  or not real-time retrieval would be feasible in actual practice
    #  would also depend on what hardware you are using, obviously.)

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors.  What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets.  (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.)  The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column.  An example of such a file is included in the C<examples>
directory.  You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your
ticket submitters are likely to throw at you is futile, you might consider
using the following approach which I prefer to actually reading the tickets
for such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for.  In a large majority of cases,
these would be the misspelled words.

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

were to add all such synonyms to a ticket, you run the danger of altering the sense
of the ticket, besides unnecessarily increasing the size of the vocabulary. This
parameter limits the number of synonyms chosen to the value used for the parameter.
When the number of synonyms returned by WordNet is greater than the value set for
this parameter, the synonyms retained are chosen randomly from the list returned by
WordNet.

=item I<min_idf_threshold:>

First recall that IDF stands for Inverse Document Frequency.  It is calculated during
the second of the three-stage processing of the tickets as described in the section
B<THE THREE STAGES OF PROCESSING TICKETS>.  The IDF value of a word gives us a
measure of the discriminatory power of the word.  Let's say you have a word that
occurs in only one out of 1000 tickets.  Such a word is obviously highly
discriminatory and its IDF would be the logarithm (to base 10) of the ratio of 1000
to 1, which is 3.  On the other hand, for a word that occurs in every one of 1000
tickets, its IDF value would be the logarithm of the ratio of 1000 to 1000, which is
0.  So, for the case when you have 1000 tickets, the upper bound on IDF is 3 and the
lower bound 0. This constructor parameter controls which of the query words you will
use for constructing the initial pool of tickets that will be used for matching.  The
larger the value of this threshold, the smaller the pool obviously.

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

This parameter sets the minimum number of characters in a word in order for it to be
included for ticket processing.

=item I<misspelled_words_file:>

As to what extent you can improve ticket retrieval precision with the addition of
synonyms depends on the degree to which you can make corrections on the fly for the
spelling errors that occur frequently in tickets.  That fact makes the file you
supply through this constructor parameter very important.  For the current version of
the module, this file must contain exactly two columns, with the first entry in each
row the misspelled word and the second entry the correctly spelled word.  See this
file in the C<examples> directory for how to format it.

=item I<processed_tickets_db:>

As mentioned earlier in B<DESCRIPTION>, the tickets must be subject to various
preprocessing steps before they can be used for document modeling for the purpose of
retrieval. Preprocessing consists of stop words removal, spelling corrections,
antonym detection, synonym addition, etc.  The tickets resulting from preprocessing
are stored in a database file whose name you supply through this constructor
parameter.

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


The raw tickets extracted from the Excel spreadsheet are stored in a database file
whose name you supply through this constructor parameter.  The idea here is that we
do not want to process an Excel spreadsheet for each new attempt at matching a query
ticket with the previously recorded tickets in the same spreadsheet.  It is much
faster to load the database back into the runtime environment than to process a large
spreadsheet.

=item I<stemmed_tickets_db:>

As mentioned in the section B<THE THREE STAGES OF PROCESSING>, one of the first
things you do in the second stage of processing is to stem the words in the tickets.
Stemming is important because it reduces the size of the vocabulary.  To illustrate,
stemming would reduce both the words `programming' and `programmed' to the common
root 'program'.  This module uses a very simple stemmer whose rules can be found in
the utility subroutine C<_simple_stemmer()>.  It would be trivial to expand on these
rules, or, for that matter, to use the Perl module C<Lingua::Stem::En> for a full
application of the Porter Stemming Algorithm.  The stemmed tickets are saved in a
database file whose name is supplied through this constructor parameter.

=item I<stop_words_file:>

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

computational speed of the module.  As mentioned earlier, it is important to ground
the tickets in a common vocabulary and this module does that by adding to the tickets
a designated number of the synonyms for the words in the tickets.  However, the calls
to WordNet for the synonyms through the Perl interface C<WordNet::QueryData> can be
expensive. Caching means that only one call would need to be made to WordNet for any
given word regardless of how many times the word appears in all of the tickets.

=item I<which_worksheet:>

This specifies the Excel worksheet that contains the tickets.  Its value should be 1
for the first sheet, 2 for the second, and so on.

=back

=begin html

<br>

=end html

=item  B<apply_filter_to_all_tickets()>

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

IDF values for the words in the vocabulary.

=item  B<display_inverted_index()>

=item  B<display_inverted_index_for_given_word( $word )>

=item  B<display_inverted_index_for_given_query( $ticket_id )>

The above three methods are useful for troubleshooting the issues that are related to
the generation of the inverted index.  The first method shows the entire inverted
index, the second the inverted index for a single specified word, and the third for
all the words in a query ticket.

=item  B<display_tickets_vocab()>

    $clusterer->display_tickets_vocab()

This method displays the ticket vocabulary constructed by a call to
C<get_ticket_vocabulary_and_construct_inverted_index()>.  The vocabulary display
consists of an alphabetized list of the words in all the tickets along with the
frequency of each word.

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

    $clusterer->get_tickets_from_excel()

This method calls on the C<Spreadsheet::ParseExcel> module to extract the tickets
from the old-style Excel spreadsheets and the C<Spreadsheet::XLSX> module for doing
the same from the new-style Excel spreadsheets.

=item  B<get_ticket_vocabulary_and_construct_inverted_index()>

    $clusterer->get_ticket_vocabulary_and_construct_inverted_index()

As mentioned in B<THE THREE STAGES OF PROCESSING>, the second stage of processing ---
doc modeling of the tickets --- starts with the stemming of the words in the tickets,
constructing a vocabulary of all the stemmed words in all the tickets, and
constructing an inverted index for the vocabulary words.  All of these things are
accomplished by this method.

=item  B<restore_processed_tickets_from_disk()>

    $clusterer->restore_processed_tickets_from_disk()

This loads into your script the output of the ticket preprocessing stage.  This

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


This method is called by the C<get_tickets_from_excel()> method to store on the disk
the tickets extracted from the Excel spreadsheet.  Obviously, you can also call it in
your own script for doing the same.

=item  B<store_stemmed_tickets_and_inverted_index_on_disk()>

    $clusterer->store_stemmed_tickets_and_inverted_index_on_disk()

This method stores in a database file the stemmed tickets and the inverted index that
are produced at the end of the second stage of processing.

=item B<show_stemmed_ticket_clustering_data_for_given_id()>

    $clusterer->show_stemmed_ticket_clustering_data_for_given_id( $ticket_num );

If you want to see what sort of a job the stemmer is doing for a ticket, this is the
method to call.  You would need to set the argument C<$ticket_num> to the unique
integer ID of the ticket you are interested in.

=item  B<store_ticket_vectors()>

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


This module requires the following five modules:

    Spreadsheet::ParseExcel
    Spreadsheet::XLSX
    WordNet::QueryData
    Storable
    SDBM_File

the first for extracting information from the old-style Excel sheets that are
commonly used for storing tickets, the second for extracting the same information
from the new-style Excel sheets, the third for interfacing with WordNet for
extracting the synonyms and antonyms, the fourth for creating the various disk-based
database files needed by the module, and the last for disk-based hashes used to lend
persistence to the extraction of the alphabet used by the tickets and the inverse
document frequencies of the words.

=head1 EXPORT

None by design.



( run in 0.694 second using v1.01-cache-2.11-cpan-39bf76dae61 )