style results from the CPAN

Algorithm-TicketClusterer

                         clustering_fieldname      => $fieldname_for_clustering,
                         unique_id_fieldname       => $unique_id_fieldname,
                         raw_tickets_db            => $raw_tickets_db,
                         processed_tickets_db      => $processed_tickets_db,
                         stemmed_tickets_db        => $stemmed_tickets_db,
                         inverted_index_db         => $inverted_index_db,
                         tickets_vocab_db          => $tickets_vocab_db,
                         idf_db                    => $idf_db,
                         tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                         tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                         min_idf_threshold         => 1.8,
                         how_many_retrievals       => 5,
                    );
    
    my $query_tkt = 1393548;
    $clusterer->restore_ticket_vectors_and_inverted_index();
    my %retrieved = %{$clusterer->retrieve_similar_tickets_with_vsm($query_tkt)};
    foreach my $tkt_id (sort {$retrieved{$b} <=> $retrieved{$a}} keys %retrieved) {
        $clusterer->show_original_ticket_for_given_id( $tkt_id );
    }

    #  Of all the parameters shown above in the constructor call, the
    #  parameter min_idf_threshold plays a large role in what tickets are
    #  returned by the retrieval function. The value of this parameter
    #  depends on the number of tickets in your Excel spreadsheet.  If the
    #  number of tickets is in the low hundreds, this parameter is likely to
    #  require a value of 1.5 to 1.8.  If the number of tickets is in the
    #  thousands, the value of this parameter is likely to be between 2 and
    #  3. See the writeup on this parameter in the API description in the
    #  rest of this documentation.


=head1 CHANGES

Version 1.01 of the module removes the platform dependency of the functions used for
reading the text files for stop words, misspelled words, etc.


=head1 DESCRIPTION

B<Algorithm::TicketClusterer> is a I<perl5> module for retrieving
previously processed Excel-stored tickets similar to a new ticket.  Routing
decisions made for the past similar tickets can be useful in expediting the
routing of a new ticket.

Tickets are commonly used in software services industry and customer
support businesses to record requests for service, product complaints,
user feedback, and so on.

With regard to the routing of a ticket, you would want each new ticket to
be handled by the tech support individual who is most qualified to address
the issue raised in the ticket.  Identifying the right individual for each
new ticket in real-time is no easy task for organizations that man large
service centers and helpdesks.  So if it were possible to quickly identify
the previously processed tickets that are most similar to a new ticket, one
could think of constructing semi-automated (or, perhaps, even fully
automated) ticket routers.

Identifying old tickets similar to a new ticket is made challenging by the
fact that folks who submit tickets often write them quickly and informally.
The informal style of writing means that different people may use different
colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket.  The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors.  What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets.  (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.)  The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column.  An example of such a file is included in the C<examples>
directory.  You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your
ticket submitters are likely to throw at you is futile, you might consider
using the following approach which I prefer to actually reading the tickets
for such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for.  In a large majority of cases,
these would be the misspelled words.

Expanding a ticket with synonyms is made complicated by the fact that some
common words have such a large number of synonyms that they can overwhelm
the relatively small number of words in a ticket.  Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.

In order to deal with the pros and the cons of using synonyms, the present
module strikes a middle ground: You can specify how many synonyms to use
for a word (assuming that the number of synonyms supplied by WordNet is
larger than the number specified).  This allows you to experiment with
retrieval precision by altering the number of synonyms used.  The retained
synonyms are selected randomly from those supplied by WordNet.  (A smarter
way to select synonyms would be to base them on the context.  For example,
you would not want to use the synonym `programmer' for the noun `developer'
if your application domain is real-estate.  However, such context-dependent
selection of synonyms would take us into the realm of ontologies that I
have chosen to stay away from in this first version of the module.)

Another issue related to the overall run-time performance of this module is
the computational cost of the calls to WordNet through its Perl interface
C<WordNet::QueryData>.  This module uses what I have referred to as
I<synset caching> to make this process as efficient as possible.  The
result of each WordNet lookup is cached in a database file whose name you
supply through the constructor option C<synset_cache_db>.  If you are doing
a good job of catching spelling errors, the module will carry out a
decreasing number of WordNet lookups as the tickets are scanned for

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

    $clusterer->delete_markup_from_all_tickets()

It is not uncommon for the textual content of a ticket to contain HTML markup. This
method deletes such strings.  Note that this method is not capable of deleting
complex markup that may include HTML comment blocks, may cross line boundaries, or
when the textual content includes angle brackets that denote "less than" or "greater
then".  If your tickets require more sophisticated processing for the removal of
markup, you might consider using the C<HTML::Restrict> module.


=item  B<display_all_doc_vectors()>

=item  B<display_all_normalized_doc_vectors()>

These two methods are useful for troubleshooting if things don't look right with
regard to retrieval.

=item  B<display_inverse_document_frequencies()>

    $clusterer->display_inverse_document_frequencies()

As mentioned earlier, the document frequency (DF) of a word is the number of tickets
in which the word appears.  The IDF of a word is the logarithm of the ratio of the
total number of tickets to the DF of the word.  A call to this method displays the
IDF values for the words in the vocabulary.

=item  B<display_inverted_index()>

=item  B<display_inverted_index_for_given_word( $word )>

=item  B<display_inverted_index_for_given_query( $ticket_id )>

The above three methods are useful for troubleshooting the issues that are related to
the generation of the inverted index.  The first method shows the entire inverted
index, the second the inverted index for a single specified word, and the third for
all the words in a query ticket.

=item  B<display_tickets_vocab()>

    $clusterer->display_tickets_vocab()

This method displays the ticket vocabulary constructed by a call to
C<get_ticket_vocabulary_and_construct_inverted_index()>.  The vocabulary display
consists of an alphabetized list of the words in all the tickets along with the
frequency of each word.

=item  B<expand_all_tickets_with_synonyms()>

    $clusterer->expand_all_tickets_with_synonyms();

This is the final step in the preprocessing of the tickets before they are ready for
the doc modeling stage.  This method calls other functions internal to the module
that ultimately make calls to WordNet through the Perl interface provided by the
C<WordNet::QueryData> module.

=item B<get_tickets_from_excel():>

    $clusterer->get_tickets_from_excel()

This method calls on the C<Spreadsheet::ParseExcel> module to extract the tickets
from the old-style Excel spreadsheets and the C<Spreadsheet::XLSX> module for doing
the same from the new-style Excel spreadsheets.

=item  B<get_ticket_vocabulary_and_construct_inverted_index()>

    $clusterer->get_ticket_vocabulary_and_construct_inverted_index()

As mentioned in B<THE THREE STAGES OF PROCESSING>, the second stage of processing ---
doc modeling of the tickets --- starts with the stemming of the words in the tickets,
constructing a vocabulary of all the stemmed words in all the tickets, and
constructing an inverted index for the vocabulary words.  All of these things are
accomplished by this method.

=item  B<restore_processed_tickets_from_disk()>

    $clusterer->restore_processed_tickets_from_disk()

This loads into your script the output of the ticket preprocessing stage.  This
method is called internally by C<restore_ticket_vectors_and_inverted_index()>, which
you would use in your ticket retrieval script, assuming it is separate from the
ticket preprocessing script.

=item B<restore_raw_tickets_from_disk()>

    $clusterer->restore_raw_tickets_from_disk()    

With this method, you are spared the trouble of having to repeatedly parse the same
Excel spreadsheet during the development phase as you are testing the module with
different query tickets.  This method is called internally by
C<restore_ticket_vectors_and_inverted_index()>.

=item  B<restore_stemmed_tickets_from_disk()>

        $clusterer->restore_stemmed_tickets_from_disk();

This method is called internally by
C<restore_ticket_vectors_and_inverted_index()>.

=item  B<restore_ticket_vectors_and_inverted_index()>

    $clusterer->restore_ticket_vectors_and_inverted_index()

If you are going to be doing ticket preprocessing and doc modeling in one script and
ticket retrieval in another, then this is the first method you would need to call in
the latter for the restoration of the VSM model for the tickets and the inverted
index.

=item B<retrieve_similar_tickets_with_vsm()>

    my $retrieved_hash_ref = $clusterer->retrieve_similar_tickets_with_vsm( $ticket_num )

It is this method that retrieves tickets that are most similar to a query ticket.
The method first utilizes the inverted index to construct a candidate list of the
tickets that share words with the query ticket.  Only those words play a role here
whose IDF values exceed C<min_idf_threshold>.  Subsequently, the query ticket vector
is matched with each of the ticket vectors in the candidate list.  The method returns
a reference to a hash whose keys are the IDs for the tickets that match the query
ticket and whose values the cosine similarity distance.

=item B<show_original_ticket_for_given_id()>

    $clusterer->show_original_ticket_for_given_id( $ticket_num )

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

the Excel file C<ExampleExcelFile.xls> that you will find in the same directory.

=item B<For retrieving similar tickets:>

Next, run the script

    retrieve_similar_tickets.pl

to retrieve five tickets that are closest to the query ticket whose integer ID is
supplied to the C<retrieve_similar_tickets_with_vsm()> method in the script.

=back

Note that the tickets in the C<ExampleExcelFil.xls> file are contrived.  The sole
purpose of executing the above two scripts is just to get you started with the use of
this module.


=head1 HOW YOU CAN TURN THIS MODULE INTO A PRODUCTION-QUALITY TOOL

By a production-quality tool, I mean a software package that you can I<actually> use
in a production environment for automated or semi-automated ticket routing in your
organization.  I am assuming you already have the tools in place that insert in
real-time the new tickets in an Excel spreadsheet.

Turning this module into a production tool will require that you find the best values
to use for the following three parameters that are needed by the constructor: (1)
C<min_idf_threshold> for the minimum C<idf> value for the words in a query ticket in
order for them to be considered for matching with the other tickets; (2)
C<min_word_length> for discarding words that are too short; and (3)
C<max_num_syn_words> for how many synonyms to retain for a word if the number of
synonyms returned by WordNet is too large.  In addition, you must also come up with a
misspelled-words file that is appropriate to your application domain and a stop-words
file.

In order to find the best values to use for the parameters that are mentioned above,
I suggest creating a graphical front-end for this module that would allow for
altering the values of the three parameters listed above in response to the
prevailing mis-routing rates for the tickets.  The front-end will display to an
operator the latest ticket that needs to be routed and a small set of the
best-matching previously routed tickets as returned by this module.  Used either in a
fully-automated mode or a semi-automated mode, this front-end would contain a
feedback recorder that would keep track of mis-routed tickets --- the mis-routed
tickets would presumably bounce back to the central operator monitoring the
front-end. The front-end display could be equipped with slider controls for altering
the values used for the three parameters. Obviously, as a parameter is changed, some
of the database files stored on the disk would need to be recomputed.  The same would
be the case if you make changes to the misspelled-words file or to the stop-words
file.

=head1 REQUIRED

This module requires the following five modules:

    Spreadsheet::ParseExcel
    Spreadsheet::XLSX
    WordNet::QueryData
    Storable
    SDBM_File

the first for extracting information from the old-style Excel sheets that are
commonly used for storing tickets, the second for extracting the same information
from the new-style Excel sheets, the third for interfacing with WordNet for
extracting the synonyms and antonyms, the fourth for creating the various disk-based
database files needed by the module, and the last for disk-based hashes used to lend
persistence to the extraction of the alphabet used by the tickets and the inverse
document frequencies of the words.

=head1 EXPORT

None by design.

=head1 CAVEATS

An automated or semi-automated ticket router based on the concepts incorporated in
this module may not be appropriate for all applications, especially in domains where
highly jargonified expressions are used to describe faults and problems associated
with an application.

=head1 BUGS

Please notify the author if you encounter any bugs.  When sending email, please place
the string 'TicketClusterer' in the subject line to get past my spam filter.

=head1 INSTALLATION

Download the archive from CPAN in any directory of your choice.  Unpack the archive
with a command that on a Linux machine would look like:

    tar zxvf Algorithm-TicketClusterer-1.01.tar.gz

This will create an installation directory for you whose name will be
C<Algorithm-TicketClusterer-1.01>.  Enter this directory and execute the following
commands for a standard install of the module if you have root privileges:

    perl Makefile.PL
    make
    make test
    sudo make install

If you do not have root privileges, you can carry out a non-standard install the
module in any directory of your choice by:

    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install

With a non-standard install, you may also have to set your PERL5LIB environment
variable so that this module can find the required other modules. How you do that
would depend on what platform you are working on.  In order to install this module in
a Linux machine on which I use tcsh for the shell, I set the PERL5LIB environment
variable by

    setenv PERL5LIB /some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/

If I used bash, I'd need to declare:

    export PERL5LIB=/some/other/directory/lib64/perl5/:/some/other/directory/share/perl5/


=head1 THANKS

( run in 2.269 seconds using v1.01-cache-2.11-cpan-5b529ec07f3 )