Algorithm-TicketClusterer

 view release on metacpan or  search on metacpan

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

    #  of the tickets.  The second is the heading of the column that
    #  contains a unique integer identifier for each ticket.

    #  The nine database related constructor parameters (these end in the
    #  suffix `_db') are there in order to avoid repeated parsing of the
    #  spreadsheet and preprocessing of the tickets every time you need to
    #  make a retrieval for a new ticket.  The goal here is that after the
    #  ticket information has been ingested from a spreadsheet, you would
    #  want to carry out similar-ticket retrieval in real time.  (Whether
    #  or not real-time retrieval would be feasible in actual practice
    #  would also depend on what hardware you are using, obviously.)

    #  After the above preprocessing and doc modeling steps, you can
    #  extract the most similar past tickets for a new query ticket with a
    #  script in which the constructor call would look like:

    my $clusterer = Algorithm::TicketClusterer->new( 
    
                         clustering_fieldname      => $fieldname_for_clustering,
                         unique_id_fieldname       => $unique_id_fieldname,
                         raw_tickets_db            => $raw_tickets_db,
                         processed_tickets_db      => $processed_tickets_db,
                         stemmed_tickets_db        => $stemmed_tickets_db,
                         inverted_index_db         => $inverted_index_db,
                         tickets_vocab_db          => $tickets_vocab_db,
                         idf_db                    => $idf_db,
                         tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                         tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                         min_idf_threshold         => 1.8,
                         how_many_retrievals       => 5,
                    );
    
    my $query_tkt = 1393548;
    $clusterer->restore_ticket_vectors_and_inverted_index();
    my %retrieved = %{$clusterer->retrieve_similar_tickets_with_vsm($query_tkt)};
    foreach my $tkt_id (sort {$retrieved{$b} <=> $retrieved{$a}} keys %retrieved) {
        $clusterer->show_original_ticket_for_given_id( $tkt_id );
    }

    #  Of all the parameters shown above in the constructor call, the
    #  parameter min_idf_threshold plays a large role in what tickets are
    #  returned by the retrieval function. The value of this parameter
    #  depends on the number of tickets in your Excel spreadsheet.  If the
    #  number of tickets is in the low hundreds, this parameter is likely to
    #  require a value of 1.5 to 1.8.  If the number of tickets is in the
    #  thousands, the value of this parameter is likely to be between 2 and
    #  3. See the writeup on this parameter in the API description in the
    #  rest of this documentation.


=head1 CHANGES

Version 1.01 of the module removes the platform dependency of the functions used for
reading the text files for stop words, misspelled words, etc.


=head1 DESCRIPTION

B<Algorithm::TicketClusterer> is a I<perl5> module for retrieving
previously processed Excel-stored tickets similar to a new ticket.  Routing
decisions made for the past similar tickets can be useful in expediting the
routing of a new ticket.

Tickets are commonly used in software services industry and customer
support businesses to record requests for service, product complaints,
user feedback, and so on.

With regard to the routing of a ticket, you would want each new ticket to
be handled by the tech support individual who is most qualified to address
the issue raised in the ticket.  Identifying the right individual for each
new ticket in real-time is no easy task for organizations that man large
service centers and helpdesks.  So if it were possible to quickly identify
the previously processed tickets that are most similar to a new ticket, one
could think of constructing semi-automated (or, perhaps, even fully
automated) ticket routers.

Identifying old tickets similar to a new ticket is made challenging by the
fact that folks who submit tickets often write them quickly and informally.
The informal style of writing means that different people may use different
colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket.  The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors.  What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets.  (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.)  The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column.  An example of such a file is included in the C<examples>
directory.  You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your
ticket submitters are likely to throw at you is futile, you might consider
using the following approach which I prefer to actually reading the tickets
for such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for.  In a large majority of cases,
these would be the misspelled words.

Expanding a ticket with synonyms is made complicated by the fact that some
common words have such a large number of synonyms that they can overwhelm
the relatively small number of words in a ticket.  Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.

In order to deal with the pros and the cons of using synonyms, the present
module strikes a middle ground: You can specify how many synonyms to use

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

My decision to use the simplest of the text models --- the Vector Space
Model --- was based of the work carried out by Shivani Rao at Purdue who
has demonstrated that the simpler models are more effective at retrieval
from software libraries than the more complex models. (See the paper by
Shivani Rao and Avinash Kak at the MSR'11 Conference.) Although tickets, in
general, are not the same as software libraries, I have a strong feeling
that Shivani's conclusions would extend to other domains as well.  Having
said that, it is important to mention that there remains the possibility
that automated ticket routing for some applications may respond better to
more elaborate text models.

The module uses three mechanisms to speed up the retrieval of tickets
similar to a query ticket: (1) It uses the inverted index for all the words
to construct for each query ticket a candidate pool of only those tickets
in the database that have words in common with the query ticket; (2) Only
those query-ticket words are used for retrieval whose
inverse-document-frequency values exceed a user-specified threshold; and
(3) The module uses stemming to reduce the variants of the same word to a
common root in order to limit the size of the vocabulary.  The stemming
used in the current module is rudimentary.  However, it would be easy to
plug into the module more powerful stemmers through their Perl interfaces.
Future versions of this module may do exactly that.



=head1 THE THREE STAGES OF PROCESSING

The tickets are processed in the following three stages:

=over

=item B<Ticket Preprocessing:>

This stage involves extracting the textual content of each ticket from the
Excel spreadsheet and subjecting it to the following steps: (1) deleting
markup; (2) dropping the stop words supplied through a file whose name is
provided as a value for the constructor parameter C<stop_words_file>; (3)
correcting spelling errors through the `bad-word good-word' entries in a
file whose name is supplied as a value for the constructor parameter
C<misspelled_words_file>; (4) replacing negated words with their antonyms;
and, finally, (5) adding synonyms.

=item B<Doc Modeling:>

Doc modeling consists of creating a Vector Space Model for the tickets
after they have been processed as described above.  VSM modeling involves
scanning the preprocessed tickets, stemming the words, and constructing a
vocabulary for all of the stemmed words in all the tickets.  Subsequently,
the alphabetized list of the vocabulary serves as a vector template for the
tickets. Each ticket is represented by a vector whose dimensionality equals
the size of the vocabulary; each element of this vector is an integer that
is the frequency of the vocabulary word corresponding to the index of the
element.  Doc modeling also involves calculating the inverse document
frequencies (IDF) values for the words and the inverted index for the
words.  The IDF values are used to diminish the importance of the words
that carry little discriminatory power vis-a-vis the tickets.  IDF for a
word is the logarithm of the ratio of the total number of tickets to the
number of tickets in which the word appears.  Obviously, if a word were to
appear in all the tickets, its IDF value would be zero.  The inverted index
entry for a word is the list of all the tickets that contain that word.
The inverted index greatly expedites the retrieval of tickets similar to a
given query ticket.

=item B<Similarity Based Retrieval:>

A query ticket is subject to the same preprocessing steps as all other
tickets.  Subsequently, it is also represented by a vector in the same
manner as the other tickets.  Using the stemmed words in the query ticket,
the inverted index is used to create a candidate list of ticket vectors for
matching with the query ticket vector.  For this, only those query words
are chosen whose IDF values exceed a threshold.  Finally, we compute the
cosine similarity distance between the query ticket vector and the ticket
vectors in the candidate list.  The matching ticket vectors are returned in
the order of decreasing similarity.

=back

=begin html

<br>

=end html

=head1 METHODS

The module provides the following methods for ticket preprocessing and for the
retrieval of tickets most similar to a given ticket:

=over

=item B<new()>

A call to C<new()> constructs a new instance of the C<Algorithm::TicketClusterer>
class:

    my $clusterer = Algorithm::TicketClusterer->new( 

                     excel_filename            => $excel_filename,
                     clustering_fieldname      => $fieldname_for_clustering,
                     unique_id_fieldname       => $unique_id_fieldname,
                     which_worksheet           => $which_worksheet,
                     raw_tickets_db            => $raw_tickets_db,
                     processed_tickets_db      => $processed_tickets_db,
                     stemmed_tickets_db        => $stemmed_tickets_db,
                     inverse_index_db          => $inverse_index_db,
                     tickets_vocab_db          => $tickets_vocab_db,
                     idf_db                    => $idf_db,
                     tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                     tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                     synset_cache_db           => $synset_cache_db,
                     stop_words_file           => $stop_words_file,
                     misspelled_words_file     => $misspelled_words_file,
                     add_synsets_to_tickets    => 1,
                     want_synset_caching       => 1,
                     min_idf_threshold         => 2.0,
                     max_num_syn_words         => 3,
                     min_word_length           => 4,
                     want_stemming             => 1,
                     how_many_retrievals       => 5,
                     debug1                    => 1,  # for processing, filtering Excel
                     debug2                    => 1,  # for doc modeling



( run in 2.720 seconds using v1.01-cache-2.11-cpan-97f6503c9c8 )