Algorithm-TicketClusterer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

Tickets are commonly used in software services industry and
customer support businesses to record requests for service,
product complaints, user feedback, and so on.

Identifying old tickets similar to a new ticket is made
challenging by the fact that folks who submit tickets often
write them quickly and informally.  The informal style of
writing means that different people may use different
colloquial terms to describe the same thing. And the
quickness associated with their submission causes the
tickets to frequently contain spelling and other errors such
as conjoined words, fragmentation of long words, and so on.
This module is an attempt at dealing with these challenges.
That different people may use different words for the same
thing is dealt with by using WordNet to expand the tickets
with synonyms in order to ground the tickets in a common
vocabulary.

This module requires the following three modules:

    Spreadsheet::ParseExcel                                                             

examples/misspelled_words.txt  view on Meta::CPAN

disabl disable
discusse discussed
distruption disruption
earliist earliest
effor effort
efforet effort
employess employees
enabl enable
enabe enable
erquired required
erro error
execut execute
exteremely extremely
extemely extremely
extremity extremely
extremly extremely
finical fiancial
finis finish
followng following
generateed generated
handicapp handicap

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

    unlink glob "$self->{_idf_db}.*";
    my $filename = $self->{_excel_filename} || die("Excel file required"),
    my $clustering_fieldname = $self->{_clustering_fieldname} 
      || die("\nYou forgot to specify a value for the constructor parameter clustering_fieldname that points to the data to be clustered in your Excel sheet -- ");
    my $unique_id_fieldname = $self->{_unique_id_fieldname} 
      || die("\nYou forgot to specify a value for the constructor parameter unique_id_fieldname that is a unique integer identifier for the rows of your Excel sheet -- ");
    my $workbook;
    if ($filename =~ /\.xls$/) {
        my $parser = Spreadsheet::ParseExcel->new();
        $workbook = $parser->parse($filename);
        die $parser->error() unless defined $workbook;
    } elsif ($filename =~ /\.xlsx$/) {
#        use Text::Iconv;
        my $converter = Text::Iconv->new("utf-8", "windows-1251");
        $workbook = Spreadsheet::XLSX->new($filename, $converter);
    } else {
        die "File suffix on the Excel file not recognized";
    }
    my @worksheets = $workbook->worksheets();
    my $which_worksheet = $self->{_which_worksheet} || 
        die "\nYou have not specified which Excel worksheet contains the tickets\n";

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

in software services industry and customer support businesses to record
requests for service, product complaints, user feedback, and so on.

=head1 SYNOPSIS

    use Algorithm::TicketClusterer;

    #  Extract the tickets from the Excel spreadsheet and subject the
    #  textual content of the tickets to various preprocessing and doc
    #  modeling steps.  The preprocessing steps consist of removing markup,
    #  dropping the words in a stop list, correcting spelling errors,
    #  detecting the need for antonyms, and, finally, adding word synonyms
    #  to the tickets in order to ground the tickets in a common
    #  vocabulary. The doc modeling steps consist of fitting a standard
    #  vector space model to the tickets.

    my $clusterer = Algorithm::TicketClusterer->new( 
    
                         excel_filename            => $excel_filename,
                         clustering_fieldname      => $fieldname_for_clustering,
                         which_worksheet           => $which_worksheet,

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

service centers and helpdesks.  So if it were possible to quickly identify
the previously processed tickets that are most similar to a new ticket, one
could think of constructing semi-automated (or, perhaps, even fully
automated) ticket routers.

Identifying old tickets similar to a new ticket is made challenging by the
fact that folks who submit tickets often write them quickly and informally.
The informal style of writing means that different people may use different
colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket.  The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors.  What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets.  (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.)  The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column.  An example of such a file is included in the C<examples>
directory.  You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your
ticket submitters are likely to throw at you is futile, you might consider
using the following approach which I prefer to actually reading the tickets
for such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for.  In a large majority of cases,
these would be the misspelled words.

Expanding a ticket with synonyms is made complicated by the fact that some
common words have such a large number of synonyms that they can overwhelm
the relatively small number of words in a ticket.  Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

if your application domain is real-estate.  However, such context-dependent
selection of synonyms would take us into the realm of ontologies that I
have chosen to stay away from in this first version of the module.)

Another issue related to the overall run-time performance of this module is
the computational cost of the calls to WordNet through its Perl interface
C<WordNet::QueryData>.  This module uses what I have referred to as
I<synset caching> to make this process as efficient as possible.  The
result of each WordNet lookup is cached in a database file whose name you
supply through the constructor option C<synset_cache_db>.  If you are doing
a good job of catching spelling errors, the module will carry out a
decreasing number of WordNet lookups as the tickets are scanned for
expansion with synonyms.  In an experiment with a spreadsheet that
contained over 1400 real tickets, the last several hundred resulted in
hardly any calls to WordNet.

As currently programmed, the synset cache is deleted and then created
afresh at every call to the function that extracts information from an
Excel spreadsheet. You would want to change this behavior of the module if
you are planning to use it in a production environment where the different
spreadsheets are likely to deal with the same application domain.  To give

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

The tickets are processed in the following three stages:

=over

=item B<Ticket Preprocessing:>

This stage involves extracting the textual content of each ticket from the
Excel spreadsheet and subjecting it to the following steps: (1) deleting
markup; (2) dropping the stop words supplied through a file whose name is
provided as a value for the constructor parameter C<stop_words_file>; (3)
correcting spelling errors through the `bad-word good-word' entries in a
file whose name is supplied as a value for the constructor parameter
C<misspelled_words_file>; (4) replacing negated words with their antonyms;
and, finally, (5) adding synonyms.

=item B<Doc Modeling:>

Doc modeling consists of creating a Vector Space Model for the tickets
after they have been processed as described above.  VSM modeling involves
scanning the preprocessed tickets, stemming the words, and constructing a
vocabulary for all of the stemmed words in all the tickets.  Subsequently,

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


=item I<min_word_length:> 

This parameter sets the minimum number of characters in a word in order for it to be
included for ticket processing.

=item I<misspelled_words_file:>

As to what extent you can improve ticket retrieval precision with the addition of
synonyms depends on the degree to which you can make corrections on the fly for the
spelling errors that occur frequently in tickets.  That fact makes the file you
supply through this constructor parameter very important.  For the current version of
the module, this file must contain exactly two columns, with the first entry in each
row the misspelled word and the second entry the correctly spelled word.  See this
file in the C<examples> directory for how to format it.

=item I<processed_tickets_db:>

As mentioned earlier in B<DESCRIPTION>, the tickets must be subject to various
preprocessing steps before they can be used for document modeling for the purpose of
retrieval. Preprocessing consists of stop words removal, spelling corrections,

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


<br>

=end html

=item  B<apply_filter_to_all_tickets()>

    $clusterer->apply_filter_to_all_tickets()

The filtering consists of dropping words from the tickets that are in your stop-list
file, fixing spelling errors using the `bad-word good-word' pairs in your spelling
errors file, and deleting short words.

=item  B<construct_doc_vectors_for_all_tickets()>

    $clusterer->construct_doc_vectors_for_all_tickets()

This method is used in the doc modeling stage of the computations.  As stated
earlier, doc modeling of the tickets consists of representing each ticket by a vector
whose size equals that of the vocabulary and whose elements represent the frequencies
of the corresponding words in the ticket.  In addition to calculating the doc
vectors, this method also constructs a normalized version of the doc vectors.  The



( run in 0.589 second using v1.01-cache-2.11-cpan-65fba6d93b7 )