Algorithm-TicketClusterer
view release on metacpan or search on metacpan
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
# thousands, the value of this parameter is likely to be between 2 and
# 3. See the writeup on this parameter in the API description in the
# rest of this documentation.
=head1 CHANGES
Version 1.01 of the module removes the platform dependency of the functions used for
reading the text files for stop words, misspelled words, etc.
=head1 DESCRIPTION
B<Algorithm::TicketClusterer> is a I<perl5> module for retrieving
previously processed Excel-stored tickets similar to a new ticket. Routing
decisions made for the past similar tickets can be useful in expediting the
routing of a new ticket.
Tickets are commonly used in software services industry and customer
support businesses to record requests for service, product complaints,
user feedback, and so on.
With regard to the routing of a ticket, you would want each new ticket to
be handled by the tech support individual who is most qualified to address
the issue raised in the ticket. Identifying the right individual for each
new ticket in real-time is no easy task for organizations that man large
service centers and helpdesks. So if it were possible to quickly identify
the previously processed tickets that are most similar to a new ticket, one
could think of constructing semi-automated (or, perhaps, even fully
automated) ticket routers.
Identifying old tickets similar to a new ticket is made challenging by the
fact that folks who submit tickets often write them quickly and informally.
The informal style of writing means that different people may use different
colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.
This module is an attempt at dealing with these challenges.
The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket. The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.
Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors. What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets. (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.) The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column. An example of such a file is included in the C<examples>
directory. You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your
ticket submitters are likely to throw at you is futile, you might consider
using the following approach which I prefer to actually reading the tickets
for such errors: Turn on the debugging options in the constructor for some
initially collected spreadsheets and watch what sort of words the WordNet
is not able to supply any synonyms for. In a large majority of cases,
these would be the misspelled words.
Expanding a ticket with synonyms is made complicated by the fact that some
common words have such a large number of synonyms that they can overwhelm
the relatively small number of words in a ticket. Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.
In order to deal with the pros and the cons of using synonyms, the present
module strikes a middle ground: You can specify how many synonyms to use
for a word (assuming that the number of synonyms supplied by WordNet is
larger than the number specified). This allows you to experiment with
retrieval precision by altering the number of synonyms used. The retained
synonyms are selected randomly from those supplied by WordNet. (A smarter
way to select synonyms would be to base them on the context. For example,
you would not want to use the synonym `programmer' for the noun `developer'
if your application domain is real-estate. However, such context-dependent
selection of synonyms would take us into the realm of ontologies that I
have chosen to stay away from in this first version of the module.)
Another issue related to the overall run-time performance of this module is
the computational cost of the calls to WordNet through its Perl interface
C<WordNet::QueryData>. This module uses what I have referred to as
I<synset caching> to make this process as efficient as possible. The
result of each WordNet lookup is cached in a database file whose name you
supply through the constructor option C<synset_cache_db>. If you are doing
a good job of catching spelling errors, the module will carry out a
decreasing number of WordNet lookups as the tickets are scanned for
expansion with synonyms. In an experiment with a spreadsheet that
contained over 1400 real tickets, the last several hundred resulted in
hardly any calls to WordNet.
As currently programmed, the synset cache is deleted and then created
afresh at every call to the function that extracts information from an
Excel spreadsheet. You would want to change this behavior of the module if
you are planning to use it in a production environment where the different
spreadsheets are likely to deal with the same application domain. To give
greater persistence to the synset cache, comment out the C<unlink
$self->{_synset_cache_db}> line in the method C<get_tickets_from_excel()>.
After a few updates of the synset cache, the module would almost never need
to make direct calls to WordNet, which would enhance the speed of the
module even further.
The textual content of the tickets, as produced by the preprocessing steps,
is used for document modeling and the doc model thus created used
subsequently for retrieving similar tickets. The doc modeling is carried
out using the Vector Space Model (VSM) in which each ticket is represented
by a vector whose size equals the size of the vocabulary used in all the
tickets and whose elements represent the word frequencies in the
ticket. After such a model is constructed, a query ticket is compared with
the other tickets on the basis of the cosine similarity distance between
the corresponding vectors.
My decision to use the simplest of the text models --- the Vector Space
Model --- was based of the work carried out by Shivani Rao at Purdue who
( run in 2.063 seconds using v1.01-cache-2.11-cpan-5b529ec07f3 )