Algorithm-TicketClusterer

 view release on metacpan or  search on metacpan

examples/misspelled_words.txt  view on Meta::CPAN

re-submitt resubmit
resubmittion resubmission
retify rectify
retreiv retreive
screeshots screenshots
scrrenshot screenshot
scedule schedule
schecul schedule
sceen schedule
scedule schedule
selectin selection
septemeber september
september september
septmeber september
servce service
shhets sheets
shoing showing
solv solve
stat state
storey story
stopp stop

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

the relatively small number of words in a ticket.  Adding too many synonyms
in relation to the size of a ticket can not only distort the sense of the
ticket but it can also increase the computational cost of processing all
the tickets.

In order to deal with the pros and the cons of using synonyms, the present
module strikes a middle ground: You can specify how many synonyms to use
for a word (assuming that the number of synonyms supplied by WordNet is
larger than the number specified).  This allows you to experiment with
retrieval precision by altering the number of synonyms used.  The retained
synonyms are selected randomly from those supplied by WordNet.  (A smarter
way to select synonyms would be to base them on the context.  For example,
you would not want to use the synonym `programmer' for the noun `developer'
if your application domain is real-estate.  However, such context-dependent
selection of synonyms would take us into the realm of ontologies that I
have chosen to stay away from in this first version of the module.)

Another issue related to the overall run-time performance of this module is
the computational cost of the calls to WordNet through its Perl interface
C<WordNet::QueryData>.  This module uses what I have referred to as
I<synset caching> to make this process as efficient as possible.  The
result of each WordNet lookup is cached in a database file whose name you
supply through the constructor option C<synset_cache_db>.  If you are doing
a good job of catching spelling errors, the module will carry out a
decreasing number of WordNet lookups as the tickets are scanned for



( run in 0.458 second using v1.01-cache-2.11-cpan-49f99fa48dc )