Algorithm-TicketClusterer
view release on metacpan or search on metacpan
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
spreadsheet that contains the textual data for the tickets. For example, if the
column heading for the textual content of the tickets is `Description', you must
supply this string as the value for the parameter C<clustering_fieldname>.
=item I<debug1:>
When this parameter is set, the module prints out information regarding what columns
of the spreadsheet it is extracting information from, the headers for those columns,
the index of the column that contains the textual content of the tickets, and of the
column that contains the unique integer identifier for each ticket. If you are
dealing with spreadsheets with a large number of tickets, it is best to pipe the
output of the module into a file to see the debugging information.
=item I<debug2:>
When this parameter is set, you will see how WordNet is being utilized to generate
word synonyms. This debugging output is also useful to see the extent of misspellings
in the tickets. If WordNet is unable to find the synonyms for a word, chances are
that the word is not spelled correctly (or that it is a jargon word or a jargon
acronym).
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
$clusterer->store_ticket_vectors()
As the name implies, this call stores the vectors, both regular and normalized, in a
database file on the disk.
=back
=head1 HOW THE MATCHING TICKETS ARE RETRIEVED
It is the method C<retrieve_similar_tickets_with_vsm()> that returns the best ticket
matches for a given query ticket. What this method returns is a hash reference; the
keys in this hash are the integer IDs of the matching tickets and the values the
cosine similarity distance between the query ticket and the matching tickets. The
number of matching tickets returned by C<retrieve_similar_tickets_with_vsm()> is set
by the constructor parameter C<how_many_retrievals>. Note that
C<retrieve_similar_tickets_with_vsm()> takes a single argument, which is the integer
ID of the query ticket.
=head1 THE C<examples> DIRECTORY
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
this module.
=head1 HOW YOU CAN TURN THIS MODULE INTO A PRODUCTION-QUALITY TOOL
By a production-quality tool, I mean a software package that you can I<actually> use
in a production environment for automated or semi-automated ticket routing in your
organization. I am assuming you already have the tools in place that insert in
real-time the new tickets in an Excel spreadsheet.
Turning this module into a production tool will require that you find the best values
to use for the following three parameters that are needed by the constructor: (1)
C<min_idf_threshold> for the minimum C<idf> value for the words in a query ticket in
order for them to be considered for matching with the other tickets; (2)
C<min_word_length> for discarding words that are too short; and (3)
C<max_num_syn_words> for how many synonyms to retain for a word if the number of
synonyms returned by WordNet is too large. In addition, you must also come up with a
misspelled-words file that is appropriate to your application domain and a stop-words
file.
In order to find the best values to use for the parameters that are mentioned above,
I suggest creating a graphical front-end for this module that would allow for
altering the values of the three parameters listed above in response to the
prevailing mis-routing rates for the tickets. The front-end will display to an
operator the latest ticket that needs to be routed and a small set of the
best-matching previously routed tickets as returned by this module. Used either in a
fully-automated mode or a semi-automated mode, this front-end would contain a
feedback recorder that would keep track of mis-routed tickets --- the mis-routed
tickets would presumably bounce back to the central operator monitoring the
front-end. The front-end display could be equipped with slider controls for altering
the values used for the three parameters. Obviously, as a parameter is changed, some
of the database files stored on the disk would need to be recomputed. The same would
be the case if you make changes to the misspelled-words file or to the stop-words
file.
=head1 REQUIRED
( run in 0.796 second using v1.01-cache-2.11-cpan-4e96b696675 )