Algorithm-TicketClusterer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

Identifying old tickets similar to a new ticket is made
challenging by the fact that folks who submit tickets often
write them quickly and informally.  The informal style of
writing means that different people may use different
colloquial terms to describe the same thing. And the
quickness associated with their submission causes the
tickets to frequently contain spelling and other errors such
as conjoined words, fragmentation of long words, and so on.
This module is an attempt at dealing with these challenges.
That different people may use different words for the same
thing is dealt with by using WordNet to expand the tickets
with synonyms in order to ground the tickets in a common
vocabulary.

This module requires the following three modules:

    Spreadsheet::ParseExcel                                                             
    Spreadsheet::XLSX                                                                   
    WordNet::QueryData                                                                  

the first for extracting information from the old-style

examples/ticket_preprocessor_and_doc_modeler.pl  view on Meta::CPAN

                     min_word_length           => 4,
                     want_stemming             => 1,
                );

## Extract information from Excel spreadsheets:
$clusterer->get_tickets_from_excel();

## Apply cleanup filters and add synonyms:
$clusterer->delete_markup_from_all_tickets();
$clusterer->apply_filter_to_all_tickets();
$clusterer->expand_all_tickets_with_synonyms();
$clusterer->store_processed_tickets_on_disk();

## Construct the VSM doc model for the tickets:
$clusterer->get_ticket_vocabulary_and_construct_inverted_index();
$clusterer->construct_doc_vectors_for_all_tickets();
$clusterer->store_stemmed_tickets_and_inverted_index_on_disk();
$clusterer->store_ticket_vectors();

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

    push @all_antonyms, @all_adj_antonyms  if @all_adj_antonyms > 0;
    push @all_antonyms, @all_adv_antonyms  if @all_adv_antonyms > 0;
    my %antonym_set;
    foreach my $antonym (@all_antonyms) {
        $antonym_set{$antonym} = 1;
    }
    my @antonym_set = sort keys %antonym_set;
    return \@antonym_set;
}

sub expand_all_tickets_with_synonyms {
    my $self = shift;
    return unless $self->{_add_synsets_to_tickets};
    my $num_of_tickets = $self->{_total_num_tickets};
    if ($self->{_want_synset_caching}) {
        eval {
            $self->{_synset_cache} = retrieve( $self->{_synset_cache_db} );
        } if -s $self->{_synset_cache_db};
        if ($@) {                                 
           print "Something went wrong with restoration of synset cache: $@";
        }
    }
    my $i = 1;
    foreach my $ticket_id (sort {$a <=> $b} keys %{$self->{_processed_tkts_by_ids}}) {
        $self->_expand_one_ticket_with_synonyms($ticket_id);
        print "Finished syn expansion of ticket $ticket_id ($i out of $num_of_tickets)\n";
        $i++;
    }
    if ($self->{_want_synset_caching}) {
        $self->{_synset_cache_db} = "synset_cache.db" unless $self->{_synset_cache_db};
        eval {                    
            store( $self->{_synset_cache}, $self->{_synset_cache_db} ); 
        };
        if ($@) {                                 
           die "Something went wrong with disk storage of synset cache: $@";
        }
    }
}

sub _expand_one_ticket_with_synonyms {
    my $self = shift;
    my $ticket_id = shift;
    print "\n\nEXPANDING TICKET $ticket_id WITH SYN-SETS:\n\n" 
                                              if $self->{_debug2};
    $self->_replace_negated_words_with_antonyms_one_ticket( $ticket_id );
    $self->_add_to_words_their_synonyms_one_ticket( $ticket_id );
}

sub _replace_negated_words_with_antonyms_one_ticket {
    my $self = shift;

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

                         min_word_length           => 4,
                         want_stemming             => 1,
                    );
    
    ## Extract information from Excel spreadsheets:
    $clusterer->get_tickets_from_excel();
    
    ## Apply cleanup filters and add synonyms:
    $clusterer->delete_markup_from_all_tickets();
    $clusterer->apply_filter_to_all_tickets();
    $clusterer->expand_all_tickets_with_synonyms();
    
    ## Construct the VSM doc model for the tickets:
    $clusterer->get_ticket_vocabulary_and_construct_inverted_index();
    $clusterer->construct_doc_vectors_for_all_tickets();

    #  Of the various constructor parameters shown above, the following two
    #  are critical to how information is extracted from an Excel
    #  spreadsheet: `clustering_fieldname' and `unique_id_fieldname'.  The
    #  first is the heading of the column that contains the textual content
    #  of the tickets.  The second is the heading of the column that

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

colloquial terms to describe the same thing. And the quickness associated
with their submission causes the tickets to frequently contain spelling and
other errors such as conjoined words, fragmentation of long words, and so
on.

This module is an attempt at dealing with these challenges.

The problem of different people using different words to describe the same
thing is taken care of by using WordNet to add to each ticket a designated
number of synonyms for each word in the ticket.  The idea is that after all
the tickets are expanded in this manner, they would become grounded in a
common vocabulary. The synonym expansion of a ticket takes place only after
the negated phrases (that is, the words preceded by 'no' or 'not') are
replaced by their antonyms.

Obviously, expanding a ticket by synonyms makes sense only after it is
corrected for spelling and other errors.  What sort of errors one looks for
and corrects would, in general, depend on the application domain of the
tickets.  (It is not uncommon for engineering services to use jargon words
and acronyms that look like spelling errors to those not familiar with the
services.)  The module expects to see a file that is supplied through the
constructor parameter C<misspelled_words_file> that contains misspelled
words in the first column and their corrected versions in the second
column.  An example of such a file is included in the C<examples>
directory.  You would need to create your own version of such a file for
your application domain. Since conjuring up the misspellings that your

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN

faster to load the database back into the runtime environment than to process a large
spreadsheet.

=item I<stemmed_tickets_db:>

As mentioned in the section B<THE THREE STAGES OF PROCESSING>, one of the first
things you do in the second stage of processing is to stem the words in the tickets.
Stemming is important because it reduces the size of the vocabulary.  To illustrate,
stemming would reduce both the words `programming' and `programmed' to the common
root 'program'.  This module uses a very simple stemmer whose rules can be found in
the utility subroutine C<_simple_stemmer()>.  It would be trivial to expand on these
rules, or, for that matter, to use the Perl module C<Lingua::Stem::En> for a full
application of the Porter Stemming Algorithm.  The stemmed tickets are saved in a
database file whose name is supplied through this constructor parameter.

=item I<stop_words_file:>

This constructor parameter is for naming the file that contains the stop words, these
being words you do not wish to be included in the vocabulary.  The format of this
file must be as shown in the sample file C<stop_words.txt> in the C<examples>
directory.

=item I<synset_cache_db:>

As mentioned in B<DESCRIPTION>, we expand each ticket with a certain number of
synonyms for the words in the ticket for the purpose of grounding all the tickets in
a common vocabulary.  This entails making calls to WordNet through its Perl interface
C<WordNet::QueryData>.  Since these calls can be expensive, you can vastly improve
the runtime performance of the module by caching the results returned by WordNet.
This constructor parameter is for naming a diskfile in which the cache will be
stored.

=item I<tickets_vocab_db:>

This parameter is for naming the DBM in which the ticket vocabulary is stored after

lib/Algorithm/TicketClusterer.pm  view on Meta::CPAN


=item  B<display_tickets_vocab()>

    $clusterer->display_tickets_vocab()

This method displays the ticket vocabulary constructed by a call to
C<get_ticket_vocabulary_and_construct_inverted_index()>.  The vocabulary display
consists of an alphabetized list of the words in all the tickets along with the
frequency of each word.

=item  B<expand_all_tickets_with_synonyms()>

    $clusterer->expand_all_tickets_with_synonyms();

This is the final step in the preprocessing of the tickets before they are ready for
the doc modeling stage.  This method calls other functions internal to the module
that ultimately make calls to WordNet through the Perl interface provided by the
C<WordNet::QueryData> module.

=item B<get_tickets_from_excel():>

    $clusterer->get_tickets_from_excel()

t/test.t  view on Meta::CPAN

## Test 2 (Check Clustering Data):

$tclusterer->get_tickets_from_excel();
my $clustering_data = $tclusterer->_raw_ticket_clustering_data_for_given_id(101);

ok( $clustering_data =~ /i am unable/, 'Able to extract the clustering field from Excel' );


## Test 3 (Check Synset Extraction from WordNet):

$tclusterer->expand_all_tickets_with_synonyms();

ok( -s "t/__test_synset_cache_db" > 20, 'Able to extract synsets from WordNet' );

unlink glob "t/__test_*";



( run in 0.839 second using v1.01-cache-2.11-cpan-5b529ec07f3 )