Algorithm-TicketClusterer
view release on metacpan or search on metacpan
examples/retrieve_similar_tickets.pl view on Meta::CPAN
print "\nDisplaying the tickets considered most similar to the query ticket $ticket_num\n\n";
my %retrieved_hash = %{$retrieved_hash_ref};
my $rank = 1;
foreach my $ticket_id (sort { $retrieved_hash{$b} <=> $retrieved_hash{$a} }
keys %retrieved_hash) {
my $similarity_score = $retrieved_hash{$ticket_id};
print "\n\n\n --------- Retrieved ticket at similarity rank $rank (simlarity score: $similarity_score) ---------\n";
$clusterer->show_processed_ticket_clustering_data_for_given_id( $ticket_id );
$clusterer->show_original_ticket_for_given_id( $ticket_id );
$rank++;
}
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
}
if ($self->{_debug1}) {
my $num_of_tickets = @{$self->{_all_tickets}};
my $num_entries_check_hash = keys %check_hash;
print "Number of tickets: $num_of_tickets\n";
print "Number of keys in check hash: $num_entries_check_hash\n";
}
return \@duplicates;
}
sub show_original_ticket_for_given_id {
my $self = shift;
my $id = shift;
print "\n\nDisplaying the fields for the ticket $id:\n\n";
foreach my $ticket (@{$self->{_all_tickets}}) {
if ( $ticket->{$self->{_unique_id_fieldname}} == $id) {
foreach my $key (sort keys %{$ticket}) {
my $value = $ticket->{$key};
$value =~ s/^\s+//;
$value =~ s/\s+$//;
printf("%20s ==> %s\n", $key, $value);
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
my $outstring = sprintf("%30s %f", $_,$self->{_idf_t}->{$_});
print "$outstring\n";
}
}
# The following subroutine is useful for diagnostic purposes. It
# lists the number of tickets that a word appears in and also lists
# the tickets. But be careful in interpreting its results. Note
# if you invoke this subroutine after the synsets have been added
# to the tickets, you may find words being attributed to tickets
# that do not actually contain them in the original Excel sheet.
sub list_processed_tickets_for_a_word {
my $self = shift;
while (my $word = <STDIN>) { #enter ctrl-D to exit the loop
chomp $word;
my @ticket_list;
foreach my $ticket_id (sort {$a <=> $b} keys %{$self->{_processed_tkts_by_ids}}) {
my $record = $self->{_processed_tkts_by_ids}->{$ticket_id};
push @ticket_list, $ticket_id if $record =~ /\b$word\b/i;
}
my $num = @ticket_list;
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
tkt_doc_vecs_db => $tkt_doc_vecs_db,
tkt_doc_vecs_normed_db => $tkt_doc_vecs_normed_db,
min_idf_threshold => 1.8,
how_many_retrievals => 5,
);
my $query_tkt = 1393548;
$clusterer->restore_ticket_vectors_and_inverted_index();
my %retrieved = %{$clusterer->retrieve_similar_tickets_with_vsm($query_tkt)};
foreach my $tkt_id (sort {$retrieved{$b} <=> $retrieved{$a}} keys %retrieved) {
$clusterer->show_original_ticket_for_given_id( $tkt_id );
}
# Of all the parameters shown above in the constructor call, the
# parameter min_idf_threshold plays a large role in what tickets are
# returned by the retrieval function. The value of this parameter
# depends on the number of tickets in your Excel spreadsheet. If the
# number of tickets is in the low hundreds, this parameter is likely to
# require a value of 1.5 to 1.8. If the number of tickets is in the
# thousands, the value of this parameter is likely to be between 2 and
# 3. See the writeup on this parameter in the API description in the
lib/Algorithm/TicketClusterer.pm view on Meta::CPAN
my $retrieved_hash_ref = $clusterer->retrieve_similar_tickets_with_vsm( $ticket_num )
It is this method that retrieves tickets that are most similar to a query ticket.
The method first utilizes the inverted index to construct a candidate list of the
tickets that share words with the query ticket. Only those words play a role here
whose IDF values exceed C<min_idf_threshold>. Subsequently, the query ticket vector
is matched with each of the ticket vectors in the candidate list. The method returns
a reference to a hash whose keys are the IDs for the tickets that match the query
ticket and whose values the cosine similarity distance.
=item B<show_original_ticket_for_given_id()>
$clusterer->show_original_ticket_for_given_id( $ticket_num )
The argument to the method is the unique integer ID of a ticket for which
you want to see all the fields as stored in the Excel spreadsheet.
=item B<show_raw_ticket_clustering_data_for_given_id()>
While the previous method shows all the fields for a ticket, this method
shows only the textual content --- the content you want to use for
establishing similarity between a query ticket and the other tickets.
( run in 0.241 second using v1.01-cache-2.11-cpan-069f9db706d )