origin results from the CPAN

origin

Algorithm-TicketClusterer

view release on metacpan or search on metacpan

examples/retrieve_similar_tickets.pl view on Meta::CPAN


print "\nDisplaying the tickets considered most similar to the query ticket $ticket_num\n\n";

my %retrieved_hash = %{$retrieved_hash_ref};
my $rank = 1;
foreach my $ticket_id (sort { $retrieved_hash{$b} <=> $retrieved_hash{$a} } 
                                                          keys %retrieved_hash) {
    my $similarity_score = $retrieved_hash{$ticket_id};
    print "\n\n\n --------- Retrieved ticket at similarity rank $rank   (simlarity score: $similarity_score) ---------\n";
    $clusterer->show_processed_ticket_clustering_data_for_given_id( $ticket_id );    
    $clusterer->show_original_ticket_for_given_id( $ticket_id );
    $rank++;
}

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

    }
    if ($self->{_debug1}) {
        my $num_of_tickets = @{$self->{_all_tickets}};
        my $num_entries_check_hash = keys %check_hash;
        print "Number of tickets: $num_of_tickets\n";
        print "Number of keys in check hash: $num_entries_check_hash\n";
    }
    return \@duplicates;
}

sub show_original_ticket_for_given_id {
    my $self = shift;
    my $id = shift;
    print "\n\nDisplaying the fields for the ticket $id:\n\n";
    foreach my $ticket (@{$self->{_all_tickets}}) {
        if ( $ticket->{$self->{_unique_id_fieldname}} == $id) {
            foreach my $key (sort keys %{$ticket}) {
                my $value = $ticket->{$key};
                $value =~ s/^\s+//;
                $value =~ s/\s+$//;
                printf("%20s  ==>  %s\n", $key, $value);

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

        my $outstring = sprintf("%30s     %f", $_,$self->{_idf_t}->{$_});
        print "$outstring\n";
    }
}

# The following subroutine is useful for diagnostic purposes.  It
# lists the number of tickets that a word appears in and also lists
# the tickets.  But be careful in interpreting its results.  Note
# if you invoke this subroutine after the synsets have been added
# to the tickets, you may find words being attributed to tickets
# that do not actually contain them in the original Excel sheet.
sub list_processed_tickets_for_a_word {
    my $self = shift;
    while (my $word = <STDIN>) {    #enter ctrl-D to exit the loop
        chomp $word;
        my @ticket_list;
        foreach my $ticket_id (sort {$a <=> $b} keys %{$self->{_processed_tkts_by_ids}}) {
            my $record = $self->{_processed_tkts_by_ids}->{$ticket_id};
            push @ticket_list, $ticket_id if $record =~ /\b$word\b/i;
        }
        my $num = @ticket_list;

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

                         tkt_doc_vecs_db           => $tkt_doc_vecs_db,
                         tkt_doc_vecs_normed_db    => $tkt_doc_vecs_normed_db,
                         min_idf_threshold         => 1.8,
                         how_many_retrievals       => 5,
                    );
    
    my $query_tkt = 1393548;
    $clusterer->restore_ticket_vectors_and_inverted_index();
    my %retrieved = %{$clusterer->retrieve_similar_tickets_with_vsm($query_tkt)};
    foreach my $tkt_id (sort {$retrieved{$b} <=> $retrieved{$a}} keys %retrieved) {
        $clusterer->show_original_ticket_for_given_id( $tkt_id );
    }

    #  Of all the parameters shown above in the constructor call, the
    #  parameter min_idf_threshold plays a large role in what tickets are
    #  returned by the retrieval function. The value of this parameter
    #  depends on the number of tickets in your Excel spreadsheet.  If the
    #  number of tickets is in the low hundreds, this parameter is likely to
    #  require a value of 1.5 to 1.8.  If the number of tickets is in the
    #  thousands, the value of this parameter is likely to be between 2 and
    #  3. See the writeup on this parameter in the API description in the

lib/Algorithm/TicketClusterer.pm view on Meta::CPAN

    my $retrieved_hash_ref = $clusterer->retrieve_similar_tickets_with_vsm( $ticket_num )

It is this method that retrieves tickets that are most similar to a query ticket.
The method first utilizes the inverted index to construct a candidate list of the
tickets that share words with the query ticket.  Only those words play a role here
whose IDF values exceed C<min_idf_threshold>.  Subsequently, the query ticket vector
is matched with each of the ticket vectors in the candidate list.  The method returns
a reference to a hash whose keys are the IDs for the tickets that match the query
ticket and whose values the cosine similarity distance.

=item B<show_original_ticket_for_given_id()>

    $clusterer->show_original_ticket_for_given_id( $ticket_num )

The argument to the method is the unique integer ID of a ticket for which
you want to see all the fields as stored in the Excel spreadsheet.

=item B<show_raw_ticket_clustering_data_for_given_id()>

While the previous method shows all the fields for a ticket, this method
shows only the textual content --- the content you want to use for
establishing similarity between a query ticket and the other tickets.

( run in 0.241 second using v1.01-cache-2.11-cpan-069f9db706d )