log results from the CPAN

AI-Categorizer

view release on metacpan or search on metacpan

    abstract classes that manage some of the work for you.

  Feature Vectors

    Most categorization algorithms don't deal directly with documents' data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem helpful for
    determining its category, but they are usually some version of the "most
    important" words in the document. A list of features and their weights in
    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
    You may think of this class as roughly analogous to a Perl hash, where the
    keys are the names of features and the values are their weights.

  Hypotheses

    The result of asking a categorizer to categorize a previously unseen
    document is called a hypothesis, because it is some kind of "statistical
    guess" of what categories this document should be assigned to. Since you may
    be interested in any of several pieces of information about the hypothesis
    (for instance, which categories were assigned, which category was the single
    most likely category, the scores assigned to each category, etc.), the

lib/AI/Categorizer.pm view on Meta::CPAN


=head2 Feature Vectors

Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.

=head2 Hypotheses

The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to.  Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the

lib/AI/Categorizer/Document.pm view on Meta::CPAN

sub _weigh_tokens {
  my ($self, $tokens, $weight) = @_;

  my %counts;
  if (my $b = 0+$self->{front_bias}) {
    die "'front_bias' value must be between -1 and 1"
      unless -1 < $b and $b < 1;
    
    my $n = @$tokens;
    my $r = ($b-1)**2 / ($b+1);
    my $mult = $weight * log($r)/($r-1);
    
    my $i = 0;
    foreach my $feature (@$tokens) {
      $counts{$feature} += $mult * $r**($i/$n);
      $i++;
    }
    
  } else {
    foreach my $feature (@$tokens) {
      $counts{$feature} += $weight;

lib/AI/Categorizer/Experiment.pm view on Meta::CPAN

Adds a new result to the experiment.  Please see the
C<Statistics::Contingency> documentation for a description of this
method.

=item add_hypothesis($hypothesis, $correct_categories)

Adds a new result to the experiment.  The first argument is a
C<AI::Categorizer::Hypothesis> object such as one generated by a
Learner's C<categorize()> method.  The list of correct categories can
be given as an array of category names (strings), as a hash whose keys
are the category names and whose values are anything logically true,
or as a single string if there is only one category.  For example, all
of the following are legal:

 $e->add_hypothesis($h, "sports");
 $e->add_hypothesis($h, ["sports", "finance"]);
 $e->add_hypothesis($h, {sports => 1, finance => 1});

=back

=head1 AUTHOR

lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN


=back

The second character specifies the "collection frequency" component, which
can take the following values:

=over 4

=item f

Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.

=item p

Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).

=item x

No change - multiply by 1.

=back


The third character specifies the "normalization" component, which
can take the following values:

lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN

  # This could be made more efficient by figuring out an execution
  # plan in advance

  my $self = shift;
  
  if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
    # Nothing to do
  } elsif ( $self->{term_weighting} eq 'l' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      $_ = 1 + log($_) foreach values %$f;
    }
  } elsif ( $self->{term_weighting} eq 'n' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      my $max_tf = AI::Categorizer::Util::max values %$f;
      $_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
    }
  } elsif ( $self->{term_weighting} eq 'b' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;

lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN

  }
  
  if ($self->{collection_weighting} eq 'x') {
    # Nothing to do
  } elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
    my $subtrahend = ($1 eq 'f' ? 0 : 1);
    my $num_docs = $self->documents;
    $self->document_frequency('foo');  # Initialize
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      $f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
    }
  } else {
    die "collection_weighting must be one of 'x', 'f', or 'p'";
  }

  if ( $self->{normalize_weighting} eq 'x' ) {
    # Nothing to do
  } elsif ( $self->{normalize_weighting} eq 'c' ) {
    $_->features->normalize foreach $self->documents;
  } else {

lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN


=back

The second character specifies the "collection frequency" component, which
can take the following values:

=over 4

=item f

Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.

=item p

Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).

=item x

No change - multiply by 1.

=back


The third character specifies the "normalization" component, which
can take the following values:

lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN

total tokens (words) in the "sports" training documents and 200 of
them are the word "curling", then C<P(curling|sports) = 200/5000 =
0.04> .  If there are 10,000 total tokens in the training corpus and
5,000 of them are in documents belonging to the category "sports",
then C<P(sports)> = 5,000/10,000 = 0.5> .

Because the probabilities involved are often very small and we
multiply many of them together, the result is often a tiny tiny
number.  This could pose problems of floating-point underflow, so
instead of working with the actual probabilities we work with the
logarithms of the probabilities.  This also speeds up various
calculations in the C<categorize()> method.

=head1 TO DO

More work on the confidence scores - right now the winning category
tends to dominate the scores overwhelmingly, when the scores should
probably be more evenly distributed.

=head1 AUTHOR

t/01-naive_bayes.t view on Meta::CPAN

  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;

  # Make sure collection_weighting is working
  ok $c->knowledge_set->document_frequency('vampires'), 2;
  for ('vampires', 'mirrors') {
    ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
	log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );
  }

  $c->learner->train( knowledge_set => $c->knowledge_set );
  ok $c->learner;
  
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';

( run in 0.281 second using v1.01-cache-2.11-cpan-0d8aa00de5b )