AI-Categorizer
view release on metacpan or search on metacpan
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
be interested in any of several pieces of information about the hypothesis
(for instance, which categories were assigned, which category was the single
most likely category, the scores assigned to each category, etc.), the
lib/AI/Categorizer.pm view on Meta::CPAN
=head2 Feature Vectors
Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>. The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document. A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class. You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.
=head2 Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to. Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
lib/AI/Categorizer/Document.pm view on Meta::CPAN
sub _weigh_tokens {
my ($self, $tokens, $weight) = @_;
my %counts;
if (my $b = 0+$self->{front_bias}) {
die "'front_bias' value must be between -1 and 1"
unless -1 < $b and $b < 1;
my $n = @$tokens;
my $r = ($b-1)**2 / ($b+1);
my $mult = $weight * log($r)/($r-1);
my $i = 0;
foreach my $feature (@$tokens) {
$counts{$feature} += $mult * $r**($i/$n);
$i++;
}
} else {
foreach my $feature (@$tokens) {
$counts{$feature} += $weight;
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
Adds a new result to the experiment. Please see the
C<Statistics::Contingency> documentation for a description of this
method.
=item add_hypothesis($hypothesis, $correct_categories)
Adds a new result to the experiment. The first argument is a
C<AI::Categorizer::Hypothesis> object such as one generated by a
Learner's C<categorize()> method. The list of correct categories can
be given as an array of category names (strings), as a hash whose keys
are the category names and whose values are anything logically true,
or as a single string if there is only one category. For example, all
of the following are legal:
$e->add_hypothesis($h, "sports");
$e->add_hypothesis($h, ["sports", "finance"]);
$e->add_hypothesis($h, {sports => 1, finance => 1});
=back
=head1 AUTHOR
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
=back
The second character specifies the "collection frequency" component, which
can take the following values:
=over 4
=item f
Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
=item p
Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).
=item x
No change - multiply by 1.
=back
The third character specifies the "normalization" component, which
can take the following values:
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
# This could be made more efficient by figuring out an execution
# plan in advance
my $self = shift;
if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
# Nothing to do
} elsif ( $self->{term_weighting} eq 'l' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = 1 + log($_) foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'n' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
my $max_tf = AI::Categorizer::Util::max values %$f;
$_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'b' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
}
if ($self->{collection_weighting} eq 'x') {
# Nothing to do
} elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
my $subtrahend = ($1 eq 'f' ? 0 : 1);
my $num_docs = $self->documents;
$self->document_frequency('foo'); # Initialize
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
}
} else {
die "collection_weighting must be one of 'x', 'f', or 'p'";
}
if ( $self->{normalize_weighting} eq 'x' ) {
# Nothing to do
} elsif ( $self->{normalize_weighting} eq 'c' ) {
$_->features->normalize foreach $self->documents;
} else {
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
=back
The second character specifies the "collection frequency" component, which
can take the following values:
=over 4
=item f
Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
=item p
Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).
=item x
No change - multiply by 1.
=back
The third character specifies the "normalization" component, which
can take the following values:
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
total tokens (words) in the "sports" training documents and 200 of
them are the word "curling", then C<P(curling|sports) = 200/5000 =
0.04> . If there are 10,000 total tokens in the training corpus and
5,000 of them are in documents belonging to the category "sports",
then C<P(sports)> = 5,000/10,000 = 0.5> .
Because the probabilities involved are often very small and we
multiply many of them together, the result is often a tiny tiny
number. This could pose problems of floating-point underflow, so
instead of working with the actual probabilities we work with the
logarithms of the probabilities. This also speeds up various
calculations in the C<categorize()> method.
=head1 TO DO
More work on the confidence scores - right now the winning category
tends to dominate the scores overwhelmingly, when the scores should
probably be more evenly distributed.
=head1 AUTHOR
t/01-naive_bayes.t view on Meta::CPAN
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure collection_weighting is working
ok $c->knowledge_set->document_frequency('vampires'), 2;
for ('vampires', 'mirrors') {
ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
);
}
$c->learner->train( knowledge_set => $c->knowledge_set );
ok $c->learner;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
ok $c->learner->categorize($doc)->best_category, 'farming';
( run in 0.647 second using v1.01-cache-2.11-cpan-49f99fa48dc )