view release on metacpan or search on metacpan
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
later releases.
- Added a 'Build.PL' file for an alternate installation method using
Module::Build.
- Fixed a problem in the Hypothesis' best_category() method that
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- The 'categorizer' script now records more useful benchmarking
information about time & memory in its outfile.
- The AI::Categorizer->dump_parameters() method now tries to avoid
showing you its entire list of stopwords.
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:
my $doc = AI::Categorizer::Document->new
( content => "Hello, I am a pretty generic document with not much to say." );
my $h = $l->categorize( $doc );
print ("For test document:\n",
" Best category = ", $h->best_category, "\n",
" All categories = ", join(', ', $h->categories), "\n");
lib/AI/Categorizer.pm view on Meta::CPAN
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
lib/AI/Categorizer.pm view on Meta::CPAN
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider. These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
lib/AI/Categorizer/Document.pm view on Meta::CPAN
Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.
=item pre_stemmed
Stopwords are already stemmed, match them against stemmed document
words.
=back
The default value is C<stem>, which seems to produce the best results
in most cases I've tried. I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.
This parameter has no effect if there are no stopwords being used, or
if stemming is not being used. In the latter case, the list of
stopwords will always be matched as-is against the document words.
Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
all_categories => {type => ARRAYREF},
scores => {type => HASHREF},
threshold => {type => SCALAR},
document_name => {type => SCALAR, optional => 1},
);
sub all_categories { @{$_[0]->{all_categories}} }
sub document_name { $_[0]->{document_name} }
sub threshold { $_[0]->{threshold} }
sub best_category {
my ($self) = @_;
my $sc = $self->{scores};
return unless %$sc;
my ($best_cat, $best_score) = each %$sc;
while (my ($key, $val) = each %$sc) {
($best_cat, $best_score) = ($key, $val) if $val > $best_score;
}
return $best_cat;
}
sub in_category {
my ($self, $cat) = @_;
return '' unless exists $self->{scores}{$cat};
return $self->{scores}{$cat} > $self->{threshold};
}
sub categories {
my $self = shift;
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
=head1 SYNOPSIS
use AI::Categorizer::Hypothesis;
# Hypotheses are usually created by the Learner's categorize() method.
# (assume here that $learner and $document have been created elsewhere)
my $h = $learner->categorize($document);
print "Assigned categories: ", join ', ', $h->categories, "\n";
print "Best category: ", $h->best_category, "\n";
print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
print "Chosen from: ", join ', ', $h->all_categories, "\n";
print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";
=head1 DESCRIPTION
A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document. Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
An optional string parameter indicating the name of the document about
which this hypothesis was made.
=back
=item categories()
Returns an ordered list of the categories the document was placed in,
with best matches first. Categories are returned by their string names.
=item best_category()
Returns the name of the category with the highest score in this
hypothesis. Bear in mind that this category may not actually be
assigned if no categories' scores exceed the threshold.
=item in_category($name)
Returns true or false depending on whether the document was placed in
the given category.
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
The C<AI::Categorizer::Learner> class is an abstract class that will
never actually be directly used in your code. Instead, you will use a
subclass like C<AI::Categorizer::Learner::NaiveBayes> which implements
an actual machine learning algorithm.
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object. If you provided a C<knowledge_set> parameter to C<new()>,
specifying one here will override it.
=item categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=item categorize_collection(collection => $collection)
Categorizes every document in a collection and returns an Experiment
object representing the results. Note that the Experiment does not
contain knowledge of the assigned categories for every document, only
a statistical summary of the results.
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
sub create_model {
my $self = shift;
$self->SUPER::create_model;
$self->{model}{first_tree}->do_purge;
delete $self->{model}{first_tree};
}
sub create_boolean_model {
my ($self, $positives, $negatives, $cat) = @_;
my $t = new AI::DecisionTree(noise_mode => 'pick_best',
verbose => $self->verbose);
my %results;
for ($positives, $negatives) {
foreach my $doc (@$_) {
$results{$doc->name} = $_ eq $positives ? 1 : 0;
}
}
if ($self->{model}{first_tree}) {
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
my $l = new AI::Categorizer::Learner::DecisionTree(...parameters...);
$l->train(knowledge_set => $k);
$l->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
while (my $document = ... ) { # An AI::Categorizer::Document object
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
This class implements a Decision Tree machine learner, using
C<AI::DecisionTree> to do the internal work.
=head1 METHODS
This class inherits from the C<AI::Categorizer::Learner> class, so all
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
lib/AI/Categorizer/Learner/Guesser.pm view on Meta::CPAN
my $l = new AI::Categorizer::Learner::Guesser;
$l->train(knowledge_set => $k);
$l->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories. For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05. This can be useful for providing baseline scores to compare
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
my $nb = new AI::Categorizer::Learner::KNN(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This is an implementation of the k-Nearest-Neighbor decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module). See L<AI::Categorizer> for a complete
description of the interface.
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This is an implementation of the Naive Bayes decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module). See L<AI::Categorizer> for a complete
description of the interface.
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 CALCULATIONS
lib/AI/Categorizer/Learner/SVM.pm view on Meta::CPAN
my $l = new AI::Categorizer::Learner::SVM(...parameters...);
$l->train(knowledge_set => $k);
$l->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
while (my $document = ... ) { # An AI::Categorizer::Document object
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
This class implements a Support Vector Machine machine learner, using
Cory Spencer's C<Algorithm::SVM> module. In lots of the recent
academic literature, SVMs perform very well for text categorization.
=head1 METHODS
lib/AI/Categorizer/Learner/SVM.pm view on Meta::CPAN
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
my $nb = new AI::Categorizer::Learner::Weka(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$nb = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/). This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
t/01-naive_bayes.t view on Meta::CPAN
log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
);
}
$c->learner->train( knowledge_set => $c->knowledge_set );
ok $c->learner;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
ok $c->learner->categorize($doc)->best_category, 'farming';
}
{
ok my $c = new AI::Categorizer(term_weighting => 'b');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
t/12-hypothesis.t view on Meta::CPAN
x => 0.561960874125361,
y => 0.0025778217241168,
z => 0.760564740281552,
},
threshold => 0.95,
document_name => 'foo',
);
ok $h;
ok $h->categories, 4;
ok $h->best_category, 'd',
ok $h->in_category('d');
ok $h->in_category('m');
ok !$h->in_category('j');
ok !$h->in_category('foo');
t/common.pl view on Meta::CPAN
sub run_test_docs {
my $l = shift;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
my $r = $l->categorize($doc);
print "Categories: ", join(', ', $r->categories), "\n";
ok($r->best_category, 'farming', "Best category is 'farming'");
ok $r->in_category('farming'), 1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
ok $r->in_category('vampire'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
ok $r->all_categories, 2, "Should be 2 categories in total";
$doc = new AI::Categorizer::Document
( name => 'test2',
content => "I see that many vampires may have eaten my beautiful daughter's blood." );
$r = $l->categorize($doc);
print "Categories: ", join(', ', $r->categories), "\n";
ok($r->best_category, 'vampire', "Best category is 'vampire'");
ok $r->in_category('farming'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
ok $r->in_category('vampire'), 1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
}
sub set_up_tests {
my %params = @_;
my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
(
name => 'Vampires/Farmers',