AI-Categorizer

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN


 - Much more extensive regression testing of the code.

 - Added a Document subclass for XML documents. [Implemented by
   Jae-Moon Lee] Its interface is still unstable, it may change in
   later releases.

 - Added a 'Build.PL' file for an alternate installation method using
   Module::Build.

 - Fixed a problem in the Hypothesis' best_category() method that
   would often result in the wrong category being reported.  Added a
   regression test to exercise the Hypothesis class.  [Spotted by
   Xiaobo Li]

 - The 'categorizer' script now records more useful benchmarking
   information about time & memory in its outfile.

 - The AI::Categorizer->dump_parameters() method now tries to avoid
   showing you its entire list of stopwords.

README  view on Meta::CPAN

     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
 
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }
 
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

README  view on Meta::CPAN

    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object

eg/demo.pl  view on Meta::CPAN


# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:

my $doc = AI::Categorizer::Document->new
  ( content => "Hello, I am a pretty generic document with not much to say." );

my $h = $l->categorize( $doc );

print ("For test document:\n",
       "  Best category = ", $h->best_category, "\n",
       "  All categories = ", join(', ', $h->categories), "\n");

lib/AI/Categorizer.pm  view on Meta::CPAN

 $c->train;
 $c->evaluate_test_set;
 print $c->stats_table;
 
 # After training, use the Learner for categorization
 my $l = $c->learner;
 while (...) {
   my $d = ...create a document...
   my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
   print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
   print "Best category: ", $hypothesis->best_category, "\n";
 }
 
=head1 DESCRIPTION

C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules.  The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically

lib/AI/Categorizer.pm  view on Meta::CPAN


A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?).  Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider.  These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable.  If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual
module.

=head1 FRAMEWORK COMPONENTS

This section explains the major pieces of the C<AI::Categorizer>

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.

=item pre_stemmed

Stopwords are already stemmed, match them against stemmed document
words.

=back

The default value is C<stem>, which seems to produce the best results
in most cases I've tried.  I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.

This parameter has no effect if there are no stopwords being used, or
if stemming is not being used.  In the latter case, the list of
stopwords will always be matched as-is against the document words.

Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

an optional argument.

=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

   all_categories => {type => ARRAYREF},
   scores => {type => HASHREF},
   threshold => {type => SCALAR},
   document_name => {type => SCALAR, optional => 1},
  );

sub all_categories { @{$_[0]->{all_categories}} }
sub document_name  { $_[0]->{document_name} }
sub threshold      { $_[0]->{threshold} }

sub best_category {
  my ($self) = @_;
  my $sc = $self->{scores};
  return unless %$sc;

  my ($best_cat, $best_score) = each %$sc;
  while (my ($key, $val) = each %$sc) {
    ($best_cat, $best_score) = ($key, $val) if $val > $best_score;
  }
  return $best_cat;
}

sub in_category {
  my ($self, $cat) = @_;
  return '' unless exists $self->{scores}{$cat};
  return $self->{scores}{$cat} > $self->{threshold};
}

sub categories {
  my $self = shift;

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN


=head1 SYNOPSIS

 use AI::Categorizer::Hypothesis;
 
 # Hypotheses are usually created by the Learner's categorize() method.
 # (assume here that $learner and $document have been created elsewhere)
 my $h = $learner->categorize($document);
 
 print "Assigned categories: ", join ', ', $h->categories, "\n";
 print "Best category: ", $h->best_category, "\n";
 print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
 print "Chosen from: ", join ', ', $h->all_categories, "\n";
 print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";

=head1 DESCRIPTION

A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document.  Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN


An optional string parameter indicating the name of the document about
which this hypothesis was made.

=back


=item categories()

Returns an ordered list of the categories the document was placed in,
with best matches first.  Categories are returned by their string names.

=item best_category()

Returns the name of the category with the highest score in this
hypothesis.  Bear in mind that this category may not actually be
assigned if no categories' scores exceed the threshold.

=item in_category($name)

Returns true or false depending on whether the document was placed in
the given category.

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

an optional argument.

=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
 $nb->train(knowledge_set => $k);
 $nb->save_state('filename');
 
 ... time passes ...
 
 $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
 my $c = new AI::Categorizer::Collection::Files( path => ... );
 while (my $document = $c->next) {
   my $hypothesis = $nb->categorize($document);
   print "Best assigned category: ", $hypothesis->best_category, "\n";
   print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
 }

=head1 DESCRIPTION

The C<AI::Categorizer::Learner> class is an abstract class that will
never actually be directly used in your code.  Instead, you will use a
subclass like C<AI::Categorizer::Learner::NaiveBayes> which implements
an actual machine learning algorithm.

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.  If you provided a C<knowledge_set> parameter to C<new()>,
specifying one here will override it.

=item categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=item categorize_collection(collection => $collection)

Categorizes every document in a collection and returns an Experiment
object representing the results.  Note that the Experiment does not
contain knowledge of the assigned categories for every document, only
a statistical summary of the results.

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

sub create_model {
  my $self = shift;
  $self->SUPER::create_model;
  $self->{model}{first_tree}->do_purge;
  delete $self->{model}{first_tree};
}

sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  
  my $t = new AI::DecisionTree(noise_mode => 'pick_best', 
			       verbose => $self->verbose);

  my %results;
  for ($positives, $negatives) {
    foreach my $doc (@$_) {
      $results{$doc->name} = $_ eq $positives ? 1 : 0;
    }
  }

  if ($self->{model}{first_tree}) {

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

  
  my $l = new AI::Categorizer::Learner::DecisionTree(...parameters...);
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  while (my $document = ... ) {  # An AI::Categorizer::Document object
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

=head1 DESCRIPTION

This class implements a Decision Tree machine learner, using
C<AI::DecisionTree> to do the internal work.

=head1 METHODS

This class inherits from the C<AI::Categorizer::Learner> class, so all

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

  my $l = new AI::Categorizer::Learner::Guesser;
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories.  For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05.  This can be useful for providing baseline scores to compare

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

  my $nb = new AI::Categorizer::Learner::KNN(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This is an implementation of the k-Nearest-Neighbor decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module).  See L<AI::Categorizer> for a complete
description of the interface.

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

  my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
  
  ... time passes ...
  
  $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This is an implementation of the Naive Bayes decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module).  See L<AI::Categorizer> for a complete
description of the interface.

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 CALCULATIONS

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

  
  my $l = new AI::Categorizer::Learner::SVM(...parameters...);
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  while (my $document = ... ) {  # An AI::Categorizer::Document object
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

=head1 DESCRIPTION

This class implements a Support Vector Machine machine learner, using
Cory Spencer's C<Algorithm::SVM> module.  In lots of the recent
academic literature, SVMs perform very well for text categorization.

=head1 METHODS

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

  my $nb = new AI::Categorizer::Learner::Weka(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
  
  ... time passes ...
  
  $nb = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

=head1 DESCRIPTION

This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/).  This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

t/01-naive_bayes.t  view on Meta::CPAN

	log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );
  }

  $c->learner->train( knowledge_set => $c->knowledge_set );
  ok $c->learner;
  
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';
}

{
  ok my $c = new AI::Categorizer(term_weighting => 'b');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;

t/12-hypothesis.t  view on Meta::CPAN

	      x => 0.561960874125361,
	      y => 0.0025778217241168,
	      z => 0.760564740281552,
	     },
   threshold => 0.95,
   document_name => 'foo',
  );
ok $h;

ok $h->categories, 4;
ok $h->best_category, 'd',
ok $h->in_category('d');
ok $h->in_category('m');
ok !$h->in_category('j');
ok !$h->in_category('foo');

t/common.pl  view on Meta::CPAN


sub run_test_docs {
  my $l = shift;

  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  my $r = $l->categorize($doc);
  
  print "Categories: ", join(', ', $r->categories), "\n";
  ok($r->best_category, 'farming', "Best category is 'farming'");
  ok $r->in_category('farming'),  1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
  ok $r->in_category('vampire'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
  
  ok $r->all_categories, 2, "Should be 2 categories in total";
  
  $doc = new AI::Categorizer::Document
    ( name => 'test2',
      content => "I see that many vampires may have eaten my beautiful daughter's blood." );
  $r = $l->categorize($doc);
  
  print "Categories: ", join(', ', $r->categories), "\n";
  ok($r->best_category, 'vampire', "Best category is 'vampire'");
  ok $r->in_category('farming'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
  ok $r->in_category('vampire'),  1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
}

sub set_up_tests {
  my %params = @_;
  my $c = new AI::Categorizer(
			      knowledge_set => AI::Categorizer::KnowledgeSet->new
			      (
			       name => 'Vampires/Farmers',



( run in 0.619 second using v1.01-cache-2.11-cpan-d6f9594c0a5 )