Incorrect search filter: invalid characters - *.p[ml]
AI-Categorizer

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

Revision history for Perl extension AI::Categorizer.

 - The t/01-naive_bayes.t test was failing (instead of skipping) when
   Algorithm::NaiveBayes wasn't installed.  Now it skips.

0.08 - Tue Mar 20 19:39:41 2007

 - Added a ChiSquared feature selection class. [Francois Paradis]

 - Changed the web locations of the reuters-21578 corpus that
   eg/demo.pl uses, since the location it referenced previously has
   gone away.

 - The building & installing process now uses Module::Build rather
   than ExtUtils::MakeMaker.

 - When the features_kept mechanism was used to explicitly state the
   features to use, and the scan_first parameter was left as its
   default value, the features_kept mechanism would silently fail to
   do anything.  This has now been fixed. [Spotted by Arnaud Gaudinat]

 - Recent versions of Weka have changed the name of the SVM class, so
   I've updated it in our test (t/03-weka.t) of the Weka wrapper
   too. [Sebastien Aperghis-Tramoni]

0.07  Tue May  6 16:15:04 CDT 2003

 - Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
   MANIFEST, so they weren't included in the 0.06 distribution.
   [Spotted by Zoltan Barta]

0.06 Tue Apr 22 10:27:26 CDT 2003

Changes  view on Meta::CPAN


 - Added a "Guesser" machine learner which simply uses overall class
   probabilities to make categorization decisions.  Sometimes useful
   for providing a set of baseline scores against which to evaluate
   other machine learners.

 - The NaiveBayes learner is now a wrapper around my new
   Algorithm::NaiveBayes module, which is just the old NaiveBayes code
   from here, turned into its own standalone module.

 - Much more extensive regression testing of the code.

 - Added a Document subclass for XML documents. [Implemented by
   Jae-Moon Lee] Its interface is still unstable, it may change in
   later releases.

 - Added a 'Build.PL' file for an alternate installation method using
   Module::Build.

 - Fixed a problem in the Hypothesis' best_category() method that
   would often result in the wrong category being reported.  Added a
   regression test to exercise the Hypothesis class.  [Spotted by
   Xiaobo Li]

 - The 'categorizer' script now records more useful benchmarking
   information about time & memory in its outfile.

 - The AI::Categorizer->dump_parameters() method now tries to avoid
   showing you its entire list of stopwords.

 - Document objects now use a default 'name' if none is supplied.

Changes  view on Meta::CPAN

 - Removed F1(), precision(), recall(), etc. from Util package since
   they're in Statistics::Contingency.  Added random_elements() to
   Util.

 - Collection::Files now warns when no category information is known
   about a document in the collection (knowing it's in zero categories
   is okay).

 - Added the Collection::InMemory class

 - Much more thorough testing with 'make test'.

 - Added add_hypothesis() method to Experiment.

 - Added dot() and value() methods to FeatureVector.

 - Added 'feature_selection' parameter to KnowledgeSet.

 - Added document($name) accessor method to KnowledgeSet.

 - In KnowledgeSet, load(), read(), and scan_*() can now accept a

INSTALL  view on Meta::CPAN

           Installation instructions for AI::Categorizer

To install this module, follow the standard steps for installing most
Perl modules:

  perl Makefile.PL
  make
  make test
  make install

Or you may use the CPAN.pm module, which will automatically execute
these steps for you, and help you get the prerequisite dependencies
installed as well.

Alternatively, you can use the new Module::Build-style installer:

  perl Build.PL
  ./Build
  ./Build test
  ./Build install

 -Ken

README  view on Meta::CPAN

NAME
    AI::Categorizer - Automatic Text Categorization

SYNOPSIS
     use AI::Categorizer;
     my $c = new AI::Categorizer(...parameters...);
 
     # Run a complete experiment - training on a corpus, testing on a test
     # set, printing a summary of results to STDOUT
     $c->run_experiment;
 
     # Or, run the parts of $c->run_experiment separately
     $c->scan_features;
     $c->read_training_set;
     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
 
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }

README  view on Meta::CPAN


    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object

README  view on Meta::CPAN

    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

    The "AI::Categorizer::Collection" class and its subclasses instantiate the
    idea of a collection in this sense.

  Documents

    Each document is represented by an "AI::Categorizer::Document" object, or an
    object of one of its subclasses. Each document class contains methods for
    turning a bunch of data into a Feature Vector. Each document also has a
    method to report which categories it belongs to.

README  view on Meta::CPAN

            left off.

        verbose
            If true, a few status messages will be printed during execution.

        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to
            "$data_root/cats.txt".

    learner()
        Returns the Learner object associated with this Categorizer. Before
        "train()", the Learner will of course not be trained yet.

    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.

    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.

    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
        use.

        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.

    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this calls the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.

    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        created during "read_training_set()". Returns the Learner object. Also
        saves the Learner object for later use.

    evaluate_test_set()
        Creates a Collection based on the value of the "test_set" parameter, and
        calls the Learner's "categorize_collection()" method using this
        Collection. Returns the resultant Experiment object. Also saves the
        Experiment object for later use in the "stats_table()" method.

    stats_table()
        Returns the value of the Experiment's (as created by
        "evaluate_test_set()") "stats_table()" method. This is a string that
        shows various statistics about the accuracy/precision/recall/F1/etc. of
        the assignments made during testing.

HISTORY
    This module is a revised and redesigned version of the previous
    "AI::Categorize" module by the same author. Note the added 'r' in the new
    name. The older module has a different interface, and no attempt at backward
    compatibility has been made - that's why I changed the name.

    You can have both "AI::Categorize" and "AI::Categorizer" installed at the
    same time on the same machine, if you want. They don't know about each other
    or use conflicting namespaces.

eg/categorizer  view on Meta::CPAN

    } else {
      warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
    }
  }
}
  

run_section('scan_features',     1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train',             3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
  my $result = $c->stats_table;
  print $result if $c->verbose;
  print $out_fh $result if $out_fh;
}

sub run_section {
  my ($section, $stage, $do_stage) = @_;
  return unless $do_stage->{$stage};
  if (keys %$do_stage > 1) {

eg/demo.pl  view on Meta::CPAN

#!/usr/bin/perl

# This script is a fairly simple demonstration of how AI::Categorizer
# can be used.  There are lots of other less-simple demonstrations
# (actually, they're doing much simpler things, but are probably
# harder to follow) in the tests in the t/ subdirectory.  The
# eg/categorizer script can also be a good example if you're willing
# to figure out a bit how it works.
#
# This script reads a training corpus from a directory of plain-text
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.

use strict;
use AI::Categorizer;
use AI::Categorizer::Collection::Files;
use AI::Categorizer::Learner::NaiveBayes;
use File::Spec;

die("Usage: $0 <corpus>\n".
    "  A sample corpus (data set) can be downloaded from\n".
    "     http://www.cpan.org/authors/Ken_Williams/data/reuters-21578.tar.gz\n".
    "  or http://www.limnus.com/~ken/reuters-21578.tar.gz\n")
  unless @ARGV == 1;

my $corpus = shift;

my $training  = File::Spec->catfile( $corpus, 'training' );
my $test      = File::Spec->catfile( $corpus, 'test' );
my $cats      = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );

my %params;
if (-e $stopwords) {
  $params{stopword_file} = $stopwords;
} else {
  warn "$stopwords not found - no stopwords will be used.\n";
}

eg/demo.pl  view on Meta::CPAN

  die "$cats not found - can't proceed without category information.\n";
}


# In a real-world application these Collection objects could be of any
# type (any Collection subclass).  Or you could create each Document
# object manually.  Or you could let the KnowledgeSet create the
# Collection objects for you.

$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test     = AI::Categorizer::Collection::Files->new( path => $test, %params );

# We turn on verbose mode so you can watch the progress of loading &
# training.  This looks nicer if you have Time::Progress installed!

print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );

print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );

print "Categorizing test set\n";
my $experiment = $l->categorize_collection( collection => $test );

print $experiment->stats_table;


# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:

my $doc = AI::Categorizer::Document->new
  ( content => "Hello, I am a pretty generic document with not much to say." );

my $h = $l->categorize( $doc );

print ("For test document:\n",
       "  Best category = ", $h->best_category, "\n",
       "  All categories = ", join(', ', $h->categories), "\n");

eg/easy_guesser.pl  view on Meta::CPAN

#!/usr/bin/perl

# This script can be helpful for getting a set of baseline scores for
# a categorization task.  It simulates using the "Guesser" learner,
# but is much faster.  Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format.  <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class.  <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.

use strict;
use Statistics::Contingency;

die "Usage: $0 <cats-file> <training-dir> <test-dir>\n" unless @ARGV == 3;
my ($cats, $training, $test) = @ARGV;

die "$cats isn't a plain file\n" unless -f $cats;
die "$training isn't a directory\n" unless -d $training;
die "$test isn't a directory\n" unless -d $test;

my %cats;
print "Reading category file\n";
open my($fh), $cats or die "Can't read $cats: $!";
while (<$fh>) {
    my ($doc, @cats) = split;
    $cats{$doc} = \@cats;
}

my (%freq, $docs);

eg/easy_guesser.pl  view on Meta::CPAN

    }
    $docs++;
    $freq{$_}++ foreach @{$cats{$file}};
}
closedir $dh;

print "Calculating probabilities (@{[ %freq ]})\n";
$_ /= $docs foreach values %freq;
my @cats = keys %freq;

print "Scoring test documents\n";
my $c = Statistics::Contingency->new(categories => \@cats);
opendir $dh, $test or die "Can't opendir $test: $!";
while (defined(my $file = readdir $dh)) {
    next if $file eq '.' or $file eq '..';
    unless ($cats{$file}) {
	warn "No category information for '$file'";
	next;
    }
    my @assigned;
    foreach (@cats) {
	push @assigned, $_ if rand() < $freq{$_};
    }

lib/AI/Categorizer.pm  view on Meta::CPAN

use AI::Categorizer::KnowledgeSet;


__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );

__PACKAGE__->contained_objects
  (
   knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
   learner       => { class => 'AI::Categorizer::Learner::NaiveBayes' },
   experiment    => { class => 'AI::Categorizer::Experiment',
		      delayed => 1 },
   collection    => { class => 'AI::Categorizer::Collection::Files',
		      delayed => 1 },
  );

sub new {
  my $package = shift;
  my %args = @_;
  my %defaults;
  if (exists $args{data_root}) {
    $defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
    $defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
    $defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
    delete $args{data_root};
  }

  return $package->SUPER::new(%defaults, %args);
}

#sub dump_parameters {
#  my $p = shift()->SUPER::dump_parameters;
#  delete $p->{stopwords} if $p->{stopword_file};

lib/AI/Categorizer.pm  view on Meta::CPAN


sub knowledge_set { shift->{knowledge_set} }
sub learner       { shift->{learner} }

# Combines several methods in one sub
sub run_experiment {
  my $self = shift;
  $self->scan_features;
  $self->read_training_set;
  $self->train;
  $self->evaluate_test_set;
  print $self->stats_table;
}

sub scan_features {
  my $self = shift;
  return unless $self->knowledge_set->scan_first;
  $self->knowledge_set->scan_features( path => $self->{training_set} );
  $self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}

lib/AI/Categorizer.pm  view on Meta::CPAN

}

sub train {
  my $self = shift;
  $self->_load_progress( '02', 'knowledge_set' );
  $self->learner->train( knowledge_set => $self->{knowledge_set} );
  $self->_save_progress( '03', 'learner' );
  return $self->learner;
}

sub evaluate_test_set {
  my $self = shift;
  $self->_load_progress( '03', 'learner' );
  my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
  $self->{experiment} = $self->learner->categorize_collection( collection => $c );
  $self->_save_progress( '04', 'experiment' );
  return $self->{experiment};
}

sub stats_table {
  my $self = shift;
  $self->_load_progress( '04', 'experiment' );
  return $self->{experiment}->stats_table;
}

lib/AI/Categorizer.pm  view on Meta::CPAN


=head1 NAME

AI::Categorizer - Automatic Text Categorization

=head1 SYNOPSIS

 use AI::Categorizer;
 my $c = new AI::Categorizer(...parameters...);
 
 # Run a complete experiment - training on a corpus, testing on a test
 # set, printing a summary of results to STDOUT
 $c->run_experiment;
 
 # Or, run the parts of $c->run_experiment separately
 $c->scan_features;
 $c->read_training_set;
 $c->train;
 $c->evaluate_test_set;
 print $c->stats_table;
 
 # After training, use the Learner for categorization
 my $l = $c->learner;
 while (...) {
   my $d = ...create a document...
   my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
   print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
   print "Best category: ", $hypothesis->best_category, "\n";
 }

lib/AI/Categorizer.pm  view on Meta::CPAN

The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use.  There are several ways to carry out this process.  The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.

A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?).  Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider.  These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable.  If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual
module.

=head1 FRAMEWORK COMPONENTS

This section explains the major pieces of the C<AI::Categorizer>

lib/AI/Categorizer.pm  view on Meta::CPAN

set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.

=head2 Collections

Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.

=head2 Documents

Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses.  Each document class
contains methods for turning a bunch of data into a Feature Vector.

lib/AI/Categorizer.pm  view on Meta::CPAN

=item verbose

If true, a few status messages will be printed during execution.

=item training_set

Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.

=item test_set

Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.

=item data_root

A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately.  Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.

=back

=item learner()

Returns the Learner object associated with this Categorizer.  Before
C<train()>, the Learner will of course not be trained yet.

=item knowledge_set()

Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.

=item run_experiment()

Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.

=item scan_features()

Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.

This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.

=item read_training_set()

Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.

=item train()

Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.

=item evaluate_test_set()

Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection.  Returns the resultant Experiment object.  Also saves the
Experiment object for later use in the C<stats_table()> method.

=item stats_table()

Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method.  This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.

=back

=head1 HISTORY

This module is a revised and redesigned version of the previous
C<AI::Categorize> module by the same author.  Note the added 'r' in
the new name.  The older module has a different interface, and no
attempt at backward compatibility has been made - that's why I changed
the name.

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

documents are plain text, but subclasses of the Document class may
handle any kind of data.

=head1 METHODS

=over 4

=item new(%parameters)

Creates a new Document object.  Document objects are used during
training (for the training documents), testing (for the test
documents), and when categorizing new unseen documents in an
application (for the unseen documents).  However, you'll typically
only call C<new()> in the latter case, since the KnowledgeSet or
Collection classes will create Document objects for you in the former
cases.

The C<new()> method accepts the following parameters:

=over 4

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

=head1 METHODS

=over 4

=item new(%parameters)

Returns a new Hypothesis object.  Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method.  However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.

The following parameters are accepted when creating a new Hypothesis:

=over 4

=item all_categories

A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a
reference to an array of category names (as strings).

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories.  For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05.  This can be useful for providing baseline scores to compare
with other more sophisticated algorithms.

See L<AI::Categorizer> for a complete description of the interface.

=head1 METHODS

This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available.

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

}

# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -t /tmp/train_file.arff -d /tmp/weka-machine

sub create_model {
  my ($self) = shift;
  my $m = $self->{model} ||= {};
  $m->{all_features} = [ $self->knowledge_set->features->names ];
  $m->{_in_dir} = File::Temp::tempdir( DIR => $self->{tmpdir} );

  # Create a dummy test file $dummy_file in ARFF format (a kludgey WEKA requirement)
  my $dummy_features = $self->create_delayed_object('features');
  $m->{dummy_file} = $self->create_arff_file("dummy", [[$dummy_features, 0]]);

  $self->SUPER::create_model(@_);
}

sub create_boolean_model {
  my ($self, $pos, $neg, $cat) = @_;

  my @docs = (map([$_->features, 1], @$pos),

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

	      '-d', $outfile,
	      '-v',
	      '-p', '0',
	     );
  $self->do_cmd(@args);
  unlink $train_file or warn "Couldn't remove $train_file: $!";

  return \%info;
}

# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -l out -T test.arff -p 0

sub get_boolean_score {
  my ($self, $doc, $info) = @_;
  
  # Create document file
  my $doc_file = $self->create_arff_file('doc', [[$doc->features, 0]], $self->{tmpdir});
  my $machine_file = File::Spec->catfile($self->{model}{_in_dir}, $info->{machine_file});

  my @args = ($self->{java_path},
	      @{$self->{java_args}},

t/01-naive_bayes.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN { 
  require 't/common.pl';
  need_module('Algorithm::NaiveBayes');
  plan tests => 15 + num_standard_tests();
}

ok(1);

#########################

perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');

#use Carp; $SIG{__DIE__} = \&Carp::confess;

my %docs = training_docs();

{
  ok my $c = new AI::Categorizer(collection_weighting => 'f');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);

t/01-naive_bayes.t  view on Meta::CPAN

  for ('vampires', 'mirrors') {
    ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
	log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );
  }

  $c->learner->train( knowledge_set => $c->knowledge_set );
  ok $c->learner;
  
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';
}

{
  ok my $c = new AI::Categorizer(term_weighting => 'b');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }

t/02-experiment.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN { plan tests => 14 };

use AI::Categorizer;
use AI::Categorizer::Experiment;

ok(1);

my $all_categories = [qw(sports politics finance world)];

{
  my $e = new AI::Categorizer::Experiment(categories => $all_categories);

t/03-weka.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
use Module::Build;

my $classpath = Module::Build->current->notes('classpath');

require 't/common.pl';
skip_test("Weka is not installed") unless defined $classpath;

plan tests => 1 + num_standard_tests();


ok(1);

#########################

my @args;
push @args, weka_path => $classpath
  unless $classpath eq '-';

perform_standard_tests(
		       learner_class => 'AI::Categorizer::Learner::Weka',
		       weka_classifier => 'weka.classifiers.functions.SMO',
                                     # or 'weka.classifiers.SMO' for older Weka versions
		       @args,
		      );

t/04-decision_tree.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN {
  require 't/common.pl';
  need_module('AI::DecisionTree 0.06');
  plan tests => 1 + num_standard_tests();
}

ok(1);

#########################

perform_standard_tests(learner_class => 'AI::Categorizer::Learner::DecisionTree');

t/05-svm.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN {
  require 't/common.pl';
  need_module('Algorithm::SVM');
  plan tests => 1 + num_standard_tests();
}

ok(1);

#########################

perform_standard_tests(learner_class => 'AI::Categorizer::Learner::SVM');

t/06-knn.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN {
  require 't/common.pl';
  plan tests => 5 + 2 * num_standard_tests();
}

ok(1);

#########################

# There are only 4 test documents, so use k=2
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::KNN', k_value => 2);
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::KNN', k_value => 2, knn_weighting => 'uniform');

my $q = AI::Categorizer::Learner::KNN::Queue->new(size => 3);

$q->add(five => 5);
$q->add(four => 4);
$q->add(one => 1);
$q->add(ten => 10);
$q->add(three => 3);
$q->add(eleven => 11);

t/07-guesser.t  view on Meta::CPAN

#!/usr/bin/perl -w

#########################

use strict;
use Test;
BEGIN {
  require 't/common.pl';
  plan tests => 1 + num_setup_tests();
}

ok(1);

#########################

my ($learner, $docs) = set_up_tests(learner_class => 'AI::Categorizer::Learner::Guesser');

t/09-rocchio.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN {
  require 't/common.pl';
  plan tests => 1 + num_standard_tests();
}

ok(1);

#########################

perform_standard_tests(learner_class => 'AI::Categorizer::Learner::Rocchio');

t/10-tools.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN { 
  plan tests => 10;
};

use AI::Categorizer::Util qw(random_elements binary_search);
ok(1);

# Test random_elements()
my @x = ('a'..'j');
my @y = random_elements(\@x, 3);
ok @y, 3;
ok $y[0] =~ /^[a-j]$/;

t/11-feature_vector.t  view on Meta::CPAN

#!/usr/bin/perl -w

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'

#########################

use strict;
use Test;
BEGIN { 
  plan tests => 18;
}

use AI::Categorizer::FeatureVector;
ok(1);

my $f1 = new AI::Categorizer::FeatureVector(features => {sports => 2, finance => 3});
ok $f1;
ok $f1->includes('sports');
ok $f1->value('sports'), 2;

t/12-hypothesis.t  view on Meta::CPAN

#!/usr/bin/perl -w

use strict;
use Test;
BEGIN { 
  plan tests => 8;
};

use AI::Categorizer::Hypothesis;
ok(1);

my @cats = ('a'..'z', 'foo', 'bar');

my $h = new AI::Categorizer::Hypothesis
  (
   all_categories => \@cats,

t/13-document.t  view on Meta::CPAN

#!/usr/bin/perl -w

use strict;
use Test;
BEGIN { plan tests => 27, todo => [] };

use AI::Categorizer;
use AI::Categorizer::Document;
use AI::Categorizer::FeatureVector;

ok(1);
my $docclass = 'AI::Categorizer::Document';

# Test empty document creation
{

t/13-document.t  view on Meta::CPAN

  ok $d->features->value('one'), 1;
  ok $d->features->value('two'), 2;
  ok $d->features->includes('foo'), '';
}
  

# Test some stemming & stopword stuff.
{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stemmed'],
     stemming => 'porter',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'stem', "stopword_behavior() is 'stem'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
  ok $d->features->includes('stem'),     '', "Shouldn't include 'stem'";
  print "Features: @{[ $d->features->names ]}\n";
}

{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stemmed'],
     stemming => 'porter',
     stopword_behavior => 'no_stem',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'no_stem', "stopword_behavior() is 'no_stem'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
  ok $d->features->includes('stem'),     1,  "Should include 'stem'";
  print "Features: @{[ $d->features->names ]}\n";
}

{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stem'],
     stemming => 'porter',
     stopword_behavior => 'pre_stemmed',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'pre_stemmed', "stopword_behavior() is 'pre_stemmed'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";

t/14-collection.t  view on Meta::CPAN

#!/usr/bin/perl -w

use strict;
use Test;
BEGIN { plan tests => 13 };

use AI::Categorizer;
use File::Spec;
require File::Spec->catfile('t', 'common.pl');

ok 1;  # Loaded

# Test InMemory collection
use AI::Categorizer::Collection::InMemory;
my $c = AI::Categorizer::Collection::InMemory->new(data => {training_docs()});

t/14-collection.t  view on Meta::CPAN

					     category_hash => {
							       doc1 => ['farming'],
							       doc2 => ['farming'],
							       doc3 => ['vampire'],
							       doc4 => ['vampire'],
							      },
					    );
ok $c;
exercise_collection($c, 4);

# 5 tests here
sub exercise_collection {
  my ($c, $num_docs) = @_;
  
  my $d = $c->next;
  ok $d;
  ok $d->isa('AI::Categorizer::Document');
  
  $c->rewind;
  my $d2 = $c->next;
  ok $d2->name, $d->name, "Make sure we get the same document after a rewind";

t/15-knowledge_set.t  view on Meta::CPAN

#!/usr/bin/perl -w

use strict;
use Test;
BEGIN { plan tests => 5 };

use AI::Categorizer;
ok 1; # Loaded

my $k = AI::Categorizer::KnowledgeSet->new();
ok $k;

my $c1 = AI::Categorizer::Category->by_name(name => 'one');
my $c2 = AI::Categorizer::Category->by_name(name => 'two');
ok $c1;

t/common.pl  view on Meta::CPAN

use AI::Categorizer::KnowledgeSet;
use AI::Categorizer::Collection::InMemory;

sub have_module {
  my $module = shift;
  return eval "use $module; 1";
}

sub need_module {
  my $module = shift;
  skip_test("$module not installed") unless have_module($module);
}

sub skip_test {
  my $msg = @_ ? shift() : '';
  print "1..0 # Skipped: $msg\n";
  exit;
}

sub training_docs {
  return (
	  doc1 => {categories => ['farming'],
		   content => 'Sheep are very valuable in farming.' },
	  doc2 => {categories => ['farming'],
		   content => 'Farming requires many kinds of animals.' },
	  doc3 => {categories => ['vampire'],
		   content => 'Vampires drink blood and vampires may be staked.' },
	  doc4 => {categories => ['vampire'],
		   content => 'Vampires cannot see their images in mirrors.'},
	 );
}

sub run_test_docs {
  my $l = shift;

  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  my $r = $l->categorize($doc);
  
  print "Categories: ", join(', ', $r->categories), "\n";
  ok($r->best_category, 'farming', "Best category is 'farming'");
  ok $r->in_category('farming'),  1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
  ok $r->in_category('vampire'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
  
  ok $r->all_categories, 2, "Should be 2 categories in total";
  
  $doc = new AI::Categorizer::Document
    ( name => 'test2',
      content => "I see that many vampires may have eaten my beautiful daughter's blood." );
  $r = $l->categorize($doc);
  
  print "Categories: ", join(', ', $r->categories), "\n";
  ok($r->best_category, 'vampire', "Best category is 'vampire'");
  ok $r->in_category('farming'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
  ok $r->in_category('vampire'),  1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
}

sub set_up_tests {
  my %params = @_;
  my $c = new AI::Categorizer(
			      knowledge_set => AI::Categorizer::KnowledgeSet->new
			      (
			       name => 'Vampires/Farmers',
			       stopwords => [qw(are be in of and)],
			      ),
			      verbose => $ENV{TEST_VERBOSE} ? 1 : 0,
			      %params,
			     );

t/common.pl  view on Meta::CPAN

  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }

  my $l = $c->learner;
  ok $l;
  
  if ($params{learner_class}) {
    ok ref($l), $params{learner_class}, "Make sure the correct Learner class is instantiated";
  } else {
    ok 1, 1, "Dummy test";
  }

  $l->train;
  return ($l, \%docs);
}

sub perform_standard_tests {
  my ($l, $docs) = set_up_tests(@_);
  
  run_test_docs($l);

  # Make sure we can save state & restore state
  $l->save_state('t/state');
  $l = $l->restore_state('t/state');
  ok $l;

  run_test_docs($l);

  my $train_collection = AI::Categorizer::Collection::InMemory->new(data => $docs);
  ok $train_collection;
  
  my $h = $l->categorize_collection(collection => $train_collection);
  ok $h->micro_precision > 0.5;
}

sub num_setup_tests    () { 3 }
sub num_standard_tests () { num_setup_tests + 17 }

1;



( run in 0.424 second using v1.01-cache-2.11-cpan-3cd7ad12f66 )