view release on metacpan or search on metacpan
Revision history for Perl extension AI::Categorizer.
- The t/01-naive_bayes.t test was failing (instead of skipping) when
Algorithm::NaiveBayes wasn't installed. Now it skips.
0.08 - Tue Mar 20 19:39:41 2007
- Added a ChiSquared feature selection class. [Francois Paradis]
- Changed the web locations of the reuters-21578 corpus that
eg/demo.pl uses, since the location it referenced previously has
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
[Spotted by Zoltan Barta]
0.06 Tue Apr 22 10:27:26 CDT 2003
- Added a "Guesser" machine learner which simply uses overall class
probabilities to make categorization decisions. Sometimes useful
for providing a set of baseline scores against which to evaluate
other machine learners.
- The NaiveBayes learner is now a wrapper around my new
Algorithm::NaiveBayes module, which is just the old NaiveBayes code
from here, turned into its own standalone module.
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
later releases.
- Added a 'Build.PL' file for an alternate installation method using
Module::Build.
- Fixed a problem in the Hypothesis' best_category() method that
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- The 'categorizer' script now records more useful benchmarking
information about time & memory in its outfile.
- The AI::Categorizer->dump_parameters() method now tries to avoid
showing you its entire list of stopwords.
- Document objects now use a default 'name' if none is supplied.
- Removed F1(), precision(), recall(), etc. from Util package since
they're in Statistics::Contingency. Added random_elements() to
Util.
- Collection::Files now warns when no category information is known
about a document in the collection (knowing it's in zero categories
is okay).
- Added the Collection::InMemory class
- Much more thorough testing with 'make test'.
- Added add_hypothesis() method to Experiment.
- Added dot() and value() methods to FeatureVector.
- Added 'feature_selection' parameter to KnowledgeSet.
- Added document($name) accessor method to KnowledgeSet.
- In KnowledgeSet, load(), read(), and scan_*() can now accept a
Installation instructions for AI::Categorizer
To install this module, follow the standard steps for installing most
Perl modules:
perl Makefile.PL
make
make test
make install
Or you may use the CPAN.pm module, which will automatically execute
these steps for you, and help you get the prerequisite dependencies
installed as well.
Alternatively, you can use the new Module::Build-style installer:
perl Build.PL
./Build
./Build test
./Build install
-Ken
NAME
AI::Categorizer - Automatic Text Categorization
SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
turning a bunch of data into a Feature Vector. Each document also has a
method to report which categories it belongs to.
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set", "test_set", and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training", "test_set" to "$data_root/test", and
"category_file" (used by some of the Collection classes) to
"$data_root/cats.txt".
learner()
Returns the Learner object associated with this Categorizer. Before
"train()", the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT". Internally, this is just a shortcut for calling
the "scan_features()", "read_training_set()", "train()", and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()". Returns the Learner object. Also
saves the Learner object for later use.
evaluate_test_set()
Creates a Collection based on the value of the "test_set" parameter, and
calls the Learner's "categorize_collection()" method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the "stats_table()" method.
stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()") "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
"AI::Categorize" module by the same author. Note the added 'r' in the new
name. The older module has a different interface, and no attempt at backward
compatibility has been made - that's why I changed the name.
You can have both "AI::Categorize" and "AI::Categorizer" installed at the
same time on the same machine, if you want. They don't know about each other
or use conflicting namespaces.
eg/categorizer view on Meta::CPAN
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
}
}
}
run_section('scan_features', 1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train', 3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
my ($section, $stage, $do_stage) = @_;
return unless $do_stage->{$stage};
if (keys %$do_stage > 1) {
#!/usr/bin/perl
# This script is a fairly simple demonstration of how AI::Categorizer
# can be used. There are lots of other less-simple demonstrations
# (actually, they're doing much simpler things, but are probably
# harder to follow) in the tests in the t/ subdirectory. The
# eg/categorizer script can also be a good example if you're willing
# to figure out a bit how it works.
#
# This script reads a training corpus from a directory of plain-text
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.
use strict;
use AI::Categorizer;
use AI::Categorizer::Collection::Files;
use AI::Categorizer::Learner::NaiveBayes;
use File::Spec;
die("Usage: $0 <corpus>\n".
" A sample corpus (data set) can be downloaded from\n".
" http://www.cpan.org/authors/Ken_Williams/data/reuters-21578.tar.gz\n".
" or http://www.limnus.com/~ken/reuters-21578.tar.gz\n")
unless @ARGV == 1;
my $corpus = shift;
my $training = File::Spec->catfile( $corpus, 'training' );
my $test = File::Spec->catfile( $corpus, 'test' );
my $cats = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );
my %params;
if (-e $stopwords) {
$params{stopword_file} = $stopwords;
} else {
warn "$stopwords not found - no stopwords will be used.\n";
}
die "$cats not found - can't proceed without category information.\n";
}
# In a real-world application these Collection objects could be of any
# type (any Collection subclass). Or you could create each Document
# object manually. Or you could let the KnowledgeSet create the
# Collection objects for you.
$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test = AI::Categorizer::Collection::Files->new( path => $test, %params );
# We turn on verbose mode so you can watch the progress of loading &
# training. This looks nicer if you have Time::Progress installed!
print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );
print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );
print "Categorizing test set\n";
my $experiment = $l->categorize_collection( collection => $test );
print $experiment->stats_table;
# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:
my $doc = AI::Categorizer::Document->new
( content => "Hello, I am a pretty generic document with not much to say." );
my $h = $l->categorize( $doc );
print ("For test document:\n",
" Best category = ", $h->best_category, "\n",
" All categories = ", join(', ', $h->categories), "\n");
eg/easy_guesser.pl view on Meta::CPAN
#!/usr/bin/perl
# This script can be helpful for getting a set of baseline scores for
# a categorization task. It simulates using the "Guesser" learner,
# but is much faster. Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format. <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class. <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.
use strict;
use Statistics::Contingency;
die "Usage: $0 <cats-file> <training-dir> <test-dir>\n" unless @ARGV == 3;
my ($cats, $training, $test) = @ARGV;
die "$cats isn't a plain file\n" unless -f $cats;
die "$training isn't a directory\n" unless -d $training;
die "$test isn't a directory\n" unless -d $test;
my %cats;
print "Reading category file\n";
open my($fh), $cats or die "Can't read $cats: $!";
while (<$fh>) {
my ($doc, @cats) = split;
$cats{$doc} = \@cats;
}
my (%freq, $docs);
eg/easy_guesser.pl view on Meta::CPAN
}
$docs++;
$freq{$_}++ foreach @{$cats{$file}};
}
closedir $dh;
print "Calculating probabilities (@{[ %freq ]})\n";
$_ /= $docs foreach values %freq;
my @cats = keys %freq;
print "Scoring test documents\n";
my $c = Statistics::Contingency->new(categories => \@cats);
opendir $dh, $test or die "Can't opendir $test: $!";
while (defined(my $file = readdir $dh)) {
next if $file eq '.' or $file eq '..';
unless ($cats{$file}) {
warn "No category information for '$file'";
next;
}
my @assigned;
foreach (@cats) {
push @assigned, $_ if rand() < $freq{$_};
}
lib/AI/Categorizer.pm view on Meta::CPAN
use AI::Categorizer::KnowledgeSet;
__PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment',
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files',
delayed => 1 },
);
sub new {
my $package = shift;
my %args = @_;
my %defaults;
if (exists $args{data_root}) {
$defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
$defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
$defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
delete $args{data_root};
}
return $package->SUPER::new(%defaults, %args);
}
#sub dump_parameters {
# my $p = shift()->SUPER::dump_parameters;
# delete $p->{stopwords} if $p->{stopword_file};
lib/AI/Categorizer.pm view on Meta::CPAN
sub knowledge_set { shift->{knowledge_set} }
sub learner { shift->{learner} }
# Combines several methods in one sub
sub run_experiment {
my $self = shift;
$self->scan_features;
$self->read_training_set;
$self->train;
$self->evaluate_test_set;
print $self->stats_table;
}
sub scan_features {
my $self = shift;
return unless $self->knowledge_set->scan_first;
$self->knowledge_set->scan_features( path => $self->{training_set} );
$self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
lib/AI/Categorizer.pm view on Meta::CPAN
}
sub train {
my $self = shift;
$self->_load_progress( '02', 'knowledge_set' );
$self->learner->train( knowledge_set => $self->{knowledge_set} );
$self->_save_progress( '03', 'learner' );
return $self->learner;
}
sub evaluate_test_set {
my $self = shift;
$self->_load_progress( '03', 'learner' );
my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
$self->{experiment} = $self->learner->categorize_collection( collection => $c );
$self->_save_progress( '04', 'experiment' );
return $self->{experiment};
}
sub stats_table {
my $self = shift;
$self->_load_progress( '04', 'experiment' );
return $self->{experiment}->stats_table;
}
lib/AI/Categorizer.pm view on Meta::CPAN
=head1 NAME
AI::Categorizer - Automatic Text Categorization
=head1 SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
lib/AI/Categorizer.pm view on Meta::CPAN
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use. There are several ways to carry out this process. The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider. These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
lib/AI/Categorizer.pm view on Meta::CPAN
set is called "feature selection". It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
=head2 Collections
Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects. A knowledge set contains a single collection
object. A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing. A C<Learner>
can mass-categorize a collection.
The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.
=head2 Documents
Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses. Each document class
contains methods for turning a bunch of data into a Feature Vector.
lib/AI/Categorizer.pm view on Meta::CPAN
=item verbose
If true, a few status messages will be printed during execution.
=item training_set
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.
=item test_set
Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.
=item data_root
A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately. Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.
=back
=item learner()
Returns the Learner object associated with this Categorizer. Before
C<train()>, the Learner will of course not be trained yet.
=item knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.
=item run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>. Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.
=item scan_features()
Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner. Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
=item read_training_set()
Populates the KnowledgeSet with the data specified in the C<test_set>
parameter. Internally, this calls the C<read()> method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
=item train()
Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>. Returns the Learner object.
Also saves the Learner object for later use.
=item evaluate_test_set()
Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the C<stats_table()> method.
=item stats_table()
Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method. This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.
=back
=head1 HISTORY
This module is a revised and redesigned version of the previous
C<AI::Categorize> module by the same author. Note the added 'r' in
the new name. The older module has a different interface, and no
attempt at backward compatibility has been made - that's why I changed
the name.
lib/AI/Categorizer/Document.pm view on Meta::CPAN
documents are plain text, but subclasses of the Document class may
handle any kind of data.
=head1 METHODS
=over 4
=item new(%parameters)
Creates a new Document object. Document objects are used during
training (for the training documents), testing (for the test
documents), and when categorizing new unseen documents in an
application (for the unseen documents). However, you'll typically
only call C<new()> in the latter case, since the KnowledgeSet or
Collection classes will create Document objects for you in the former
cases.
The C<new()> method accepts the following parameters:
=over 4
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
=head1 METHODS
=over 4
=item new(%parameters)
Returns a new Hypothesis object. Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method. However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.
The following parameters are accepted when creating a new Hypothesis:
=over 4
=item all_categories
A required parameter which gives the set of all categories that could
possibly be assigned to. The categories should be specified as a
reference to an array of category names (as strings).
lib/AI/Categorizer/Learner/Guesser.pm view on Meta::CPAN
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories. For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05. This can be useful for providing baseline scores to compare
with other more sophisticated algorithms.
See L<AI::Categorizer> for a complete description of the interface.
=head1 METHODS
This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available.
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
}
# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -t /tmp/train_file.arff -d /tmp/weka-machine
sub create_model {
my ($self) = shift;
my $m = $self->{model} ||= {};
$m->{all_features} = [ $self->knowledge_set->features->names ];
$m->{_in_dir} = File::Temp::tempdir( DIR => $self->{tmpdir} );
# Create a dummy test file $dummy_file in ARFF format (a kludgey WEKA requirement)
my $dummy_features = $self->create_delayed_object('features');
$m->{dummy_file} = $self->create_arff_file("dummy", [[$dummy_features, 0]]);
$self->SUPER::create_model(@_);
}
sub create_boolean_model {
my ($self, $pos, $neg, $cat) = @_;
my @docs = (map([$_->features, 1], @$pos),
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
'-d', $outfile,
'-v',
'-p', '0',
);
$self->do_cmd(@args);
unlink $train_file or warn "Couldn't remove $train_file: $!";
return \%info;
}
# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -l out -T test.arff -p 0
sub get_boolean_score {
my ($self, $doc, $info) = @_;
# Create document file
my $doc_file = $self->create_arff_file('doc', [[$doc->features, 0]], $self->{tmpdir});
my $machine_file = File::Spec->catfile($self->{model}{_in_dir}, $info->{machine_file});
my @args = ($self->{java_path},
@{$self->{java_args}},
t/01-naive_bayes.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
need_module('Algorithm::NaiveBayes');
plan tests => 15 + num_standard_tests();
}
ok(1);
#########################
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');
#use Carp; $SIG{__DIE__} = \&Carp::confess;
my %docs = training_docs();
{
ok my $c = new AI::Categorizer(collection_weighting => 'f');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
t/01-naive_bayes.t view on Meta::CPAN
for ('vampires', 'mirrors') {
ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
);
}
$c->learner->train( knowledge_set => $c->knowledge_set );
ok $c->learner;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
ok $c->learner->categorize($doc)->best_category, 'farming';
}
{
ok my $c = new AI::Categorizer(term_weighting => 'b');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
t/02-experiment.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN { plan tests => 14 };
use AI::Categorizer;
use AI::Categorizer::Experiment;
ok(1);
my $all_categories = [qw(sports politics finance world)];
{
my $e = new AI::Categorizer::Experiment(categories => $all_categories);
t/03-weka.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
use Module::Build;
my $classpath = Module::Build->current->notes('classpath');
require 't/common.pl';
skip_test("Weka is not installed") unless defined $classpath;
plan tests => 1 + num_standard_tests();
ok(1);
#########################
my @args;
push @args, weka_path => $classpath
unless $classpath eq '-';
perform_standard_tests(
learner_class => 'AI::Categorizer::Learner::Weka',
weka_classifier => 'weka.classifiers.functions.SMO',
# or 'weka.classifiers.SMO' for older Weka versions
@args,
);
t/04-decision_tree.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
need_module('AI::DecisionTree 0.06');
plan tests => 1 + num_standard_tests();
}
ok(1);
#########################
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::DecisionTree');
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
need_module('Algorithm::SVM');
plan tests => 1 + num_standard_tests();
}
ok(1);
#########################
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::SVM');
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
plan tests => 5 + 2 * num_standard_tests();
}
ok(1);
#########################
# There are only 4 test documents, so use k=2
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::KNN', k_value => 2);
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::KNN', k_value => 2, knn_weighting => 'uniform');
my $q = AI::Categorizer::Learner::KNN::Queue->new(size => 3);
$q->add(five => 5);
$q->add(four => 4);
$q->add(one => 1);
$q->add(ten => 10);
$q->add(three => 3);
$q->add(eleven => 11);
t/07-guesser.t view on Meta::CPAN
#!/usr/bin/perl -w
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
plan tests => 1 + num_setup_tests();
}
ok(1);
#########################
my ($learner, $docs) = set_up_tests(learner_class => 'AI::Categorizer::Learner::Guesser');
t/09-rocchio.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
require 't/common.pl';
plan tests => 1 + num_standard_tests();
}
ok(1);
#########################
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::Rocchio');
t/10-tools.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
plan tests => 10;
};
use AI::Categorizer::Util qw(random_elements binary_search);
ok(1);
# Test random_elements()
my @x = ('a'..'j');
my @y = random_elements(\@x, 3);
ok @y, 3;
ok $y[0] =~ /^[a-j]$/;
t/11-feature_vector.t view on Meta::CPAN
#!/usr/bin/perl -w
# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl test.pl'
#########################
use strict;
use Test;
BEGIN {
plan tests => 18;
}
use AI::Categorizer::FeatureVector;
ok(1);
my $f1 = new AI::Categorizer::FeatureVector(features => {sports => 2, finance => 3});
ok $f1;
ok $f1->includes('sports');
ok $f1->value('sports'), 2;
t/12-hypothesis.t view on Meta::CPAN
#!/usr/bin/perl -w
use strict;
use Test;
BEGIN {
plan tests => 8;
};
use AI::Categorizer::Hypothesis;
ok(1);
my @cats = ('a'..'z', 'foo', 'bar');
my $h = new AI::Categorizer::Hypothesis
(
all_categories => \@cats,
t/13-document.t view on Meta::CPAN
#!/usr/bin/perl -w
use strict;
use Test;
BEGIN { plan tests => 27, todo => [] };
use AI::Categorizer;
use AI::Categorizer::Document;
use AI::Categorizer::FeatureVector;
ok(1);
my $docclass = 'AI::Categorizer::Document';
# Test empty document creation
{
t/13-document.t view on Meta::CPAN
ok $d->features->value('one'), 1;
ok $d->features->value('two'), 2;
ok $d->features->includes('foo'), '';
}
# Test some stemming & stopword stuff.
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stemmed'],
stemming => 'porter',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'stem', "stopword_behavior() is 'stem'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
ok $d->features->includes('stem'), '', "Shouldn't include 'stem'";
print "Features: @{[ $d->features->names ]}\n";
}
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stemmed'],
stemming => 'porter',
stopword_behavior => 'no_stem',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'no_stem', "stopword_behavior() is 'no_stem'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
ok $d->features->includes('stem'), 1, "Should include 'stem'";
print "Features: @{[ $d->features->names ]}\n";
}
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stem'],
stemming => 'porter',
stopword_behavior => 'pre_stemmed',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'pre_stemmed', "stopword_behavior() is 'pre_stemmed'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
t/14-collection.t view on Meta::CPAN
#!/usr/bin/perl -w
use strict;
use Test;
BEGIN { plan tests => 13 };
use AI::Categorizer;
use File::Spec;
require File::Spec->catfile('t', 'common.pl');
ok 1; # Loaded
# Test InMemory collection
use AI::Categorizer::Collection::InMemory;
my $c = AI::Categorizer::Collection::InMemory->new(data => {training_docs()});
t/14-collection.t view on Meta::CPAN
category_hash => {
doc1 => ['farming'],
doc2 => ['farming'],
doc3 => ['vampire'],
doc4 => ['vampire'],
},
);
ok $c;
exercise_collection($c, 4);
# 5 tests here
sub exercise_collection {
my ($c, $num_docs) = @_;
my $d = $c->next;
ok $d;
ok $d->isa('AI::Categorizer::Document');
$c->rewind;
my $d2 = $c->next;
ok $d2->name, $d->name, "Make sure we get the same document after a rewind";
t/15-knowledge_set.t view on Meta::CPAN
#!/usr/bin/perl -w
use strict;
use Test;
BEGIN { plan tests => 5 };
use AI::Categorizer;
ok 1; # Loaded
my $k = AI::Categorizer::KnowledgeSet->new();
ok $k;
my $c1 = AI::Categorizer::Category->by_name(name => 'one');
my $c2 = AI::Categorizer::Category->by_name(name => 'two');
ok $c1;
t/common.pl view on Meta::CPAN
use AI::Categorizer::KnowledgeSet;
use AI::Categorizer::Collection::InMemory;
sub have_module {
my $module = shift;
return eval "use $module; 1";
}
sub need_module {
my $module = shift;
skip_test("$module not installed") unless have_module($module);
}
sub skip_test {
my $msg = @_ ? shift() : '';
print "1..0 # Skipped: $msg\n";
exit;
}
sub training_docs {
return (
doc1 => {categories => ['farming'],
content => 'Sheep are very valuable in farming.' },
doc2 => {categories => ['farming'],
content => 'Farming requires many kinds of animals.' },
doc3 => {categories => ['vampire'],
content => 'Vampires drink blood and vampires may be staked.' },
doc4 => {categories => ['vampire'],
content => 'Vampires cannot see their images in mirrors.'},
);
}
sub run_test_docs {
my $l = shift;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
my $r = $l->categorize($doc);
print "Categories: ", join(', ', $r->categories), "\n";
ok($r->best_category, 'farming', "Best category is 'farming'");
ok $r->in_category('farming'), 1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
ok $r->in_category('vampire'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
ok $r->all_categories, 2, "Should be 2 categories in total";
$doc = new AI::Categorizer::Document
( name => 'test2',
content => "I see that many vampires may have eaten my beautiful daughter's blood." );
$r = $l->categorize($doc);
print "Categories: ", join(', ', $r->categories), "\n";
ok($r->best_category, 'vampire', "Best category is 'vampire'");
ok $r->in_category('farming'), '', sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('farming'));
ok $r->in_category('vampire'), 1, sprintf("threshold = %s, score = %s", $r->threshold, $r->scores('vampire'));
}
sub set_up_tests {
my %params = @_;
my $c = new AI::Categorizer(
knowledge_set => AI::Categorizer::KnowledgeSet->new
(
name => 'Vampires/Farmers',
stopwords => [qw(are be in of and)],
),
verbose => $ENV{TEST_VERBOSE} ? 1 : 0,
%params,
);
t/common.pl view on Meta::CPAN
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
my $l = $c->learner;
ok $l;
if ($params{learner_class}) {
ok ref($l), $params{learner_class}, "Make sure the correct Learner class is instantiated";
} else {
ok 1, 1, "Dummy test";
}
$l->train;
return ($l, \%docs);
}
sub perform_standard_tests {
my ($l, $docs) = set_up_tests(@_);
run_test_docs($l);
# Make sure we can save state & restore state
$l->save_state('t/state');
$l = $l->restore_state('t/state');
ok $l;
run_test_docs($l);
my $train_collection = AI::Categorizer::Collection::InMemory->new(data => $docs);
ok $train_collection;
my $h = $l->categorize_collection(collection => $train_collection);
ok $h->micro_precision > 0.5;
}
sub num_setup_tests () { 3 }
sub num_standard_tests () { num_setup_tests + 17 }
1;