view release on metacpan or search on metacpan
use Module::Build;
use Config;
use File::Spec;
my $build = new Module::Build
(
module_name => 'AI::Categorizer',
license => 'perl',
requires => {
perl => '5.6.0',
Class::Container => 0.09,
Storable => 0,
Params::Validate => 0.18,
Statistics::Contingency => 0.06,
Lingua::Stem => 0.50,
File::Spec => 0,
},
recommends => {
Scalar::Util => 0,
Time::Progress => 1.1,
Algorithm::SVM => 0.06,
AI::DecisionTree => 0.06,
Algorithm::NaiveBayes => 0,
},
build_requires => {
Module::Build => 0.20,
},
create_makefile_pl => 'passthrough',
);
my $categorizer = File::Spec->catfile('eg', 'categorizer');
if ($build->y_n("Do you want to install the $categorizer script to $Config{installscript}?", 'n')) {
$build->scripts($categorizer);
}
- Added a ChiSquared feature selection class. [Francois Paradis]
- Changed the web locations of the reuters-21578 corpus that
eg/demo.pl uses, since the location it referenced previously has
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
Collection class.
0.05 Sat Mar 29 00:38:21 CST 2003
- Feature selection is now handled by an abstract FeatureSelector
framework class. Currently the only concrete subclass implemented
is FeatureSelector::DocFrequency. The 'feature_selection'
parameter has been replaced with a 'feature_selector_class'
parameter.
- Added a k-Nearest-Neighbor machine learner. [First revision
implemented by David Bell]
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
Li]
- Added a "Guesser" machine learner which simply uses overall class
probabilities to make categorization decisions. Sometimes useful
for providing a set of baseline scores against which to evaluate
other machine learners.
- The NaiveBayes learner is now a wrapper around my new
Algorithm::NaiveBayes module, which is just the old NaiveBayes code
from here, turned into its own standalone module.
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
later releases.
- Added a 'Build.PL' file for an alternate installation method using
Module::Build.
- Fixed a problem in the Hypothesis' best_category() method that
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- The 'categorizer' script now records more useful benchmarking
information about time & memory in its outfile.
- The AI::Categorizer->dump_parameters() method now tries to avoid
showing you its entire list of stopwords.
- Document objects now use a default 'name' if none is supplied.
- Added a virtual class for binary classifiers.
- Wrote documentation for lots of the undocumented classes.
- Added a PNG file giving an overview diagram of the classes.
- Added a script 'categorizer' to provide a simple command-line
interface to AI::Categorizer
- save_state() and restore_state() now save to a directory, not a
file.
- Removed F1(), precision(), recall(), etc. from Util package since
they're in Statistics::Contingency. Added random_elements() to
Util.
- Collection::Files now warns when no category information is known
about a document in the collection (knowing it's in zero categories
is okay).
- Added dot() and value() methods to FeatureVector.
- Added 'feature_selection' parameter to KnowledgeSet.
- Added document($name) accessor method to KnowledgeSet.
- In KnowledgeSet, load(), read(), and scan_*() can now accept a
Collection object.
- Added document_frequency(), finish(), and weigh_features() methods
to KnowledgeSet.
- Added save_features() and restore_features() to KnowledgeSet.
- Added default categories() and categorize() methods to Learner base
class. get_scores() is now abstract.
- Extended interface of ObjectSet class with retrieve(), includes(),
and includes_name().
- Moved 'term_weighting' parameter from Document to KnowledgeSet,
since the normalized version needs to know the maximum
term-frequency. Also changed its values to 'n', 'l', 'b', and 't',
with 'x' a synonym for 't'.
- Implemented full range of TF/IDF term weighting methods (see Salton
name: AI-Categorizer
version: 0.09
author:
- 'Ken Williams <ken@mathforum.org>'
- |-
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai
abstract: Automatic Text Categorization
license: perl
resources:
license: http://dev.perl.org/licenses/
requires:
Class::Container: 0.09
File::Spec: 0
Lingua::Stem: 0.5
Params::Validate: 0.18
Statistics::Contingency: 0.06
Storable: 0
perl: 5.6.0
build_requires:
Module::Build: 0.2
recommends:
AI::DecisionTree: 0.06
Algorithm::NaiveBayes: 0
Algorithm::SVM: 0.06
Scalar::Util: 0
Time::Progress: 1.1
provides:
AI::Categorizer:
file: lib/AI/Categorizer.pm
version: 0.09
AI::Categorizer::Category:
file: lib/AI/Categorizer/Category.pm
AI::Categorizer::Collection:
file: lib/AI/Categorizer/Collection.pm
AI::Categorizer::Collection::DBI:
file: lib/AI/Categorizer/Collection/DBI.pm
Makefile.PL view on Meta::CPAN
# Note: this file was auto-generated by Module::Build::Compat version 0.03
unless (eval "use Module::Build::Compat 0.02; 1" ) {
print "This module requires Module::Build to install itself.\n";
require ExtUtils::MakeMaker;
my $yn = ExtUtils::MakeMaker::prompt
(' Install Module::Build now from CPAN?', 'y');
unless ($yn =~ /^y/i) {
die " *** Cannot install without Module::Build. Exiting ...\n";
}
require Cwd;
NAME
AI::Categorizer - Automatic Text Categorization
SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
turning a bunch of data into a Feature Vector. Each document also has a
method to report which categories it belongs to.
Categories
Each category is represented by an "AI::Categorizer::Category" object. Its
main purpose is to keep track of which documents belong to it, though you
can also examine statistical properties of an entire category, such as
obtaining a Feature Vector representing an amalgamation of all the documents
that belong to it.
Machine Learning Algorithms
There are lots of different ways to make the inductive leap from the
training documents to unseen documents. The Machine Learning community has
studied many algorithms for this purpose. To allow flexibility in choosing
and configuring categorization algorithms, each such algorithm is a subclass
of "AI::Categorizer::Learner". There are currently four categorizers
included in the distribution:
AI::Categorizer::Learner::NaiveBayes
A pure-perl implementation of a Naive Bayes classifier. No dependencies
on external modules or other resources. Naive Bayes is usually very fast
to train and fast to make categorization decisions, but isn't always the
most accurate categorizer.
AI::Categorizer::Learner::SVM
An interface to Corey Spencer's "Algorithm::SVM", which implements a
Support Vector Machine classifier. SVMs can take a while to train
(though in certain conditions there are optimizations to make them quite
fast), but are pretty quick to categorize. They often have very good
accuracy.
AI::Categorizer::Learner::DecisionTree
An interface to "AI::DecisionTree", which implements a Decision Tree
classifier. Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing. Decision Trees have the advantage that you can scrutinize
the structures of trained decision trees to see how decisions are being
made.
AI::Categorizer::Learner::Weka
An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines. This gives you
access to lots and lots of machine learning algorithms in use by machine
learning researches. The main drawback is that Weka tends to be quite
slow and use a lot of memory, and the current interface between Weka and
"AI::Categorizer" is a bit clumsy.
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
for ensemble learning. No timetable for their creation has yet been set.
Please see the documentation of these individual modules for more details on
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
be interested in any of several pieces of information about the hypothesis
(for instance, which categories were assigned, which category was the single
most likely category, the scores assigned to each category, etc.), the
hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
class, and you can use its object methods to get information about the
hypothesis. See its class documentation for the details.
Experiments
The "AI::Categorizer::Experiment" class helps you organize the results of
categorization experiments. As you get lots of categorization results
(Hypotheses) back from the Learner, you can feed these results to the
Experiment class, along with the correct answers. When all results have been
collected, you can get a report on accuracy, precision, recall, F1, and so
on, with both micro-averaging and macro-averaging over categories. We use
the "Statistics::Contingency" module from CPAN to manage the calculations.
See the docs for "AI::Categorizer::Experiment" for more details.
METHODS
new()
Creates a new Categorizer object and returns it. Accepts lots of
parameters controlling behavior. In addition to the parameters listed
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that *they* create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set", "test_set", and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training", "test_set" to "$data_root/test", and
"category_file" (used by some of the Collection classes) to
Returns the Learner object associated with this Categorizer. Before
"train()", the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT". Internally, this is just a shortcut for calling
the "scan_features()", "read_training_set()", "train()", and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()". Returns the Learner object. Also
saves the Learner object for later use.
evaluate_test_set()
Creates a Collection based on the value of the "test_set" parameter, and
calls the Learner's "categorize_collection()" method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the "stats_table()" method.
stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()") "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai
REFERENCES
An excellent introduction to the academic field of Text Categorization is
Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.
COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This distribution is free software; you can redistribute it and/or modify it
under the same terms as Perl itself. These terms apply to every file in the
distribution - if you have questions, please contact the author.
eg/categorizer view on Meta::CPAN
#!/usr/bin/perl
# This script creates a Categorizer and runs several of its methods on
# a corpus, reporting the results.
#
# Copyright 2002 Ken Williams, under the same license as the
# AI::Categorizer distribution.
use strict;
use AI::Categorizer;
use Benchmark;
my $HAVE_YAML = eval "use YAML; 1";
eg/categorizer view on Meta::CPAN
print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
if ($HAVE_YAML) {
print {$out_fh} YAML::Dump($c->dump_parameters);
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
}
}
}
run_section('scan_features', 1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train', 3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
my ($section, $stage, $do_stage) = @_;
return unless $do_stage->{$stage};
if (keys %$do_stage > 1) {
print " % $0 @ARGV -$stage\n" if $c->verbose;
die "$0 is not executable, please change its execution permissions"
unless -x $0;
system($0, @ARGV, "-$stage") == 0
eg/categorizer view on Meta::CPAN
sub parse_command_line {
my (%opt, %do_stage);
while (@_) {
if ($_[0] =~ /^-(\d+)$/) {
shift;
$do_stage{$1} = 1;
} elsif ( $_[0] eq '--config_file' ) {
die "--config_file requires the YAML module from CPAN to be installed.\n" unless $HAVE_YAML;
shift;
my $file = shift;
my $href = YAML::LoadFile($file);
@opt{keys %$href} = values %$href;
} elsif ( $_[0] =~ /^--/ ) {
my ($k, $v) = (shift, shift);
$k =~ s/^--//;
$opt{$k} = $v;
eg/categorizer view on Meta::CPAN
# Allow abbreviations
if ($k =~ /^(\w+)_class$/) {
my $name = $1;
$v =~ s/^::/AI::Categorizer::\u${name}::/;
$opt{$k} = $v;
}
}
my $outfile;
unless ($outfile = delete $opt{outfile}) {
$outfile = $opt{progress_file} ? "$opt{progress_file}-results.txt" : "results.txt";
}
return (\%opt, \%do_stage, $outfile);
}
sub usage {
return <<EOF;
Usage:
$0 --parameter_1 <value_1> --parameter_2 <value_2>
# In a real-world application these Collection objects could be of any
# type (any Collection subclass). Or you could create each Document
# object manually. Or you could let the KnowledgeSet create the
# Collection objects for you.
$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test = AI::Categorizer::Collection::Files->new( path => $test, %params );
# We turn on verbose mode so you can watch the progress of loading &
# training. This looks nicer if you have Time::Progress installed!
print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );
print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );
print "Categorizing test set\n";
eg/easy_guesser.pl view on Meta::CPAN
#!/usr/bin/perl
# This script can be helpful for getting a set of baseline scores for
# a categorization task. It simulates using the "Guesser" learner,
# but is much faster. Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format. <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class. <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.
use strict;
use Statistics::Contingency;
eg/easy_guesser.pl view on Meta::CPAN
while (defined(my $file = readdir $dh)) {
next if $file eq '.' or $file eq '..';
unless ($cats{$file}) {
warn "No category information for '$file'";
next;
}
my @assigned;
foreach (@cats) {
push @assigned, $_ if rand() < $freq{$_};
}
$c->add_result(\@assigned, $cats{$file});
}
print $c->stats_table(4);
lib/AI/Categorizer.pm view on Meta::CPAN
use AI::Categorizer::Learner;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::Collection;
use AI::Categorizer::Hypothesis;
use AI::Categorizer::KnowledgeSet;
__PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
lib/AI/Categorizer.pm view on Meta::CPAN
# delete $p->{stopwords} if $p->{stopword_file};
# return $p;
#}
sub knowledge_set { shift->{knowledge_set} }
sub learner { shift->{learner} }
# Combines several methods in one sub
sub run_experiment {
my $self = shift;
$self->scan_features;
$self->read_training_set;
$self->train;
$self->evaluate_test_set;
print $self->stats_table;
}
sub scan_features {
my $self = shift;
return unless $self->knowledge_set->scan_first;
$self->knowledge_set->scan_features( path => $self->{training_set} );
$self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
sub read_training_set {
my $self = shift;
$self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
if -e "$self->{progress_file}-01-features";
$self->knowledge_set->read( path => $self->{training_set} );
$self->_save_progress( '02', 'knowledge_set' );
return $self->knowledge_set;
}
sub train {
my $self = shift;
$self->_load_progress( '02', 'knowledge_set' );
$self->learner->train( knowledge_set => $self->{knowledge_set} );
$self->_save_progress( '03', 'learner' );
return $self->learner;
}
sub evaluate_test_set {
my $self = shift;
$self->_load_progress( '03', 'learner' );
my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
$self->{experiment} = $self->learner->categorize_collection( collection => $c );
$self->_save_progress( '04', 'experiment' );
return $self->{experiment};
}
sub stats_table {
my $self = shift;
$self->_load_progress( '04', 'experiment' );
return $self->{experiment}->stats_table;
}
sub progress_file {
shift->{progress_file};
}
sub verbose {
shift->{verbose};
}
sub _save_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Saving to $file\n" if $self->{verbose};
$self->{$node}->save_state($file);
}
sub _load_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Loading $file\n" if $self->{verbose};
$self->{$node} = $self->contained_class($node)->restore_state($file);
}
1;
__END__
=head1 NAME
AI::Categorizer - Automatic Text Categorization
=head1 SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use. There are several ways to carry out this process. The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider. These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
object framework. We give a conceptual overview, but don't get into
any of the details about interfaces or usage. See the documentation
for the individual classes for more details.
lib/AI/Categorizer.pm view on Meta::CPAN
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
=head2 Knowledge Sets
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge". A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations. In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.
A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class. Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on. See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
=head3 Feature selection
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized. There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.
The process of selecting the most important features in the training
set is called "feature selection". It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
=head2 Collections
Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects. A knowledge set contains a single collection
object. A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing. A C<Learner>
can mass-categorize a collection.
The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.
=head2 Documents
Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses. Each document class
contains methods for turning a bunch of data into a Feature Vector.
Each document also has a method to report which categories it belongs
to.
=head2 Categories
Each category is represented by an C<AI::Categorizer::Category>
object. Its main purpose is to keep track of which documents belong
to it, though you can also examine statistical properties of an entire
category, such as obtaining a Feature Vector representing an
amalgamation of all the documents that belong to it.
=head2 Machine Learning Algorithms
There are lots of different ways to make the inductive leap from the
training documents to unseen documents. The Machine Learning
community has studied many algorithms for this purpose. To allow
flexibility in choosing and configuring categorization algorithms,
each such algorithm is a subclass of C<AI::Categorizer::Learner>.
There are currently four categorizers included in the distribution:
=over 4
=item AI::Categorizer::Learner::NaiveBayes
A pure-perl implementation of a Naive Bayes classifier. No
dependencies on external modules or other resources. Naive Bayes is
usually very fast to train and fast to make categorization decisions,
but isn't always the most accurate categorizer.
=item AI::Categorizer::Learner::SVM
An interface to Corey Spencer's C<Algorithm::SVM>, which implements a
Support Vector Machine classifier. SVMs can take a while to train
(though in certain conditions there are optimizations to make them
quite fast), but are pretty quick to categorize. They often have very
good accuracy.
=item AI::Categorizer::Learner::DecisionTree
An interface to C<AI::DecisionTree>, which implements a Decision Tree
classifier. Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing. Decision Trees have the advantage that you can
scrutinize the structures of trained decision trees to see how
decisions are being made.
=item AI::Categorizer::Learner::Weka
An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines. This gives you
access to lots and lots of machine learning algorithms in use by
machine learning researches. The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface
between Weka and C<AI::Categorizer> is a bit clumsy.
=back
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
combiner for ensemble learning. No timetable for their creation has
yet been set.
Please see the documentation of these individual modules for more
details on their guts and quirks. See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.
=head2 Feature Vectors
Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>. The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document. A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class. You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.
=head2 Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to. Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
scores assigned to each category, etc.), the hypothesis is returned as
an object of the C<AI::Categorizer::Hypothesis> class, and you can use
its object methods to get information about the hypothesis. See its
class documentation for the details.
=head2 Experiments
The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments. As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers. When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
micro-averaging and macro-averaging over categories. We use the
C<Statistics::Contingency> module from CPAN to manage the
calculations. See the docs for C<AI::Categorizer::Experiment> for more
details.
=head1 METHODS
=over 4
lib/AI/Categorizer.pm view on Meta::CPAN
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that I<they> create. This is managed by the
C<Class::Container> module, so see
L<its documentation|Class::Container> for the details of how this
works.
The specific parameters accepted here are:
=over 4
=item progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
=item verbose
If true, a few status messages will be printed during execution.
=item training_set
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.
=item test_set
Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.
=item data_root
A shortcut for setting the C<training_set>, C<test_set>, and
lib/AI/Categorizer.pm view on Meta::CPAN
=item knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.
=item run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>. Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.
=item scan_features()
Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner. Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
=item read_training_set()
Populates the KnowledgeSet with the data specified in the C<test_set>
parameter. Internally, this calls the C<read()> method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
=item train()
Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>. Returns the Learner object.
Also saves the Learner object for later use.
=item evaluate_test_set()
Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the C<stats_table()> method.
=item stats_table()
Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method. This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.
lib/AI/Categorizer.pm view on Meta::CPAN
=head1 REFERENCES
An excellent introduction to the academic field of Text Categorization
is Fabrizio Sebastiani's "Machine Learning in Automated Text
Categorization": ACM Computing Surveys, Vol. 34, No. 1, March 2002,
pp. 1-47.
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself. These terms apply to
every file in the distribution - if you have questions, please contact
the author.
=cut
lib/AI/Categorizer/Category.pm view on Meta::CPAN
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
},
public => 0,
},
);
__PACKAGE__->contained_objects
(
features => {
class => 'AI::Categorizer::FeatureVector',
delayed => 1,
},
);
my %REGISTRY = ();
sub new {
my $self = shift()->SUPER::new(@_);
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
lib/AI/Categorizer/Category.pm view on Meta::CPAN
return wantarray ? $d->members : $d->size;
}
sub contains_document {
return $_[0]->{documents}->includes( $_[1] );
}
sub add_document {
my $self = shift;
$self->{documents}->insert( $_[0] );
delete $self->{features}; # Could be more efficient?
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features} if $self->{features};
my $v = $self->create_delayed_object('features');
return $self->{features} = $v unless $self->documents;
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
1;
__END__
=head1 NAME
AI::Categorizer::Category - A named category of documents
=head1 SYNOPSIS
my $category = AI::Categorizer::Category->by_name("sports");
my $name = $category->name;
my @docs = $category->documents;
my $num_docs = $category->documents;
my $features = $category->features;
$category->add_document($doc);
if ($category->contains_document($doc)) { ...
=head1 DESCRIPTION
This simple class represents a named category which may contain zero
or more documents. Each category is a "singleton" by name, so two
Category objects with the same name should not be created at once.
=head1 METHODS
=over 4
=item new()
Creates a new Category object and returns it. Accepts the following
lib/AI/Categorizer/Category.pm view on Meta::CPAN
=item by_name(name => $string)
Returns the Category object with the given name, or creates one if no
such object exists.
=item documents()
Returns a list of the Document objects in this category in a list
context, or the number of such objects in a scalar context.
=item features()
Returns a FeatureVector object representing the sum of all the
FeatureVectors of the Documents in this Category.
=item add_document($document)
Informs the Category that the given Document belongs to it.
=item contains_document($document)
Returns true if the given document belongs to this category, or false
otherwise.
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3), Storable(3)
=cut
lib/AI/Categorizer/Collection.pm view on Meta::CPAN
Returns the next Document object in the Collection.
=item rewind()
Resets the iterator for further calls to C<next()>.
=item count_documents()
Returns the total number of documents in the Collection. Note that
this usually resets the iterator. This is because it may not be
possible to resume iterating where we left off.
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2002-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3), Storable(3)
=cut
lib/AI/Categorizer/Collection/DBI.pm view on Meta::CPAN
if ($self->{sth}{Active}) {
$self->{sth}->finish;
}
$self->{sth}->execute;
}
sub next {
my $self = shift;
my @result = $self->{sth}->fetchrow_array;
return undef unless @result;
return $self->create_delayed_object('document',
name => $result[0],
categories => [$result[1]],
content => $result[2],
);
}
1;
lib/AI/Categorizer/Collection/Files.pm view on Meta::CPAN
=back
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2002-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer::Collection(3)
=cut
lib/AI/Categorizer/Document.pm view on Meta::CPAN
default => undef,
},
parse => {
type => SCALAR,
optional => 1,
},
parse_handle => {
type => HANDLE,
optional => 1,
},
features => {
isa => 'AI::Categorizer::FeatureVector',
optional => 1,
},
content_weights => {
type => HASHREF,
default => {},
},
front_bias => {
type => SCALAR,
default => 0,
},
use_features => {
type => HASHREF|UNDEF,
default => undef,
},
stemming => {
type => SCALAR|UNDEF,
optional => 1,
},
stopword_behavior => {
type => SCALAR,
default => "stem",
},
);
__PACKAGE__->contained_objects
(
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
);
### Constructors
my $NAME = 'a';
sub new {
my $pkg = shift;
my $self = $pkg->SUPER::new(name => $NAME++, # Use a default name
@_);
# Get efficient internal data structures
$self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
$self->_fix_stopwords;
# A few different ways for the caller to initialize the content
if (exists $self->{parse}) {
$self->parse(content => delete $self->{parse});
} elsif (exists $self->{parse_handle}) {
$self->parse_handle(handle => delete $self->{parse_handle});
lib/AI/Categorizer/Document.pm view on Meta::CPAN
# This flag is attached to the stopword structure itself so that
# other documents will notice it.
$s->{___stemmed} = 1;
}
sub finish {
my $self = shift;
$self->create_feature_vector;
# Now we're done with all the content stuff
delete @{$self}{'content', 'content_weights', 'stopwords', 'use_features'};
}
# Parse a document format - a virtual method
sub parse;
sub parse_handle {
my ($self, %args) = @_;
my $fh = $args{handle} or die "No 'handle' argument given to parse_handle()";
return $self->parse( content => join '', <$fh> );
}
### Accessors
sub name { $_[0]->{name} }
sub stopword_behavior { $_[0]->{stopword_behavior} }
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features};
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
### Workers
sub create_feature_vector {
my $self = shift;
my $content = $self->{content};
my $weights = $self->{content_weights};
die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;
$self->{features} = $self->create_delayed_object('features');
while (my ($name, $data) = each %$content) {
my $t = $self->tokenize($data);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
$self->stem_words($t);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
$self->{features}->add($h);
}
}
sub is_in_category {
return (ref $_[1]
? $_[0]->{categories}->includes( $_[1] )
: $_[0]->{categories}->includes_name( $_[1] ));
}
lib/AI/Categorizer/Document.pm view on Meta::CPAN
}
sub stem_words {
my ($self, $tokens) = @_;
return unless $self->{stemming};
return if $self->{stemming} eq 'none';
die "Unknown stemming option '$self->{stemming}' - options are 'porter' or 'none'"
unless $self->{stemming} eq 'porter';
eval {require Lingua::Stem; 1}
or die "Porter stemming requires the Lingua::Stem module, available from CPAN.\n";
@$tokens = @{ Lingua::Stem::stem(@$tokens) };
}
sub _filter_tokens {
my ($self, $tokens_in) = @_;
if ($self->{use_features}) {
my $f = $self->{use_features}->as_hash;
return [ grep exists($f->{$_}), @$tokens_in ];
} elsif ($self->{stopwords} and keys %{$self->{stopwords}}) {
my $s = $self->{stopwords};
return [ grep !exists($s->{$_}), @$tokens_in ];
}
return $tokens_in;
}
sub _weigh_tokens {
my ($self, $tokens, $weight) = @_;
lib/AI/Categorizer/Document.pm view on Meta::CPAN
my $self = $class->new(%args);
open my($fh), "< $path" or die "$path: $!";
$self->parse_handle(handle => $fh);
close $fh;
$self->finish;
return $self;
}
sub dump_features {
my ($self, %args) = @_;
my $path = $args{path} or die "No 'path' argument given to dump_features()";
open my($fh), "> $path" or die "Can't create $path: $!";
my $f = $self->features->as_hash;
while (my ($k, $v) = each %$f) {
print $fh "$k\t$v\n";
}
}
1;
__END__
=head1 NAME
lib/AI/Categorizer/Document.pm view on Meta::CPAN
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
=head1 DESCRIPTION
The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector. Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.
=head1 METHODS
lib/AI/Categorizer/Document.pm view on Meta::CPAN
A string that identifies this document. Required.
=item content
The raw content of this document. May be specified as either a string
or as a hash reference, allowing structured document types.
=item content_weights
A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector. The weight is a multiplier of the feature vector
values. For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.
If no weights are specified, all weights are set to 1.
=item front_bias
lib/AI/Categorizer/Document.pm view on Meta::CPAN
document. Negative numbers indicate the opposite. A bias of 0
indicates that no biasing should be done.
=item categories
A reference to an array of Category objects that this document belongs
to. Optional.
=item stopwords
A list/hash of features (words) that should be ignored when parsing
document content. A hash reference is preferred, with the features as
the keys. If you pass an array reference containing the features, it
will be converted to a hash reference internally.
=item use_features
A Feature Vector specifying the only features that should be
considered when parsing this document. This is an alternative to
using C<stopwords>.
=item stemming
Indicates the linguistic procedure that should be used to convert
tokens in the document to features. Possible values are C<none>,
which indicates that the tokens should be used without change, or
C<porter>, indicating that the Porter stemming algorithm should be
applied to each token. This requires the C<Lingua::Stem> module from
CPAN.
=item stopword_behavior
There are a few ways you might want the stopword list (specified with
the C<stopwords> parameter) to interact with the stemming algorithm
(specified with the C<stemming> parameter). These options can be
controlled with the C<stopword_behavior> parameter, which can take the
following values:
lib/AI/Categorizer/Document.pm view on Meta::CPAN
Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.
=item pre_stemmed
Stopwords are already stemmed, match them against stemmed document
words.
=back
The default value is C<stem>, which seems to produce the best results
in most cases I've tried. I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.
This parameter has no effect if there are no stopwords being used, or
if stemming is not being used. In the latter case, the list of
stopwords will always be matched as-is against the document words.
Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.
lib/AI/Categorizer/Document.pm view on Meta::CPAN
=item parse( content =E<gt> $content )
=item name()
Returns this document's C<name> property as specified when the
document was created.
=item features()
Returns the Feature Vector associated with this document.
=item categories()
In a list context, returns a list of Category objects to which this
document belongs. In a scalar context, returns the number of such
categories.
=item create_feature_vector()
lib/AI/Categorizer/Document/XML.pm view on Meta::CPAN
# Input: a hash which is weights of elements
# Output: object of this class
# Description: this is constructor
sub new{
my ($class, %args) = @_;
# call super class such as XML::SAX::Base
my $self = $class->SUPER::new;
# save weights of elements which is a hash for pairs <elementName, weight>
# weight is times duplication of corresponding element
# It is provided by caller(one of parameters) at construction, and
# we must save it in order to use doing duplication at end_element
$self->{weightHash} = $args{weights};
# It is storage to store the data produced by Text, CDataSection and etc.
$self->{content} = '';
# This array is used to store the data for every element from root to the current visiting element.
# Thus, data of 0~($levelPointer-1)th in the array is only valid.
# The array which store the starting location(index) of the content for an element,
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
UNIVERSAL::isa($c, 'HASH') ? keys(%$c) : @$c
};
return $self;
}
sub add_hypothesis {
my ($self, $h, $correct, $name) = @_;
die "No hypothesis given to add_hypothesis()" unless $h;
$name = $h->document_name unless defined $name;
$self->add_result([$h->categories], $correct, $name);
}
sub stats_table {
my $self = shift;
$self->SUPER::stats_table($self->{sig_figs});
}
1;
__END__
=head1 NAME
AI::Categorizer::Experiment - Coordinate experimental results
=head1 SYNOPSIS
use AI::Categorizer::Experiment;
my $e = new AI::Categorizer::Experiment(categories => \%categories);
my $l = AI::Categorizer::Learner->restore_state(...path...);
while (my $d = ... get document ...) {
my $h = $l->categorize($d); # A Hypothesis
$e->add_hypothesis($h, [map $_->name, $d->categories]);
}
print "Micro F1: ", $e->micro_F1, "\n"; # Access a single statistic
print $e->stats_table; # Show several stats in table form
=head1 DESCRIPTION
The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments. As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers. When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
macro-averaging and micro-averaging over categories.
=head1 METHODS
The general execution flow when using this class is to create an
Experiment object, add a bunch of Hypotheses to it, and then report on
the results.
Internally, C<AI::Categorizer::Experiment> inherits from the
C<Statistics::Contingency>. Please see the documentation of
C<Statistics::Contingency> for a description of its interface. All of
its methods are available here, with the following additions:
=over 4
=item new( categories => \%categories )
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
Returns a new Experiment object. A required C<categories> parameter
specifies the names of all categories in the data set. The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.
The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.
A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method. It
does not affect the other methods like C<micro_precision()>.
=item add_result($assigned, $correct, $name)
Adds a new result to the experiment. Please see the
C<Statistics::Contingency> documentation for a description of this
method.
=item add_hypothesis($hypothesis, $correct_categories)
Adds a new result to the experiment. The first argument is a
C<AI::Categorizer::Hypothesis> object such as one generated by a
Learner's C<categorize()> method. The list of correct categories can
be given as an array of category names (strings), as a hash whose keys
are the category names and whose values are anything logically true,
or as a single string if there is only one category. For example, all
of the following are legal:
$e->add_hypothesis($h, "sports");
$e->add_hypothesis($h, ["sports", "finance"]);
$e->add_hypothesis($h, {sports => 1, finance => 1});
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
features_kept => {
type => SCALAR,
default => 0.2,
},
verbose => {
type => SCALAR,
default => 0,
},
);
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub reduce_features {
# Takes a feature vector whose weights are "feature scores", and
# chops to the highest n features. n is specified by the
# 'features_kept' parameter. If it's zero, all features are kept.
# If it's between 0 and 1, we multiply by the present number of
# features. If it's greater than 1, we treat it as the number of
# features to use.
my ($self, $f, %args) = @_;
my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};
return $f unless $kept;
my $num_kept = ($kept < 1 ?
$f->length * $kept :
$kept);
print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
# This is algorithmic overkill, but the sort seems fast enough. Will revisit later.
my $features = $f->as_hash;
my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
[0 .. $num_kept-1];
my $result = $f->intersection( \@new_features );
print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
return $result;
}
# Abstract methods
sub rank_features;
sub scan_features;
sub select_features {
my ($self, %args) = @_;
die "No knowledge_set parameter provided to select_features()"
unless $args{knowledge_set};
my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
return $self->reduce_features( $f, features_kept => $args{features_kept} );
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
=item new()
Creates a new KnowledgeSet and returns it. Accepts the following
parameters:
=over 4
=item load
If a C<load> parameter is present, the C<load()> method will be
invoked immediately. If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>. If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
=item categories
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
C<documents> parameter should also be specified.
=item documents
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet. If used, the
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
=item t
Raw term frequency - equal to the number of times a feature occurs in
the document.
=item x
A synonym for 't'.
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
Apply cosine normalization - multiply by 1/length(document_vector).
=item x
No change - multiply by 1.
=back
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
=item verbose
If set to a true value, some status/debugging information will be
output on C<STDOUT>.
=back
=item categories()
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
=item documents()
In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item document()
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
=item features()
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
=item verbose()
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies. Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.
=item save_features()
Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.
=item restore_features()
Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.
=item read()
Iterates through a Collection of documents and adds them to the
KnowledgeSet. The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
categories I<by name>. These are the categories that the document
belongs to. Any other parameters will be passed to the Document
class's C<new()> method.
=item finish()
This method will be called prior to training the Learner. Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.
=item weigh_features()
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
=item document_frequency()
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
=item partition()
Divides the KnowledgeSet into several subsets. This may be useful for
performing cross-validation. The relative sizes of the subsets should
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
partitions will be returned as a list.
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
=cut
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
package AI::Categorizer::FeatureSelector::CategorySelector;
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
1;
sub reduction_function;
# figure out the feature set before reading collection (default)
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or
die "No 'collection' parameter provided to scan_features()";
if(!($self->{features_kept})) {return;}
my %cat_features;
my $coll_features = $self->create_delayed_object('features');
my $nbDocuments = 0;
while (my $doc = $c->next) {
$nbDocuments++;
$args{prog_bar}->() if $args{prog_bar};
my $docfeatures = $doc->features->as_hash;
foreach my $cat ($doc->categories) {
my $catname = $cat->name;
if(!(exists $cat_features{$catname})) {
$cat_features{$catname} = $self->create_delayed_object('features');
}
$cat_features{$catname}->add($docfeatures);
}
$coll_features->add( $docfeatures );
}
print STDERR "\n* Computing Chi-Square values\n" if $self->verbose;
my $r_features = $self->create_delayed_object('features');
my @terms = $coll_features->names;
my $progressBar = $self->prog_bar(scalar @terms);
my $allFeaturesSum = $coll_features->sum;
my %cat_features_sum;
while( my($catname,$features) = each %cat_features ) {
$cat_features_sum{$catname} = $features->sum;
}
foreach my $term (@terms) {
$progressBar->();
$r_features->{features}{$term} = $self->reduction_function($term,
$nbDocuments,$allFeaturesSum,$coll_features,
\%cat_features,\%cat_features_sum);
}
print STDERR "\n" if $self->verbose;
my $new_features = $self->reduce_features($r_features);
return $coll_features->intersection( $new_features );
}
# calculate feature set after reading collection (scan_first=0)
sub rank_features {
die "CategorySelector->rank_features is not implemented yet!";
# my ($self, %args) = @_;
#
# my $k = $args{knowledge_set}
# or die "No knowledge_set parameter provided to rank_features()";
#
# my %freq_counts;
# foreach my $name ($k->features->names) {
# $freq_counts{$name} = $k->document_frequency($name);
# }
# return $self->create_delayed_object('features', features => \%freq_counts);
}
# copied from KnowledgeSet->prog_bar by Ken Williams
sub prog_bar {
my ($self, $count) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
my $pb = 'Time::Progress'->new;
$pb->attr(max => $count);
my $i = 0;
return sub {
$i++;
return if $i % 25;
print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
};
}
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
AI::Categorizer::CategorySelector - Abstract Category Selection class
=head1 SYNOPSIS
This class is abstract. For example of instanciation, see
ChiSquare.
=head1 DESCRIPTION
A base class for FeatureSelectors that calculate their global features
from a set of features by categories.
=head1 METHODS
=head1 AUTHOR
Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code
=cut
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm view on Meta::CPAN
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector::CategorySelector);
use Params::Validate qw(:types);
# Chi-Square function
# NB: this could probably be optimised a bit...
sub reduction_function {
my ($self,$term,$N,$allFeaturesSum,
$coll_features,$cat_features,$cat_features_sum) = @_;
my $CHI2SUM = 0;
my $nbcats = 0;
foreach my $catname (keys %{$cat_features}) {
# while ( my ($catname,$catfeatures) = each %{$cat_features}) {
my ($A,$B,$C,$D); # A = number of times where t and c co-occur
# B = " " " t occurs without c
# C = " " " c occurs without t
# D = " " " neither c nor t occur
$A = $cat_features->{$catname}->value($term);
$B = $coll_features->value($term) - $A;
$C = $cat_features_sum->{$catname} - $A;
$D = $allFeaturesSum - ($A+$B+$C);
my $ADminCB = ($A*$D)-($C*$B);
my $CHI2 = $N*$ADminCB*$ADminCB / (($A+$C)*($B+$D)*($A+$B)*($C+$D));
$CHI2SUM += $CHI2;
$nbcats++;
}
return $CHI2SUM/$nbcats;
}
1;
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm view on Meta::CPAN
use AI::Categorizer::KnowledgeSetSMART;
my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
tfidf_notation =>'Categorizer',
feature_selection=>'chi_square', ...other parameters...);
# however it is also possible to pass an instance to the KnowledgeSet
use AI::Categorizer::KnowledgeSet;
use AI::Categorizer::FeatureSelector::ChiSquare;
my $ksetCHI = new AI::Categorizer::KnowledgeSet(
feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
...other parameters...
);
=head1 DESCRIPTION
Feature selection with the ChiSquare function.
Chi-Square(t,ci) = (N.(AD-CB)^2)
-----------------------
(A+C).(B+D).(A+B).(C+D)
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm view on Meta::CPAN
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
use Carp qw(croak);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
my ($self, %args) = @_;
my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
my %freq_counts;
foreach my $name ($k->features->names) {
$freq_counts{$name} = $k->document_frequency($name);
}
return $self->create_delayed_object('features', features => \%freq_counts);
}
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or die "No 'collection' parameter provided to scan_features()";
my $doc_freq = $self->create_delayed_object('features');
while (my $doc = $c->next) {
$args{prog_bar}->() if $args{prog_bar};
$doc_freq->add( $doc->features->as_boolean_hash );
}
print "\n" if $self->verbose;
return $self->reduce_features($doc_freq);
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
package AI::Categorizer::FeatureVector;
sub new {
my ($package, %args) = @_;
$args{features} ||= {};
return bless {features => $args{features}}, $package;
}
sub names {
my $self = shift;
return keys %{$self->{features}};
}
sub set {
my $self = shift;
$self->{features} = (ref $_[0] ? $_[0] : {@_});
}
sub as_hash {
my $self = shift;
return $self->{features};
}
sub euclidean_length {
my $self = shift;
my $f = $self->{features};
my $total = 0;
foreach (values %$f) {
$total += $_**2;
}
return sqrt($total);
}
sub normalize {
my $self = shift;
my $length = $self->euclidean_length;
return $length ? $self->scale(1/$length) : $self;
}
sub scale {
my ($self, $scalar) = @_;
$_ *= $scalar foreach values %{$self->{features}};
return $self;
}
sub as_boolean_hash {
my $self = shift;
return { map {($_ => 1)} keys %{$self->{features}} };
}
sub length {
my $self = shift;
return scalar keys %{$self->{features}};
}
sub clone {
my $self = shift;
return ref($self)->new( features => { %{$self->{features}} } );
}
sub intersection {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $common;
if (UNIVERSAL::isa($other, 'ARRAY')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
} elsif (UNIVERSAL::isa($other, 'HASH')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
}
return ref($self)->new( features => $common );
}
sub add {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
while (my ($k,$v) = each %$other) {
$self->{features}{$k} += $v;
}
}
sub dot {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $sum = 0;
my $f = $self->{features};
while (my ($k, $v) = each %$f) {
$sum += $other->{$k} * $v if exists $other->{$k};
}
return $sum;
}
sub sum {
my ($self) = @_;
# Return total of values in this vector
my $total = 0;
$total += $_ foreach values %{ $self->{features} };
return $total;
}
sub includes {
return exists $_[0]->{features}{$_[1]};
}
sub value {
return $_[0]->{features}{$_[1]};
}
sub values {
my $self = shift;
return @{ $self->{features} }{ @_ };
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureVector - Features vs. Values
=head1 SYNOPSIS
my $f1 = new AI::Categorizer::FeatureVector
(features => {howdy => 2, doody => 3});
my $f2 = new AI::Categorizer::FeatureVector
(features => {doody => 1, whopper => 2});
@names = $f1->names;
$x = $f1->length;
$x = $f1->sum;
$x = $f1->includes('howdy');
$x = $f1->value('howdy');
$x = $f1->dot($f2);
$f3 = $f1->clone;
$f3 = $f1->intersection($f2);
$f3 = $f1->add($f2);
$h = $f1->as_hash;
$h = $f1->as_boolean_hash;
$f1->normalize;
=head1 DESCRIPTION
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features. At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document. However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.
=head1 METHODS
=over 4
=item ...
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3), Storable(3)
=cut
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
use strict;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
all_categories => {type => ARRAYREF},
scores => {type => HASHREF},
threshold => {type => SCALAR},
document_name => {type => SCALAR, optional => 1},
);
sub all_categories { @{$_[0]->{all_categories}} }
sub document_name { $_[0]->{document_name} }
sub threshold { $_[0]->{threshold} }
sub best_category {
my ($self) = @_;
my $sc = $self->{scores};
return unless %$sc;
my ($best_cat, $best_score) = each %$sc;
while (my ($key, $val) = each %$sc) {
($best_cat, $best_score) = ($key, $val) if $val > $best_score;
}
return $best_cat;
}
sub in_category {
my ($self, $cat) = @_;
return '' unless exists $self->{scores}{$cat};
return $self->{scores}{$cat} > $self->{threshold};
}
sub categories {
my $self = shift;
return @{$self->{cats}} if $self->{cats};
$self->{cats} = [sort {$self->{scores}{$b} <=> $self->{scores}{$a}}
grep {$self->{scores}{$_} >= $self->{threshold}}
keys %{$self->{scores}}];
return @{$self->{cats}};
}
sub scores {
my $self = shift;
return @{$self->{scores}}{@_};
}
1;
__END__
=head1 NAME
AI::Categorizer::Hypothesis - Embodies a set of category assignments
=head1 SYNOPSIS
use AI::Categorizer::Hypothesis;
# Hypotheses are usually created by the Learner's categorize() method.
# (assume here that $learner and $document have been created elsewhere)
my $h = $learner->categorize($document);
print "Assigned categories: ", join ', ', $h->categories, "\n";
print "Best category: ", $h->best_category, "\n";
print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
print "Chosen from: ", join ', ', $h->all_categories, "\n";
print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";
=head1 DESCRIPTION
A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document. Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
whether a particular category was assigned), we provide a simple class
to help facilitate these scenarios.
=head1 METHODS
=over 4
=item new(%parameters)
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
The following parameters are accepted when creating a new Hypothesis:
=over 4
=item all_categories
A required parameter which gives the set of all categories that could
possibly be assigned to. The categories should be specified as a
reference to an array of category names (as strings).
=item scores
A hash reference indicating the assignment score for each category.
Any score higher than the C<threshold> will be considered to be
assigned.
=item threshold
A number controlling which categories should be assigned - any
category whose score is greater than or equal to C<threshold> will be
assigned, any category whose score is lower than C<threshold> will not
be assigned.
=item document_name
An optional string parameter indicating the name of the document about
which this hypothesis was made.
=back
=item categories()
Returns an ordered list of the categories the document was placed in,
with best matches first. Categories are returned by their string names.
=item best_category()
Returns the name of the category with the highest score in this
hypothesis. Bear in mind that this category may not actually be
assigned if no categories' scores exceed the threshold.
=item in_category($name)
Returns true or false depending on whether the document was placed in
the given category.
=item scores(@names)
Returns a list of result scores for the given categories. Since the
interface is still changing, and since different Learners implement
scoring in different ways, not very much can officially be said
about the scores, except that a good score is higher than a bad
score. Individual Learners will have their own procedures for
determining scores, so you cannot compare one Learner's score with
another Learner's - for instance, one Learner might always give scores
between 0 and 1, and another Learner might always return scores less
than 0. You often cannot compare scores from a single Learner on two
different categorization tasks either.
=item all_categories()
Returns the list of category names specified with the
C<all_categories> constructor parameter.
=item document_name()
Returns the value of the C<document_name> parameter specified as a
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
);
__PACKAGE__->contained_objects
(
document => { delayed => 1,
class => 'AI::Categorizer::Document' },
category => { delayed => 1,
class => 'AI::Categorizer::Category' },
collection => { delayed => 1,
class => 'AI::Categorizer::Collection::Files' },
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
feature_selector => 'AI::Categorizer::FeatureSelector::DocFrequency',
);
sub new {
my ($pkg, %args) = @_;
# Shortcuts
if ($args{tfidf_weighting}) {
@args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
if ($self->{load}) {
my $args = ref($self->{load}) ? $self->{load} : { path => $self->{load} };
$self->load(%$args);
delete $self->{load};
}
return $self;
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
$self->trim_doc_features if $self->{features};
}
return $self->{features} if $self->{features};
# Create a feature vector encompassing the whole set of documents
my $v = $self->create_delayed_object('features');
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
sub documents {
my $d = $_[0]->{documents};
return wantarray ? $d->members : $d->size;
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
sub feature_selector { $_[0]->{feature_selector} }
sub scan_first { $_[0]->{scan_first} }
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub trim_doc_features {
my ($self) = @_;
foreach my $doc ($self->documents) {
$doc->features( $doc->features->intersection($self->features) );
}
}
sub prog_bar {
my ($self, $collection) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
my $count = $collection->can('count_documents') ? $collection->count_documents : 0;
my $pb = 'Time::Progress'->new;
$pb->attr(max => $count);
my $i = 0;
return sub {
$i++;
return if $i % 25;
print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
};
}
# A little utility method for several other methods like scan_stats(),
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
my %stats;
while (my $doc = $collection->next) {
$pb->();
$stats{category_count_with_duplicates} += $doc->categories;
my ($sum, $length) = ($doc->features->sum, $doc->features->length);
$stats{document_count}++;
$stats{token_count} += $sum;
$stats{type_count} += $length;
foreach my $cat ($doc->categories) {
#warn $doc->name, ": ", $cat->name, "\n";
$stats{categories}{$cat->name}{document_count}++;
$stats{categories}{$cat->name}{token_count} += $sum;
$stats{categories}{$cat->name}{type_count} += $length;
}
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
$stats{"${thing}_skew_by_category"} = sqrt($ssum/@cats) / $stats{"${thing}s_per_category"};
}
return \%stats;
}
sub load {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
if ($self->{features_kept}) {
# Read the whole thing in, then reduce
$self->read( collection => $c );
$self->select_features;
} elsif ($self->{scan_first}) {
# Figure out the feature set first, then read data in
$self->scan_features( collection => $c );
$c->rewind;
$self->read( collection => $c );
} else {
# Don't do any feature reduction, just read the data
$self->read( collection => $c );
}
}
sub read {
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
while (my $doc = $collection->next) {
$pb->();
$self->add_document($doc);
}
print "\n" if $self->verbose;
}
sub finish {
my $self = shift;
return if $self->{finished}++;
$self->weigh_features;
}
sub weigh_features {
# This could be made more efficient by figuring out an execution
# plan in advance
my $self = shift;
if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
# Nothing to do
} elsif ( $self->{term_weighting} eq 'l' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = 1 + log($_) foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'n' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
my $max_tf = AI::Categorizer::Util::max values %$f;
$_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'b' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = $_ ? 1 : 0 foreach values %$f;
}
} else {
die "term_weighting must be one of 'x', 't', 'l', 'b', or 'n'";
}
if ($self->{collection_weighting} eq 'x') {
# Nothing to do
} elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
my $subtrahend = ($1 eq 'f' ? 0 : 1);
my $num_docs = $self->documents;
$self->document_frequency('foo'); # Initialize
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
}
} else {
die "collection_weighting must be one of 'x', 'f', or 'p'";
}
if ( $self->{normalize_weighting} eq 'x' ) {
# Nothing to do
} elsif ( $self->{normalize_weighting} eq 'c' ) {
$_->features->normalize foreach $self->documents;
} else {
die "normalize_weighting must be one of 'x' or 'c'";
}
}
sub document_frequency {
my ($self, $term) = @_;
unless (exists $self->{doc_freq_vector}) {
die "No corpus has been scanned for features" unless $self->documents;
my $doc_freq = $self->create_delayed_object('features', features => {});
foreach my $doc ($self->documents) {
$doc_freq->add( $doc->features->as_boolean_hash );
}
$self->{doc_freq_vector} = $doc_freq->as_hash;
}
return exists $self->{doc_freq_vector}{$term} ? $self->{doc_freq_vector}{$term} : 0;
}
sub scan_features {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
my $pb = $self->prog_bar($c);
my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );
$self->delayed_object_params('document', use_features => $ranked_features);
$self->delayed_object_params('collection', use_features => $ranked_features);
return $ranked_features;
}
sub select_features {
my $self = shift;
my $f = $self->feature_selector->select_features(knowledge_set => $self);
$self->features($f);
}
sub partition {
my ($self, @sizes) = @_;
my $num_docs = my @docs = $self->documents;
my @groups;
while (@sizes > 1) {
my $size = int ($num_docs * shift @sizes);
push @groups, [];
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
sub add_document {
my ($self, $doc) = @_;
foreach ($doc->categories) {
$_->add_document($doc);
}
$self->{documents}->insert($doc);
$self->{categories}->insert($doc->categories);
}
sub save_features {
my ($self, $file) = @_;
my $f = ($self->{features} || { $self->delayed_object_params('document') }->{use_features})
or croak "No features to save";
open my($fh), "> $file" or croak "Can't create $file: $!";
my $h = $f->as_hash;
print $fh "# Total: ", $f->length, "\n";
foreach my $k (sort {$h->{$b} <=> $h->{$a}} keys %$h) {
print $fh "$k\t$h->{$k}\n";
}
close $fh;
}
sub restore_features {
my ($self, $file, $n) = @_;
open my($fh), "< $file" or croak "Can't open $file: $!";
my %hash;
while (<$fh>) {
next if /^#/;
/^(.*)\t([\d.]+)$/ or croak "Malformed line: $_";
$hash{$1} = $2;
last if defined $n and $. >= $n;
}
my $features = $self->create_delayed_object('features', features => \%hash);
$self->delayed_object_params('document', use_features => $features);
$self->delayed_object_params('collection', use_features => $features);
}
1;
__END__
=head1 NAME
AI::Categorizer::KnowledgeSet - Encapsulates set of documents
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
=item new()
Creates a new KnowledgeSet and returns it. Accepts the following
parameters:
=over 4
=item load
If a C<load> parameter is present, the C<load()> method will be
invoked immediately. If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>. If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
=item categories
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
C<documents> parameter should also be specified.
=item documents
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet. If used, the
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
=item t
Raw term frequency - equal to the number of times a feature occurs in
the document.
=item x
A synonym for 't'.
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
Apply cosine normalization - multiply by 1/length(document_vector).
=item x
No change - multiply by 1.
=back
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
=item verbose
If set to a true value, some status/debugging information will be
output on C<STDOUT>.
=back
=item categories()
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
=item documents()
In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item document()
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
=item features()
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
=item verbose()
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies. Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.
=item save_features()
Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.
=item restore_features()
Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.
=item read()
Iterates through a Collection of documents and adds them to the
KnowledgeSet. The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
categories I<by name>. These are the categories that the document
belongs to. Any other parameters will be passed to the Document
class's C<new()> method.
=item finish()
This method will be called prior to training the Learner. Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.
=item weigh_features()
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
=item document_frequency()
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
=item partition()
Divides the KnowledgeSet into several subsets. This may be useful for
performing cross-validation. The relative sizes of the subsets should
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
partitions will be returned as a list.
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
=cut
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
class => 'AI::Categorizer::Hypothesis',
delayed => 1,
},
experiment => {
class => 'AI::Categorizer::Experiment',
delayed => 1,
},
);
# Subclasses must override these virtual methods:
sub get_scores;
sub create_model;
# Optional virtual method for on-line learning:
sub add_knowledge;
sub verbose {
my $self = shift;
if (@_) {
$self->{verbose} = shift;
}
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
$self->{knowledge_set}->finish;
$self->create_model; # Creates $self->{model}
$self->delayed_object_params('hypothesis',
all_categories => [map $_->name, $self->categories],
);
}
sub prog_bar {
my ($self, $count) = @_;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
my $pb = 'Time::Progress'->new;
$pb->attr(max => $count);
my $i = 0;
return sub {
$i++;
return if $i % 25;
my $string = '';
if (@_) {
my $e = shift;
$string = sprintf " (maF1=%.03f, miF1=%.03f)", $e->macro_F1, $e->micro_F1;
}
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
}
}
print STDERR "\n" if $self->verbose;
return $experiment;
}
sub categorize {
my ($self, $doc) = @_;
my ($scores, $threshold) = $self->get_scores($doc);
if ($self->verbose > 2) {
warn "scores: @{[ %$scores ]}" if $self->verbose > 3;
foreach my $key (sort {$scores->{$b} <=> $scores->{$a}} keys %$scores) {
print "$key: $scores->{$key}\n";
}
}
return $self->create_delayed_object('hypothesis',
scores => $scores,
threshold => $threshold,
document_name => $doc->name,
);
}
1;
__END__
=head1 NAME
AI::Categorizer::Learner - Abstract Machine Learner Class
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
use AI::Categorizer::Learner::NaiveBayes; # Or other subclass
# Here $k is an AI::Categorizer::KnowledgeSet object
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
The C<AI::Categorizer::Learner> class is an abstract class that will
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
If true, the Learner will display some diagnostic output while
training and categorizing documents.
=back
=item train()
=item train(knowledge_set => $k)
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object. If you provided a C<knowledge_set> parameter to C<new()>,
specifying one here will override it.
=item categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=item categorize_collection(collection => $collection)
Categorizes every document in a collection and returns an Experiment
object representing the results. Note that the Experiment does not
contain knowledge of the assigned categories for every document, only
a statistical summary of the results.
=item knowledge_set()
Gets/sets the internal C<knowledge_set> member. Note that since the
knowledge set may be enormous, some Learners may throw away their
knowledge set after training or after restoring state from a file.
=item $learner-E<gt>save_state($path)
Saves the Learner for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=item $class-E<gt>restore_state($path)
Returns a Learner saved in a file with C<save_state()>. This method
is inherited from C<AI::Categorizer::Storable>.
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
=cut
lib/AI/Categorizer/Learner/Boolean.pm view on Meta::CPAN
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use AI::Categorizer::Util qw(random_elements);
__PACKAGE__->valid_params
(
max_instances => {type => SCALAR, default => 0},
threshold => {type => SCALAR, default => 0.5},
);
sub create_model {
my $self = shift;
my $m = $self->{model} ||= {};
my $mi = $self->{max_instances};
foreach my $cat ($self->knowledge_set->categories) {
my (@p, @n);
foreach my $doc ($self->knowledge_set->documents) {
if ($doc->is_in_category($cat)) {
push @p, $doc;
} else {
push @n, $doc;
}
}
if ($mi and @p + @n > $mi) {
# Get rid of random instances from training set, preserving
# current positive/negative ratio
my $ratio = $mi / (@p + @n);
@p = random_elements(\@p, @p * $ratio);
@n = random_elements(\@n, @n * $ratio);
warn "Limiting to ". @p ." positives and ". @n ." negatives\n" if $self->verbose;
}
warn "Creating model for ", $cat->name, "\n" if $self->verbose;
$m->{learners}{ $cat->name } = $self->create_boolean_model(\@p, \@n, $cat);
}
}
sub create_boolean_model; # Abstract method
sub get_scores {
my ($self, $doc) = @_;
my $m = $self->{model};
my %scores;
foreach my $cat (keys %{$m->{learners}}) {
$scores{$cat} = $self->get_boolean_score($doc, $m->{learners}{$cat});
}
return (\%scores, $self->{threshold});
}
sub get_boolean_score; # Abstract method
sub threshold {
my $self = shift;
$self->{threshold} = shift if @_;
return $self->{threshold};
}
sub categories {
my $self = shift;
return map AI::Categorizer::Category->by_name( name => $_ ), keys %{ $self->{model}{learners} };
}
1;
__END__
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
$self->{model}{first_tree}->do_purge;
delete $self->{model}{first_tree};
}
sub create_boolean_model {
my ($self, $positives, $negatives, $cat) = @_;
my $t = new AI::DecisionTree(noise_mode => 'pick_best',
verbose => $self->verbose);
my %results;
for ($positives, $negatives) {
foreach my $doc (@$_) {
$results{$doc->name} = $_ eq $positives ? 1 : 0;
}
}
if ($self->{model}{first_tree}) {
$t->copy_instances(from => $self->{model}{first_tree});
$t->set_results(\%results);
} else {
for ($positives, $negatives) {
foreach my $doc (@$_) {
$t->add_instance( attributes => $doc->features->as_boolean_hash,
result => $results{$doc->name},
name => $doc->name,
);
}
}
$t->purge(0);
$self->{model}{first_tree} = $t;
}
print STDERR "\nBuilding tree for category '", $cat->name, "'" if $self->verbose;
$t->train;
return $t;
}
sub get_scores {
my ($self, $doc) = @_;
local $self->{current_doc} = $doc->features->as_boolean_hash;
return $self->SUPER::get_scores($doc);
}
sub get_boolean_score {
my ($self, $doc, $t) = @_;
return $t->get_result( attributes => $self->{current_doc} ) || 0;
}
1;
__END__
=head1 NAME
AI::Categorizer::Learner::DecisionTree - Decision Tree Learner
=head1 SYNOPSIS
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
use AI::Categorizer::Learner::DecisionTree;
# Here $k is an AI::Categorizer::KnowledgeSet object
my $l = new AI::Categorizer::Learner::DecisionTree(...parameters...);
$l->train(knowledge_set => $k);
$l->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
while (my $document = ... ) { # An AI::Categorizer::Document object
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
This class implements a Decision Tree machine learner, using
C<AI::DecisionTree> to do the internal work.
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.
=head2 new()
Creates a new DecisionTree Learner and returns it.
=head2 train(knowledge_set => $k)
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
=cut
lib/AI/Categorizer/Learner/Guesser.pm view on Meta::CPAN
my $self = shift;
my $k = $self->knowledge_set;
my $num_docs = $k->documents;
foreach my $cat ($k->categories) {
next unless $cat->documents;
$self->{model}{$cat->name} = $cat->documents / $num_docs;
}
}
sub get_scores {
my ($self, $newdoc) = @_;
my %scores;
while (my ($cat, $prob) = each %{$self->{model}}) {
$scores{$cat} = 0.5 + $prob - rand();
}
return (\%scores, 0.5);
}
1;
__END__
=head1 NAME
AI::Categorizer::Learner::Guesser - Simple guessing based on class probabilities
lib/AI/Categorizer/Learner/Guesser.pm view on Meta::CPAN
use AI::Categorizer::Learner::Guesser;
# Here $k is an AI::Categorizer::KnowledgeSet object
my $l = new AI::Categorizer::Learner::Guesser;
$l->train(knowledge_set => $k);
$l->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories. For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05. This can be useful for providing baseline scores to compare
with other more sophisticated algorithms.
See L<AI::Categorizer> for a complete description of the interface.
=head1 METHODS
This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available.
=head1 AUTHOR
Ken Williams (C<< <ken@mathforum.org> >>)
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
=cut
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
package AI::Categorizer::Learner::KNN;
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
threshold => {type => SCALAR, default => 0.4},
k_value => {type => SCALAR, default => 20},
knn_weighting => {type => SCALAR, default => 'score'},
max_instances => {type => SCALAR, default => 0},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->knowledge_set->features; # Initialize
}
sub threshold {
my $self = shift;
$self->{threshold} = shift if @_;
return $self->{threshold};
}
sub categorize_collection {
my $self = shift;
my $f_class = $self->knowledge_set->contained_class('features');
if ($f_class->can('all_features')) {
$f_class->all_features([$self->knowledge_set->features->names]);
}
$self->SUPER::categorize_collection(@_);
}
sub get_scores {
my ($self, $newdoc) = @_;
my $currentDocName = $newdoc->name;
#print "classifying $currentDocName\n";
my $features = $newdoc->features->intersection($self->knowledge_set->features)->normalize;
my $q = AI::Categorizer::Learner::KNN::Queue->new(size => $self->{k_value});
my @docset;
if ($self->{max_instances}) {
# Use (approximately) max_instances documents, chosen randomly from corpus
my $probability = $self->{max_instances} / $self->knowledge_set->documents;
@docset = grep {rand() < $probability} $self->knowledge_set->documents;
} else {
# Use the whole corpus
@docset = $self->knowledge_set->documents;
}
foreach my $doc (@docset) {
my $score = $doc->features->dot( $features );
warn "Score for ", $doc->name, " (", ($doc->categories)[0]->name, "): $score" if $self->verbose > 1;
$q->add($doc, $score);
}
my %scores = map {+$_->name, 0} $self->categories;
foreach my $e (@{$q->entries}) {
foreach my $cat ($e->{thing}->categories) {
$scores{$cat->name} += ($self->{knn_weighting} eq 'score' ? $e->{score} : 1); #increment cat score
}
}
$_ /= $self->{k_value} foreach values %scores;
return (\%scores, $self->{threshold});
}
###################################################################
package AI::Categorizer::Learner::KNN::Queue;
sub new {
my ($pkg, %args) = @_;
return bless {
size => $args{size},
entries => [],
}, $pkg;
}
sub add {
my ($self, $thing, $score) = @_;
# scores may be (0.2, 0.4, 0.4, 0.8) - ascending
return unless (@{$self->{entries}} < $self->{size} # Queue not filled
or $score > $self->{entries}[0]{score}); # Found a better entry
my $i;
if (!@{$self->{entries}}) {
$i = 0;
} elsif ($score > $self->{entries}[-1]{score}) {
$i = @{$self->{entries}};
} else {
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
sub entries {
return shift->{entries};
}
1;
__END__
=head1 NAME
AI::Categorizer::Learner::KNN - K Nearest Neighbour Algorithm For AI::Categorizer
=head1 SYNOPSIS
use AI::Categorizer::Learner::KNN;
# Here $k is an AI::Categorizer::KnowledgeSet object
my $nb = new AI::Categorizer::Learner::KNN(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$l = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $l->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This is an implementation of the k-Nearest-Neighbor decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module). See L<AI::Categorizer> for a complete
description of the interface.
=head1 METHODS
This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.
=head2 new()
Creates a new KNN Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
KNN subclass accepts the following parameters:
=over 4
=item threshold
Sets the score threshold for category membership. The default is
currently 0.1. Set the threshold lower to assign more categories per
document, set it higher to assign fewer. This can be an effective way
to trade of between precision and recall.
=item k_value
Sets the C<k> value (as in k-Nearest-Neighbor) to the given integer.
This indicates how many of each document's nearest neighbors should be
considered when assigning categories. The default is 5.
=back
=head2 threshold()
Returns the current threshold value. With an optional numeric
argument, you may set the threshold.
=head2 train(knowledge_set => $k)
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 AUTHOR
Originally written by David Bell (C<< <dave@student.usyd.edu.au> >>),
October 2002.
Added to AI::Categorizer November 2002, modified, and maintained by
Ken Williams (C<< <ken@mathforum.org> >>).
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3)
"A re-examination of text categorization methods" by Yiming Yang
L<http://www.cs.cmu.edu/~yiming/publications.html>
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
package AI::Categorizer::Learner::NaiveBayes;
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use Algorithm::NaiveBayes;
__PACKAGE__->valid_params
(
threshold => {type => SCALAR, default => 0.3},
);
sub create_model {
my $self = shift;
my $m = $self->{model} = Algorithm::NaiveBayes->new;
foreach my $d ($self->knowledge_set->documents) {
$m->add_instance(attributes => $d->features->as_hash,
label => [ map $_->name, $d->categories ]);
}
$m->train;
}
sub get_scores {
my ($self, $newdoc) = @_;
return ($self->{model}->predict( attributes => $newdoc->features->as_hash ),
$self->{threshold});
}
sub threshold {
my $self = shift;
$self->{threshold} = shift if @_;
return $self->{threshold};
}
sub save_state {
my $self = shift;
local $self->{knowledge_set}; # Don't need the knowledge_set to categorize
$self->SUPER::save_state(@_);
}
sub categories {
my $self = shift;
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
use AI::Categorizer::Learner::NaiveBayes;
# Here $k is an AI::Categorizer::KnowledgeSet object
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
$nb->save_state('filename');
... time passes ...
$nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
}
=head1 DESCRIPTION
This is an implementation of the Naive Bayes decision-making
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
of its methods are available unless explicitly mentioned here.
=head2 new()
Creates a new Naive Bayes Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Naive Bayes subclass accepts the following parameters:
=over 4
=item * threshold
Sets the score threshold for category membership. The default is
currently 0.3. Set the threshold lower to assign more categories per
document, set it higher to assign fewer. This can be an effective way
to trade of between precision and recall.
=back
=head2 threshold()
Returns the current threshold value. With an optional numeric
argument, you may set the threshold.
=head2 train(knowledge_set => $k)
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
=head2 categorize($document)
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to. See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
=head2 save_state($path)
Saves the categorizer for later use. This method is inherited from
C<AI::Categorizer::Storable>.
=head1 CALCULATIONS
The various probabilities used in the above calculations are found
directly from the training documents. For instance, if there are 5000
total tokens (words) in the "sports" training documents and 200 of
them are the word "curling", then C<P(curling|sports) = 200/5000 =
0.04> . If there are 10,000 total tokens in the training corpus and
5,000 of them are in documents belonging to the category "sports",
then C<P(sports)> = 5,000/10,000 = 0.5> .
Because the probabilities involved are often very small and we
multiply many of them together, the result is often a tiny tiny
number. This could pose problems of floating-point underflow, so
instead of working with the actual probabilities we work with the
logarithms of the probabilities. This also speeds up various
calculations in the C<categorize()> method.
=head1 TO DO
More work on the confidence scores - right now the winning category
tends to dominate the scores overwhelmingly, when the scores should
probably be more evenly distributed.
=head1 AUTHOR
Ken Williams, ken@forum.swarthmore.edu
=head1 COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.
=head1 SEE ALSO
AI::Categorizer(3), Algorithm::NaiveBayes(3)
"A re-examination of text categorization methods" by Yiming Yang
L<http://www.cs.cmu.edu/~yiming/publications.html>
lib/AI/Categorizer/Learner/Rocchio.pm view on Meta::CPAN
use strict;
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Learner::Boolean;
use base qw(AI::Categorizer::Learner::Boolean);
__PACKAGE__->valid_params
(
positive_setting => {type => SCALAR, default => 16 },
negative_setting => {type => SCALAR, default => 4 },
threshold => {type => SCALAR, default => 0.1},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->{model}{all_features} = $self->knowledge_set->features(undef);
$self->SUPER::create_model(@_);
delete $self->{knowledge_set};
}
sub create_boolean_model {
my ($self, $positives, $negatives, $cat) = @_;
my $posdocnum = @$positives;
my $negdocnum = @$negatives;
my $beta = $self->{positive_setting};
my $gamma = $self->{negative_setting};
my $profile = $self->{model}{all_features}->clone->scale(-$gamma/$negdocnum);
my $f = $cat->features(undef)->clone->scale( $beta/$posdocnum + $gamma/$negdocnum );
$profile->add($f);
return $profile->normalize;
}
sub get_boolean_score {
my ($self, $newdoc, $profile) = @_;
return $newdoc->features->normalize->dot($profile);
}
1;