AI-Categorizer
view release on metacpan or search on metacpan
NAME
AI::Categorizer - Automatic Text Categorization
SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
be interested in any of several pieces of information about the hypothesis
(for instance, which categories were assigned, which category was the single
most likely category, the scores assigned to each category, etc.), the
hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
class, and you can use its object methods to get information about the
hypothesis. See its class documentation for the details.
Experiments
The "AI::Categorizer::Experiment" class helps you organize the results of
categorization experiments. As you get lots of categorization results
(Hypotheses) back from the Learner, you can feed these results to the
Experiment class, along with the correct answers. When all results have been
collected, you can get a report on accuracy, precision, recall, F1, and so
on, with both micro-averaging and macro-averaging over categories. We use
the "Statistics::Contingency" module from CPAN to manage the calculations.
See the docs for "AI::Categorizer::Experiment" for more details.
METHODS
new()
Creates a new Categorizer object and returns it. Accepts lots of
parameters controlling behavior. In addition to the parameters listed
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that *they* create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set", "test_set", and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training", "test_set" to "$data_root/test", and
"category_file" (used by some of the Collection classes) to
"$data_root/cats.txt".
learner()
Returns the Learner object associated with this Categorizer. Before
"train()", the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT". Internally, this is just a shortcut for calling
the "scan_features()", "read_training_set()", "train()", and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()". Returns the Learner object. Also
saves the Learner object for later use.
evaluate_test_set()
Creates a Collection based on the value of the "test_set" parameter, and
calls the Learner's "categorize_collection()" method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the "stats_table()" method.
stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()") "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
"AI::Categorize" module by the same author. Note the added 'r' in the new
name. The older module has a different interface, and no attempt at backward
compatibility has been made - that's why I changed the name.
You can have both "AI::Categorize" and "AI::Categorizer" installed at the
same time on the same machine, if you want. They don't know about each other
or use conflicting namespaces.
AUTHOR
Ken Williams <ken@mathforum.org>
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai
REFERENCES
An excellent introduction to the academic field of Text Categorization is
Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.
COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
( run in 0.528 second using v1.01-cache-2.11-cpan-39bf76dae61 )