AI-Categorizer
view release on metacpan or search on metacpan
lib/AI/Categorizer.pm view on Meta::CPAN
sub stats_table {
my $self = shift;
$self->_load_progress( '04', 'experiment' );
return $self->{experiment}->stats_table;
}
sub progress_file {
shift->{progress_file};
}
sub verbose {
shift->{verbose};
}
sub _save_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Saving to $file\n" if $self->{verbose};
$self->{$node}->save_state($file);
}
sub _load_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Loading $file\n" if $self->{verbose};
$self->{$node} = $self->contained_class($node)->restore_state($file);
}
1;
__END__
=head1 NAME
AI::Categorizer - Automatic Text Categorization
=head1 SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use. There are several ways to carry out this process. The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider. These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
object framework. We give a conceptual overview, but don't get into
any of the details about interfaces or usage. See the documentation
for the individual classes for more details.
A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
=head2 Knowledge Sets
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge". A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations. In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.
A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class. Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on. See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
=head3 Feature selection
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized. There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.
The process of selecting the most important features in the training
set is called "feature selection". It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
=head2 Collections
Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects. A knowledge set contains a single collection
object. A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing. A C<Learner>
can mass-categorize a collection.
( run in 0.901 second using v1.01-cache-2.11-cpan-5837b0d9d2c )