AI-Categorizer
view release on metacpan or search on metacpan
the KnowledgeSet objects were never accepted, claiming that it
failed the "All are Document objects" or "All are Category objects"
callbacks. [Spotted by rob@phraud.org]
- Moved the 'stopword_file' parameter from Categorizer.pm to the
Collection class.
0.05 Sat Mar 29 00:38:21 CST 2003
- Feature selection is now handled by an abstract FeatureSelector
framework class. Currently the only concrete subclass implemented
is FeatureSelector::DocFrequency. The 'feature_selection'
parameter has been replaced with a 'feature_selector_class'
parameter.
- Added a k-Nearest-Neighbor machine learner. [First revision
implemented by David Bell]
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
Li]
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
eg/easy_guesser.pl view on Meta::CPAN
#!/usr/bin/perl
# This script can be helpful for getting a set of baseline scores for
# a categorization task. It simulates using the "Guesser" learner,
# but is much faster. Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format. <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class. <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.
use strict;
use Statistics::Contingency;
die "Usage: $0 <cats-file> <training-dir> <test-dir>\n" unless @ARGV == 3;
my ($cats, $training, $test) = @ARGV;
lib/AI/Categorizer.pm view on Meta::CPAN
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use. There are several ways to carry out this process. The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?). Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider. These are only
lib/AI/Categorizer.pm view on Meta::CPAN
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
object framework. We give a conceptual overview, but don't get into
any of the details about interfaces or usage. See the documentation
for the individual classes for more details.
A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
=head2 Knowledge Sets
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge". A knowledge
set also contains some information on how documents will be parsed and
( run in 1.410 second using v1.01-cache-2.11-cpan-677af5a14d3 )