AI-Categorizer

 view release on metacpan or  search on metacpan

lib/AI/Categorizer.pm  view on Meta::CPAN

usually very fast to train and fast to make categorization decisions,
but isn't always the most accurate categorizer.

=item AI::Categorizer::Learner::SVM

An interface to Corey Spencer's C<Algorithm::SVM>, which implements a
Support Vector Machine classifier.  SVMs can take a while to train
(though in certain conditions there are optimizations to make them
quite fast), but are pretty quick to categorize.  They often have very
good accuracy.

=item AI::Categorizer::Learner::DecisionTree

An interface to C<AI::DecisionTree>, which implements a Decision Tree
classifier.  Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing.  Decision Trees have the advantage that you can
scrutinize the structures of trained decision trees to see how
decisions are being made.

=item AI::Categorizer::Learner::Weka

An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines.  This gives you
access to lots and lots of machine learning algorithms in use by
machine learning researches.  The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface
between Weka and C<AI::Categorizer> is a bit clumsy.

=back

Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
combiner for ensemble learning.  No timetable for their creation has
yet been set.

Please see the documentation of these individual modules for more
details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.

If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.

=head2 Feature Vectors

Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.

=head2 Hypotheses

The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to.  Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
scores assigned to each category, etc.), the hypothesis is returned as
an object of the C<AI::Categorizer::Hypothesis> class, and you can use
its object methods to get information about the hypothesis.  See its
class documentation for the details.

=head2 Experiments

The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments.  As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers.  When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
micro-averaging and macro-averaging over categories.  We use the
C<Statistics::Contingency> module from CPAN to manage the
calculations. See the docs for C<AI::Categorizer::Experiment> for more
details.

=head1 METHODS

=over 4

=item new()

Creates a new Categorizer object and returns it.  Accepts lots of
parameters controlling behavior.  In addition to the parameters listed
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that I<they> create.  This is managed by the
C<Class::Container> module, so see
L<its documentation|Class::Container> for the details of how this
works.

The specific parameters accepted here are:

=over 4

=item progress_file

A string that indicates a place where objects will be saved during
several of the methods of this class.  The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created.  The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.

=item verbose

If true, a few status messages will be printed during execution.

=item training_set

Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.

=item test_set



( run in 1.976 second using v1.01-cache-2.11-cpan-39bf76dae61 )