AI-Categorizer
view release on metacpan or search on metacpan
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
turning a bunch of data into a Feature Vector. Each document also has a
method to report which categories it belongs to.
Categories
Each category is represented by an "AI::Categorizer::Category" object. Its
main purpose is to keep track of which documents belong to it, though you
can also examine statistical properties of an entire category, such as
obtaining a Feature Vector representing an amalgamation of all the documents
that belong to it.
Machine Learning Algorithms
There are lots of different ways to make the inductive leap from the
training documents to unseen documents. The Machine Learning community has
studied many algorithms for this purpose. To allow flexibility in choosing
and configuring categorization algorithms, each such algorithm is a subclass
of "AI::Categorizer::Learner". There are currently four categorizers
included in the distribution:
AI::Categorizer::Learner::NaiveBayes
A pure-perl implementation of a Naive Bayes classifier. No dependencies
on external modules or other resources. Naive Bayes is usually very fast
to train and fast to make categorization decisions, but isn't always the
most accurate categorizer.
AI::Categorizer::Learner::SVM
An interface to Corey Spencer's "Algorithm::SVM", which implements a
Support Vector Machine classifier. SVMs can take a while to train
(though in certain conditions there are optimizations to make them quite
fast), but are pretty quick to categorize. They often have very good
accuracy.
AI::Categorizer::Learner::DecisionTree
An interface to "AI::DecisionTree", which implements a Decision Tree
classifier. Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing. Decision Trees have the advantage that you can scrutinize
the structures of trained decision trees to see how decisions are being
made.
AI::Categorizer::Learner::Weka
An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines. This gives you
access to lots and lots of machine learning algorithms in use by machine
learning researches. The main drawback is that Weka tends to be quite
slow and use a lot of memory, and the current interface between Weka and
"AI::Categorizer" is a bit clumsy.
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
for ensemble learning. No timetable for their creation has yet been set.
Please see the documentation of these individual modules for more details on
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
( run in 1.006 second using v1.01-cache-2.11-cpan-39bf76dae61 )