AI-Categorizer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object
    framework. We give a conceptual overview, but don't get into any of the
    details about interfaces or usage. See the documentation for the individual
    classes for more details.

    A diagram of the various classes in the framework can be seen in
    "doc/classes-overview.png", and a more detailed view of the same thing can
    be seen in "doc/classes.png".

  Knowledge Sets

    A "knowledge set" is defined as a collection of documents, together with
    some information on the categories each document belongs to. Note that this
    term is somewhat unique to this project - other sources may call it a
    "training corpus", or "prior knowledge". A knowledge set also contains some
    information on how documents will be parsed and how their features (words)
    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant stuff of the documents would get lost in
    the "noise" of the insignificant stuff.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

    The "AI::Categorizer::Collection" class and its subclasses instantiate the
    idea of a collection in this sense.

  Documents

    Each document is represented by an "AI::Categorizer::Document" object, or an
    object of one of its subclasses. Each document class contains methods for
    turning a bunch of data into a Feature Vector. Each document also has a
    method to report which categories it belongs to.

  Categories

    Each category is represented by an "AI::Categorizer::Category" object. Its
    main purpose is to keep track of which documents belong to it, though you
    can also examine statistical properties of an entire category, such as
    obtaining a Feature Vector representing an amalgamation of all the documents
    that belong to it.

  Machine Learning Algorithms

    There are lots of different ways to make the inductive leap from the
    training documents to unseen documents. The Machine Learning community has
    studied many algorithms for this purpose. To allow flexibility in choosing
    and configuring categorization algorithms, each such algorithm is a subclass
    of "AI::Categorizer::Learner". There are currently four categorizers
    included in the distribution:

    AI::Categorizer::Learner::NaiveBayes
        A pure-perl implementation of a Naive Bayes classifier. No dependencies
        on external modules or other resources. Naive Bayes is usually very fast
        to train and fast to make categorization decisions, but isn't always the
        most accurate categorizer.

    AI::Categorizer::Learner::SVM
        An interface to Corey Spencer's "Algorithm::SVM", which implements a
        Support Vector Machine classifier. SVMs can take a while to train
        (though in certain conditions there are optimizations to make them quite
        fast), but are pretty quick to categorize. They often have very good
        accuracy.

    AI::Categorizer::Learner::DecisionTree
        An interface to "AI::DecisionTree", which implements a Decision Tree
        classifier. Decision Trees generally take longer to train than Naive
        Bayes or SVM classifiers, but they are also quite fast when
        categorizing. Decision Trees have the advantage that you can scrutinize
        the structures of trained decision trees to see how decisions are being
        made.

    AI::Categorizer::Learner::Weka
        An interface to version 2 of the Weka Knowledge Analysis system that



( run in 1.785 second using v1.01-cache-2.11-cpan-39bf76dae61 )