AI-Categorizer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

    that belong to it.

  Machine Learning Algorithms

    There are lots of different ways to make the inductive leap from the
    training documents to unseen documents. The Machine Learning community has
    studied many algorithms for this purpose. To allow flexibility in choosing
    and configuring categorization algorithms, each such algorithm is a subclass
    of "AI::Categorizer::Learner". There are currently four categorizers
    included in the distribution:

    AI::Categorizer::Learner::NaiveBayes
        A pure-perl implementation of a Naive Bayes classifier. No dependencies
        on external modules or other resources. Naive Bayes is usually very fast
        to train and fast to make categorization decisions, but isn't always the
        most accurate categorizer.

    AI::Categorizer::Learner::SVM
        An interface to Corey Spencer's "Algorithm::SVM", which implements a
        Support Vector Machine classifier. SVMs can take a while to train
        (though in certain conditions there are optimizations to make them quite
        fast), but are pretty quick to categorize. They often have very good
        accuracy.

    AI::Categorizer::Learner::DecisionTree
        An interface to "AI::DecisionTree", which implements a Decision Tree
        classifier. Decision Trees generally take longer to train than Naive
        Bayes or SVM classifiers, but they are also quite fast when
        categorizing. Decision Trees have the advantage that you can scrutinize
        the structures of trained decision trees to see how decisions are being
        made.

    AI::Categorizer::Learner::Weka
        An interface to version 2 of the Weka Knowledge Analysis system that
        lets you use any of the machine learners it defines. This gives you
        access to lots and lots of machine learning algorithms in use by machine
        learning researches. The main drawback is that Weka tends to be quite
        slow and use a lot of memory, and the current interface between Weka and
        "AI::Categorizer" is a bit clumsy.

    Other machine learning methods that may be implemented soonish include
    Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
    for ensemble learning. No timetable for their creation has yet been set.

    Please see the documentation of these individual modules for more details on
    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
    a description of the general categorizer interface.

    If you wish to create your own classifier, you should inherit from
    "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
    abstract classes that manage some of the work for you.

  Feature Vectors

    Most categorization algorithms don't deal directly with documents' data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem helpful for
    determining its category, but they are usually some version of the "most
    important" words in the document. A list of features and their weights in
    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
    You may think of this class as roughly analogous to a Perl hash, where the
    keys are the names of features and the values are their weights.

  Hypotheses

    The result of asking a categorizer to categorize a previously unseen
    document is called a hypothesis, because it is some kind of "statistical
    guess" of what categories this document should be assigned to. Since you may
    be interested in any of several pieces of information about the hypothesis
    (for instance, which categories were assigned, which category was the single
    most likely category, the scores assigned to each category, etc.), the
    hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
    class, and you can use its object methods to get information about the
    hypothesis. See its class documentation for the details.

  Experiments

    The "AI::Categorizer::Experiment" class helps you organize the results of
    categorization experiments. As you get lots of categorization results
    (Hypotheses) back from the Learner, you can feed these results to the
    Experiment class, along with the correct answers. When all results have been
    collected, you can get a report on accuracy, precision, recall, F1, and so
    on, with both micro-averaging and macro-averaging over categories. We use
    the "Statistics::Contingency" module from CPAN to manage the calculations.
    See the docs for "AI::Categorizer::Experiment" for more details.

METHODS
    new()
        Creates a new Categorizer object and returns it. Accepts lots of
        parameters controlling behavior. In addition to the parameters listed
        here, you may pass any parameter accepted by any class that we create
        internally (the KnowledgeSet, Learner, Experiment, or Collection
        classes), or any class that *they* create. This is managed by the
        "Class::Container" module, so see its documentation for the details of
        how this works.

        The specific parameters accepted here are:

        progress_file
            A string that indicates a place where objects will be saved during
            several of the methods of this class. The default value is the
            string "save", which means files like "save-01-knowledge_set" will
            get created. The exact names of these files may change in future
            releases, since they're just used internally to resume where we last
            left off.

        verbose
            If true, a few status messages will be printed during execution.

        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to



( run in 1.051 second using v1.01-cache-2.11-cpan-39bf76dae61 )