AI-Categorizer

 view release on metacpan or  search on metacpan

Build.PL  view on Meta::CPAN

use Module::Build;
use Config;
use File::Spec;

my $build = new Module::Build
  (
   module_name => 'AI::Categorizer',
   license => 'perl',
   requires => {
		perl                    => '5.6.0',
		Class::Container        => 0.09,
		Storable                => 0,
		Params::Validate        => 0.18,
		Statistics::Contingency => 0.06,
		Lingua::Stem            => 0.50,
		File::Spec              => 0,
	       },
   recommends => {
		  Scalar::Util          => 0,
		  Time::Progress        => 1.1,
		  Algorithm::SVM        => 0.06,
		  AI::DecisionTree      => 0.06,
		  Algorithm::NaiveBayes => 0,
		 },
   build_requires => {
		      Module::Build     => 0.20,
		     },
   create_makefile_pl => 'passthrough',
  );

my $categorizer = File::Spec->catfile('eg', 'categorizer');
if ($build->y_n("Do you want to install the $categorizer script to $Config{installscript}?", 'n')) {
  $build->scripts($categorizer);
}


$build->create_build_script;

{
  my $path = $build->prompt
    (
     "\nIf you have the Weka system installed, please specify the path\n".
     "to the 'weka.jar' file, or '-' to search CLASSPATH, or '!' to skip:",
     '!'
    );
  $build->notes(classpath => $path eq '!' ? undef : $path);
}

Changes  view on Meta::CPAN

Revision history for Perl extension AI::Categorizer.

 - The t/01-naive_bayes.t test was failing (instead of skipping) when
   Algorithm::NaiveBayes wasn't installed.  Now it skips.

0.08 - Tue Mar 20 19:39:41 2007

 - Added a ChiSquared feature selection class. [Francois Paradis]

 - Changed the web locations of the reuters-21578 corpus that
   eg/demo.pl uses, since the location it referenced previously has
   gone away.

 - The building & installing process now uses Module::Build rather
   than ExtUtils::MakeMaker.

 - When the features_kept mechanism was used to explicitly state the
   features to use, and the scan_first parameter was left as its
   default value, the features_kept mechanism would silently fail to
   do anything.  This has now been fixed. [Spotted by Arnaud Gaudinat]

 - Recent versions of Weka have changed the name of the SVM class, so
   I've updated it in our test (t/03-weka.t) of the Weka wrapper
   too. [Sebastien Aperghis-Tramoni]

0.07  Tue May  6 16:15:04 CDT 2003

 - Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
   MANIFEST, so they weren't included in the 0.06 distribution.
   [Spotted by Zoltan Barta]

0.06 Tue Apr 22 10:27:26 CDT 2003

 - Added a relatively simple example script at the request of several
   people, at eg/demo.pl

 - Forgot to note a dependency on Algorithm::NaiveBayes in version
   0.05.  Fixed.

 - AI::Categorizer class wasn't loading AI::Categorizer::KnowledgeSet
   class.  Fixed.

 - Fixed a bug in which the 'documents' and 'categories' parameters to
   the KnowledgeSet objects were never accepted, claiming that it
   failed the "All are Document objects" or "All are Category objects"
   callbacks. [Spotted by rob@phraud.org]

 - Moved the 'stopword_file' parameter from Categorizer.pm to the
   Collection class.

0.05  Sat Mar 29 00:38:21 CST 2003

 - Feature selection is now handled by an abstract FeatureSelector
   framework class.  Currently the only concrete subclass implemented
   is FeatureSelector::DocFrequency.  The 'feature_selection'
   parameter has been replaced with a 'feature_selector_class'
   parameter.

 - Added a k-Nearest-Neighbor machine learner. [First revision
   implemented by David Bell]

 - Added a Rocchio machine learner. [Partially implemented by Xiaobo
   Li]

 - Added a "Guesser" machine learner which simply uses overall class
   probabilities to make categorization decisions.  Sometimes useful
   for providing a set of baseline scores against which to evaluate
   other machine learners.

 - The NaiveBayes learner is now a wrapper around my new
   Algorithm::NaiveBayes module, which is just the old NaiveBayes code
   from here, turned into its own standalone module.

 - Much more extensive regression testing of the code.

 - Added a Document subclass for XML documents. [Implemented by
   Jae-Moon Lee] Its interface is still unstable, it may change in
   later releases.

 - Added a 'Build.PL' file for an alternate installation method using
   Module::Build.

 - Fixed a problem in the Hypothesis' best_category() method that
   would often result in the wrong category being reported.  Added a
   regression test to exercise the Hypothesis class.  [Spotted by
   Xiaobo Li]

 - The 'categorizer' script now records more useful benchmarking
   information about time & memory in its outfile.

 - The AI::Categorizer->dump_parameters() method now tries to avoid
   showing you its entire list of stopwords.

 - Document objects now use a default 'name' if none is supplied.

 - For some Learner classes, the generated Hypothesis objects had
   non-functioning all_categories() methods.  Fixed.

 - The Collection::Files class now uses File::Spec internally to
   manage cross-platform filenames.

 - Added the 'stopword_behavior' parameter for controlling how
   stopword lists and stemming interact.  Previously, if stopwords &
   stemming were both used, stopwords were assumed to be pre-stemmed,
   which often isn't the case.

 - parse() is now an instance method of the Document class, not a
   class method.  This means it can operate directly on an object, it
   doesn't have to return a hash of content.  This allows more
   flexible document parsing.  This may cause some backward
   compatibility problems if people were overriding the parse()
   method.

 - Added a parse_handle() method, which can parse a document directly
   from a filehandle.

 - Fixed documentation for add_hypothesis() [spotted by Thierry
   Guillotin]

 - Added documentation for the AI::Categorizer::Collection::Files
   class.

0.04  Thu Nov  7 19:27:15 AEST 2002

 - Added learners for SVMs, Decision Trees, and a pass-through to
   Weka.

 - Added a virtual class for binary classifiers.

 - Wrote documentation for lots of the undocumented classes.

 - Added a PNG file giving an overview diagram of the classes.

 - Added a script 'categorizer' to provide a simple command-line
   interface to AI::Categorizer

 - save_state() and restore_state() now save to a directory, not a
   file.

 - Removed F1(), precision(), recall(), etc. from Util package since
   they're in Statistics::Contingency.  Added random_elements() to
   Util.

 - Collection::Files now warns when no category information is known
   about a document in the collection (knowing it's in zero categories
   is okay).

 - Added the Collection::InMemory class

 - Much more thorough testing with 'make test'.

 - Added add_hypothesis() method to Experiment.

 - Added dot() and value() methods to FeatureVector.

 - Added 'feature_selection' parameter to KnowledgeSet.

 - Added document($name) accessor method to KnowledgeSet.

 - In KnowledgeSet, load(), read(), and scan_*() can now accept a
   Collection object.

 - Added document_frequency(), finish(), and weigh_features() methods
   to KnowledgeSet.

 - Added save_features() and restore_features() to KnowledgeSet.

 - Added default categories() and categorize() methods to Learner base
   class.  get_scores() is now abstract.

 - Extended interface of ObjectSet class with retrieve(), includes(),
   and includes_name().

 - Moved 'term_weighting' parameter from Document to KnowledgeSet,
   since the normalized version needs to know the maximum
   term-frequency.  Also changed its values to 'n', 'l', 'b', and 't',
   with 'x' a synonym for 't'.

 - Implemented full range of TF/IDF term weighting methods (see Salton
   & Buckley, "Term Weighting Approaches in Automatic Text Retrieval",
   in journal "Information Processing & Management", 1988 #5)

0.03  Wed Jul 24 01:57:00 AEST 2002

 - First version released to CPAN

0.01  Wed Apr 17 10:47:21 2002
 - original version; created by h2xs 1.21 with options
   -XA -n AI::Categorizer

INSTALL  view on Meta::CPAN

           Installation instructions for AI::Categorizer

To install this module, follow the standard steps for installing most
Perl modules:

  perl Makefile.PL
  make
  make test
  make install

Or you may use the CPAN.pm module, which will automatically execute
these steps for you, and help you get the prerequisite dependencies
installed as well.

Alternatively, you can use the new Module::Build-style installer:

  perl Build.PL
  ./Build
  ./Build test
  ./Build install

 -Ken

MANIFEST  view on Meta::CPAN

Build.PL
Changes
doc/classes-overview.png
doc/classes.png
eg/categorizer
eg/demo.pl
eg/easy_guesser.pl
INSTALL
lib/AI/Categorizer.pm
lib/AI/Categorizer/Category.pm
lib/AI/Categorizer/Collection.pm
lib/AI/Categorizer/Collection/DBI.pm
lib/AI/Categorizer/Collection/Files.pm
lib/AI/Categorizer/Collection/InMemory.pm
lib/AI/Categorizer/Collection/SingleFile.pm
lib/AI/Categorizer/Document.pm
lib/AI/Categorizer/Document/SMART.pm
lib/AI/Categorizer/Document/Text.pm
lib/AI/Categorizer/Document/XML.pm
lib/AI/Categorizer/Experiment.pm
lib/AI/Categorizer/FeatureSelector.pm
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm
lib/AI/Categorizer/FeatureVector.pm
lib/AI/Categorizer/Hypothesis.pm
lib/AI/Categorizer/KnowledgeSet.pm
lib/AI/Categorizer/Learner.pm
lib/AI/Categorizer/Learner/Boolean.pm
lib/AI/Categorizer/Learner/DecisionTree.pm
lib/AI/Categorizer/Learner/Guesser.pm
lib/AI/Categorizer/Learner/KNN.pm
lib/AI/Categorizer/Learner/NaiveBayes.pm
lib/AI/Categorizer/Learner/Rocchio.pm
lib/AI/Categorizer/Learner/SVM.pm
lib/AI/Categorizer/Learner/Weka.pm
lib/AI/Categorizer/ObjectSet.pm
lib/AI/Categorizer/Storable.pm
lib/AI/Categorizer/Util.pm
Makefile.PL
MANIFEST
README
t/01-naive_bayes.t
t/02-experiment.t
t/03-weka.t
t/04-decision_tree.t
t/05-svm.t
t/06-knn.t
t/07-guesser.t
t/09-rocchio.t
t/10-tools.t
t/11-feature_vector.t
t/12-hypothesis.t
t/13-document.t
t/14-collection.t
t/15-knowledge_set.t
t/common.pl
t/traindocs/doc1
t/traindocs/doc2
t/traindocs/doc3
t/traindocs/doc4
META.yml

META.yml  view on Meta::CPAN

---
name: AI-Categorizer
version: 0.09
author:
  - 'Ken Williams <ken@mathforum.org>'
  - |-
    Discussion about this module can be directed to the perl-AI list at
    <perl-ai@perl.org>.  For more info about the list, see
    http://lists.perl.org/showlist.cgi?name=perl-ai
abstract: Automatic Text Categorization
license: perl
resources:
  license: http://dev.perl.org/licenses/
requires:
  Class::Container: 0.09
  File::Spec: 0
  Lingua::Stem: 0.5
  Params::Validate: 0.18
  Statistics::Contingency: 0.06
  Storable: 0
  perl: 5.6.0
build_requires:
  Module::Build: 0.2
recommends:
  AI::DecisionTree: 0.06
  Algorithm::NaiveBayes: 0
  Algorithm::SVM: 0.06
  Scalar::Util: 0
  Time::Progress: 1.1
provides:
  AI::Categorizer:
    file: lib/AI/Categorizer.pm
    version: 0.09
  AI::Categorizer::Category:
    file: lib/AI/Categorizer/Category.pm
  AI::Categorizer::Collection:
    file: lib/AI/Categorizer/Collection.pm
  AI::Categorizer::Collection::DBI:
    file: lib/AI/Categorizer/Collection/DBI.pm
  AI::Categorizer::Collection::Files:
    file: lib/AI/Categorizer/Collection/Files.pm
  AI::Categorizer::Collection::InMemory:
    file: lib/AI/Categorizer/Collection/InMemory.pm
  AI::Categorizer::Collection::SingleFile:
    file: lib/AI/Categorizer/Collection/SingleFile.pm
  AI::Categorizer::Document:
    file: lib/AI/Categorizer/Document.pm
  AI::Categorizer::Document::SMART:
    file: lib/AI/Categorizer/Document/SMART.pm
  AI::Categorizer::Document::Text:
    file: lib/AI/Categorizer/Document/Text.pm
  AI::Categorizer::Document::XML:
    file: lib/AI/Categorizer/Document/XML.pm
  AI::Categorizer::Document::XML::Handler:
    file: lib/AI/Categorizer/Document/XML.pm
  AI::Categorizer::Experiment:
    file: lib/AI/Categorizer/Experiment.pm
  AI::Categorizer::FeatureSelector:
    file: lib/AI/Categorizer/FeatureSelector.pm
  AI::Categorizer::FeatureSelector::CategorySelector:
    file: lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
  AI::Categorizer::FeatureSelector::ChiSquare:
    file: lib/AI/Categorizer/FeatureSelector/ChiSquare.pm
  AI::Categorizer::FeatureSelector::DocFrequency:
    file: lib/AI/Categorizer/FeatureSelector/DocFrequency.pm
  AI::Categorizer::FeatureVector:
    file: lib/AI/Categorizer/FeatureVector.pm
  AI::Categorizer::Hypothesis:
    file: lib/AI/Categorizer/Hypothesis.pm
  AI::Categorizer::KnowledgeSet:
    file: lib/AI/Categorizer/KnowledgeSet.pm
  AI::Categorizer::Learner:
    file: lib/AI/Categorizer/Learner.pm
  AI::Categorizer::Learner::Boolean:
    file: lib/AI/Categorizer/Learner/Boolean.pm
  AI::Categorizer::Learner::DecisionTree:
    file: lib/AI/Categorizer/Learner/DecisionTree.pm
    version: 0.01
  AI::Categorizer::Learner::Guesser:
    file: lib/AI/Categorizer/Learner/Guesser.pm
  AI::Categorizer::Learner::KNN:
    file: lib/AI/Categorizer/Learner/KNN.pm
  AI::Categorizer::Learner::KNN::Queue:
    file: lib/AI/Categorizer/Learner/KNN.pm
  AI::Categorizer::Learner::NaiveBayes:
    file: lib/AI/Categorizer/Learner/NaiveBayes.pm
  AI::Categorizer::Learner::Rocchio:
    file: lib/AI/Categorizer/Learner/Rocchio.pm
    version: 0.01
  AI::Categorizer::Learner::SVM:
    file: lib/AI/Categorizer/Learner/SVM.pm
    version: 0.01
  AI::Categorizer::Learner::Weka:
    file: lib/AI/Categorizer/Learner/Weka.pm
  AI::Categorizer::ObjectSet:
    file: lib/AI/Categorizer/ObjectSet.pm
  AI::Categorizer::Storable:
    file: lib/AI/Categorizer/Storable.pm
  AI::Categorizer::Util:
    file: lib/AI/Categorizer/Util.pm
generated_by: Module::Build version 0.2806
meta-spec:
  url: http://module-build.sourceforge.net/META-spec-v1.2.html
  version: 1.2

Makefile.PL  view on Meta::CPAN

# Note: this file was auto-generated by Module::Build::Compat version 0.03
    
    unless (eval "use Module::Build::Compat 0.02; 1" ) {
      print "This module requires Module::Build to install itself.\n";
      
      require ExtUtils::MakeMaker;
      my $yn = ExtUtils::MakeMaker::prompt
	('  Install Module::Build now from CPAN?', 'y');
      
      unless ($yn =~ /^y/i) {
	die " *** Cannot install without Module::Build.  Exiting ...\n";
      }
      
      require Cwd;
      require File::Spec;
      require CPAN;
      
      # Save this 'cause CPAN will chdir all over the place.
      my $cwd = Cwd::cwd();
      
      CPAN::Shell->install('Module::Build::Compat');
      CPAN::Shell->expand("Module", "Module::Build::Compat")->uptodate
	or die "Couldn't install Module::Build, giving up.\n";
      
      chdir $cwd or die "Cannot chdir() back to $cwd: $!";
    }
    eval "use Module::Build::Compat 0.02; 1" or die $@;
    
    Module::Build::Compat->run_build_pl(args => \@ARGV);
    require Module::Build;
    Module::Build::Compat->write_makefile(build_class => 'Module::Build');

README  view on Meta::CPAN

NAME
    AI::Categorizer - Automatic Text Categorization

SYNOPSIS
     use AI::Categorizer;
     my $c = new AI::Categorizer(...parameters...);
 
     # Run a complete experiment - training on a corpus, testing on a test
     # set, printing a summary of results to STDOUT
     $c->run_experiment;
 
     # Or, run the parts of $c->run_experiment separately
     $c->scan_features;
     $c->read_training_set;
     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
 
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }
 
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object
    framework. We give a conceptual overview, but don't get into any of the
    details about interfaces or usage. See the documentation for the individual
    classes for more details.

    A diagram of the various classes in the framework can be seen in
    "doc/classes-overview.png", and a more detailed view of the same thing can
    be seen in "doc/classes.png".

  Knowledge Sets

    A "knowledge set" is defined as a collection of documents, together with
    some information on the categories each document belongs to. Note that this
    term is somewhat unique to this project - other sources may call it a
    "training corpus", or "prior knowledge". A knowledge set also contains some
    information on how documents will be parsed and how their features (words)
    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant stuff of the documents would get lost in
    the "noise" of the insignificant stuff.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

    The "AI::Categorizer::Collection" class and its subclasses instantiate the
    idea of a collection in this sense.

  Documents

    Each document is represented by an "AI::Categorizer::Document" object, or an
    object of one of its subclasses. Each document class contains methods for
    turning a bunch of data into a Feature Vector. Each document also has a
    method to report which categories it belongs to.

  Categories

    Each category is represented by an "AI::Categorizer::Category" object. Its
    main purpose is to keep track of which documents belong to it, though you
    can also examine statistical properties of an entire category, such as
    obtaining a Feature Vector representing an amalgamation of all the documents
    that belong to it.

  Machine Learning Algorithms

    There are lots of different ways to make the inductive leap from the
    training documents to unseen documents. The Machine Learning community has
    studied many algorithms for this purpose. To allow flexibility in choosing
    and configuring categorization algorithms, each such algorithm is a subclass
    of "AI::Categorizer::Learner". There are currently four categorizers
    included in the distribution:

    AI::Categorizer::Learner::NaiveBayes
        A pure-perl implementation of a Naive Bayes classifier. No dependencies
        on external modules or other resources. Naive Bayes is usually very fast
        to train and fast to make categorization decisions, but isn't always the
        most accurate categorizer.

    AI::Categorizer::Learner::SVM
        An interface to Corey Spencer's "Algorithm::SVM", which implements a
        Support Vector Machine classifier. SVMs can take a while to train
        (though in certain conditions there are optimizations to make them quite
        fast), but are pretty quick to categorize. They often have very good
        accuracy.

    AI::Categorizer::Learner::DecisionTree
        An interface to "AI::DecisionTree", which implements a Decision Tree
        classifier. Decision Trees generally take longer to train than Naive
        Bayes or SVM classifiers, but they are also quite fast when
        categorizing. Decision Trees have the advantage that you can scrutinize
        the structures of trained decision trees to see how decisions are being
        made.

    AI::Categorizer::Learner::Weka
        An interface to version 2 of the Weka Knowledge Analysis system that
        lets you use any of the machine learners it defines. This gives you
        access to lots and lots of machine learning algorithms in use by machine
        learning researches. The main drawback is that Weka tends to be quite
        slow and use a lot of memory, and the current interface between Weka and
        "AI::Categorizer" is a bit clumsy.

    Other machine learning methods that may be implemented soonish include
    Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
    for ensemble learning. No timetable for their creation has yet been set.

    Please see the documentation of these individual modules for more details on
    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
    a description of the general categorizer interface.

    If you wish to create your own classifier, you should inherit from
    "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
    abstract classes that manage some of the work for you.

  Feature Vectors

    Most categorization algorithms don't deal directly with documents' data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem helpful for
    determining its category, but they are usually some version of the "most
    important" words in the document. A list of features and their weights in
    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
    You may think of this class as roughly analogous to a Perl hash, where the
    keys are the names of features and the values are their weights.

  Hypotheses

    The result of asking a categorizer to categorize a previously unseen
    document is called a hypothesis, because it is some kind of "statistical
    guess" of what categories this document should be assigned to. Since you may
    be interested in any of several pieces of information about the hypothesis
    (for instance, which categories were assigned, which category was the single
    most likely category, the scores assigned to each category, etc.), the
    hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
    class, and you can use its object methods to get information about the
    hypothesis. See its class documentation for the details.

  Experiments

    The "AI::Categorizer::Experiment" class helps you organize the results of
    categorization experiments. As you get lots of categorization results
    (Hypotheses) back from the Learner, you can feed these results to the
    Experiment class, along with the correct answers. When all results have been
    collected, you can get a report on accuracy, precision, recall, F1, and so
    on, with both micro-averaging and macro-averaging over categories. We use
    the "Statistics::Contingency" module from CPAN to manage the calculations.
    See the docs for "AI::Categorizer::Experiment" for more details.

METHODS
    new()
        Creates a new Categorizer object and returns it. Accepts lots of
        parameters controlling behavior. In addition to the parameters listed
        here, you may pass any parameter accepted by any class that we create
        internally (the KnowledgeSet, Learner, Experiment, or Collection
        classes), or any class that *they* create. This is managed by the
        "Class::Container" module, so see its documentation for the details of
        how this works.

        The specific parameters accepted here are:

        progress_file
            A string that indicates a place where objects will be saved during
            several of the methods of this class. The default value is the
            string "save", which means files like "save-01-knowledge_set" will
            get created. The exact names of these files may change in future
            releases, since they're just used internally to resume where we last
            left off.

        verbose
            If true, a few status messages will be printed during execution.

        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to
            "$data_root/cats.txt".

    learner()
        Returns the Learner object associated with this Categorizer. Before
        "train()", the Learner will of course not be trained yet.

    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.

    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.

    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
        use.

        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.

    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this calls the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.

    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        created during "read_training_set()". Returns the Learner object. Also
        saves the Learner object for later use.

    evaluate_test_set()
        Creates a Collection based on the value of the "test_set" parameter, and
        calls the Learner's "categorize_collection()" method using this
        Collection. Returns the resultant Experiment object. Also saves the
        Experiment object for later use in the "stats_table()" method.

    stats_table()
        Returns the value of the Experiment's (as created by
        "evaluate_test_set()") "stats_table()" method. This is a string that
        shows various statistics about the accuracy/precision/recall/F1/etc. of
        the assignments made during testing.

HISTORY
    This module is a revised and redesigned version of the previous
    "AI::Categorize" module by the same author. Note the added 'r' in the new
    name. The older module has a different interface, and no attempt at backward
    compatibility has been made - that's why I changed the name.

    You can have both "AI::Categorize" and "AI::Categorizer" installed at the
    same time on the same machine, if you want. They don't know about each other
    or use conflicting namespaces.

AUTHOR
    Ken Williams <ken@mathforum.org>

    Discussion about this module can be directed to the perl-AI list at
    <perl-ai@perl.org>. For more info about the list, see
    http://lists.perl.org/showlist.cgi?name=perl-ai

REFERENCES
    An excellent introduction to the academic field of Text Categorization is
    Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
    ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.

COPYRIGHT
    Copyright 2000-2003 Ken Williams. All rights reserved.

    This distribution is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself. These terms apply to every file in the
    distribution - if you have questions, please contact the author.

eg/categorizer  view on Meta::CPAN

#!/usr/bin/perl

# This script creates a Categorizer and runs several of its methods on
# a corpus, reporting the results.
#
# Copyright 2002 Ken Williams, under the same license as the
# AI::Categorizer distribution.


use strict;
use AI::Categorizer;
use Benchmark;
my $HAVE_YAML = eval "use YAML; 1";

my ($opt, $do_stage, $outfile) = parse_command_line(@ARGV);
@ARGV = grep !/^-\d$/, @ARGV;

my $c = eval {new AI::Categorizer(%$opt)};
if ($@ and $@ =~ /^The following parameter/) {
  die "$@\nPlease see the AI::Categorizer documentation for a description of parameters accepted.\n";
}
die $@ if $@;

%$do_stage = map {$_, 1} 1..5 unless keys %$do_stage;

my $out_fh;
if ($outfile) {
  open $out_fh, ">> $outfile" or die "Can't create $outfile: $!";
  select((select($out_fh), $|=1)[0]);
  if (keys(%$do_stage) > 1) {
    print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
    if ($HAVE_YAML) {
      print {$out_fh} YAML::Dump($c->dump_parameters);
    } else {
      warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
    }
  }
}
  

run_section('scan_features',     1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train',             3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
  my $result = $c->stats_table;
  print $result if $c->verbose;
  print $out_fh $result if $out_fh;
}

sub run_section {
  my ($section, $stage, $do_stage) = @_;
  return unless $do_stage->{$stage};
  if (keys %$do_stage > 1) {
    print " % $0 @ARGV -$stage\n" if $c->verbose;
    die "$0 is not executable, please change its execution permissions"
      unless -x $0;
    system($0, @ARGV, "-$stage") == 0
      or die "$0 returned nonzero status, \$?=$?";
    return;
  }
  my $start = new Benchmark;
  $c->$section();
  my $end = new Benchmark;
  my $summary = timestr(timediff($end, $start));
  my ($rss, $vsz) = memory_usage();
  print "$summary (memory: rss=$rss, vsz=$vsz)\n" if $c->verbose;
  print $out_fh "Stage $stage: $summary (memory: rss=$rss, vsz=$vsz)\n" if $out_fh;
}

sub parse_command_line {
  my (%opt, %do_stage);

  while (@_) {
    if ($_[0] =~ /^-(\d+)$/) {
      shift;
      $do_stage{$1} = 1;
      
    } elsif ( $_[0] eq '--config_file' ) {
      die "--config_file requires the YAML module from CPAN to be installed.\n" unless $HAVE_YAML;
      shift;
      my $file = shift;
      my $href = YAML::LoadFile($file);
      @opt{keys %$href} = values %$href;
      
    } elsif ( $_[0] =~ /^--/ ) {
      my ($k, $v) = (shift, shift);
      $k =~ s/^--//;
      $opt{$k} = $v;
      
    } else {
      die usage();
    }
  }

  while (my ($k, $v) = each %opt) {
    # Allow abbreviations
    if ($k =~ /^(\w+)_class$/) {
      my $name = $1;
      $v =~ s/^::/AI::Categorizer::\u${name}::/;
      $opt{$k} = $v;
    }
  }

  my $outfile;
  unless ($outfile = delete $opt{outfile}) {
    $outfile = $opt{progress_file} ? "$opt{progress_file}-results.txt" : "results.txt";
  }

  return (\%opt, \%do_stage, $outfile);
}

sub usage {
  return <<EOF;
 Usage:

  $0 --parameter_1 <value_1> --parameter_2 <value_2>
      # You may specify a YAML config file as follows:
  $0 --config_file <path> --parameter_3 <value_3>
      # Or, to run only step 3 (of 5)
  $0 --config_file <path> -3

 --parameter_1, --parameter_2, etc. are parameters accepted by
 AI::Categorizer objects' new() methods.

EOF
}

sub memory_usage {
  my ($rss, $vsz);
  if ($^O eq 'darwin' or $^O eq 'linux') {
    ($rss, $vsz) = `ps -eo rss,vsz -p $$` =~ /(\d+)\s+(\d+)/;
  } elsif ($^O eq 'solaris') {
    ($rss, $vsz) = `ps -o rss,vsz -p $$` =~ /(\d+)\s+(\d+)/;
  } else {
    warn "Unknown system, can't get memory usage";
  }
  return ($rss, $vsz);
}

eg/demo.pl  view on Meta::CPAN

#!/usr/bin/perl

# This script is a fairly simple demonstration of how AI::Categorizer
# can be used.  There are lots of other less-simple demonstrations
# (actually, they're doing much simpler things, but are probably
# harder to follow) in the tests in the t/ subdirectory.  The
# eg/categorizer script can also be a good example if you're willing
# to figure out a bit how it works.
#
# This script reads a training corpus from a directory of plain-text
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.

use strict;
use AI::Categorizer;
use AI::Categorizer::Collection::Files;
use AI::Categorizer::Learner::NaiveBayes;
use File::Spec;

die("Usage: $0 <corpus>\n".
    "  A sample corpus (data set) can be downloaded from\n".
    "     http://www.cpan.org/authors/Ken_Williams/data/reuters-21578.tar.gz\n".
    "  or http://www.limnus.com/~ken/reuters-21578.tar.gz\n")
  unless @ARGV == 1;

my $corpus = shift;

my $training  = File::Spec->catfile( $corpus, 'training' );
my $test      = File::Spec->catfile( $corpus, 'test' );
my $cats      = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );

my %params;
if (-e $stopwords) {
  $params{stopword_file} = $stopwords;
} else {
  warn "$stopwords not found - no stopwords will be used.\n";
}

if (-e $cats) {
  $params{category_file} = $cats;
} else {
  die "$cats not found - can't proceed without category information.\n";
}


# In a real-world application these Collection objects could be of any
# type (any Collection subclass).  Or you could create each Document
# object manually.  Or you could let the KnowledgeSet create the
# Collection objects for you.

$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test     = AI::Categorizer::Collection::Files->new( path => $test, %params );

# We turn on verbose mode so you can watch the progress of loading &
# training.  This looks nicer if you have Time::Progress installed!

print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );

print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );

print "Categorizing test set\n";
my $experiment = $l->categorize_collection( collection => $test );

print $experiment->stats_table;


# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:

my $doc = AI::Categorizer::Document->new
  ( content => "Hello, I am a pretty generic document with not much to say." );

my $h = $l->categorize( $doc );

print ("For test document:\n",
       "  Best category = ", $h->best_category, "\n",
       "  All categories = ", join(', ', $h->categories), "\n");

eg/easy_guesser.pl  view on Meta::CPAN

#!/usr/bin/perl

# This script can be helpful for getting a set of baseline scores for
# a categorization task.  It simulates using the "Guesser" learner,
# but is much faster.  Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format.  <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class.  <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.

use strict;
use Statistics::Contingency;

die "Usage: $0 <cats-file> <training-dir> <test-dir>\n" unless @ARGV == 3;
my ($cats, $training, $test) = @ARGV;

die "$cats isn't a plain file\n" unless -f $cats;
die "$training isn't a directory\n" unless -d $training;
die "$test isn't a directory\n" unless -d $test;

my %cats;
print "Reading category file\n";
open my($fh), $cats or die "Can't read $cats: $!";
while (<$fh>) {
    my ($doc, @cats) = split;
    $cats{$doc} = \@cats;
}

my (%freq, $docs);
print "Scanning training set\n";
opendir my($dh), $training or die "Can't opendir $training: $!";
while (defined(my $file = readdir $dh)) {
    next if $file eq '.' or $file eq '..';
    unless ($cats{$file}) {
	warn "No category information for '$file'";
	next;
    }
    $docs++;
    $freq{$_}++ foreach @{$cats{$file}};
}
closedir $dh;

print "Calculating probabilities (@{[ %freq ]})\n";
$_ /= $docs foreach values %freq;
my @cats = keys %freq;

print "Scoring test documents\n";
my $c = Statistics::Contingency->new(categories => \@cats);
opendir $dh, $test or die "Can't opendir $test: $!";
while (defined(my $file = readdir $dh)) {
    next if $file eq '.' or $file eq '..';
    unless ($cats{$file}) {
	warn "No category information for '$file'";
	next;
    }
    my @assigned;
    foreach (@cats) {
	push @assigned, $_ if rand() < $freq{$_};
    }
    $c->add_result(\@assigned, $cats{$file});
}

print $c->stats_table(4);

lib/AI/Categorizer.pm  view on Meta::CPAN

package AI::Categorizer;
$VERSION = '0.09';

use strict;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use File::Spec;
use AI::Categorizer::Learner;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::Collection;
use AI::Categorizer::Hypothesis;
use AI::Categorizer::KnowledgeSet;


__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );

__PACKAGE__->contained_objects
  (
   knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
   learner       => { class => 'AI::Categorizer::Learner::NaiveBayes' },
   experiment    => { class => 'AI::Categorizer::Experiment',
		      delayed => 1 },
   collection    => { class => 'AI::Categorizer::Collection::Files',
		      delayed => 1 },
  );

sub new {
  my $package = shift;
  my %args = @_;
  my %defaults;
  if (exists $args{data_root}) {
    $defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
    $defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
    $defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
    delete $args{data_root};
  }

  return $package->SUPER::new(%defaults, %args);
}

#sub dump_parameters {
#  my $p = shift()->SUPER::dump_parameters;
#  delete $p->{stopwords} if $p->{stopword_file};
#  return $p;
#}

sub knowledge_set { shift->{knowledge_set} }
sub learner       { shift->{learner} }

# Combines several methods in one sub
sub run_experiment {
  my $self = shift;
  $self->scan_features;
  $self->read_training_set;
  $self->train;
  $self->evaluate_test_set;
  print $self->stats_table;
}

sub scan_features {
  my $self = shift;
  return unless $self->knowledge_set->scan_first;
  $self->knowledge_set->scan_features( path => $self->{training_set} );
  $self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}

sub read_training_set {
  my $self = shift;
  $self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
    if -e "$self->{progress_file}-01-features";
  $self->knowledge_set->read( path => $self->{training_set} );
  $self->_save_progress( '02', 'knowledge_set' );
  return $self->knowledge_set;
}

sub train {
  my $self = shift;
  $self->_load_progress( '02', 'knowledge_set' );
  $self->learner->train( knowledge_set => $self->{knowledge_set} );
  $self->_save_progress( '03', 'learner' );
  return $self->learner;
}

sub evaluate_test_set {
  my $self = shift;
  $self->_load_progress( '03', 'learner' );
  my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
  $self->{experiment} = $self->learner->categorize_collection( collection => $c );
  $self->_save_progress( '04', 'experiment' );
  return $self->{experiment};
}

sub stats_table {
  my $self = shift;
  $self->_load_progress( '04', 'experiment' );
  return $self->{experiment}->stats_table;
}

sub progress_file {
  shift->{progress_file};
}

sub verbose {
  shift->{verbose};
}

sub _save_progress {
  my ($self, $stage, $node) = @_;
  return unless $self->{progress_file};
  my $file = "$self->{progress_file}-$stage-$node";
  warn "Saving to $file\n" if $self->{verbose};
  $self->{$node}->save_state($file);
}

sub _load_progress {
  my ($self, $stage, $node) = @_;
  return unless $self->{progress_file};
  my $file = "$self->{progress_file}-$stage-$node";
  warn "Loading $file\n" if $self->{verbose};
  $self->{$node} = $self->contained_class($node)->restore_state($file);
}

1;
__END__

=head1 NAME

AI::Categorizer - Automatic Text Categorization

=head1 SYNOPSIS

 use AI::Categorizer;
 my $c = new AI::Categorizer(...parameters...);
 
 # Run a complete experiment - training on a corpus, testing on a test
 # set, printing a summary of results to STDOUT
 $c->run_experiment;
 
 # Or, run the parts of $c->run_experiment separately
 $c->scan_features;
 $c->read_training_set;
 $c->train;
 $c->evaluate_test_set;
 print $c->stats_table;
 
 # After training, use the Learner for categorization
 my $l = $c->learner;
 while (...) {
   my $d = ...create a document...
   my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
   print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
   print "Best category: ", $hypothesis->best_category, "\n";
 }
 
=head1 DESCRIPTION

C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules.  The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.

The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use.  There are several ways to carry out this process.  The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.

A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?).  Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider.  These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable.  If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual
module.

=head1 FRAMEWORK COMPONENTS

This section explains the major pieces of the C<AI::Categorizer>
object framework.  We give a conceptual overview, but don't get into
any of the details about interfaces or usage.  See the documentation
for the individual classes for more details.

A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.

=head2 Knowledge Sets

A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.

The process of selecting the most important features in the training
set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.

=head2 Collections

Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.

=head2 Documents

Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses.  Each document class
contains methods for turning a bunch of data into a Feature Vector.
Each document also has a method to report which categories it belongs
to.

=head2 Categories

Each category is represented by an C<AI::Categorizer::Category>
object.  Its main purpose is to keep track of which documents belong
to it, though you can also examine statistical properties of an entire
category, such as obtaining a Feature Vector representing an
amalgamation of all the documents that belong to it.

=head2 Machine Learning Algorithms

There are lots of different ways to make the inductive leap from the
training documents to unseen documents.  The Machine Learning
community has studied many algorithms for this purpose.  To allow
flexibility in choosing and configuring categorization algorithms,
each such algorithm is a subclass of C<AI::Categorizer::Learner>.
There are currently four categorizers included in the distribution:

=over 4

=item AI::Categorizer::Learner::NaiveBayes

A pure-perl implementation of a Naive Bayes classifier.  No
dependencies on external modules or other resources.  Naive Bayes is
usually very fast to train and fast to make categorization decisions,
but isn't always the most accurate categorizer.

=item AI::Categorizer::Learner::SVM

An interface to Corey Spencer's C<Algorithm::SVM>, which implements a
Support Vector Machine classifier.  SVMs can take a while to train
(though in certain conditions there are optimizations to make them
quite fast), but are pretty quick to categorize.  They often have very
good accuracy.

=item AI::Categorizer::Learner::DecisionTree

An interface to C<AI::DecisionTree>, which implements a Decision Tree
classifier.  Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing.  Decision Trees have the advantage that you can
scrutinize the structures of trained decision trees to see how
decisions are being made.

=item AI::Categorizer::Learner::Weka

An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines.  This gives you
access to lots and lots of machine learning algorithms in use by
machine learning researches.  The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface
between Weka and C<AI::Categorizer> is a bit clumsy.

=back

Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
combiner for ensemble learning.  No timetable for their creation has
yet been set.

Please see the documentation of these individual modules for more
details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.

If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.

=head2 Feature Vectors

Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.

=head2 Hypotheses

The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to.  Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
scores assigned to each category, etc.), the hypothesis is returned as
an object of the C<AI::Categorizer::Hypothesis> class, and you can use
its object methods to get information about the hypothesis.  See its
class documentation for the details.

=head2 Experiments

The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments.  As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers.  When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
micro-averaging and macro-averaging over categories.  We use the
C<Statistics::Contingency> module from CPAN to manage the
calculations. See the docs for C<AI::Categorizer::Experiment> for more
details.

=head1 METHODS

=over 4

=item new()

Creates a new Categorizer object and returns it.  Accepts lots of
parameters controlling behavior.  In addition to the parameters listed
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that I<they> create.  This is managed by the
C<Class::Container> module, so see
L<its documentation|Class::Container> for the details of how this
works.

The specific parameters accepted here are:

=over 4

=item progress_file

A string that indicates a place where objects will be saved during
several of the methods of this class.  The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created.  The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.

=item verbose

If true, a few status messages will be printed during execution.

=item training_set

Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.

=item test_set

Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.

=item data_root

A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately.  Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.

=back

=item learner()

Returns the Learner object associated with this Categorizer.  Before
C<train()>, the Learner will of course not be trained yet.

=item knowledge_set()

Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.

=item run_experiment()

Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.

=item scan_features()

Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.

This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.

=item read_training_set()

Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.

=item train()

Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.

=item evaluate_test_set()

Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection.  Returns the resultant Experiment object.  Also saves the
Experiment object for later use in the C<stats_table()> method.

=item stats_table()

Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method.  This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.

=back

=head1 HISTORY

This module is a revised and redesigned version of the previous
C<AI::Categorize> module by the same author.  Note the added 'r' in
the new name.  The older module has a different interface, and no
attempt at backward compatibility has been made - that's why I changed
the name.

You can have both C<AI::Categorize> and C<AI::Categorizer> installed
at the same time on the same machine, if you want.  They don't know
about each other or use conflicting namespaces.

=head1 AUTHOR

Ken Williams <ken@mathforum.org>

Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>.  For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai

=head1 REFERENCES

An excellent introduction to the academic field of Text Categorization
is Fabrizio Sebastiani's "Machine Learning in Automated Text
Categorization": ACM Computing Surveys, Vol. 34, No. 1, March 2002,
pp. 1-47.

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.  These terms apply to
every file in the distribution - if you have questions, please contact
the author.

=cut

lib/AI/Categorizer/Category.pm  view on Meta::CPAN

package AI::Categorizer::Category;

use strict;
use AI::Categorizer::ObjectSet;
use Class::Container;
use base qw(Class::Container);

use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;

__PACKAGE__->valid_params
  (
   name => {type => SCALAR, public => 0},
   documents  => {
		  type => ARRAYREF,
		  default => [],
		  callbacks => { 'all are Document objects' => 
				 sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
			       },
		  public => 0,
		 },
  );

__PACKAGE__->contained_objects
  (
   features => {
		class => 'AI::Categorizer::FeatureVector',
		delayed => 1,
	       },
  );

my %REGISTRY = ();

sub new {
  my $self = shift()->SUPER::new(@_);
  $self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
  $REGISTRY{$self->{name}} = $self;
  return $self;
}

sub by_name {
  my ($class, %args) = @_;
  return $REGISTRY{$args{name}} if exists $REGISTRY{$args{name}};
  return $class->new(%args);
}

sub name { $_[0]->{name} }

sub documents {
  my $d = $_[0]->{documents};
  return wantarray ? $d->members : $d->size;
}

sub contains_document {
  return $_[0]->{documents}->includes( $_[1] );
}

sub add_document {
  my $self = shift;
  $self->{documents}->insert( $_[0] );
  delete $self->{features};  # Could be more efficient?
}

sub features {
  my $self = shift;

  if (@_) {
    $self->{features} = shift;
  }
  return $self->{features} if $self->{features};

  my $v = $self->create_delayed_object('features');
  return $self->{features} = $v unless $self->documents;

  foreach my $document ($self->documents) {
    $v->add( $document->features );
  }
  
  return $self->{features} = $v;
}

1;
__END__

=head1 NAME

AI::Categorizer::Category - A named category of documents

=head1 SYNOPSIS

  my $category = AI::Categorizer::Category->by_name("sports");
  my $name = $category->name;
  
  my @docs = $category->documents;
  my $num_docs = $category->documents;
  my $features = $category->features;
  
  $category->add_document($doc);
  if ($category->contains_document($doc)) { ...

=head1 DESCRIPTION

This simple class represents a named category which may contain zero
or more documents.  Each category is a "singleton" by name, so two
Category objects with the same name should not be created at once.

=head1 METHODS

=over 4

=item new()

Creates a new Category object and returns it.  Accepts the following
parameters:

=over 4

=item name

The name of this category

=item documents

A reference to an array of Document objects that should belong to this
category.

=back

=item by_name(name => $string)

Returns the Category object with the given name, or creates one if no
such object exists.

=item documents()

Returns a list of the Document objects in this category in a list
context, or the number of such objects in a scalar context.

=item features()

Returns a FeatureVector object representing the sum of all the
FeatureVectors of the Documents in this Category.

=item add_document($document)

Informs the Category that the given Document belongs to it.

=item contains_document($document)

Returns true if the given document belongs to this category, or false
otherwise.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Collection.pm  view on Meta::CPAN

package AI::Categorizer::Collection;
use strict;

use Params::Validate qw(:types);
use Class::Container;
use base qw(Class::Container);
__PACKAGE__->valid_params
  (
   verbose => {type => SCALAR, default => 0},
   stopword_file => { type => SCALAR, optional => 1 },
   category_hash => { type => HASHREF, default => {} },
   category_file => { type => SCALAR, optional => 1 },
  );

__PACKAGE__->contained_objects
  (
   document => { class => 'AI::Categorizer::Document::Text',
		 delayed => 1 },
  );

sub new {
  my ($class, %args) = @_;
  
  # Optimize so every document doesn't have to convert the stopword list to a hash
  if ($args{stopwords} and UNIVERSAL::isa($args{stopwords}, 'ARRAY')) {
    $args{stopwords} = { map {+$_ => 1} @{ $args{stopwords} } };
  }
  
  my $self = $class->SUPER::new(%args);

  if ($self->{category_file}) {
    local *FH;
    open FH, $self->{category_file} or die "Can't open $self->{category_file}: $!";
    while (<FH>) {
      my ($doc, @cats) = split;
      $self->{category_hash}{$doc} = \@cats;
    }
    close FH;
  }
  if (exists $self->{stopword_file}) {
    my %stopwords;
    local *FH;
    open FH, "< $self->{stopword_file}" or die "$self->{stopword_file}: $!";
    while (<FH>) {
      chomp;
      $stopwords{$_} = 1;
    }
    close FH;

    $self->delayed_object_params('document', stopwords => \%stopwords);
  }

  return $self;
}

# This should usually be replaced in subclasses with a faster version that doesn't
# need to create actual documents each time through
sub count_documents {
  my $self = shift;
  return $self->{document_count} if exists $self->{document_count};

  $self->rewind;
  my $count = 0;
  $count++ while $self->next;
  $self->rewind;

  return $self->{document_count} = $count;
}

# Abstract methods
sub next;
sub rewind;

1;
__END__

=head1 NAME

AI::Categorizer::Collection - Access stored documents

=head1 SYNOPSIS

  my $c = new AI::Categorizer::Collection::Files
    (path => '/tmp/docs/training',
     category_file => '/tmp/docs/cats.txt');
  print "Total number of docs: ", $c->count_documents, "\n";
  while (my $document = $c->next) {
    ...
  }
  $c->rewind; # For further operations
  
=head1 DESCRIPTION

This abstract class implements an iterator for accessing documents in
their natively stored format.  You cannot directly create an instance
of the Collection class, because it is abstract - see the
documentation for the C<Files>, C<SingleFile>, or C<InMemory>
subclasses for a concrete interface.

=head1 METHODS

=over 4

=item new()

Creates a new Collection object and returns it.  Accepts the following
parameters:

=over 4

=item category_hash

Indicates a reference to a hash which maps document names to category
names.  The keys of the hash are the document names, each value should
be a reference to an array containing the names of the categories to
which each document belongs.

=item category_file

Indicates a file which should be read in order to create the
C<category_hash>.  Each line of the file should list a document's
name, followed by a list of category names, all separated by
whitespace.

=item stopword_file

Specifies a file containing a list of "stopwords", which are words
that should automatically be disregarded when scanning/reading
documents.  The file should contain one word per line.  The file will
be parsed and then fed as the C<stopwords> parameter to the
Document C<new()> method.

=item verbose

If true, some status/debugging information will be printed to
C<STDOUT> during operation.

=item document_class

The class indicating what type of Document object should be created.
This generally specifies the format that the documents are stored in.
The default is C<AI::Categorizer::Document::Text>.

=back

=item next()

Returns the next Document object in the Collection.

=item rewind()

Resets the iterator for further calls to C<next()>.

=item count_documents()

Returns the total number of documents in the Collection.  Note that
this usually resets the iterator.  This is because it may not be
possible to resume iterating where we left off.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2002-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Collection/DBI.pm  view on Meta::CPAN

package AI::Categorizer::Collection::DBI;
use strict;

use DBI;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   connection_string => {type => SCALAR, default => undef},
   dbh => {isa => 'DBI::db', default => undef},
   select_statement => {type => SCALAR, default => "SELECT text FROM documents"},
  );

__PACKAGE__->contained_objects
  (
   document => { class => 'AI::Categorizer::Document',
		 delayed => 1 },
  );

sub new {
  my $class = shift;
  my $self = $class->SUPER::new(@_);
  
  die "Must provide 'dbh' or 'connection_string' arguments"
    unless $self->{dbh} or $self->{connection_string};
  
  unless ($self->{dbh}) {
    $self->{dbh} = DBI->connect($self->{connection_string}, '', '', {RaiseError => 1})
      or die DBI->errstr;
    delete $self->{connection_string};
  }
  
  $self->rewind;
  return $self;
}

sub dbh { shift()->{dbh} }

sub rewind {
  my $self = shift;
  
  if (!$self->{sth}) {
    $self->{sth} = $self->dbh->prepare($self->{select_statement});
  }

  if ($self->{sth}{Active}) {
    $self->{sth}->finish;
  }

  $self->{sth}->execute;
}

sub next {
  my $self = shift;

  my @result = $self->{sth}->fetchrow_array;
  return undef unless @result;
  
  return $self->create_delayed_object('document',
				      name => $result[0],
				      categories => [$result[1]],
				      content => $result[2],
				     );
}

1;

lib/AI/Categorizer/Collection/Files.pm  view on Meta::CPAN

package AI::Categorizer::Collection::Files;
use strict;

use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);
use File::Spec;

__PACKAGE__->valid_params
  (
   path => { type => SCALAR|ARRAYREF },
   recurse => { type => BOOLEAN, default => 0 },
  );

sub new {
  my $class = shift;
  my $self = $class->SUPER::new(@_);
  
  $self->{dir_fh} = do {local *FH; *FH};  # double *FH avoids a warning

  # Documents are contained in a directory, or list of directories
  $self->{path} = [$self->{path}] unless ref $self->{path};
  $self->{used} = [];

  $self->_next_path;
  return $self;
}

sub _next_path {
  my $self = shift;
  closedir $self->{dir_fh} if $self->{cur_dir};

  $self->{cur_dir} = shift @{$self->{path}};
  push @{$self->{used}}, $self->{cur_dir};
  opendir $self->{dir_fh}, $self->{cur_dir} or die "$self->{cur_dir}: $!";
}

sub next {
  my $self = shift;
  my $file = $self->_read_file;
  return unless defined $file;
  
  warn "No category information about '$file'" unless defined $self->{category_hash}{$file};
  my @cats = map AI::Categorizer::Category->by_name(name => $_), @{ $self->{category_hash}{$file} || [] };

  return $self->call_method('document', 'read', 
			    path => File::Spec->catfile($self->{cur_dir}, $file),
			    name => $file,
			    categories => \@cats,
			   );
}

sub _read_file {
  my ($self) = @_;
  
  my $file = readdir $self->{dir_fh};

  if (!defined $file) { # Directory has been exhausted
    return undef unless @{$self->{path}};
    $self->_next_path;
    return $self->_read_file;
  } elsif ($file eq '.' or $file eq '..') {
    return $self->_read_file;
  } elsif (-d (my $path = File::Spec->catdir($self->{cur_dir}, $file))) {
    push @{$self->{path}}, $path  # Add for later processing
      if $self->{recurse} and !grep {$_ eq $path} @{$self->{path}}, @{$self->{used}};
    return $self->_read_file;
  }
  return $file;
}

sub rewind {
  my $self = shift;
  push @{$self->{path}}, @{$self->{used}};
  @{$self->{used}} = ();
  $self->_next_path;
}

# This should share an iterator with next()
sub count_documents {
    my $self = shift;
    return $self->{document_count} if defined $self->{document_count};
    
    $self->rewind;
    
    my $count = 0;
    $count++ while defined $self->_read_file;

    $self->rewind;
    return $self->{document_count} = $count;
}

1;
__END__

=head1 NAME

AI::Categorizer::Collection::Files - One document per file

=head1 SYNOPSIS

  my $c = new AI::Categorizer::Collection::Files
    (path => '/tmp/docs/training',
     category_file => '/tmp/docs/cats.txt');
  print "Total number of docs: ", $c->count_documents, "\n";
  while (my $document = $c->next) {
    ...
  }
  $c->rewind; # For further operations
  
=head1 DESCRIPTION

This implements a Collection class in which each document exists as a
single file on a filesystem.  The documents can exist in a single
directory, or in several directories.

=head1 METHODS

This is a subclass of the abstract AI::Categorizer::Collection class,
so any methods mentioned in its documentation are available here.

=over 4

=item new()

Creates a new Collection object and returns it.  In addition to the
parameters accepted by the superclass, the following parameters are
accepted:

=over 4

=item path

Indicates a location on disk where the documents can be found.  The
path may be specified as a string giving the name of a directory, or
as a reference to an array of such strings if the documents are
located in more than one directory.

=item recurse

Indicates whether subdirectories of the directory (or directories) in
the C<path> parameter should be descended into.  If set to a true
value, they will be descended into.  If false, they will be ignored.
The default is false.

=back

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2002-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer::Collection(3)

=cut

lib/AI/Categorizer/Collection/InMemory.pm  view on Meta::CPAN

package AI::Categorizer::Collection::InMemory;
use strict;

use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   data => { type => HASHREF },
  );

sub new {
  my $self = shift()->SUPER::new(@_);
  
  while (my ($name, $params) = each %{$self->{data}}) {
    foreach (@{$params->{categories}}) {
      next if ref $_;
      $_ = AI::Categorizer::Category->by_name(name => $_);
    }
  }

  return $self;
}

sub next {
  my $self = shift;
  my ($name, $params) = each %{$self->{data}} or return;
  return AI::Categorizer::Document->new(name => $name, %$params);
}

sub rewind {
  my $self = shift;
  scalar keys %{$self->{data}};
  return;
}

sub count_documents {
  my $self = shift;
  return scalar keys %{$self->{data}};
}

1;

lib/AI/Categorizer/Collection/SingleFile.pm  view on Meta::CPAN

package AI::Categorizer::Collection::SingleFile;
use strict;

use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   path => { type => SCALAR|ARRAYREF },
   categories => { type => HASHREF|UNDEF, default => undef },
   delimiter => { type => SCALAR },
  );

__PACKAGE__->contained_objects
  (
   document => { class => 'AI::Categorizer::Document::Text',
		 delayed => 1 },
  );

sub new {
  my $class = shift;
  my $self = $class->SUPER::new(@_);
  
  $self->{fh} = do {local *FH; *FH};  # double *FH avoids a warning

  # Documents are contained in a file, or list of files
  $self->{path} = [$self->{path}] unless ref $self->{path};
  $self->{used} = [];

  $self->_next_path;
  return $self;
}

sub _next_path {
  my $self = shift;
  close $self->{fh} if $self->{cur_file};

  push @{$self->{used}}, shift @{$self->{path}};
  $self->{cur_file} = $self->{used}[-1];
  open $self->{fh}, "< $self->{cur_file}" or die "$self->{cur_file}: $!";
}

sub next {
  my $self = shift;

  my $fh = $self->{fh}; # Must put in a simple scalar
  my $content = do {local $/ = $self->{delimiter}; <$fh>};

  if (!defined $content) { # File has been exhausted
    unless (@{$self->{path}}) { # All files have been exhausted
      $self->{fh} = undef;
      return undef;
    }
    $self->_next_path;
    return $self->next;
  } elsif ($content =~ /^\s*$self->{delimiter}$/) { # Skip empty docs
    return $self->next;
  }
#warn "doc is $content";
#warn "creating document=>@{[ %{$self->{container}{delayed}{document}} ]}";

  return $self->create_delayed_object('document', content => $content);
}

sub count_documents {
  my ($self) = @_;
  return $self->{document_count} if defined $self->{document_count};
  
  $self->rewind;

  my $count = 0;
  local $/ = $self->{delimiter};
  my $fh = $self->{fh};
  while (1) {
    $count++ while <$fh>;
    last unless @{$self->{path}};
    $self->_next_path;
  }
  
  $self->rewind;

  return $self->{document_count} = $count;
}

sub rewind {
  my ($self) = @_;

  close $self->{fh} if $self->{cur_file};
  unshift @{$self->{path}}, @{$self->{used}};
  $self->{used} = [];
  $self->_next_path;
}

1;

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

package AI::Categorizer::Document;

use strict;
use Class::Container;
use base qw(Class::Container);

use Params::Validate qw(:types);
use AI::Categorizer::ObjectSet;
use AI::Categorizer::FeatureVector;

__PACKAGE__->valid_params
  (
   name       => {
		  type => SCALAR, 
		 },
   categories => {
		  type => ARRAYREF,
		  default => [],
		  callbacks => { 'all are Category objects' => 
				 sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Category'), @{$_[0]} },
			       },
		  public => 0,
		 },
   stopwords => {
		 type => ARRAYREF|HASHREF,
		 default => {},
		},
   content   => {
		 type => HASHREF|SCALAR,
		 default => undef,
		},
   parse => {
	     type => SCALAR,
	     optional => 1,
	    },
   parse_handle => {
		    type => HANDLE,
		    optional => 1,
		   },
   features => {
		isa => 'AI::Categorizer::FeatureVector',
		optional => 1,
	       },
   content_weights => {
		       type => HASHREF,
		       default => {},
		      },
   front_bias => {
		  type => SCALAR,
		  default => 0,
		  },
   use_features => {
		    type => HASHREF|UNDEF,
		    default => undef,
		   },
   stemming => {
		type => SCALAR|UNDEF,
		optional => 1,
	       },
   stopword_behavior => {
			 type => SCALAR,
			 default => "stem",
			},
  );

__PACKAGE__->contained_objects
  (
   features => { delayed => 1,
		 class => 'AI::Categorizer::FeatureVector' },
  );

### Constructors

my $NAME = 'a';

sub new {
  my $pkg = shift;
  my $self = $pkg->SUPER::new(name => $NAME++,  # Use a default name
			      @_);

  # Get efficient internal data structures
  $self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );

  $self->_fix_stopwords;
  
  # A few different ways for the caller to initialize the content
  if (exists $self->{parse}) {
    $self->parse(content => delete $self->{parse});
    
  } elsif (exists $self->{parse_handle}) {
    $self->parse_handle(handle => delete $self->{parse_handle});
    
  } elsif (defined $self->{content}) {
    # Allow a simple string as the content
    $self->{content} = { body => $self->{content} } unless ref $self->{content};
  }
  
  $self->finish if $self->{content};
  return $self;
}

sub _fix_stopwords {
  my $self = shift;
  
  # Convert to hash
  $self->{stopwords} = { map {($_ => 1)} @{ $self->{stopwords} } }
    if UNIVERSAL::isa($self->{stopwords}, 'ARRAY');
  
  my $s = $self->{stopwords};

  # May need to perform stemming on the stopwords
  return unless keys %$s; # No point in doing anything if there are no stopwords
  return unless $self->{stopword_behavior} eq 'stem';
  return if !defined($self->{stemming}) or $self->{stemming} eq 'none';
  return if $s->{___stemmed};
  
  my @keys = keys %$s;
  %$s = ();
  $self->stem_words(\@keys);
  $s->{$_} = 1 foreach @keys;
  
  # This flag is attached to the stopword structure itself so that
  # other documents will notice it.
  $s->{___stemmed} = 1;
}

sub finish {
  my $self = shift;
  $self->create_feature_vector;
  
  # Now we're done with all the content stuff
  delete @{$self}{'content', 'content_weights', 'stopwords', 'use_features'};
}


# Parse a document format - a virtual method
sub parse;

sub parse_handle {
  my ($self, %args) = @_;
  my $fh = $args{handle} or die "No 'handle' argument given to parse_handle()";
  return $self->parse( content => join '', <$fh> );
}

### Accessors

sub name { $_[0]->{name} }
sub stopword_behavior { $_[0]->{stopword_behavior} }

sub features {
  my $self = shift;
  if (@_) {
    $self->{features} = shift;
  }
  return $self->{features};
}

sub categories {
  my $c = $_[0]->{categories};
  return wantarray ? $c->members : $c->size;
}


### Workers

sub create_feature_vector {
  my $self = shift;
  my $content = $self->{content};
  my $weights = $self->{content_weights};

  die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
    unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;

  $self->{features} = $self->create_delayed_object('features');
  while (my ($name, $data) = each %$content) {
    my $t = $self->tokenize($data);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
    $self->stem_words($t);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
    my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
    $self->{features}->add($h);
  }
}

sub is_in_category {
  return (ref $_[1]
	  ? $_[0]->{categories}->includes( $_[1] )
	  : $_[0]->{categories}->includes_name( $_[1] ));
    
}

sub tokenize {
  my $self = shift;
  my @tokens;
  while ($_[0] =~ /([-\w]+)/g) {
    my $word = lc $1;
    next unless $word =~ /[a-z]/;
    $word =~ s/^[^a-z]+//;  # Trim leading non-alpha characters (helps with ordinals)
    push @tokens, $word;
  }
  return \@tokens;
}

sub stem_words {
  my ($self, $tokens) = @_;
  return unless $self->{stemming};
  return if $self->{stemming} eq 'none';
  die "Unknown stemming option '$self->{stemming}' - options are 'porter' or 'none'"
    unless $self->{stemming} eq 'porter';
  
  eval {require Lingua::Stem; 1}
    or die "Porter stemming requires the Lingua::Stem module, available from CPAN.\n";

  @$tokens = @{ Lingua::Stem::stem(@$tokens) };
}

sub _filter_tokens {
  my ($self, $tokens_in) = @_;

  if ($self->{use_features}) {
    my $f = $self->{use_features}->as_hash;
    return [ grep  exists($f->{$_}), @$tokens_in ];
  } elsif ($self->{stopwords} and keys %{$self->{stopwords}}) {
    my $s = $self->{stopwords};
    return [ grep !exists($s->{$_}), @$tokens_in ];
  }
  return $tokens_in;
}

sub _weigh_tokens {
  my ($self, $tokens, $weight) = @_;

  my %counts;
  if (my $b = 0+$self->{front_bias}) {
    die "'front_bias' value must be between -1 and 1"
      unless -1 < $b and $b < 1;
    
    my $n = @$tokens;
    my $r = ($b-1)**2 / ($b+1);
    my $mult = $weight * log($r)/($r-1);
    
    my $i = 0;
    foreach my $feature (@$tokens) {
      $counts{$feature} += $mult * $r**($i/$n);
      $i++;
    }
    
  } else {
    foreach my $feature (@$tokens) {
      $counts{$feature} += $weight;
    }
  }

  return \%counts;
}

sub vectorize {
  my ($self, %args) = @_;
  if ($self->{stem_stopwords}) {
    my $s = $self->stem_tokens([keys %{$self->{stopwords}}]);
    $self->{stopwords} = { map {+$_, 1} @$s };
    $args{tokens} = $self->_filter_tokens($args{tokens});
  }
  return $self->_weigh_tokens($args{tokens}, $args{weight});
}

sub read {
  my ($class, %args) = @_;
  my $path = delete $args{path} or die "Must specify 'path' argument to read()";
  
  my $self = $class->new(%args);
  
  open my($fh), "< $path" or die "$path: $!";
  $self->parse_handle(handle => $fh);
  close $fh;
  
  $self->finish;
  return $self;
}

sub dump_features {
  my ($self, %args) = @_;
  my $path = $args{path} or die "No 'path' argument given to dump_features()";
  open my($fh), "> $path" or die "Can't create $path: $!";
  my $f = $self->features->as_hash;
  while (my ($k, $v) = each %$f) {
    print $fh "$k\t$v\n";
  }
}

1;

__END__

=head1 NAME

AI::Categorizer::Document - Embodies a document

=head1 SYNOPSIS

 use AI::Categorizer::Document;
 
 # Simplest way to create a document:
 my $d = new AI::Categorizer::Document(name => $string,
                                       content => $string);
 
 # Other parameters are accepted:
 my $d = new AI::Categorizer::Document(name => $string,
                                       categories => \@category_objects,
                                       content => { subject => $string,
                                                    body => $string2, ... },
                                       content_weights => { subject => 3,
                                                            body => 1, ... },
                                       stopwords => \%skip_these_words,
                                       stemming => $string,
                                       front_bias => $float,
                                       use_features => $feature_vector,
                                      );
 
 # Specify explicit feature vector:
 my $d = new AI::Categorizer::Document(name => $string);
 $d->features( $feature_vector );
 
 # Now pass the document to a categorization algorithm:
 my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
 my $hypothesis = $learner->categorize($document);

=head1 DESCRIPTION

The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector.  Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.

=head1 METHODS

=over 4

=item new(%parameters)

Creates a new Document object.  Document objects are used during
training (for the training documents), testing (for the test
documents), and when categorizing new unseen documents in an
application (for the unseen documents).  However, you'll typically
only call C<new()> in the latter case, since the KnowledgeSet or
Collection classes will create Document objects for you in the former
cases.

The C<new()> method accepts the following parameters:

=over 4

=item name

A string that identifies this document.  Required.

=item content

The raw content of this document.  May be specified as either a string
or as a hash reference, allowing structured document types.

=item content_weights

A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector.  The weight is a multiplier of the feature vector
values.  For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.

If no weights are specified, all weights are set to 1.

=item front_bias

Allows smooth bias of the weights of words in a document according to
their position.  The value should be a number between -1 and 1.
Positive numbers indicate that words toward the beginning of the
document should have higher weight than words toward the end of the
document.  Negative numbers indicate the opposite.  A bias of 0
indicates that no biasing should be done.

=item categories

A reference to an array of Category objects that this document belongs
to.  Optional.

=item stopwords

A list/hash of features (words) that should be ignored when parsing
document content.  A hash reference is preferred, with the features as
the keys.  If you pass an array reference containing the features, it
will be converted to a hash reference internally.

=item use_features

A Feature Vector specifying the only features that should be
considered when parsing this document.  This is an alternative to
using C<stopwords>.

=item stemming

Indicates the linguistic procedure that should be used to convert
tokens in the document to features.  Possible values are C<none>,
which indicates that the tokens should be used without change, or
C<porter>, indicating that the Porter stemming algorithm should be
applied to each token.  This requires the C<Lingua::Stem> module from
CPAN.

=item stopword_behavior

There are a few ways you might want the stopword list (specified with
the C<stopwords> parameter) to interact with the stemming algorithm
(specified with the C<stemming> parameter).  These options can be
controlled with the C<stopword_behavior> parameter, which can take the
following values:

=over 4

=item no_stem

Match stopwords against non-stemmed document words.  

=item stem

Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.

=item pre_stemmed

Stopwords are already stemmed, match them against stemmed document
words.

=back

The default value is C<stem>, which seems to produce the best results
in most cases I've tried.  I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.

This parameter has no effect if there are no stopwords being used, or
if stemming is not being used.  In the latter case, the list of
stopwords will always be matched as-is against the document words.

Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.

=back

=item read( path =E<gt> $path )

An alternative constructor method which reads a file on disk and
returns a document with that file's contents.

=item parse( content =E<gt> $content )



=item name()

Returns this document's C<name> property as specified when the
document was created.

=item features()

Returns the Feature Vector associated with this document.

=item categories()

In a list context, returns a list of Category objects to which this
document belongs.  In a scalar context, returns the number of such
categories.

=item create_feature_vector()

Creates this document's Feature Vector by parsing its content.  You
won't call this method directly, it's called by C<new()>.

=back



=head1 AUTHOR

Ken Williams <ken@mathforum.org>

=head1 COPYRIGHT

This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.  These terms apply to
every file in the distribution - if you have questions, please contact
the author.

=cut

lib/AI/Categorizer/Document/SMART.pm  view on Meta::CPAN

package AI::Categorizer::Document::SMART;

use strict;
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);

sub parse {
  my ($self, %args) = @_;
  
  $args{content} =~ s{
		   ^(?:\.I)?\s+(\d+)\n  # ID number - becomes document name
		   \.C\n
		   ([^\n]+)\n     # Categories
		   \.T\n
		   (.+)\n+        # Title
		   \.W\n
		  }
                  {}sx
     
     or die "Malformed record: $args{content}";
  
  my ($id, $categories, $title) = ($1, $2, $3);

  $self->{name} = $id;
  $self->{content} = { title => $title,
		       body  => $args{content} };

  my @categories = $categories =~ m/(.*?)\s+\d+[\s;]*/g;
  @categories = map AI::Categorizer::Category->by_name(name => $_), @categories;
  $self->{categories} = \@categories;
}

1;

lib/AI/Categorizer/Document/Text.pm  view on Meta::CPAN

package AI::Categorizer::Document::Text;

use strict;
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);

#use Params::Validate qw(:types);
#use AI::Categorizer::ObjectSet;
#use AI::Categorizer::FeatureVector;

### Constructors

sub parse {
  my ($self, %args) = @_;
  $self->{content} = { body => $args{content} };
}

1;

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

package AI::Categorizer::Document::XML;

use strict;
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);
use XML::SAX;

__PACKAGE__->contained_objects
  (
   xml_handler => 'AI::Categorizer::Document::XML::Handler',
  );

### Constructors

sub parse {
  my ($self, %args) = @_;

  # it is a string which contains the content of XML
  my $body= $args{content};			

  # it is a hash which includes a pair of <elementName, weight>
  my $elementWeight= $args{elementWeight};	

  # construct Handler which receive event of element, data, comment, processing_instruction
  # And convert their values into a sequence  of string and save it into buffer
  my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);

  # construct parser
  my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);

  # let's start parsing XML, where the methids of Handler will be called
  $xmlParser->parse_string($body);

  # extract the converted string from Handler
  $body= $xmlHandler->getContent;

  # Now, construct Document Object and return it
  return { body => $body };
}

##########################################################################
package AI::Categorizer::Document::XML::Handler;
use strict;
use base qw(XML::SAX::Base);

# Input: a hash which is weights of elements
# Output: object of this class
# Description: this is constructor
sub new{
  my ($class, %args) = @_;

  # call super class such as XML::SAX::Base
  my $self = $class->SUPER::new;

  # save weights of elements which is a hash for pairs <elementName, weight>
  # weight is times duplication of corresponding element
  # It is provided by caller(one of parameters) at construction, and
  # we must save it in order to use doing duplication at end_element
  $self->{weightHash} = $args{weights};

  # It is storage to store the data produced by Text, CDataSection and etc.
  $self->{content} = '';

  # This array is used to store the data for every element from root to the current visiting element.
  # Thus, data of 0~($levelPointer-1)th in the array is only valid.
  # The array which store the starting location(index) of the content for an element, 
  # From it, we can know all the data produced by an element at the end_element
  # It is needed at the duplication of the data produced by the specific element
  $self->{locationArray} = [];

  return $self;
}
	
# Input: None
# Output: None
# Description:
# 	it is called whenever the parser meets the document
# 	it will be called at once
#	Currently, confirm if the content buffer is an empty
sub start_document{
  my ($self, $doc)= @_;

  # The level(depth) of the last called element in XML tree
  # Calling of start_element is the preorder of the tree traversal.
  # The level is the level of current visiting element in tree.
  # the first element is 0-level
  $self->{levelPointer} = 0;

  # all data will be saved into here, initially, it is an empty
  $self->{content} = "";

  #$self->SUPER::start_document($doc);
}

# Input: None
# Output: None
# Description:
# 	it is called whenever the parser ends the document
# 	it will be called at once
#	Nothing to do
sub end_document{
  my ($self, $doc)= @_;

  #$self->SUPER::end_document($doc);
}

# Input
#	LocalName: 	$el->{LocalName}
#	NamespaceURI: 	$el->{NamespaceURI}
#	Name		$el->{Name}
#	Prefix		$el->{Prefix}
#	Attributes	$el->{Attributes}
#	for each attribute
#		LocalName: 	$el->{LocalName}
#		NamespaceURI: 	$el->{NamespaceURI}
#		Name		$el->{Name}
#		Prefix		$el->{Prefix}
#		Value		$el->{Value}
# Output: None
# Description:
# 	it is called whenever the parser meets the element
sub start_element{
  my ($self, $el)= @_;

  # find the last location of the content
  # its meaning is to append the new data at this location
  my $location= length $self->{content};

  # save the last location of the current content
  # so that at end_element the starting location of data of this element can be known
  $self->{locationArray}[$self->{levelPointer}] = $location;

  # for the next element, increase levelPointer
  $self->{levelPointer}++;

  #$self->SUPER::start_document($el);
}

# Input: None
# Output: None
# Description:
# 	it is called whenever the parser ends the element
sub end_element{
  my ($self, $el)= @_;

  $self->{levelPointer}--;
  my $location= $self->{locationArray}[$self->{levelPointer}];

  # find the name of element
  my $elementName= $el->{Name};

  # set the default weight
  my $weight= 1;

  # check if user give the weight to duplicate data
  $weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};

  # 0 - remove all the data to be related to this element
  if($weight == 0){
    $self->{content} = substr($self->{content}, 0, $location);
    return;
  }

  # 1 - dont duplicate
  if($weight == 1){
    return;
  }
  
  # n - duplicate data by n times
  # get new content
  my $newContent= substr($self->{content}, $location);

  # start to copy
  for(my $i=1; $i<$weight;$i++){
    $self->{content} .= $newContent;
  }

  #$self->SUPER::end_document($el);
}

# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the text which comes from Text, CDataSection and etc
#	Value must be saved into content buffer.
sub characters{
  my ($self, $args)= @_;

  # save "data plus new line" into content
  $self->{content} .= "$args->{Data}\n";
}
	
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the comment
#	Currently, it will be ignored
sub comment{
  my ($self, $args)= @_;
}

# Input: a hash which consists of pair <Data, Value> and <Target, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the processing_instructing
#	Currently, it will be ignored
sub processing_instruction{
  my ($self, $args)= @_;
}

# Input: None
# Output: the converted data, that is, content
# Description:
# 	return the content
sub getContent{
  my ($self)= @_;
  return $self->{content};
}

1;
__END__

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN

package AI::Categorizer::Experiment;

use strict;
use Class::Container;
use AI::Categorizer::Storable;
use Statistics::Contingency;

use base qw(Class::Container AI::Categorizer::Storable Statistics::Contingency);

use Params::Validate qw(:types);
__PACKAGE__->valid_params
  (
   categories => { type => ARRAYREF|HASHREF },
   sig_figs   => { type => SCALAR, default => 4 },
  );

sub new {
  my $package = shift;
  my $self = $package->Class::Container::new(@_);
  
  $self->{$_} = 0 foreach qw(a b c d);
  my $c = delete $self->{categories};
  $self->{categories} = { map {($_ => {a=>0, b=>0, c=>0, d=>0})} 
			  UNIVERSAL::isa($c, 'HASH') ? keys(%$c) : @$c
			};
  return $self;
}

sub add_hypothesis {
  my ($self, $h, $correct, $name) = @_;
  die "No hypothesis given to add_hypothesis()" unless $h;
  $name = $h->document_name unless defined $name;
  
  $self->add_result([$h->categories], $correct, $name);
}

sub stats_table {
  my $self = shift;
  $self->SUPER::stats_table($self->{sig_figs});
}

1;

__END__

=head1 NAME

AI::Categorizer::Experiment - Coordinate experimental results

=head1 SYNOPSIS

 use AI::Categorizer::Experiment;
 my $e = new AI::Categorizer::Experiment(categories => \%categories);
 my $l = AI::Categorizer::Learner->restore_state(...path...);
 
 while (my $d = ... get document ...) {
   my $h = $l->categorize($d); # A Hypothesis
   $e->add_hypothesis($h, [map $_->name, $d->categories]);
 }
 
 print "Micro F1: ", $e->micro_F1, "\n"; # Access a single statistic
 print $e->stats_table; # Show several stats in table form

=head1 DESCRIPTION

The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments.  As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers.  When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
macro-averaging and micro-averaging over categories.

=head1 METHODS

The general execution flow when using this class is to create an
Experiment object, add a bunch of Hypotheses to it, and then report on
the results.

Internally, C<AI::Categorizer::Experiment> inherits from the
C<Statistics::Contingency>.  Please see the documentation of
C<Statistics::Contingency> for a description of its interface.  All of
its methods are available here, with the following additions:

=over 4

=item new( categories => \%categories )

=item new( categories => \@categories, verbose => 1, sig_figs => 2 )

Returns a new Experiment object.  A required C<categories> parameter
specifies the names of all categories in the data set.  The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.

The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.

A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method.  It
does not affect the other methods like C<micro_precision()>.

=item add_result($assigned, $correct, $name)

Adds a new result to the experiment.  Please see the
C<Statistics::Contingency> documentation for a description of this
method.

=item add_hypothesis($hypothesis, $correct_categories)

Adds a new result to the experiment.  The first argument is a
C<AI::Categorizer::Hypothesis> object such as one generated by a
Learner's C<categorize()> method.  The list of correct categories can
be given as an array of category names (strings), as a hash whose keys
are the category names and whose values are anything logically true,
or as a single string if there is only one category.  For example, all
of the following are legal:

 $e->add_hypothesis($h, "sports");
 $e->add_hypothesis($h, ["sports", "finance"]);
 $e->add_hypothesis($h, {sports => 1, finance => 1});

=back

=head1 AUTHOR

Ken Williams <ken@mathforum.org>

=head1 COPYRIGHT

This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.  These terms apply to
every file in the distribution - if you have questions, please contact
the author.

=cut

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

package AI::Categorizer::FeatureSelector;

use strict;
use Class::Container;
use base qw(Class::Container);

use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);

__PACKAGE__->valid_params
  (
   features_kept => {
		     type => SCALAR,
		     default => 0.2,
		    },
   verbose => {
	       type => SCALAR,
	       default => 0,
	      },
  );

sub verbose {
  my $self = shift;
  $self->{verbose} = shift if @_;
  return $self->{verbose};
}

sub reduce_features {
  # Takes a feature vector whose weights are "feature scores", and
  # chops to the highest n features.  n is specified by the
  # 'features_kept' parameter.  If it's zero, all features are kept.
  # If it's between 0 and 1, we multiply by the present number of
  # features.  If it's greater than 1, we treat it as the number of
  # features to use.

  my ($self, $f, %args) = @_;
  my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};
  return $f unless $kept;

  my $num_kept = ($kept < 1 ? 
		  $f->length * $kept :
		  $kept);

  print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
  
  # This is algorithmic overkill, but the sort seems fast enough.  Will revisit later.
  my $features = $f->as_hash;
  my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
                      [0 .. $num_kept-1];

  my $result = $f->intersection( \@new_features );
  print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
  return $result;
}

# Abstract methods
sub rank_features;
sub scan_features;

sub select_features {
  my ($self, %args) = @_;
  
  die "No knowledge_set parameter provided to select_features()"
    unless $args{knowledge_set};

  my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
  return $self->reduce_features( $f, features_kept => $args{features_kept} );
}


1;

__END__

=head1 NAME

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

=head1 SYNOPSIS

 ...

=head1 DESCRIPTION

The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.

=head1 METHODS

=over 4

=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

=over 4

=item load

If a C<load> parameter is present, the C<load()> method will be
invoked immediately.  If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>.  If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.

=item categories

An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet.  If used, the
C<documents> parameter should also be specified.

=item documents

An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.

=item features_kept

A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.

=item feature_selection

A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.

=item tfidf_weighting

Specifies how document word counts should be converted to vector
values.  Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval".  The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature.  The default weighting is C<xxx>.

The first character specifies the "term frequency" component, which
can take the following values:

=over 4

=item b

Binary weighting - 1 for terms present in a document, 0 for terms absent.

=item t

Raw term frequency - equal to the number of times a feature occurs in
the document.

=item x

A synonym for 't'.

=item n

Normalized term frequency - 0.5 + 0.5 * t/max(t).  This is the same as
the 't' specification, but with term frequency normalized to lie
between 0.5 and 1.

=back

The second character specifies the "collection frequency" component, which
can take the following values:

=over 4

=item f

Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.

=item p

Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).

=item x

No change - multiply by 1.

=back


The third character specifies the "normalization" component, which
can take the following values:

=over 4

=item c

Apply cosine normalization - multiply by 1/length(document_vector).

=item x

No change - multiply by 1.

=back

The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.

=item verbose

If set to a true value, some status/debugging information will be
output on C<STDOUT>.

=back


=item categories()

In a list context returns a list of all Category objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

=item documents()

In a list context returns a list of all Document objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

=item document()

Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.

=item features()

Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.

=item verbose()

Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.

=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

=item save_features()

Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.

=item restore_features()

Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.

=item read()

Iterates through a Collection of documents and adds them to the
KnowledgeSet.  The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.

=item load()

This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).

=item add_document()

Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.

=item make_document()

This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet.  A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.

=item finish()

This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.

=item weigh_features()

This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.

=item document_frequency()

Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.

=item partition()

Divides the KnowledgeSet into several subsets.  This may be useful for
performing cross-validation.  The relative sizes of the subsets should
be passed as arguments.  For example, to split the KnowledgeSet into
four KnowledgeSets of equal size, pass the arguments .25, .25, .25
(the final size is 1 minus the sum of the other sizes).  The
partitions will be returned as a list.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN

package AI::Categorizer::FeatureSelector::CategorySelector;

use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);

use Params::Validate qw(:types);

__PACKAGE__->contained_objects
  (
   features => { class => 'AI::Categorizer::FeatureVector',
		 delayed => 1 },
  );

1;


sub reduction_function;

# figure out the feature set before reading collection (default)

sub scan_features {
  my ($self, %args) = @_;
  my $c = $args{collection} or 
    die "No 'collection' parameter provided to scan_features()";

  if(!($self->{features_kept})) {return;}

  my %cat_features;
  my $coll_features = $self->create_delayed_object('features');
  my $nbDocuments = 0;

  while (my $doc = $c->next) {
    $nbDocuments++;
    $args{prog_bar}->() if $args{prog_bar};
    my $docfeatures = $doc->features->as_hash;
    foreach my $cat ($doc->categories) {
      my $catname = $cat->name;
      if(!(exists $cat_features{$catname})) {
        $cat_features{$catname} = $self->create_delayed_object('features');
      }
      $cat_features{$catname}->add($docfeatures);
    }
    $coll_features->add( $docfeatures );
  }
  print STDERR "\n* Computing Chi-Square values\n" if $self->verbose;

  my $r_features = $self->create_delayed_object('features');
  my @terms = $coll_features->names;
  my $progressBar = $self->prog_bar(scalar @terms);
  my $allFeaturesSum = $coll_features->sum;
  my %cat_features_sum;
  while( my($catname,$features) = each %cat_features ) {
    $cat_features_sum{$catname} = $features->sum;
  }

  foreach my $term (@terms) {
    $progressBar->();
    $r_features->{features}{$term} = $self->reduction_function($term,
      $nbDocuments,$allFeaturesSum,$coll_features,
      \%cat_features,\%cat_features_sum);
  }
  print STDERR "\n" if $self->verbose;
  my $new_features = $self->reduce_features($r_features);
  return $coll_features->intersection( $new_features );
}


# calculate feature set after reading collection (scan_first=0)

sub rank_features {
  die "CategorySelector->rank_features is not implemented yet!";
#  my ($self, %args) = @_;
#  
#  my $k = $args{knowledge_set} 
#    or die "No knowledge_set parameter provided to rank_features()";
#
#  my %freq_counts;
#  foreach my $name ($k->features->names) {
#    $freq_counts{$name} = $k->document_frequency($name);
#  }
#  return $self->create_delayed_object('features', features => \%freq_counts);
}


# copied from KnowledgeSet->prog_bar by Ken Williams

sub prog_bar {
  my ($self, $count) = @_;

  return sub {} unless $self->verbose;
  return sub { print STDERR '.' } unless eval "use Time::Progress; 1";

  my $pb = 'Time::Progress'->new;
  $pb->attr(max => $count);
  my $i = 0;
  return sub {
    $i++;
    return if $i % 25;
    print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
  };
}


__END__

=head1 NAME

AI::Categorizer::CategorySelector - Abstract Category Selection class

=head1 SYNOPSIS

This class is abstract. For example of instanciation, see
ChiSquare.

=head1 DESCRIPTION

A base class for FeatureSelectors that calculate their global features
from a set of features by categories.

=head1 METHODS

=head1 AUTHOR

Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code

=cut

lib/AI/Categorizer/FeatureSelector/ChiSquare.pm  view on Meta::CPAN

package AI::Categorizer::FeatureSelector::ChiSquare;

use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector::CategorySelector);

use Params::Validate qw(:types);

# Chi-Square function
# NB: this could probably be optimised a bit...

sub reduction_function {
  my ($self,$term,$N,$allFeaturesSum,
      $coll_features,$cat_features,$cat_features_sum) = @_;
  my $CHI2SUM = 0;
  my $nbcats = 0;
  foreach my $catname (keys %{$cat_features}) {
#  while ( my ($catname,$catfeatures) = each %{$cat_features}) {
    my ($A,$B,$C,$D); # A = number of times where t and c co-occur
                      # B =   "     "   "   t occurs without c
                      # C =   "     "   "   c occurs without t
                      # D =   "     "   "   neither c nor t occur
    $A = $cat_features->{$catname}->value($term);
    $B = $coll_features->value($term) - $A;
    $C = $cat_features_sum->{$catname} - $A;
    $D = $allFeaturesSum - ($A+$B+$C);
    my $ADminCB = ($A*$D)-($C*$B);
    my $CHI2 = $N*$ADminCB*$ADminCB / (($A+$C)*($B+$D)*($A+$B)*($C+$D));
    $CHI2SUM += $CHI2;
    $nbcats++;
  }
  return $CHI2SUM/$nbcats;
}

1;

__END__

=head1 NAME

AI::Categorizer::FeatureSelector::ChiSquare - ChiSquare Feature Selection class

=head1 SYNOPSIS

 # the recommended way to use this class is to let the KnowledgeSet
 # instanciate it

 use AI::Categorizer::KnowledgeSetSMART;
 my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
   tfidf_notation =>'Categorizer',
   feature_selection=>'chi_square', ...other parameters...); 

 # however it is also possible to pass an instance to the KnowledgeSet

 use AI::Categorizer::KnowledgeSet;
 use AI::Categorizer::FeatureSelector::ChiSquare;
 my $ksetCHI = new AI::Categorizer::KnowledgeSet(
   feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
   ...other parameters...
   );

=head1 DESCRIPTION

Feature selection with the ChiSquare function.

  Chi-Square(t,ci) = (N.(AD-CB)^2)
                    -----------------------
                    (A+C).(B+D).(A+B).(C+D)

where t = term
      ci = category i
      N = number of documents in the collection
      A = number of times where t and c co-occur
      B =   "     "   "   t occurs without c
      C =   "     "   "   c occurs without t
      D =   "     "   "   neither c nor t occur

for more details, see :
Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection 
in Text Categorization, in Proceedings of ICML-97, 14th International 
Conference on Machine Learning, 1997.
(available on citeseer.nj.nec.com)

=head1 METHODS

=head1 AUTHOR

Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code

=cut

lib/AI/Categorizer/FeatureSelector/DocFrequency.pm  view on Meta::CPAN

package AI::Categorizer::FeatureSelector::DocFrequency;

use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);

use Params::Validate qw(:types);
use Carp qw(croak);

__PACKAGE__->contained_objects
  (
   features => { class => 'AI::Categorizer::FeatureVector',
		 delayed => 1 },
  );

# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
  my ($self, %args) = @_;
  
  my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
  
  my %freq_counts;
  foreach my $name ($k->features->names) {
    $freq_counts{$name} = $k->document_frequency($name);
  }
  return $self->create_delayed_object('features', features => \%freq_counts);
}

sub scan_features {
  my ($self, %args) = @_;
  my $c = $args{collection} or die "No 'collection' parameter provided to scan_features()";

  my $doc_freq = $self->create_delayed_object('features');
  
  while (my $doc = $c->next) {
    $args{prog_bar}->() if $args{prog_bar};
    $doc_freq->add( $doc->features->as_boolean_hash );
  }
  print "\n" if $self->verbose;
  
  return $self->reduce_features($doc_freq);
}

1;

__END__

=head1 NAME

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

=head1 SYNOPSIS

 ...

=head1 DESCRIPTION

The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.

=head1 METHODS

=over 4

=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

package AI::Categorizer::FeatureVector;

sub new {
  my ($package, %args) = @_;
  $args{features} ||= {};
  return bless {features => $args{features}}, $package;
}

sub names {
  my $self = shift;
  return keys %{$self->{features}};
}

sub set {
  my $self = shift;
  $self->{features} = (ref $_[0] ? $_[0] : {@_});
}

sub as_hash {
  my $self = shift;
  return $self->{features};
}

sub euclidean_length {
  my $self = shift;
  my $f = $self->{features};

  my $total = 0;
  foreach (values %$f) {
    $total += $_**2;
  }
  return sqrt($total);
}

sub normalize {
  my $self = shift;

  my $length = $self->euclidean_length;
  return $length ? $self->scale(1/$length) : $self;
}

sub scale {
  my ($self, $scalar) = @_;
  $_ *= $scalar foreach values %{$self->{features}};
  return $self;
}

sub as_boolean_hash {
  my $self = shift;
  return { map {($_ => 1)} keys %{$self->{features}} };
}

sub length {
  my $self = shift;
  return scalar keys %{$self->{features}};
}

sub clone {
  my $self = shift;
  return ref($self)->new( features => { %{$self->{features}} } );
}

sub intersection {
  my ($self, $other) = @_;
  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);

  my $common;
  if (UNIVERSAL::isa($other, 'ARRAY')) {
    $common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
  } elsif (UNIVERSAL::isa($other, 'HASH')) {
    $common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
  }
  return ref($self)->new( features => $common );
}

sub add {
  my ($self, $other) = @_;

  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
  while (my ($k,$v) = each %$other) {
    $self->{features}{$k} += $v;
  }
}

sub dot {
  my ($self, $other) = @_;
  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);

  my $sum = 0;
  my $f = $self->{features};
  while (my ($k, $v) = each %$f) {
    $sum += $other->{$k} * $v if exists $other->{$k};
  }
  return $sum;
}

sub sum {
  my ($self) = @_;

  # Return total of values in this vector
  my $total = 0;
  $total += $_ foreach values %{ $self->{features} };
  return $total;
}

sub includes {
  return exists $_[0]->{features}{$_[1]};
}

sub value {
  return $_[0]->{features}{$_[1]};
}

sub values {
  my $self = shift;
  return @{ $self->{features} }{ @_ };
}

1;
__END__

=head1 NAME

AI::Categorizer::FeatureVector - Features vs. Values

=head1 SYNOPSIS

  my $f1 = new AI::Categorizer::FeatureVector
    (features => {howdy => 2, doody => 3});
  my $f2 = new AI::Categorizer::FeatureVector
    (features => {doody => 1, whopper => 2});
   
  @names = $f1->names;
  $x = $f1->length;
  $x = $f1->sum;
  $x = $f1->includes('howdy');
  $x = $f1->value('howdy');
  $x = $f1->dot($f2);
  
  $f3 = $f1->clone;
  $f3 = $f1->intersection($f2);
  $f3 = $f1->add($f2);
  
  $h = $f1->as_hash;
  $h = $f1->as_boolean_hash;
  
  $f1->normalize;

=head1 DESCRIPTION

This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features.  At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document.  However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.

=head1 METHODS

=over 4

=item ...

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

package AI::Categorizer::Hypothesis;

use strict;

use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   all_categories => {type => ARRAYREF},
   scores => {type => HASHREF},
   threshold => {type => SCALAR},
   document_name => {type => SCALAR, optional => 1},
  );

sub all_categories { @{$_[0]->{all_categories}} }
sub document_name  { $_[0]->{document_name} }
sub threshold      { $_[0]->{threshold} }

sub best_category {
  my ($self) = @_;
  my $sc = $self->{scores};
  return unless %$sc;

  my ($best_cat, $best_score) = each %$sc;
  while (my ($key, $val) = each %$sc) {
    ($best_cat, $best_score) = ($key, $val) if $val > $best_score;
  }
  return $best_cat;
}

sub in_category {
  my ($self, $cat) = @_;
  return '' unless exists $self->{scores}{$cat};
  return $self->{scores}{$cat} > $self->{threshold};
}

sub categories {
  my $self = shift;
  return @{$self->{cats}} if $self->{cats};
  $self->{cats} = [sort {$self->{scores}{$b} <=> $self->{scores}{$a}}
                   grep {$self->{scores}{$_} >= $self->{threshold}}
                   keys %{$self->{scores}}];
  return @{$self->{cats}};
}

sub scores {
  my $self = shift;
  return @{$self->{scores}}{@_};
}

1;

__END__

=head1 NAME

AI::Categorizer::Hypothesis - Embodies a set of category assignments

=head1 SYNOPSIS

 use AI::Categorizer::Hypothesis;
 
 # Hypotheses are usually created by the Learner's categorize() method.
 # (assume here that $learner and $document have been created elsewhere)
 my $h = $learner->categorize($document);
 
 print "Assigned categories: ", join ', ', $h->categories, "\n";
 print "Best category: ", $h->best_category, "\n";
 print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
 print "Chosen from: ", join ', ', $h->all_categories, "\n";
 print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";

=head1 DESCRIPTION

A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document.  Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
whether a particular category was assigned), we provide a simple class
to help facilitate these scenarios.

=head1 METHODS

=over 4

=item new(%parameters)

Returns a new Hypothesis object.  Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method.  However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.

The following parameters are accepted when creating a new Hypothesis:

=over 4

=item all_categories

A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a
reference to an array of category names (as strings).

=item scores

A hash reference indicating the assignment score for each category.
Any score higher than the C<threshold> will be considered to be
assigned.

=item threshold

A number controlling which categories should be assigned - any
category whose score is greater than or equal to C<threshold> will be
assigned, any category whose score is lower than C<threshold> will not
be assigned.

=item document_name

An optional string parameter indicating the name of the document about
which this hypothesis was made.

=back


=item categories()

Returns an ordered list of the categories the document was placed in,
with best matches first.  Categories are returned by their string names.

=item best_category()

Returns the name of the category with the highest score in this
hypothesis.  Bear in mind that this category may not actually be
assigned if no categories' scores exceed the threshold.

=item in_category($name)

Returns true or false depending on whether the document was placed in
the given category.

=item scores(@names)

Returns a list of result scores for the given categories.  Since the
interface is still changing, and since different Learners implement
scoring in different ways, not very much can officially be said
about the scores, except that a good score is higher than a bad
score.  Individual Learners will have their own procedures for
determining scores, so you cannot compare one Learner's score with
another Learner's - for instance, one Learner might always give scores
between 0 and 1, and another Learner might always return scores less
than 0.  You often cannot compare scores from a single Learner on two
different categorization tasks either.

=item all_categories()

Returns the list of category names specified with the
C<all_categories> constructor parameter.

=item document_name()

Returns the value of the C<document_name> parameter specified as a
constructor parameter, or C<undef> if none was specified.

=back

=head1 AUTHOR

Ken Williams <ken@mathforum.org>

=head1 COPYRIGHT

This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.  These terms apply to
every file in the distribution - if you have questions, please contact
the author.

=cut

 view all matches for this distribution
 view release on metacpan -  search on metacpan

( run in 1.210 second using v1.00-cache-2.02-grep-82fe00e-cpan-72ae3ad1e6da )