AI-Categorizer

 view release on metacpan or  search on metacpan

Build.PL  view on Meta::CPAN

use Module::Build;
use Config;
use File::Spec;

my $build = new Module::Build
  (
   module_name => 'AI::Categorizer',
   license => 'perl',
   requires => {
		perl                    => '5.6.0',
		Class::Container        => 0.09,
		Storable                => 0,
		Params::Validate        => 0.18,
		Statistics::Contingency => 0.06,
		Lingua::Stem            => 0.50,
		File::Spec              => 0,
	       },
   recommends => {
		  Scalar::Util          => 0,
		  Time::Progress        => 1.1,
		  Algorithm::SVM        => 0.06,
		  AI::DecisionTree      => 0.06,
		  Algorithm::NaiveBayes => 0,
		 },
   build_requires => {
		      Module::Build     => 0.20,
		     },
   create_makefile_pl => 'passthrough',
  );

my $categorizer = File::Spec->catfile('eg', 'categorizer');
if ($build->y_n("Do you want to install the $categorizer script to $Config{installscript}?", 'n')) {
  $build->scripts($categorizer);
}

Changes  view on Meta::CPAN


 - Added a ChiSquared feature selection class. [Francois Paradis]

 - Changed the web locations of the reuters-21578 corpus that
   eg/demo.pl uses, since the location it referenced previously has
   gone away.

 - The building & installing process now uses Module::Build rather
   than ExtUtils::MakeMaker.

 - When the features_kept mechanism was used to explicitly state the
   features to use, and the scan_first parameter was left as its
   default value, the features_kept mechanism would silently fail to
   do anything.  This has now been fixed. [Spotted by Arnaud Gaudinat]

 - Recent versions of Weka have changed the name of the SVM class, so
   I've updated it in our test (t/03-weka.t) of the Weka wrapper
   too. [Sebastien Aperghis-Tramoni]

0.07  Tue May  6 16:15:04 CDT 2003

 - Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
   MANIFEST, so they weren't included in the 0.06 distribution.

Changes  view on Meta::CPAN

   Collection class.

0.05  Sat Mar 29 00:38:21 CST 2003

 - Feature selection is now handled by an abstract FeatureSelector
   framework class.  Currently the only concrete subclass implemented
   is FeatureSelector::DocFrequency.  The 'feature_selection'
   parameter has been replaced with a 'feature_selector_class'
   parameter.

 - Added a k-Nearest-Neighbor machine learner. [First revision
   implemented by David Bell]

 - Added a Rocchio machine learner. [Partially implemented by Xiaobo
   Li]

 - Added a "Guesser" machine learner which simply uses overall class
   probabilities to make categorization decisions.  Sometimes useful
   for providing a set of baseline scores against which to evaluate
   other machine learners.

 - The NaiveBayes learner is now a wrapper around my new
   Algorithm::NaiveBayes module, which is just the old NaiveBayes code
   from here, turned into its own standalone module.

 - Much more extensive regression testing of the code.

 - Added a Document subclass for XML documents. [Implemented by
   Jae-Moon Lee] Its interface is still unstable, it may change in
   later releases.

 - Added a 'Build.PL' file for an alternate installation method using
   Module::Build.

 - Fixed a problem in the Hypothesis' best_category() method that
   would often result in the wrong category being reported.  Added a
   regression test to exercise the Hypothesis class.  [Spotted by
   Xiaobo Li]

 - The 'categorizer' script now records more useful benchmarking
   information about time & memory in its outfile.

 - The AI::Categorizer->dump_parameters() method now tries to avoid
   showing you its entire list of stopwords.

 - Document objects now use a default 'name' if none is supplied.

Changes  view on Meta::CPAN


 - Added a virtual class for binary classifiers.

 - Wrote documentation for lots of the undocumented classes.

 - Added a PNG file giving an overview diagram of the classes.

 - Added a script 'categorizer' to provide a simple command-line
   interface to AI::Categorizer

 - save_state() and restore_state() now save to a directory, not a
   file.

 - Removed F1(), precision(), recall(), etc. from Util package since
   they're in Statistics::Contingency.  Added random_elements() to
   Util.

 - Collection::Files now warns when no category information is known
   about a document in the collection (knowing it's in zero categories
   is okay).

Changes  view on Meta::CPAN


 - Added dot() and value() methods to FeatureVector.

 - Added 'feature_selection' parameter to KnowledgeSet.

 - Added document($name) accessor method to KnowledgeSet.

 - In KnowledgeSet, load(), read(), and scan_*() can now accept a
   Collection object.

 - Added document_frequency(), finish(), and weigh_features() methods
   to KnowledgeSet.

 - Added save_features() and restore_features() to KnowledgeSet.

 - Added default categories() and categorize() methods to Learner base
   class.  get_scores() is now abstract.

 - Extended interface of ObjectSet class with retrieve(), includes(),
   and includes_name().

 - Moved 'term_weighting' parameter from Document to KnowledgeSet,
   since the normalized version needs to know the maximum
   term-frequency.  Also changed its values to 'n', 'l', 'b', and 't',
   with 'x' a synonym for 't'.

 - Implemented full range of TF/IDF term weighting methods (see Salton

META.yml  view on Meta::CPAN

name: AI-Categorizer
version: 0.09
author:
  - 'Ken Williams <ken@mathforum.org>'
  - |-
    Discussion about this module can be directed to the perl-AI list at
    <perl-ai@perl.org>.  For more info about the list, see
    http://lists.perl.org/showlist.cgi?name=perl-ai
abstract: Automatic Text Categorization
license: perl
resources:
  license: http://dev.perl.org/licenses/
requires:
  Class::Container: 0.09
  File::Spec: 0
  Lingua::Stem: 0.5
  Params::Validate: 0.18
  Statistics::Contingency: 0.06
  Storable: 0
  perl: 5.6.0
build_requires:
  Module::Build: 0.2
recommends:
  AI::DecisionTree: 0.06
  Algorithm::NaiveBayes: 0
  Algorithm::SVM: 0.06
  Scalar::Util: 0
  Time::Progress: 1.1
provides:
  AI::Categorizer:
    file: lib/AI/Categorizer.pm
    version: 0.09
  AI::Categorizer::Category:
    file: lib/AI/Categorizer/Category.pm
  AI::Categorizer::Collection:
    file: lib/AI/Categorizer/Collection.pm
  AI::Categorizer::Collection::DBI:
    file: lib/AI/Categorizer/Collection/DBI.pm

Makefile.PL  view on Meta::CPAN

# Note: this file was auto-generated by Module::Build::Compat version 0.03
    
    unless (eval "use Module::Build::Compat 0.02; 1" ) {
      print "This module requires Module::Build to install itself.\n";
      
      require ExtUtils::MakeMaker;
      my $yn = ExtUtils::MakeMaker::prompt
	('  Install Module::Build now from CPAN?', 'y');
      
      unless ($yn =~ /^y/i) {
	die " *** Cannot install without Module::Build.  Exiting ...\n";
      }
      
      require Cwd;

README  view on Meta::CPAN

NAME
    AI::Categorizer - Automatic Text Categorization

SYNOPSIS
     use AI::Categorizer;
     my $c = new AI::Categorizer(...parameters...);
 
     # Run a complete experiment - training on a corpus, testing on a test
     # set, printing a summary of results to STDOUT
     $c->run_experiment;
 
     # Or, run the parts of $c->run_experiment separately
     $c->scan_features;
     $c->read_training_set;
     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
 
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }
 
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object
    framework. We give a conceptual overview, but don't get into any of the
    details about interfaces or usage. See the documentation for the individual
    classes for more details.

    A diagram of the various classes in the framework can be seen in
    "doc/classes-overview.png", and a more detailed view of the same thing can
    be seen in "doc/classes.png".

  Knowledge Sets

    A "knowledge set" is defined as a collection of documents, together with
    some information on the categories each document belongs to. Note that this
    term is somewhat unique to this project - other sources may call it a
    "training corpus", or "prior knowledge". A knowledge set also contains some
    information on how documents will be parsed and how their features (words)
    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant stuff of the documents would get lost in
    the "noise" of the insignificant stuff.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

    The "AI::Categorizer::Collection" class and its subclasses instantiate the
    idea of a collection in this sense.

  Documents

    Each document is represented by an "AI::Categorizer::Document" object, or an
    object of one of its subclasses. Each document class contains methods for
    turning a bunch of data into a Feature Vector. Each document also has a
    method to report which categories it belongs to.

  Categories

    Each category is represented by an "AI::Categorizer::Category" object. Its
    main purpose is to keep track of which documents belong to it, though you
    can also examine statistical properties of an entire category, such as
    obtaining a Feature Vector representing an amalgamation of all the documents
    that belong to it.

  Machine Learning Algorithms

    There are lots of different ways to make the inductive leap from the
    training documents to unseen documents. The Machine Learning community has
    studied many algorithms for this purpose. To allow flexibility in choosing
    and configuring categorization algorithms, each such algorithm is a subclass
    of "AI::Categorizer::Learner". There are currently four categorizers
    included in the distribution:

    AI::Categorizer::Learner::NaiveBayes
        A pure-perl implementation of a Naive Bayes classifier. No dependencies
        on external modules or other resources. Naive Bayes is usually very fast
        to train and fast to make categorization decisions, but isn't always the
        most accurate categorizer.

    AI::Categorizer::Learner::SVM
        An interface to Corey Spencer's "Algorithm::SVM", which implements a
        Support Vector Machine classifier. SVMs can take a while to train
        (though in certain conditions there are optimizations to make them quite
        fast), but are pretty quick to categorize. They often have very good
        accuracy.

    AI::Categorizer::Learner::DecisionTree
        An interface to "AI::DecisionTree", which implements a Decision Tree
        classifier. Decision Trees generally take longer to train than Naive
        Bayes or SVM classifiers, but they are also quite fast when
        categorizing. Decision Trees have the advantage that you can scrutinize
        the structures of trained decision trees to see how decisions are being
        made.

    AI::Categorizer::Learner::Weka
        An interface to version 2 of the Weka Knowledge Analysis system that
        lets you use any of the machine learners it defines. This gives you
        access to lots and lots of machine learning algorithms in use by machine
        learning researches. The main drawback is that Weka tends to be quite
        slow and use a lot of memory, and the current interface between Weka and
        "AI::Categorizer" is a bit clumsy.

    Other machine learning methods that may be implemented soonish include
    Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
    for ensemble learning. No timetable for their creation has yet been set.

    Please see the documentation of these individual modules for more details on
    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
    a description of the general categorizer interface.

    If you wish to create your own classifier, you should inherit from
    "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
    abstract classes that manage some of the work for you.

  Feature Vectors

    Most categorization algorithms don't deal directly with documents' data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem helpful for
    determining its category, but they are usually some version of the "most
    important" words in the document. A list of features and their weights in
    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
    You may think of this class as roughly analogous to a Perl hash, where the
    keys are the names of features and the values are their weights.

  Hypotheses

    The result of asking a categorizer to categorize a previously unseen
    document is called a hypothesis, because it is some kind of "statistical
    guess" of what categories this document should be assigned to. Since you may
    be interested in any of several pieces of information about the hypothesis
    (for instance, which categories were assigned, which category was the single
    most likely category, the scores assigned to each category, etc.), the
    hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
    class, and you can use its object methods to get information about the
    hypothesis. See its class documentation for the details.

  Experiments

    The "AI::Categorizer::Experiment" class helps you organize the results of
    categorization experiments. As you get lots of categorization results
    (Hypotheses) back from the Learner, you can feed these results to the
    Experiment class, along with the correct answers. When all results have been
    collected, you can get a report on accuracy, precision, recall, F1, and so
    on, with both micro-averaging and macro-averaging over categories. We use
    the "Statistics::Contingency" module from CPAN to manage the calculations.
    See the docs for "AI::Categorizer::Experiment" for more details.

METHODS
    new()
        Creates a new Categorizer object and returns it. Accepts lots of
        parameters controlling behavior. In addition to the parameters listed
        here, you may pass any parameter accepted by any class that we create
        internally (the KnowledgeSet, Learner, Experiment, or Collection
        classes), or any class that *they* create. This is managed by the
        "Class::Container" module, so see its documentation for the details of
        how this works.

        The specific parameters accepted here are:

        progress_file
            A string that indicates a place where objects will be saved during
            several of the methods of this class. The default value is the
            string "save", which means files like "save-01-knowledge_set" will
            get created. The exact names of these files may change in future
            releases, since they're just used internally to resume where we last
            left off.

        verbose
            If true, a few status messages will be printed during execution.

        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to

README  view on Meta::CPAN

        Returns the Learner object associated with this Categorizer. Before
        "train()", the Learner will of course not be trained yet.

    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.

    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.

    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
        use.

        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.

    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this calls the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.

    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        created during "read_training_set()". Returns the Learner object. Also
        saves the Learner object for later use.

    evaluate_test_set()
        Creates a Collection based on the value of the "test_set" parameter, and
        calls the Learner's "categorize_collection()" method using this
        Collection. Returns the resultant Experiment object. Also saves the
        Experiment object for later use in the "stats_table()" method.

    stats_table()
        Returns the value of the Experiment's (as created by
        "evaluate_test_set()") "stats_table()" method. This is a string that
        shows various statistics about the accuracy/precision/recall/F1/etc. of
        the assignments made during testing.

HISTORY
    This module is a revised and redesigned version of the previous

README  view on Meta::CPAN

    Discussion about this module can be directed to the perl-AI list at
    <perl-ai@perl.org>. For more info about the list, see
    http://lists.perl.org/showlist.cgi?name=perl-ai

REFERENCES
    An excellent introduction to the academic field of Text Categorization is
    Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
    ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.

COPYRIGHT
    Copyright 2000-2003 Ken Williams. All rights reserved.

    This distribution is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself. These terms apply to every file in the
    distribution - if you have questions, please contact the author.

eg/categorizer  view on Meta::CPAN

#!/usr/bin/perl

# This script creates a Categorizer and runs several of its methods on
# a corpus, reporting the results.
#
# Copyright 2002 Ken Williams, under the same license as the
# AI::Categorizer distribution.


use strict;
use AI::Categorizer;
use Benchmark;
my $HAVE_YAML = eval "use YAML; 1";

eg/categorizer  view on Meta::CPAN

    print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
    if ($HAVE_YAML) {
      print {$out_fh} YAML::Dump($c->dump_parameters);
    } else {
      warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
    }
  }
}
  

run_section('scan_features',     1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train',             3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
  my $result = $c->stats_table;
  print $result if $c->verbose;
  print $out_fh $result if $out_fh;
}

sub run_section {
  my ($section, $stage, $do_stage) = @_;
  return unless $do_stage->{$stage};
  if (keys %$do_stage > 1) {
    print " % $0 @ARGV -$stage\n" if $c->verbose;
    die "$0 is not executable, please change its execution permissions"
      unless -x $0;
    system($0, @ARGV, "-$stage") == 0

eg/categorizer  view on Meta::CPAN


sub parse_command_line {
  my (%opt, %do_stage);

  while (@_) {
    if ($_[0] =~ /^-(\d+)$/) {
      shift;
      $do_stage{$1} = 1;
      
    } elsif ( $_[0] eq '--config_file' ) {
      die "--config_file requires the YAML module from CPAN to be installed.\n" unless $HAVE_YAML;
      shift;
      my $file = shift;
      my $href = YAML::LoadFile($file);
      @opt{keys %$href} = values %$href;
      
    } elsif ( $_[0] =~ /^--/ ) {
      my ($k, $v) = (shift, shift);
      $k =~ s/^--//;
      $opt{$k} = $v;
      

eg/categorizer  view on Meta::CPAN

    # Allow abbreviations
    if ($k =~ /^(\w+)_class$/) {
      my $name = $1;
      $v =~ s/^::/AI::Categorizer::\u${name}::/;
      $opt{$k} = $v;
    }
  }

  my $outfile;
  unless ($outfile = delete $opt{outfile}) {
    $outfile = $opt{progress_file} ? "$opt{progress_file}-results.txt" : "results.txt";
  }

  return (\%opt, \%do_stage, $outfile);
}

sub usage {
  return <<EOF;
 Usage:

  $0 --parameter_1 <value_1> --parameter_2 <value_2>

eg/demo.pl  view on Meta::CPAN



# In a real-world application these Collection objects could be of any
# type (any Collection subclass).  Or you could create each Document
# object manually.  Or you could let the KnowledgeSet create the
# Collection objects for you.

$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test     = AI::Categorizer::Collection::Files->new( path => $test, %params );

# We turn on verbose mode so you can watch the progress of loading &
# training.  This looks nicer if you have Time::Progress installed!

print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );

print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );

print "Categorizing test set\n";

eg/easy_guesser.pl  view on Meta::CPAN

#!/usr/bin/perl

# This script can be helpful for getting a set of baseline scores for
# a categorization task.  It simulates using the "Guesser" learner,
# but is much faster.  Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format.  <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class.  <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.

use strict;
use Statistics::Contingency;

eg/easy_guesser.pl  view on Meta::CPAN

while (defined(my $file = readdir $dh)) {
    next if $file eq '.' or $file eq '..';
    unless ($cats{$file}) {
	warn "No category information for '$file'";
	next;
    }
    my @assigned;
    foreach (@cats) {
	push @assigned, $_ if rand() < $freq{$_};
    }
    $c->add_result(\@assigned, $cats{$file});
}

print $c->stats_table(4);

lib/AI/Categorizer.pm  view on Meta::CPAN

use AI::Categorizer::Learner;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::Collection;
use AI::Categorizer::Hypothesis;
use AI::Categorizer::KnowledgeSet;


__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );

__PACKAGE__->contained_objects
  (

lib/AI/Categorizer.pm  view on Meta::CPAN

#  delete $p->{stopwords} if $p->{stopword_file};
#  return $p;
#}

sub knowledge_set { shift->{knowledge_set} }
sub learner       { shift->{learner} }

# Combines several methods in one sub
sub run_experiment {
  my $self = shift;
  $self->scan_features;
  $self->read_training_set;
  $self->train;
  $self->evaluate_test_set;
  print $self->stats_table;
}

sub scan_features {
  my $self = shift;
  return unless $self->knowledge_set->scan_first;
  $self->knowledge_set->scan_features( path => $self->{training_set} );
  $self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}

sub read_training_set {
  my $self = shift;
  $self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
    if -e "$self->{progress_file}-01-features";
  $self->knowledge_set->read( path => $self->{training_set} );
  $self->_save_progress( '02', 'knowledge_set' );
  return $self->knowledge_set;
}

sub train {
  my $self = shift;
  $self->_load_progress( '02', 'knowledge_set' );
  $self->learner->train( knowledge_set => $self->{knowledge_set} );
  $self->_save_progress( '03', 'learner' );
  return $self->learner;
}

sub evaluate_test_set {
  my $self = shift;
  $self->_load_progress( '03', 'learner' );
  my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
  $self->{experiment} = $self->learner->categorize_collection( collection => $c );
  $self->_save_progress( '04', 'experiment' );
  return $self->{experiment};
}

sub stats_table {
  my $self = shift;
  $self->_load_progress( '04', 'experiment' );
  return $self->{experiment}->stats_table;
}

sub progress_file {
  shift->{progress_file};
}

sub verbose {
  shift->{verbose};
}

sub _save_progress {
  my ($self, $stage, $node) = @_;
  return unless $self->{progress_file};
  my $file = "$self->{progress_file}-$stage-$node";
  warn "Saving to $file\n" if $self->{verbose};
  $self->{$node}->save_state($file);
}

sub _load_progress {
  my ($self, $stage, $node) = @_;
  return unless $self->{progress_file};
  my $file = "$self->{progress_file}-$stage-$node";
  warn "Loading $file\n" if $self->{verbose};
  $self->{$node} = $self->contained_class($node)->restore_state($file);
}

1;
__END__

=head1 NAME

AI::Categorizer - Automatic Text Categorization

=head1 SYNOPSIS

 use AI::Categorizer;
 my $c = new AI::Categorizer(...parameters...);
 
 # Run a complete experiment - training on a corpus, testing on a test
 # set, printing a summary of results to STDOUT
 $c->run_experiment;
 
 # Or, run the parts of $c->run_experiment separately
 $c->scan_features;
 $c->read_training_set;
 $c->train;
 $c->evaluate_test_set;
 print $c->stats_table;
 
 # After training, use the Learner for categorization
 my $l = $c->learner;
 while (...) {
   my $d = ...create a document...
   my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
   print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
   print "Best category: ", $hypothesis->best_category, "\n";
 }
 
=head1 DESCRIPTION

C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules.  The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.

The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use.  There are several ways to carry out this process.  The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.

A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?).  Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider.  These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable.  If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual
module.

=head1 FRAMEWORK COMPONENTS

This section explains the major pieces of the C<AI::Categorizer>
object framework.  We give a conceptual overview, but don't get into
any of the details about interfaces or usage.  See the documentation
for the individual classes for more details.

lib/AI/Categorizer.pm  view on Meta::CPAN

C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.

=head2 Knowledge Sets

A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.

The process of selecting the most important features in the training
set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.

=head2 Collections

Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.

=head2 Documents

Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses.  Each document class
contains methods for turning a bunch of data into a Feature Vector.
Each document also has a method to report which categories it belongs
to.

=head2 Categories

Each category is represented by an C<AI::Categorizer::Category>
object.  Its main purpose is to keep track of which documents belong
to it, though you can also examine statistical properties of an entire
category, such as obtaining a Feature Vector representing an
amalgamation of all the documents that belong to it.

=head2 Machine Learning Algorithms

There are lots of different ways to make the inductive leap from the
training documents to unseen documents.  The Machine Learning
community has studied many algorithms for this purpose.  To allow
flexibility in choosing and configuring categorization algorithms,
each such algorithm is a subclass of C<AI::Categorizer::Learner>.
There are currently four categorizers included in the distribution:

=over 4

=item AI::Categorizer::Learner::NaiveBayes

A pure-perl implementation of a Naive Bayes classifier.  No
dependencies on external modules or other resources.  Naive Bayes is
usually very fast to train and fast to make categorization decisions,
but isn't always the most accurate categorizer.

=item AI::Categorizer::Learner::SVM

An interface to Corey Spencer's C<Algorithm::SVM>, which implements a
Support Vector Machine classifier.  SVMs can take a while to train
(though in certain conditions there are optimizations to make them
quite fast), but are pretty quick to categorize.  They often have very
good accuracy.

=item AI::Categorizer::Learner::DecisionTree

An interface to C<AI::DecisionTree>, which implements a Decision Tree
classifier.  Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing.  Decision Trees have the advantage that you can
scrutinize the structures of trained decision trees to see how
decisions are being made.

=item AI::Categorizer::Learner::Weka

An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines.  This gives you
access to lots and lots of machine learning algorithms in use by
machine learning researches.  The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface
between Weka and C<AI::Categorizer> is a bit clumsy.

=back

Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
combiner for ensemble learning.  No timetable for their creation has
yet been set.

Please see the documentation of these individual modules for more
details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.

If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.

=head2 Feature Vectors

Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.

=head2 Hypotheses

The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to.  Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
scores assigned to each category, etc.), the hypothesis is returned as
an object of the C<AI::Categorizer::Hypothesis> class, and you can use
its object methods to get information about the hypothesis.  See its
class documentation for the details.

=head2 Experiments

The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments.  As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers.  When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
micro-averaging and macro-averaging over categories.  We use the
C<Statistics::Contingency> module from CPAN to manage the
calculations. See the docs for C<AI::Categorizer::Experiment> for more
details.

=head1 METHODS

=over 4

lib/AI/Categorizer.pm  view on Meta::CPAN

internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that I<they> create.  This is managed by the
C<Class::Container> module, so see
L<its documentation|Class::Container> for the details of how this
works.

The specific parameters accepted here are:

=over 4

=item progress_file

A string that indicates a place where objects will be saved during
several of the methods of this class.  The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created.  The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.

=item verbose

If true, a few status messages will be printed during execution.

=item training_set

Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.

=item test_set

Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.

=item data_root

A shortcut for setting the C<training_set>, C<test_set>, and

lib/AI/Categorizer.pm  view on Meta::CPAN


=item knowledge_set()

Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.

=item run_experiment()

Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.

=item scan_features()

Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.

This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.

=item read_training_set()

Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.

=item train()

Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.

=item evaluate_test_set()

Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection.  Returns the resultant Experiment object.  Also saves the
Experiment object for later use in the C<stats_table()> method.

=item stats_table()

Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method.  This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.

lib/AI/Categorizer.pm  view on Meta::CPAN


=head1 REFERENCES

An excellent introduction to the academic field of Text Categorization
is Fabrizio Sebastiani's "Machine Learning in Automated Text
Categorization": ACM Computing Surveys, Vol. 34, No. 1, March 2002,
pp. 1-47.

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This distribution is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.  These terms apply to
every file in the distribution - if you have questions, please contact
the author.

=cut

lib/AI/Categorizer/Category.pm  view on Meta::CPAN

		  default => [],
		  callbacks => { 'all are Document objects' => 
				 sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
			       },
		  public => 0,
		 },
  );

__PACKAGE__->contained_objects
  (
   features => {
		class => 'AI::Categorizer::FeatureVector',
		delayed => 1,
	       },
  );

my %REGISTRY = ();

sub new {
  my $self = shift()->SUPER::new(@_);
  $self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );

lib/AI/Categorizer/Category.pm  view on Meta::CPAN

  return wantarray ? $d->members : $d->size;
}

sub contains_document {
  return $_[0]->{documents}->includes( $_[1] );
}

sub add_document {
  my $self = shift;
  $self->{documents}->insert( $_[0] );
  delete $self->{features};  # Could be more efficient?
}

sub features {
  my $self = shift;

  if (@_) {
    $self->{features} = shift;
  }
  return $self->{features} if $self->{features};

  my $v = $self->create_delayed_object('features');
  return $self->{features} = $v unless $self->documents;

  foreach my $document ($self->documents) {
    $v->add( $document->features );
  }
  
  return $self->{features} = $v;
}

1;
__END__

=head1 NAME

AI::Categorizer::Category - A named category of documents

=head1 SYNOPSIS

  my $category = AI::Categorizer::Category->by_name("sports");
  my $name = $category->name;
  
  my @docs = $category->documents;
  my $num_docs = $category->documents;
  my $features = $category->features;
  
  $category->add_document($doc);
  if ($category->contains_document($doc)) { ...

=head1 DESCRIPTION

This simple class represents a named category which may contain zero
or more documents.  Each category is a "singleton" by name, so two
Category objects with the same name should not be created at once.

=head1 METHODS

=over 4

=item new()

Creates a new Category object and returns it.  Accepts the following

lib/AI/Categorizer/Category.pm  view on Meta::CPAN

=item by_name(name => $string)

Returns the Category object with the given name, or creates one if no
such object exists.

=item documents()

Returns a list of the Document objects in this category in a list
context, or the number of such objects in a scalar context.

=item features()

Returns a FeatureVector object representing the sum of all the
FeatureVectors of the Documents in this Category.

=item add_document($document)

Informs the Category that the given Document belongs to it.

=item contains_document($document)

Returns true if the given document belongs to this category, or false
otherwise.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Collection.pm  view on Meta::CPAN


Returns the next Document object in the Collection.

=item rewind()

Resets the iterator for further calls to C<next()>.

=item count_documents()

Returns the total number of documents in the Collection.  Note that
this usually resets the iterator.  This is because it may not be
possible to resume iterating where we left off.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2002-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Collection/DBI.pm  view on Meta::CPAN

  if ($self->{sth}{Active}) {
    $self->{sth}->finish;
  }

  $self->{sth}->execute;
}

sub next {
  my $self = shift;

  my @result = $self->{sth}->fetchrow_array;
  return undef unless @result;
  
  return $self->create_delayed_object('document',
				      name => $result[0],
				      categories => [$result[1]],
				      content => $result[2],
				     );
}

1;

lib/AI/Categorizer/Collection/Files.pm  view on Meta::CPAN

=back

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2002-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer::Collection(3)

=cut

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

		 default => undef,
		},
   parse => {
	     type => SCALAR,
	     optional => 1,
	    },
   parse_handle => {
		    type => HANDLE,
		    optional => 1,
		   },
   features => {
		isa => 'AI::Categorizer::FeatureVector',
		optional => 1,
	       },
   content_weights => {
		       type => HASHREF,
		       default => {},
		      },
   front_bias => {
		  type => SCALAR,
		  default => 0,
		  },
   use_features => {
		    type => HASHREF|UNDEF,
		    default => undef,
		   },
   stemming => {
		type => SCALAR|UNDEF,
		optional => 1,
	       },
   stopword_behavior => {
			 type => SCALAR,
			 default => "stem",
			},
  );

__PACKAGE__->contained_objects
  (
   features => { delayed => 1,
		 class => 'AI::Categorizer::FeatureVector' },
  );

### Constructors

my $NAME = 'a';

sub new {
  my $pkg = shift;
  my $self = $pkg->SUPER::new(name => $NAME++,  # Use a default name
			      @_);

  # Get efficient internal data structures
  $self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );

  $self->_fix_stopwords;
  
  # A few different ways for the caller to initialize the content
  if (exists $self->{parse}) {
    $self->parse(content => delete $self->{parse});
    
  } elsif (exists $self->{parse_handle}) {
    $self->parse_handle(handle => delete $self->{parse_handle});

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

  # This flag is attached to the stopword structure itself so that
  # other documents will notice it.
  $s->{___stemmed} = 1;
}

sub finish {
  my $self = shift;
  $self->create_feature_vector;
  
  # Now we're done with all the content stuff
  delete @{$self}{'content', 'content_weights', 'stopwords', 'use_features'};
}


# Parse a document format - a virtual method
sub parse;

sub parse_handle {
  my ($self, %args) = @_;
  my $fh = $args{handle} or die "No 'handle' argument given to parse_handle()";
  return $self->parse( content => join '', <$fh> );
}

### Accessors

sub name { $_[0]->{name} }
sub stopword_behavior { $_[0]->{stopword_behavior} }

sub features {
  my $self = shift;
  if (@_) {
    $self->{features} = shift;
  }
  return $self->{features};
}

sub categories {
  my $c = $_[0]->{categories};
  return wantarray ? $c->members : $c->size;
}


### Workers

sub create_feature_vector {
  my $self = shift;
  my $content = $self->{content};
  my $weights = $self->{content_weights};

  die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
    unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;

  $self->{features} = $self->create_delayed_object('features');
  while (my ($name, $data) = each %$content) {
    my $t = $self->tokenize($data);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
    $self->stem_words($t);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
    my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
    $self->{features}->add($h);
  }
}

sub is_in_category {
  return (ref $_[1]
	  ? $_[0]->{categories}->includes( $_[1] )
	  : $_[0]->{categories}->includes_name( $_[1] ));
    
}

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

}

sub stem_words {
  my ($self, $tokens) = @_;
  return unless $self->{stemming};
  return if $self->{stemming} eq 'none';
  die "Unknown stemming option '$self->{stemming}' - options are 'porter' or 'none'"
    unless $self->{stemming} eq 'porter';
  
  eval {require Lingua::Stem; 1}
    or die "Porter stemming requires the Lingua::Stem module, available from CPAN.\n";

  @$tokens = @{ Lingua::Stem::stem(@$tokens) };
}

sub _filter_tokens {
  my ($self, $tokens_in) = @_;

  if ($self->{use_features}) {
    my $f = $self->{use_features}->as_hash;
    return [ grep  exists($f->{$_}), @$tokens_in ];
  } elsif ($self->{stopwords} and keys %{$self->{stopwords}}) {
    my $s = $self->{stopwords};
    return [ grep !exists($s->{$_}), @$tokens_in ];
  }
  return $tokens_in;
}

sub _weigh_tokens {
  my ($self, $tokens, $weight) = @_;

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

  my $self = $class->new(%args);
  
  open my($fh), "< $path" or die "$path: $!";
  $self->parse_handle(handle => $fh);
  close $fh;
  
  $self->finish;
  return $self;
}

sub dump_features {
  my ($self, %args) = @_;
  my $path = $args{path} or die "No 'path' argument given to dump_features()";
  open my($fh), "> $path" or die "Can't create $path: $!";
  my $f = $self->features->as_hash;
  while (my ($k, $v) = each %$f) {
    print $fh "$k\t$v\n";
  }
}

1;

__END__

=head1 NAME

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

 # Other parameters are accepted:
 my $d = new AI::Categorizer::Document(name => $string,
                                       categories => \@category_objects,
                                       content => { subject => $string,
                                                    body => $string2, ... },
                                       content_weights => { subject => 3,
                                                            body => 1, ... },
                                       stopwords => \%skip_these_words,
                                       stemming => $string,
                                       front_bias => $float,
                                       use_features => $feature_vector,
                                      );
 
 # Specify explicit feature vector:
 my $d = new AI::Categorizer::Document(name => $string);
 $d->features( $feature_vector );
 
 # Now pass the document to a categorization algorithm:
 my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
 my $hypothesis = $learner->categorize($document);

=head1 DESCRIPTION

The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector.  Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.

=head1 METHODS

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

A string that identifies this document.  Required.

=item content

The raw content of this document.  May be specified as either a string
or as a hash reference, allowing structured document types.

=item content_weights

A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector.  The weight is a multiplier of the feature vector
values.  For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.

If no weights are specified, all weights are set to 1.

=item front_bias

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

document.  Negative numbers indicate the opposite.  A bias of 0
indicates that no biasing should be done.

=item categories

A reference to an array of Category objects that this document belongs
to.  Optional.

=item stopwords

A list/hash of features (words) that should be ignored when parsing
document content.  A hash reference is preferred, with the features as
the keys.  If you pass an array reference containing the features, it
will be converted to a hash reference internally.

=item use_features

A Feature Vector specifying the only features that should be
considered when parsing this document.  This is an alternative to
using C<stopwords>.

=item stemming

Indicates the linguistic procedure that should be used to convert
tokens in the document to features.  Possible values are C<none>,
which indicates that the tokens should be used without change, or
C<porter>, indicating that the Porter stemming algorithm should be
applied to each token.  This requires the C<Lingua::Stem> module from
CPAN.

=item stopword_behavior

There are a few ways you might want the stopword list (specified with
the C<stopwords> parameter) to interact with the stemming algorithm
(specified with the C<stemming> parameter).  These options can be
controlled with the C<stopword_behavior> parameter, which can take the
following values:

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.

=item pre_stemmed

Stopwords are already stemmed, match them against stemmed document
words.

=back

The default value is C<stem>, which seems to produce the best results
in most cases I've tried.  I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.

This parameter has no effect if there are no stopwords being used, or
if stemming is not being used.  In the latter case, the list of
stopwords will always be matched as-is against the document words.

Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.

lib/AI/Categorizer/Document.pm  view on Meta::CPAN


=item parse( content =E<gt> $content )



=item name()

Returns this document's C<name> property as specified when the
document was created.

=item features()

Returns the Feature Vector associated with this document.

=item categories()

In a list context, returns a list of Category objects to which this
document belongs.  In a scalar context, returns the number of such
categories.

=item create_feature_vector()

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

# Input: a hash which is weights of elements
# Output: object of this class
# Description: this is constructor
sub new{
  my ($class, %args) = @_;

  # call super class such as XML::SAX::Base
  my $self = $class->SUPER::new;

  # save weights of elements which is a hash for pairs <elementName, weight>
  # weight is times duplication of corresponding element
  # It is provided by caller(one of parameters) at construction, and
  # we must save it in order to use doing duplication at end_element
  $self->{weightHash} = $args{weights};

  # It is storage to store the data produced by Text, CDataSection and etc.
  $self->{content} = '';

  # This array is used to store the data for every element from root to the current visiting element.
  # Thus, data of 0~($levelPointer-1)th in the array is only valid.
  # The array which store the starting location(index) of the content for an element, 

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN

			  UNIVERSAL::isa($c, 'HASH') ? keys(%$c) : @$c
			};
  return $self;
}

sub add_hypothesis {
  my ($self, $h, $correct, $name) = @_;
  die "No hypothesis given to add_hypothesis()" unless $h;
  $name = $h->document_name unless defined $name;
  
  $self->add_result([$h->categories], $correct, $name);
}

sub stats_table {
  my $self = shift;
  $self->SUPER::stats_table($self->{sig_figs});
}

1;

__END__

=head1 NAME

AI::Categorizer::Experiment - Coordinate experimental results

=head1 SYNOPSIS

 use AI::Categorizer::Experiment;
 my $e = new AI::Categorizer::Experiment(categories => \%categories);
 my $l = AI::Categorizer::Learner->restore_state(...path...);
 
 while (my $d = ... get document ...) {
   my $h = $l->categorize($d); # A Hypothesis
   $e->add_hypothesis($h, [map $_->name, $d->categories]);
 }
 
 print "Micro F1: ", $e->micro_F1, "\n"; # Access a single statistic
 print $e->stats_table; # Show several stats in table form

=head1 DESCRIPTION

The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments.  As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers.  When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
macro-averaging and micro-averaging over categories.

=head1 METHODS

The general execution flow when using this class is to create an
Experiment object, add a bunch of Hypotheses to it, and then report on
the results.

Internally, C<AI::Categorizer::Experiment> inherits from the
C<Statistics::Contingency>.  Please see the documentation of
C<Statistics::Contingency> for a description of its interface.  All of
its methods are available here, with the following additions:

=over 4

=item new( categories => \%categories )

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN


Returns a new Experiment object.  A required C<categories> parameter
specifies the names of all categories in the data set.  The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.

The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.

A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method.  It
does not affect the other methods like C<micro_precision()>.

=item add_result($assigned, $correct, $name)

Adds a new result to the experiment.  Please see the
C<Statistics::Contingency> documentation for a description of this
method.

=item add_hypothesis($hypothesis, $correct_categories)

Adds a new result to the experiment.  The first argument is a
C<AI::Categorizer::Hypothesis> object such as one generated by a
Learner's C<categorize()> method.  The list of correct categories can
be given as an array of category names (strings), as a hash whose keys
are the category names and whose values are anything logically true,
or as a single string if there is only one category.  For example, all
of the following are legal:

 $e->add_hypothesis($h, "sports");
 $e->add_hypothesis($h, ["sports", "finance"]);
 $e->add_hypothesis($h, {sports => 1, finance => 1});

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

use Class::Container;
use base qw(Class::Container);

use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);

__PACKAGE__->valid_params
  (
   features_kept => {
		     type => SCALAR,
		     default => 0.2,
		    },
   verbose => {
	       type => SCALAR,
	       default => 0,
	      },
  );

sub verbose {
  my $self = shift;
  $self->{verbose} = shift if @_;
  return $self->{verbose};
}

sub reduce_features {
  # Takes a feature vector whose weights are "feature scores", and
  # chops to the highest n features.  n is specified by the
  # 'features_kept' parameter.  If it's zero, all features are kept.
  # If it's between 0 and 1, we multiply by the present number of
  # features.  If it's greater than 1, we treat it as the number of
  # features to use.

  my ($self, $f, %args) = @_;
  my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};
  return $f unless $kept;

  my $num_kept = ($kept < 1 ? 
		  $f->length * $kept :
		  $kept);

  print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
  
  # This is algorithmic overkill, but the sort seems fast enough.  Will revisit later.
  my $features = $f->as_hash;
  my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
                      [0 .. $num_kept-1];

  my $result = $f->intersection( \@new_features );
  print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
  return $result;
}

# Abstract methods
sub rank_features;
sub scan_features;

sub select_features {
  my ($self, %args) = @_;
  
  die "No knowledge_set parameter provided to select_features()"
    unless $args{knowledge_set};

  my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
  return $self->reduce_features( $f, features_kept => $args{features_kept} );
}


1;

__END__

=head1 NAME

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN


=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

=over 4

=item load

If a C<load> parameter is present, the C<load()> method will be
invoked immediately.  If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>.  If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.

=item categories

An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet.  If used, the
C<documents> parameter should also be specified.

=item documents

An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.

=item features_kept

A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.

=item feature_selection

A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.

=item tfidf_weighting

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

be multiplied for each feature to find the final vector value for that
feature.  The default weighting is C<xxx>.

The first character specifies the "term frequency" component, which
can take the following values:

=over 4

=item b

Binary weighting - 1 for terms present in a document, 0 for terms absent.

=item t

Raw term frequency - equal to the number of times a feature occurs in
the document.

=item x

A synonym for 't'.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

Apply cosine normalization - multiply by 1/length(document_vector).

=item x

No change - multiply by 1.

=back

The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.

=item verbose

If set to a true value, some status/debugging information will be
output on C<STDOUT>.

=back


=item categories()

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

=item documents()

In a list context returns a list of all Document objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

=item document()

Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.

=item features()

Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.

=item verbose()

Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.

=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

=item save_features()

Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.

=item restore_features()

Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.

=item read()

Iterates through a Collection of documents and adds them to the
KnowledgeSet.  The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.

=item finish()

This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.

=item weigh_features()

This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.

=item document_frequency()

Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.

=item partition()

Divides the KnowledgeSet into several subsets.  This may be useful for
performing cross-validation.  The relative sizes of the subsets should

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

partitions will be returned as a list.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN

package AI::Categorizer::FeatureSelector::CategorySelector;

use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);

use Params::Validate qw(:types);

__PACKAGE__->contained_objects
  (
   features => { class => 'AI::Categorizer::FeatureVector',
		 delayed => 1 },
  );

1;


sub reduction_function;

# figure out the feature set before reading collection (default)

sub scan_features {
  my ($self, %args) = @_;
  my $c = $args{collection} or 
    die "No 'collection' parameter provided to scan_features()";

  if(!($self->{features_kept})) {return;}

  my %cat_features;
  my $coll_features = $self->create_delayed_object('features');
  my $nbDocuments = 0;

  while (my $doc = $c->next) {
    $nbDocuments++;
    $args{prog_bar}->() if $args{prog_bar};
    my $docfeatures = $doc->features->as_hash;
    foreach my $cat ($doc->categories) {
      my $catname = $cat->name;
      if(!(exists $cat_features{$catname})) {
        $cat_features{$catname} = $self->create_delayed_object('features');
      }
      $cat_features{$catname}->add($docfeatures);
    }
    $coll_features->add( $docfeatures );
  }
  print STDERR "\n* Computing Chi-Square values\n" if $self->verbose;

  my $r_features = $self->create_delayed_object('features');
  my @terms = $coll_features->names;
  my $progressBar = $self->prog_bar(scalar @terms);
  my $allFeaturesSum = $coll_features->sum;
  my %cat_features_sum;
  while( my($catname,$features) = each %cat_features ) {
    $cat_features_sum{$catname} = $features->sum;
  }

  foreach my $term (@terms) {
    $progressBar->();
    $r_features->{features}{$term} = $self->reduction_function($term,
      $nbDocuments,$allFeaturesSum,$coll_features,
      \%cat_features,\%cat_features_sum);
  }
  print STDERR "\n" if $self->verbose;
  my $new_features = $self->reduce_features($r_features);
  return $coll_features->intersection( $new_features );
}


# calculate feature set after reading collection (scan_first=0)

sub rank_features {
  die "CategorySelector->rank_features is not implemented yet!";
#  my ($self, %args) = @_;
#  
#  my $k = $args{knowledge_set} 
#    or die "No knowledge_set parameter provided to rank_features()";
#
#  my %freq_counts;
#  foreach my $name ($k->features->names) {
#    $freq_counts{$name} = $k->document_frequency($name);
#  }
#  return $self->create_delayed_object('features', features => \%freq_counts);
}


# copied from KnowledgeSet->prog_bar by Ken Williams

sub prog_bar {
  my ($self, $count) = @_;

  return sub {} unless $self->verbose;
  return sub { print STDERR '.' } unless eval "use Time::Progress; 1";

  my $pb = 'Time::Progress'->new;
  $pb->attr(max => $count);
  my $i = 0;
  return sub {
    $i++;
    return if $i % 25;
    print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
  };
}


lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN


AI::Categorizer::CategorySelector - Abstract Category Selection class

=head1 SYNOPSIS

This class is abstract. For example of instanciation, see
ChiSquare.

=head1 DESCRIPTION

A base class for FeatureSelectors that calculate their global features
from a set of features by categories.

=head1 METHODS

=head1 AUTHOR

Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code

=cut

lib/AI/Categorizer/FeatureSelector/ChiSquare.pm  view on Meta::CPAN

use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector::CategorySelector);

use Params::Validate qw(:types);

# Chi-Square function
# NB: this could probably be optimised a bit...

sub reduction_function {
  my ($self,$term,$N,$allFeaturesSum,
      $coll_features,$cat_features,$cat_features_sum) = @_;
  my $CHI2SUM = 0;
  my $nbcats = 0;
  foreach my $catname (keys %{$cat_features}) {
#  while ( my ($catname,$catfeatures) = each %{$cat_features}) {
    my ($A,$B,$C,$D); # A = number of times where t and c co-occur
                      # B =   "     "   "   t occurs without c
                      # C =   "     "   "   c occurs without t
                      # D =   "     "   "   neither c nor t occur
    $A = $cat_features->{$catname}->value($term);
    $B = $coll_features->value($term) - $A;
    $C = $cat_features_sum->{$catname} - $A;
    $D = $allFeaturesSum - ($A+$B+$C);
    my $ADminCB = ($A*$D)-($C*$B);
    my $CHI2 = $N*$ADminCB*$ADminCB / (($A+$C)*($B+$D)*($A+$B)*($C+$D));
    $CHI2SUM += $CHI2;
    $nbcats++;
  }
  return $CHI2SUM/$nbcats;
}

1;

lib/AI/Categorizer/FeatureSelector/ChiSquare.pm  view on Meta::CPAN

 use AI::Categorizer::KnowledgeSetSMART;
 my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
   tfidf_notation =>'Categorizer',
   feature_selection=>'chi_square', ...other parameters...); 

 # however it is also possible to pass an instance to the KnowledgeSet

 use AI::Categorizer::KnowledgeSet;
 use AI::Categorizer::FeatureSelector::ChiSquare;
 my $ksetCHI = new AI::Categorizer::KnowledgeSet(
   feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
   ...other parameters...
   );

=head1 DESCRIPTION

Feature selection with the ChiSquare function.

  Chi-Square(t,ci) = (N.(AD-CB)^2)
                    -----------------------
                    (A+C).(B+D).(A+B).(C+D)

lib/AI/Categorizer/FeatureSelector/DocFrequency.pm  view on Meta::CPAN


use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);

use Params::Validate qw(:types);
use Carp qw(croak);

__PACKAGE__->contained_objects
  (
   features => { class => 'AI::Categorizer::FeatureVector',
		 delayed => 1 },
  );

# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
  my ($self, %args) = @_;
  
  my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
  
  my %freq_counts;
  foreach my $name ($k->features->names) {
    $freq_counts{$name} = $k->document_frequency($name);
  }
  return $self->create_delayed_object('features', features => \%freq_counts);
}

sub scan_features {
  my ($self, %args) = @_;
  my $c = $args{collection} or die "No 'collection' parameter provided to scan_features()";

  my $doc_freq = $self->create_delayed_object('features');
  
  while (my $doc = $c->next) {
    $args{prog_bar}->() if $args{prog_bar};
    $doc_freq->add( $doc->features->as_boolean_hash );
  }
  print "\n" if $self->verbose;
  
  return $self->reduce_features($doc_freq);
}

1;

__END__

=head1 NAME

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

package AI::Categorizer::FeatureVector;

sub new {
  my ($package, %args) = @_;
  $args{features} ||= {};
  return bless {features => $args{features}}, $package;
}

sub names {
  my $self = shift;
  return keys %{$self->{features}};
}

sub set {
  my $self = shift;
  $self->{features} = (ref $_[0] ? $_[0] : {@_});
}

sub as_hash {
  my $self = shift;
  return $self->{features};
}

sub euclidean_length {
  my $self = shift;
  my $f = $self->{features};

  my $total = 0;
  foreach (values %$f) {
    $total += $_**2;
  }
  return sqrt($total);
}

sub normalize {
  my $self = shift;

  my $length = $self->euclidean_length;
  return $length ? $self->scale(1/$length) : $self;
}

sub scale {
  my ($self, $scalar) = @_;
  $_ *= $scalar foreach values %{$self->{features}};
  return $self;
}

sub as_boolean_hash {
  my $self = shift;
  return { map {($_ => 1)} keys %{$self->{features}} };
}

sub length {
  my $self = shift;
  return scalar keys %{$self->{features}};
}

sub clone {
  my $self = shift;
  return ref($self)->new( features => { %{$self->{features}} } );
}

sub intersection {
  my ($self, $other) = @_;
  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);

  my $common;
  if (UNIVERSAL::isa($other, 'ARRAY')) {
    $common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
  } elsif (UNIVERSAL::isa($other, 'HASH')) {
    $common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
  }
  return ref($self)->new( features => $common );
}

sub add {
  my ($self, $other) = @_;

  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
  while (my ($k,$v) = each %$other) {
    $self->{features}{$k} += $v;
  }
}

sub dot {
  my ($self, $other) = @_;
  $other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);

  my $sum = 0;
  my $f = $self->{features};
  while (my ($k, $v) = each %$f) {
    $sum += $other->{$k} * $v if exists $other->{$k};
  }
  return $sum;
}

sub sum {
  my ($self) = @_;

  # Return total of values in this vector
  my $total = 0;
  $total += $_ foreach values %{ $self->{features} };
  return $total;
}

sub includes {
  return exists $_[0]->{features}{$_[1]};
}

sub value {
  return $_[0]->{features}{$_[1]};
}

sub values {
  my $self = shift;
  return @{ $self->{features} }{ @_ };
}

1;
__END__

=head1 NAME

AI::Categorizer::FeatureVector - Features vs. Values

=head1 SYNOPSIS

  my $f1 = new AI::Categorizer::FeatureVector
    (features => {howdy => 2, doody => 3});
  my $f2 = new AI::Categorizer::FeatureVector
    (features => {doody => 1, whopper => 2});
   
  @names = $f1->names;
  $x = $f1->length;
  $x = $f1->sum;
  $x = $f1->includes('howdy');
  $x = $f1->value('howdy');
  $x = $f1->dot($f2);
  
  $f3 = $f1->clone;
  $f3 = $f1->intersection($f2);
  $f3 = $f1->add($f2);
  
  $h = $f1->as_hash;
  $h = $f1->as_boolean_hash;
  
  $f1->normalize;

=head1 DESCRIPTION

This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features.  At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document.  However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.

=head1 METHODS

=over 4

=item ...

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Storable(3)

=cut

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN


use strict;

use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   all_categories => {type => ARRAYREF},
   scores => {type => HASHREF},
   threshold => {type => SCALAR},
   document_name => {type => SCALAR, optional => 1},
  );

sub all_categories { @{$_[0]->{all_categories}} }
sub document_name  { $_[0]->{document_name} }
sub threshold      { $_[0]->{threshold} }

sub best_category {
  my ($self) = @_;
  my $sc = $self->{scores};
  return unless %$sc;

  my ($best_cat, $best_score) = each %$sc;
  while (my ($key, $val) = each %$sc) {
    ($best_cat, $best_score) = ($key, $val) if $val > $best_score;
  }
  return $best_cat;
}

sub in_category {
  my ($self, $cat) = @_;
  return '' unless exists $self->{scores}{$cat};
  return $self->{scores}{$cat} > $self->{threshold};
}

sub categories {
  my $self = shift;
  return @{$self->{cats}} if $self->{cats};
  $self->{cats} = [sort {$self->{scores}{$b} <=> $self->{scores}{$a}}
                   grep {$self->{scores}{$_} >= $self->{threshold}}
                   keys %{$self->{scores}}];
  return @{$self->{cats}};
}

sub scores {
  my $self = shift;
  return @{$self->{scores}}{@_};
}

1;

__END__

=head1 NAME

AI::Categorizer::Hypothesis - Embodies a set of category assignments

=head1 SYNOPSIS

 use AI::Categorizer::Hypothesis;
 
 # Hypotheses are usually created by the Learner's categorize() method.
 # (assume here that $learner and $document have been created elsewhere)
 my $h = $learner->categorize($document);
 
 print "Assigned categories: ", join ', ', $h->categories, "\n";
 print "Best category: ", $h->best_category, "\n";
 print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
 print "Chosen from: ", join ', ', $h->all_categories, "\n";
 print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";

=head1 DESCRIPTION

A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document.  Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
whether a particular category was assigned), we provide a simple class
to help facilitate these scenarios.

=head1 METHODS

=over 4

=item new(%parameters)

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

The following parameters are accepted when creating a new Hypothesis:

=over 4

=item all_categories

A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a
reference to an array of category names (as strings).

=item scores

A hash reference indicating the assignment score for each category.
Any score higher than the C<threshold> will be considered to be
assigned.

=item threshold

A number controlling which categories should be assigned - any
category whose score is greater than or equal to C<threshold> will be
assigned, any category whose score is lower than C<threshold> will not
be assigned.

=item document_name

An optional string parameter indicating the name of the document about
which this hypothesis was made.

=back


=item categories()

Returns an ordered list of the categories the document was placed in,
with best matches first.  Categories are returned by their string names.

=item best_category()

Returns the name of the category with the highest score in this
hypothesis.  Bear in mind that this category may not actually be
assigned if no categories' scores exceed the threshold.

=item in_category($name)

Returns true or false depending on whether the document was placed in
the given category.

=item scores(@names)

Returns a list of result scores for the given categories.  Since the
interface is still changing, and since different Learners implement
scoring in different ways, not very much can officially be said
about the scores, except that a good score is higher than a bad
score.  Individual Learners will have their own procedures for
determining scores, so you cannot compare one Learner's score with
another Learner's - for instance, one Learner might always give scores
between 0 and 1, and another Learner might always return scores less
than 0.  You often cannot compare scores from a single Learner on two
different categorization tasks either.

=item all_categories()

Returns the list of category names specified with the
C<all_categories> constructor parameter.

=item document_name()

Returns the value of the C<document_name> parameter specified as a

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  );

__PACKAGE__->contained_objects
  (
   document => { delayed => 1,
		 class => 'AI::Categorizer::Document' },
   category => { delayed => 1,
		 class => 'AI::Categorizer::Category' },
   collection => { delayed => 1,
		   class => 'AI::Categorizer::Collection::Files' },
   features => { delayed => 1,
		 class => 'AI::Categorizer::FeatureVector' },
   feature_selector => 'AI::Categorizer::FeatureSelector::DocFrequency',
  );

sub new {
  my ($pkg, %args) = @_;
  
  # Shortcuts
  if ($args{tfidf_weighting}) {
    @args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  $self->{documents}  = new AI::Categorizer::ObjectSet( @{$self->{documents}}  );

  if ($self->{load}) {
    my $args = ref($self->{load}) ? $self->{load} : { path => $self->{load} };
    $self->load(%$args);
    delete $self->{load};
  }
  return $self;
}

sub features {
  my $self = shift;

  if (@_) {
    $self->{features} = shift;
    $self->trim_doc_features if $self->{features};
  }
  return $self->{features} if $self->{features};

  # Create a feature vector encompassing the whole set of documents
  my $v = $self->create_delayed_object('features');
  foreach my $document ($self->documents) {
    $v->add( $document->features );
  }
  return $self->{features} = $v;
}

sub categories {
  my $c = $_[0]->{categories};
  return wantarray ? $c->members : $c->size;
}

sub documents {
  my $d = $_[0]->{documents};
  return wantarray ? $d->members : $d->size;

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


sub feature_selector { $_[0]->{feature_selector} }
sub scan_first       { $_[0]->{scan_first} }

sub verbose {
  my $self = shift;
  $self->{verbose} = shift if @_;
  return $self->{verbose};
}

sub trim_doc_features {
  my ($self) = @_;
  
  foreach my $doc ($self->documents) {
    $doc->features( $doc->features->intersection($self->features) );
  }
}


sub prog_bar {
  my ($self, $collection) = @_;

  return sub {} unless $self->verbose;
  return sub { print STDERR '.' } unless eval "use Time::Progress; 1";

  my $count = $collection->can('count_documents') ? $collection->count_documents : 0;
  
  my $pb = 'Time::Progress'->new;
  $pb->attr(max => $count);
  my $i = 0;
  return sub {
    $i++;
    return if $i % 25;
    print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
  };
}

# A little utility method for several other methods like scan_stats(),

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  my $collection = $self->_make_collection(\%args);
  my $pb = $self->prog_bar($collection);

  my %stats;


  while (my $doc = $collection->next) {
    $pb->();
    $stats{category_count_with_duplicates} += $doc->categories;

    my ($sum, $length) = ($doc->features->sum, $doc->features->length);
    $stats{document_count}++;
    $stats{token_count} += $sum;
    $stats{type_count}  += $length;
    
    foreach my $cat ($doc->categories) {
#warn $doc->name, ": ", $cat->name, "\n";
      $stats{categories}{$cat->name}{document_count}++;
      $stats{categories}{$cat->name}{token_count} += $sum;
      $stats{categories}{$cat->name}{type_count}  += $length;
    }

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

    $stats{"${thing}_skew_by_category"} = sqrt($ssum/@cats) / $stats{"${thing}s_per_category"};
  }

  return \%stats;
}

sub load {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);

  if ($self->{features_kept}) {
    # Read the whole thing in, then reduce
    $self->read( collection => $c );
    $self->select_features;

  } elsif ($self->{scan_first}) {
    # Figure out the feature set first, then read data in
    $self->scan_features( collection => $c );
    $c->rewind;
    $self->read( collection => $c );

  } else {
    # Don't do any feature reduction, just read the data
    $self->read( collection => $c );
  }
}

sub read {

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  while (my $doc = $collection->next) {
    $pb->();
    $self->add_document($doc);
  }
  print "\n" if $self->verbose;
}

sub finish {
  my $self = shift;
  return if $self->{finished}++;
  $self->weigh_features;
}

sub weigh_features {
  # This could be made more efficient by figuring out an execution
  # plan in advance

  my $self = shift;
  
  if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
    # Nothing to do
  } elsif ( $self->{term_weighting} eq 'l' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      $_ = 1 + log($_) foreach values %$f;
    }
  } elsif ( $self->{term_weighting} eq 'n' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      my $max_tf = AI::Categorizer::Util::max values %$f;
      $_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
    }
  } elsif ( $self->{term_weighting} eq 'b' ) {
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      $_ = $_ ? 1 : 0 foreach values %$f;
    }
  } else {
    die "term_weighting must be one of 'x', 't', 'l', 'b', or 'n'";
  }
  
  if ($self->{collection_weighting} eq 'x') {
    # Nothing to do
  } elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
    my $subtrahend = ($1 eq 'f' ? 0 : 1);
    my $num_docs = $self->documents;
    $self->document_frequency('foo');  # Initialize
    foreach my $doc ($self->documents) {
      my $f = $doc->features->as_hash;
      $f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
    }
  } else {
    die "collection_weighting must be one of 'x', 'f', or 'p'";
  }

  if ( $self->{normalize_weighting} eq 'x' ) {
    # Nothing to do
  } elsif ( $self->{normalize_weighting} eq 'c' ) {
    $_->features->normalize foreach $self->documents;
  } else {
    die "normalize_weighting must be one of 'x' or 'c'";
  }
}

sub document_frequency {
  my ($self, $term) = @_;
  
  unless (exists $self->{doc_freq_vector}) {
    die "No corpus has been scanned for features" unless $self->documents;

    my $doc_freq = $self->create_delayed_object('features', features => {});
    foreach my $doc ($self->documents) {
      $doc_freq->add( $doc->features->as_boolean_hash );
    }
    $self->{doc_freq_vector} = $doc_freq->as_hash;
  }
  
  return exists $self->{doc_freq_vector}{$term} ? $self->{doc_freq_vector}{$term} : 0;
}

sub scan_features {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);

  my $pb = $self->prog_bar($c);
  my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );

  $self->delayed_object_params('document', use_features => $ranked_features);
  $self->delayed_object_params('collection', use_features => $ranked_features);
  return $ranked_features;
}

sub select_features {
  my $self = shift;
  
  my $f = $self->feature_selector->select_features(knowledge_set => $self);
  $self->features($f);
}

sub partition {
  my ($self, @sizes) = @_;
  my $num_docs = my @docs = $self->documents;
  my @groups;

  while (@sizes > 1) {
    my $size = int ($num_docs * shift @sizes);
    push @groups, [];

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

sub add_document {
  my ($self, $doc) = @_;

  foreach ($doc->categories) {
    $_->add_document($doc);
  }
  $self->{documents}->insert($doc);
  $self->{categories}->insert($doc->categories);
}

sub save_features {
  my ($self, $file) = @_;
  
  my $f = ($self->{features} || { $self->delayed_object_params('document') }->{use_features})
    or croak "No features to save";
  
  open my($fh), "> $file" or croak "Can't create $file: $!";
  my $h = $f->as_hash;
  print $fh "# Total: ", $f->length, "\n";
  
  foreach my $k (sort {$h->{$b} <=> $h->{$a}} keys %$h) {
    print $fh "$k\t$h->{$k}\n";
  }
  close $fh;
}

sub restore_features {
  my ($self, $file, $n) = @_;
  
  open my($fh), "< $file" or croak "Can't open $file: $!";

  my %hash;
  while (<$fh>) {
    next if /^#/;
    /^(.*)\t([\d.]+)$/ or croak "Malformed line: $_";
    $hash{$1} = $2;
    last if defined $n and $. >= $n;
  }
  my $features = $self->create_delayed_object('features', features => \%hash);
  
  $self->delayed_object_params('document',   use_features => $features);
  $self->delayed_object_params('collection', use_features => $features);
}

1;

__END__

=head1 NAME

AI::Categorizer::KnowledgeSet - Encapsulates set of documents

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

=over 4

=item load

If a C<load> parameter is present, the C<load()> method will be
invoked immediately.  If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>.  If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.

=item categories

An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet.  If used, the
C<documents> parameter should also be specified.

=item documents

An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.

=item features_kept

A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.

=item feature_selection

A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.

=item tfidf_weighting

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

be multiplied for each feature to find the final vector value for that
feature.  The default weighting is C<xxx>.

The first character specifies the "term frequency" component, which
can take the following values:

=over 4

=item b

Binary weighting - 1 for terms present in a document, 0 for terms absent.

=item t

Raw term frequency - equal to the number of times a feature occurs in
the document.

=item x

A synonym for 't'.

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

Apply cosine normalization - multiply by 1/length(document_vector).

=item x

No change - multiply by 1.

=back

The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.

=item verbose

If set to a true value, some status/debugging information will be
output on C<STDOUT>.

=back


=item categories()

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

=item documents()

In a list context returns a list of all Document objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

=item document()

Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.

=item features()

Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.

=item verbose()

Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.

=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

=item save_features()

Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.

=item restore_features()

Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.

=item read()

Iterates through a Collection of documents and adds them to the
KnowledgeSet.  The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.

=item finish()

This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.

=item weigh_features()

This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.

=item document_frequency()

Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.

=item partition()

Divides the KnowledgeSet into several subsets.  This may be useful for
performing cross-validation.  The relative sizes of the subsets should

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

partitions will be returned as a list.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

		  class => 'AI::Categorizer::Hypothesis',
		  delayed => 1,
		 },
   experiment => {
		  class => 'AI::Categorizer::Experiment',
		  delayed => 1,
		 },
  );

# Subclasses must override these virtual methods:
sub get_scores;
sub create_model;

# Optional virtual method for on-line learning:
sub add_knowledge;

sub verbose {
  my $self = shift;
  if (@_) {
    $self->{verbose} = shift;
  }

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

  $self->{knowledge_set}->finish;
  $self->create_model;    # Creates $self->{model}
  $self->delayed_object_params('hypothesis',
			       all_categories => [map $_->name, $self->categories],
			      );
}

sub prog_bar {
  my ($self, $count) = @_;
  
  return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
  
  my $pb = 'Time::Progress'->new;
  $pb->attr(max => $count);
  my $i = 0;
  return sub {
    $i++;
    return if $i % 25;
    my $string = '';
    if (@_) {
      my $e = shift;
      $string = sprintf " (maF1=%.03f, miF1=%.03f)", $e->macro_F1, $e->micro_F1;
    }

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

    }
  }
  print STDERR "\n" if $self->verbose;

  return $experiment;
}

sub categorize {
  my ($self, $doc) = @_;
  
  my ($scores, $threshold) = $self->get_scores($doc);
  
  if ($self->verbose > 2) {
    warn "scores: @{[ %$scores ]}" if $self->verbose > 3;
    
    foreach my $key (sort {$scores->{$b} <=> $scores->{$a}} keys %$scores) {
      print "$key: $scores->{$key}\n";
    }
  }
  
  return $self->create_delayed_object('hypothesis',
                                      scores => $scores,
                                      threshold => $threshold,
                                      document_name => $doc->name,
                                     );
}
1;

__END__

=head1 NAME

AI::Categorizer::Learner - Abstract Machine Learner Class

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

 use AI::Categorizer::Learner::NaiveBayes;  # Or other subclass
 
 # Here $k is an AI::Categorizer::KnowledgeSet object
 
 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
 $nb->train(knowledge_set => $k);
 $nb->save_state('filename');
 
 ... time passes ...
 
 $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
 my $c = new AI::Categorizer::Collection::Files( path => ... );
 while (my $document = $c->next) {
   my $hypothesis = $nb->categorize($document);
   print "Best assigned category: ", $hypothesis->best_category, "\n";
   print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
 }

=head1 DESCRIPTION

The C<AI::Categorizer::Learner> class is an abstract class that will

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN


If true, the Learner will display some diagnostic output while
training and categorizing documents.

=back

=item train()

=item train(knowledge_set => $k)

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.  If you provided a C<knowledge_set> parameter to C<new()>,
specifying one here will override it.

=item categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=item categorize_collection(collection => $collection)

Categorizes every document in a collection and returns an Experiment
object representing the results.  Note that the Experiment does not
contain knowledge of the assigned categories for every document, only
a statistical summary of the results.

=item knowledge_set()

Gets/sets the internal C<knowledge_set> member.  Note that since the
knowledge set may be enormous, some Learners may throw away their
knowledge set after training or after restoring state from a file.

=item $learner-E<gt>save_state($path)

Saves the Learner for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=item $class-E<gt>restore_state($path)

Returns a Learner saved in a file with C<save_state()>.  This method
is inherited from C<AI::Categorizer::Storable>.

=back

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/Learner/Boolean.pm  view on Meta::CPAN


use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use AI::Categorizer::Util qw(random_elements);

__PACKAGE__->valid_params
  (
   max_instances => {type => SCALAR, default => 0},
   threshold => {type => SCALAR, default => 0.5},
  );

sub create_model {
  my $self = shift;
  my $m = $self->{model} ||= {};
  my $mi = $self->{max_instances};

  foreach my $cat ($self->knowledge_set->categories) {
    my (@p, @n);
    foreach my $doc ($self->knowledge_set->documents) {
      if ($doc->is_in_category($cat)) {
	push @p, $doc;
      } else {
	push @n, $doc;
      }
    }
    if ($mi and @p + @n > $mi) {
      # Get rid of random instances from training set, preserving
      # current positive/negative ratio
      my $ratio = $mi / (@p + @n);
      @p = random_elements(\@p, @p * $ratio);
      @n = random_elements(\@n, @n * $ratio);
      
      warn "Limiting to ". @p ." positives and ". @n ." negatives\n" if $self->verbose;
    }

    warn "Creating model for ", $cat->name, "\n" if $self->verbose;
    $m->{learners}{ $cat->name } = $self->create_boolean_model(\@p, \@n, $cat);
  }
}

sub create_boolean_model;  # Abstract method

sub get_scores {
  my ($self, $doc) = @_;
  my $m = $self->{model};
  my %scores;
  foreach my $cat (keys %{$m->{learners}}) {
    $scores{$cat} = $self->get_boolean_score($doc, $m->{learners}{$cat});
  }
  return (\%scores, $self->{threshold});
}

sub get_boolean_score;  # Abstract method

sub threshold {
  my $self = shift;
  $self->{threshold} = shift if @_;
  return $self->{threshold};
}

sub categories {
  my $self = shift;
  return map AI::Categorizer::Category->by_name( name => $_ ), keys %{ $self->{model}{learners} };
}

1;
__END__

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

  $self->{model}{first_tree}->do_purge;
  delete $self->{model}{first_tree};
}

sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  
  my $t = new AI::DecisionTree(noise_mode => 'pick_best', 
			       verbose => $self->verbose);

  my %results;
  for ($positives, $negatives) {
    foreach my $doc (@$_) {
      $results{$doc->name} = $_ eq $positives ? 1 : 0;
    }
  }

  if ($self->{model}{first_tree}) {
    $t->copy_instances(from => $self->{model}{first_tree});
    $t->set_results(\%results);

  } else {
    for ($positives, $negatives) {
      foreach my $doc (@$_) {
	$t->add_instance( attributes => $doc->features->as_boolean_hash,
			  result => $results{$doc->name},
			  name => $doc->name,
			);
      }
    }
    $t->purge(0);
    $self->{model}{first_tree} = $t;
  }

  print STDERR "\nBuilding tree for category '", $cat->name, "'" if $self->verbose;
  $t->train;
  return $t;
}

sub get_scores {
  my ($self, $doc) = @_;
  local $self->{current_doc} = $doc->features->as_boolean_hash;
  return $self->SUPER::get_scores($doc);
}

sub get_boolean_score {
  my ($self, $doc, $t) = @_;
  return $t->get_result( attributes => $self->{current_doc} ) || 0;
}

1;
__END__

=head1 NAME

AI::Categorizer::Learner::DecisionTree - Decision Tree Learner

=head1 SYNOPSIS

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

  use AI::Categorizer::Learner::DecisionTree;
  
  # Here $k is an AI::Categorizer::KnowledgeSet object
  
  my $l = new AI::Categorizer::Learner::DecisionTree(...parameters...);
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  while (my $document = ... ) {  # An AI::Categorizer::Document object
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

=head1 DESCRIPTION

This class implements a Decision Tree machine learner, using
C<AI::DecisionTree> to do the internal work.

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN


This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.

=head2 new()

Creates a new DecisionTree Learner and returns it.

=head2 train(knowledge_set => $k)

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

  my $self = shift;
  my $k = $self->knowledge_set;
  my $num_docs = $k->documents;
  
  foreach my $cat ($k->categories) {
    next unless $cat->documents;
    $self->{model}{$cat->name} = $cat->documents / $num_docs;
  }
}

sub get_scores {
  my ($self, $newdoc) = @_;
  
  my %scores;
  while (my ($cat, $prob) = each %{$self->{model}}) {
    $scores{$cat} = 0.5 + $prob - rand();
  }
  
  return (\%scores, 0.5);
}

1;

__END__

=head1 NAME

AI::Categorizer::Learner::Guesser - Simple guessing based on class probabilities

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

  use AI::Categorizer::Learner::Guesser;
  
  # Here $k is an AI::Categorizer::KnowledgeSet object
  
  my $l = new AI::Categorizer::Learner::Guesser;
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This implements a simple category guesser that makes assignments based
solely on the prior probabilities of categories.  For instance, if 5%
of the training documents belong to a certain category, then the
probability of any test document being assigned to that category is
0.05.  This can be useful for providing baseline scores to compare
with other more sophisticated algorithms.

See L<AI::Categorizer> for a complete description of the interface.

=head1 METHODS

This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available.

=head1 AUTHOR

Ken Williams (C<< <ken@mathforum.org> >>)

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

=cut

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

package AI::Categorizer::Learner::KNN;

use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   threshold => {type => SCALAR, default => 0.4},
   k_value => {type => SCALAR, default => 20},
   knn_weighting => {type => SCALAR, default => 'score'},
   max_instances => {type => SCALAR, default => 0},
  );

sub create_model {
  my $self = shift;
  foreach my $doc ($self->knowledge_set->documents) {
    $doc->features->normalize;
  }
  $self->knowledge_set->features;  # Initialize
}

sub threshold {
  my $self = shift;
  $self->{threshold} = shift if @_;
  return $self->{threshold};
}

sub categorize_collection {
  my $self = shift;
  
  my $f_class = $self->knowledge_set->contained_class('features');
  if ($f_class->can('all_features')) {
    $f_class->all_features([$self->knowledge_set->features->names]);
  }
  $self->SUPER::categorize_collection(@_);
}

sub get_scores {
  my ($self, $newdoc) = @_;
  my $currentDocName = $newdoc->name;
  #print "classifying $currentDocName\n";

  my $features = $newdoc->features->intersection($self->knowledge_set->features)->normalize;
  my $q = AI::Categorizer::Learner::KNN::Queue->new(size => $self->{k_value});

  my @docset;
  if ($self->{max_instances}) {
    # Use (approximately) max_instances documents, chosen randomly from corpus
    my $probability = $self->{max_instances} / $self->knowledge_set->documents;
    @docset = grep {rand() < $probability} $self->knowledge_set->documents;
  } else {
    # Use the whole corpus
    @docset = $self->knowledge_set->documents;
  }
  
  foreach my $doc (@docset) {
    my $score = $doc->features->dot( $features );
    warn "Score for ", $doc->name, " (", ($doc->categories)[0]->name, "): $score" if $self->verbose > 1;
    $q->add($doc, $score);
  }
  
  my %scores = map {+$_->name, 0} $self->categories;
  foreach my $e (@{$q->entries}) {
    foreach my $cat ($e->{thing}->categories) {
      $scores{$cat->name} += ($self->{knn_weighting} eq 'score' ? $e->{score} : 1); #increment cat score
    }
  }
  
  $_ /= $self->{k_value} foreach values %scores;
  
  return (\%scores, $self->{threshold});
}

###################################################################
package AI::Categorizer::Learner::KNN::Queue;

sub new {
  my ($pkg, %args) = @_;
  return bless {
		size => $args{size},
		entries => [],
	       }, $pkg;
}

sub add {
  my ($self, $thing, $score) = @_;

  # scores may be (0.2, 0.4, 0.4, 0.8) - ascending

  return unless (@{$self->{entries}} < $self->{size}       # Queue not filled
		 or $score > $self->{entries}[0]{score});  # Found a better entry
  
  my $i;
  if (!@{$self->{entries}}) {
    $i = 0;
  } elsif ($score > $self->{entries}[-1]{score}) {
    $i = @{$self->{entries}};
  } else {

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

sub entries {
  return shift->{entries};
}

1;

__END__

=head1 NAME

AI::Categorizer::Learner::KNN - K Nearest Neighbour Algorithm For AI::Categorizer

=head1 SYNOPSIS

  use AI::Categorizer::Learner::KNN;
  
  # Here $k is an AI::Categorizer::KnowledgeSet object
  
  my $nb = new AI::Categorizer::Learner::KNN(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
  
  ... time passes ...
  
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This is an implementation of the k-Nearest-Neighbor decision-making
algorithm, applied to the task of document categorization (as defined
by the AI::Categorizer module).  See L<AI::Categorizer> for a complete
description of the interface.

=head1 METHODS

This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.

=head2 new()

Creates a new KNN Learner and returns it.  In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
KNN subclass accepts the following parameters:

=over 4

=item threshold

Sets the score threshold for category membership.  The default is
currently 0.1.  Set the threshold lower to assign more categories per
document, set it higher to assign fewer.  This can be an effective way
to trade of between precision and recall.

=item k_value

Sets the C<k> value (as in k-Nearest-Neighbor) to the given integer.
This indicates how many of each document's nearest neighbors should be
considered when assigning categories.  The default is 5.

=back

=head2 threshold()

Returns the current threshold value.  With an optional numeric
argument, you may set the threshold.

=head2 train(knowledge_set => $k)

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 AUTHOR

Originally written by David Bell (C<< <dave@student.usyd.edu.au> >>),
October 2002.

Added to AI::Categorizer November 2002, modified, and maintained by
Ken Williams (C<< <ken@mathforum.org> >>).

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3)

"A re-examination of text categorization methods" by Yiming Yang
L<http://www.cs.cmu.edu/~yiming/publications.html>

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

package AI::Categorizer::Learner::NaiveBayes;

use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use Algorithm::NaiveBayes;

__PACKAGE__->valid_params
  (
   threshold => {type => SCALAR, default => 0.3},
  );

sub create_model {
  my $self = shift;
  my $m = $self->{model} = Algorithm::NaiveBayes->new;

  foreach my $d ($self->knowledge_set->documents) {
    $m->add_instance(attributes => $d->features->as_hash,
		     label      => [ map $_->name, $d->categories ]);
  }
  $m->train;
}

sub get_scores {
  my ($self, $newdoc) = @_;

  return ($self->{model}->predict( attributes => $newdoc->features->as_hash ),
	  $self->{threshold});
}

sub threshold {
  my $self = shift;
  $self->{threshold} = shift if @_;
  return $self->{threshold};
}

sub save_state {
  my $self = shift;
  local $self->{knowledge_set};  # Don't need the knowledge_set to categorize
  $self->SUPER::save_state(@_);
}

sub categories {
  my $self = shift;

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

  use AI::Categorizer::Learner::NaiveBayes;
  
  # Here $k is an AI::Categorizer::KnowledgeSet object
  
  my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
  
  ... time passes ...
  
  $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
  }

=head1 DESCRIPTION

This is an implementation of the Naive Bayes decision-making

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

of its methods are available unless explicitly mentioned here.

=head2 new()

Creates a new Naive Bayes Learner and returns it.  In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Naive Bayes subclass accepts the following parameters:

=over 4

=item * threshold

Sets the score threshold for category membership.  The default is
currently 0.3.  Set the threshold lower to assign more categories per
document, set it higher to assign fewer.  This can be an effective way
to trade of between precision and recall.

=back

=head2 threshold()

Returns the current threshold value.  With an optional numeric
argument, you may set the threshold.

=head2 train(knowledge_set => $k)

Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.

=head2 categorize($document)

Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.

=head2 save_state($path)

Saves the categorizer for later use.  This method is inherited from
C<AI::Categorizer::Storable>.

=head1 CALCULATIONS

The various probabilities used in the above calculations are found
directly from the training documents.  For instance, if there are 5000
total tokens (words) in the "sports" training documents and 200 of
them are the word "curling", then C<P(curling|sports) = 200/5000 =
0.04> .  If there are 10,000 total tokens in the training corpus and
5,000 of them are in documents belonging to the category "sports",
then C<P(sports)> = 5,000/10,000 = 0.5> .

Because the probabilities involved are often very small and we
multiply many of them together, the result is often a tiny tiny
number.  This could pose problems of floating-point underflow, so
instead of working with the actual probabilities we work with the
logarithms of the probabilities.  This also speeds up various
calculations in the C<categorize()> method.

=head1 TO DO

More work on the confidence scores - right now the winning category
tends to dominate the scores overwhelmingly, when the scores should
probably be more evenly distributed.

=head1 AUTHOR

Ken Williams, ken@forum.swarthmore.edu

=head1 COPYRIGHT

Copyright 2000-2003 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.

=head1 SEE ALSO

AI::Categorizer(3), Algorithm::NaiveBayes(3)

"A re-examination of text categorization methods" by Yiming Yang
L<http://www.cs.cmu.edu/~yiming/publications.html>

lib/AI/Categorizer/Learner/Rocchio.pm  view on Meta::CPAN

use strict;
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Learner::Boolean;
use base qw(AI::Categorizer::Learner::Boolean);

__PACKAGE__->valid_params
  (
   positive_setting => {type => SCALAR, default => 16 },
   negative_setting => {type => SCALAR, default => 4  },
   threshold        => {type => SCALAR, default => 0.1},
  );

sub create_model {
  my $self = shift;
  foreach my $doc ($self->knowledge_set->documents) {
    $doc->features->normalize;
  }
  
  $self->{model}{all_features} = $self->knowledge_set->features(undef);
  $self->SUPER::create_model(@_);
  delete $self->{knowledge_set};
}

sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  my $posdocnum = @$positives;
  my $negdocnum = @$negatives;
  
  my $beta = $self->{positive_setting};
  my $gamma = $self->{negative_setting};
  
  my $profile = $self->{model}{all_features}->clone->scale(-$gamma/$negdocnum);
  my $f = $cat->features(undef)->clone->scale( $beta/$posdocnum + $gamma/$negdocnum );
  $profile->add($f);

  return $profile->normalize;
}

sub get_boolean_score {
  my ($self, $newdoc, $profile) = @_;
  return $newdoc->features->normalize->dot($profile);
}

1;









( run in 1.650 second using v1.01-cache-2.11-cpan-49f99fa48dc )