AI-Categorizer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

    be seen in "doc/classes.png".

  Knowledge Sets

    A "knowledge set" is defined as a collection of documents, together with
    some information on the categories each document belongs to. Note that this
    term is somewhat unique to this project - other sources may call it a
    "training corpus", or "prior knowledge". A knowledge set also contains some
    information on how documents will be parsed and how their features (words)
    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of

README  view on Meta::CPAN

    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

    The "AI::Categorizer::Collection" class and its subclasses instantiate the
    idea of a collection in this sense.

  Documents

    Each document is represented by an "AI::Categorizer::Document" object, or an
    object of one of its subclasses. Each document class contains methods for
    turning a bunch of data into a Feature Vector. Each document also has a
    method to report which categories it belongs to.

  Categories

    Each category is represented by an "AI::Categorizer::Category" object. Its
    main purpose is to keep track of which documents belong to it, though you
    can also examine statistical properties of an entire category, such as
    obtaining a Feature Vector representing an amalgamation of all the documents
    that belong to it.

README  view on Meta::CPAN

    Please see the documentation of these individual modules for more details on
    their guts and quirks. See the "AI::Categorizer::Learner" documentation for
    a description of the general categorizer interface.

    If you wish to create your own classifier, you should inherit from
    "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
    abstract classes that manage some of the work for you.

  Feature Vectors

    Most categorization algorithms don't deal directly with documents' data,
    they instead deal with a *vector representation* of a document's *features*.
    The features may be any properties of the document that seem helpful for
    determining its category, but they are usually some version of the "most
    important" words in the document. A list of features and their weights in
    each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
    You may think of this class as roughly analogous to a Perl hash, where the
    keys are the names of features and the values are their weights.

  Hypotheses

README  view on Meta::CPAN


        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.

        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.

        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to
            "$data_root/cats.txt".

    learner()
        Returns the Learner object associated with this Categorizer. Before
        "train()", the Learner will of course not be trained yet.

    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.

    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.

    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
        use.

        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.

    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this calls the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.

    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        created during "read_training_set()". Returns the Learner object. Also
        saves the Learner object for later use.

    evaluate_test_set()

eg/demo.pl  view on Meta::CPAN

# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.

use strict;
use AI::Categorizer;
use AI::Categorizer::Collection::Files;
use AI::Categorizer::Learner::NaiveBayes;
use File::Spec;

die("Usage: $0 <corpus>\n".
    "  A sample corpus (data set) can be downloaded from\n".
    "     http://www.cpan.org/authors/Ken_Williams/data/reuters-21578.tar.gz\n".
    "  or http://www.limnus.com/~ken/reuters-21578.tar.gz\n")
  unless @ARGV == 1;

my $corpus = shift;

my $training  = File::Spec->catfile( $corpus, 'training' );
my $test      = File::Spec->catfile( $corpus, 'test' );
my $cats      = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );

lib/AI/Categorizer.pm  view on Meta::CPAN



__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );

__PACKAGE__->contained_objects
  (
   knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
   learner       => { class => 'AI::Categorizer::Learner::NaiveBayes' },
   experiment    => { class => 'AI::Categorizer::Experiment',
		      delayed => 1 },
   collection    => { class => 'AI::Categorizer::Collection::Files',
		      delayed => 1 },
  );

sub new {
  my $package = shift;
  my %args = @_;
  my %defaults;
  if (exists $args{data_root}) {
    $defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
    $defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
    $defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
    delete $args{data_root};
  }

  return $package->SUPER::new(%defaults, %args);
}

#sub dump_parameters {
#  my $p = shift()->SUPER::dump_parameters;
#  delete $p->{stopwords} if $p->{stopword_file};
#  return $p;
#}

lib/AI/Categorizer.pm  view on Meta::CPAN


=head2 Knowledge Sets

A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take

lib/AI/Categorizer.pm  view on Meta::CPAN

two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.

=head2 Documents

Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses.  Each document class
contains methods for turning a bunch of data into a Feature Vector.
Each document also has a method to report which categories it belongs
to.

=head2 Categories

Each category is represented by an C<AI::Categorizer::Category>
object.  Its main purpose is to keep track of which documents belong
to it, though you can also examine statistical properties of an entire
category, such as obtaining a Feature Vector representing an
amalgamation of all the documents that belong to it.

lib/AI/Categorizer.pm  view on Meta::CPAN

details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.

If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.

=head2 Feature Vectors

Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.

=head2 Hypotheses

lib/AI/Categorizer.pm  view on Meta::CPAN


Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.

=item test_set

Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.

=item data_root

A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately.  Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.

=back

=item learner()

Returns the Learner object associated with this Categorizer.  Before
C<train()>, the Learner will of course not be trained yet.

=item knowledge_set()

Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.

=item run_experiment()

Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.

=item scan_features()

Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.

This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.

=item read_training_set()

Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.

=item train()

Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.

lib/AI/Categorizer/Collection/InMemory.pm  view on Meta::CPAN

package AI::Categorizer::Collection::InMemory;
use strict;

use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   data => { type => HASHREF },
  );

sub new {
  my $self = shift()->SUPER::new(@_);
  
  while (my ($name, $params) = each %{$self->{data}}) {
    foreach (@{$params->{categories}}) {
      next if ref $_;
      $_ = AI::Categorizer::Category->by_name(name => $_);
    }
  }

  return $self;
}

sub next {
  my $self = shift;
  my ($name, $params) = each %{$self->{data}} or return;
  return AI::Categorizer::Document->new(name => $name, %$params);
}

sub rewind {
  my $self = shift;
  scalar keys %{$self->{data}};
  return;
}

sub count_documents {
  my $self = shift;
  return scalar keys %{$self->{data}};
}

1;

lib/AI/Categorizer/Document.pm  view on Meta::CPAN


### Constructors

my $NAME = 'a';

sub new {
  my $pkg = shift;
  my $self = $pkg->SUPER::new(name => $NAME++,  # Use a default name
			      @_);

  # Get efficient internal data structures
  $self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );

  $self->_fix_stopwords;
  
  # A few different ways for the caller to initialize the content
  if (exists $self->{parse}) {
    $self->parse(content => delete $self->{parse});
    
  } elsif (exists $self->{parse_handle}) {
    $self->parse_handle(handle => delete $self->{parse_handle});

lib/AI/Categorizer/Document.pm  view on Meta::CPAN


sub create_feature_vector {
  my $self = shift;
  my $content = $self->{content};
  my $weights = $self->{content_weights};

  die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
    unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;

  $self->{features} = $self->create_delayed_object('features');
  while (my ($name, $data) = each %$content) {
    my $t = $self->tokenize($data);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
    $self->stem_words($t);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
    my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
    $self->{features}->add($h);
  }
}

sub is_in_category {
  return (ref $_[1]

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

 # Specify explicit feature vector:
 my $d = new AI::Categorizer::Document(name => $string);
 $d->features( $feature_vector );
 
 # Now pass the document to a categorization algorithm:
 my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
 my $hypothesis = $learner->categorize($document);

=head1 DESCRIPTION

The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector.  Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.

=head1 METHODS

=over 4

=item new(%parameters)

Creates a new Document object.  Document objects are used during
training (for the training documents), testing (for the test
documents), and when categorizing new unseen documents in an

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

=back

The default value is C<stem>, which seems to produce the best results
in most cases I've tried.  I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.

This parameter has no effect if there are no stopwords being used, or
if stemming is not being used.  In the latter case, the list of
stopwords will always be matched as-is against the document words.

Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.

=back

=item read( path =E<gt> $path )

An alternative constructor method which reads a file on disk and
returns a document with that file's contents.

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN


sub parse {
  my ($self, %args) = @_;

  # it is a string which contains the content of XML
  my $body= $args{content};			

  # it is a hash which includes a pair of <elementName, weight>
  my $elementWeight= $args{elementWeight};	

  # construct Handler which receive event of element, data, comment, processing_instruction
  # And convert their values into a sequence  of string and save it into buffer
  my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);

  # construct parser
  my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);

  # let's start parsing XML, where the methids of Handler will be called
  $xmlParser->parse_string($body);

  # extract the converted string from Handler

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN


  # call super class such as XML::SAX::Base
  my $self = $class->SUPER::new;

  # save weights of elements which is a hash for pairs <elementName, weight>
  # weight is times duplication of corresponding element
  # It is provided by caller(one of parameters) at construction, and
  # we must save it in order to use doing duplication at end_element
  $self->{weightHash} = $args{weights};

  # It is storage to store the data produced by Text, CDataSection and etc.
  $self->{content} = '';

  # This array is used to store the data for every element from root to the current visiting element.
  # Thus, data of 0~($levelPointer-1)th in the array is only valid.
  # The array which store the starting location(index) of the content for an element, 
  # From it, we can know all the data produced by an element at the end_element
  # It is needed at the duplication of the data produced by the specific element
  $self->{locationArray} = [];

  return $self;
}
	
# Input: None
# Output: None
# Description:
# 	it is called whenever the parser meets the document
# 	it will be called at once
#	Currently, confirm if the content buffer is an empty
sub start_document{
  my ($self, $doc)= @_;

  # The level(depth) of the last called element in XML tree
  # Calling of start_element is the preorder of the tree traversal.
  # The level is the level of current visiting element in tree.
  # the first element is 0-level
  $self->{levelPointer} = 0;

  # all data will be saved into here, initially, it is an empty
  $self->{content} = "";

  #$self->SUPER::start_document($doc);
}

# Input: None
# Output: None
# Description:
# 	it is called whenever the parser ends the document
# 	it will be called at once

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

#		Name		$el->{Name}
#		Prefix		$el->{Prefix}
#		Value		$el->{Value}
# Output: None
# Description:
# 	it is called whenever the parser meets the element
sub start_element{
  my ($self, $el)= @_;

  # find the last location of the content
  # its meaning is to append the new data at this location
  my $location= length $self->{content};

  # save the last location of the current content
  # so that at end_element the starting location of data of this element can be known
  $self->{locationArray}[$self->{levelPointer}] = $location;

  # for the next element, increase levelPointer
  $self->{levelPointer}++;

  #$self->SUPER::start_document($el);
}

# Input: None
# Output: None

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN


  $self->{levelPointer}--;
  my $location= $self->{locationArray}[$self->{levelPointer}];

  # find the name of element
  my $elementName= $el->{Name};

  # set the default weight
  my $weight= 1;

  # check if user give the weight to duplicate data
  $weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};

  # 0 - remove all the data to be related to this element
  if($weight == 0){
    $self->{content} = substr($self->{content}, 0, $location);
    return;
  }

  # 1 - dont duplicate
  if($weight == 1){
    return;
  }
  
  # n - duplicate data by n times
  # get new content
  my $newContent= substr($self->{content}, $location);

  # start to copy
  for(my $i=1; $i<$weight;$i++){
    $self->{content} .= $newContent;
  }

  #$self->SUPER::end_document($el);
}

# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the text which comes from Text, CDataSection and etc
#	Value must be saved into content buffer.
sub characters{
  my ($self, $args)= @_;

  # save "data plus new line" into content
  $self->{content} .= "$args->{Data}\n";
}
	
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the comment
#	Currently, it will be ignored
sub comment{
  my ($self, $args)= @_;

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

# Input: a hash which consists of pair <Data, Value> and <Target, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the processing_instructing
#	Currently, it will be ignored
sub processing_instruction{
  my ($self, $args)= @_;
}

# Input: None
# Output: the converted data, that is, content
# Description:
# 	return the content
sub getContent{
  my ($self)= @_;
  return $self->{content};
}

1;
__END__

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN

C<Statistics::Contingency> for a description of its interface.  All of
its methods are available here, with the following additions:

=over 4

=item new( categories => \%categories )

=item new( categories => \@categories, verbose => 1, sig_figs => 2 )

Returns a new Experiment object.  A required C<categories> parameter
specifies the names of all categories in the data set.  The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.

The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.

A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method.  It
does not affect the other methods like C<micro_precision()>.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).

=item add_document()

Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.

=item make_document()

This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet.  A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.

=item finish()

This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

  $f3 = $f1->intersection($f2);
  $f3 = $f1->add($f2);
  
  $h = $f1->as_hash;
  $h = $f1->as_boolean_hash;
  
  $f1->normalize;

=head1 DESCRIPTION

This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features.  At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document.  However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.

=head1 METHODS

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN


=head1 METHODS

=over 4

=item new(%parameters)

Returns a new Hypothesis object.  Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method.  However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.

The following parameters are accepted when creating a new Hypothesis:

=over 4

=item all_categories

A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

sub load {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);

  if ($self->{features_kept}) {
    # Read the whole thing in, then reduce
    $self->read( collection => $c );
    $self->select_features;

  } elsif ($self->{scan_first}) {
    # Figure out the feature set first, then read data in
    $self->scan_features( collection => $c );
    $c->rewind;
    $self->read( collection => $c );

  } else {
    # Don't do any feature reduction, just read the data
    $self->read( collection => $c );
  }
}

sub read {
  my ($self, %args) = @_;
  my $collection = $self->_make_collection(\%args);
  my $pb = $self->prog_bar($collection);
  
  while (my $doc = $collection->next) {

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).  

=item add_document()

Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.

=item make_document()

This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet.  A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.

=item finish()

This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN


sub create_model {
  my $self = shift;
  my $f = $self->knowledge_set->features->as_hash;
  my $rmap = [ keys %$f ];
  $self->{model}{feature_map} = { map { $rmap->[$_], $_ } 0..$#$rmap };
  $self->{model}{feature_map_reverse} = $rmap;
  $self->SUPER::create_model(@_);
}

sub _doc_2_dataset {
  my ($self, $doc, $label, $fm) = @_;

  my $ds = new Algorithm::SVM::DataSet(Label => $label);
  my $f = $doc->features->as_hash;
  while (my ($k, $v) = each %$f) {
    next unless exists $fm->{$k};
    $ds->attribute( $fm->{$k}, $v );
  }
  return $ds;
}

sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  my $svm = new Algorithm::SVM(Kernel => $self->{svm_kernel});
  
  my (@pos, @neg);
  foreach my $doc (@$positives) {
    push @pos, $self->_doc_2_dataset($doc, 1, $self->{model}{feature_map});
  }
  foreach my $doc (@$negatives) {
    push @neg, $self->_doc_2_dataset($doc, 0, $self->{model}{feature_map});
  }

  $svm->train(@pos, @neg);
  return $svm;
}

sub get_scores {
  my ($self, $doc) = @_;
  local $self->{current_doc} = $self->_doc_2_dataset($doc, -1, $self->{model}{feature_map});
  return $self->SUPER::get_scores($doc);
}

sub get_boolean_score {
  my ($self, $doc, $svm) = @_;
  return $svm->predict($self->{current_doc});
}

sub save_state {
  my ($self, $path) = @_;

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

  $nb = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

=head1 DESCRIPTION

This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/).  This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.

Currently this is a simple command-line wrapper that calls C<java>
subprocesses.  In the future this may be converted to an
C<Inline::Java> wrapper for better performance (faster running
times).  However, if you're looking for really great performance,
you're probably looking in the wrong place - this Weka wrapper is
intended more as a way to try lots of different machine learning

lib/AI/Categorizer/Storable.pm  view on Meta::CPAN

=head1 SYNOPSIS

  $object->save_state($path);
  ... time passes ...
  $object = Class->restore_state($path);
  
=head1 DESCRIPTION

This class implements methods for storing the state of an object to a
file and restoring from that file later.  In C<AI::Categorizer> it is
generally used in order to let data persist across multiple
invocations of a program.

=head1 METHODS

=over 4

=item save_state($path)

This object method saves the object to disk for later use.  The
C<$path> argument indicates the place on disk where the object should

t/01-naive_bayes.t  view on Meta::CPAN


perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');

#use Carp; $SIG{__DIE__} = \&Carp::confess;

my %docs = training_docs();

{
  ok my $c = new AI::Categorizer(collection_weighting => 'f');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;

  # Make sure collection_weighting is working
  ok $c->knowledge_set->document_frequency('vampires'), 2;
  for ('vampires', 'mirrors') {
    ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
	log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );

t/01-naive_bayes.t  view on Meta::CPAN

  
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';
}

{
  ok my $c = new AI::Categorizer(term_weighting => 'b');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;
  
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
}

{
  ok my $c = new AI::Categorizer(term_weighting => 'n');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;
  
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
  ok $c->knowledge_set->document('doc3')->features->as_hash->{blood}, 0.75;
  ok $c->knowledge_set->document('doc4')->features->as_hash->{mirrors}, 1;
}

{
  ok my $c = new AI::Categorizer(tfidf_weighting => 'txx');
  
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
  
  $c->knowledge_set->finish;
  
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 2;
}

t/14-collection.t  view on Meta::CPAN

BEGIN { plan tests => 13 };

use AI::Categorizer;
use File::Spec;
require File::Spec->catfile('t', 'common.pl');

ok 1;  # Loaded

# Test InMemory collection
use AI::Categorizer::Collection::InMemory;
my $c = AI::Categorizer::Collection::InMemory->new(data => {training_docs()});
ok $c;
exercise_collection($c, 4);

# Test Files collection
use AI::Categorizer::Collection::Files;
$c = AI::Categorizer::Collection::Files->new(path => File::Spec->catdir('t', 'traindocs'),
					     category_hash => {
							       doc1 => ['farming'],
							       doc2 => ['farming'],
							       doc3 => ['vampire'],

t/common.pl  view on Meta::CPAN

			      (
			       name => 'Vampires/Farmers',
			       stopwords => [qw(are be in of and)],
			      ),
			      verbose => $ENV{TEST_VERBOSE} ? 1 : 0,
			      %params,
			     );
  ok ref($c), 'AI::Categorizer', "Create an AI::Categorizer object";
  
  my %docs = training_docs();
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }

  my $l = $c->learner;
  ok $l;
  
  if ($params{learner_class}) {
    ok ref($l), $params{learner_class}, "Make sure the correct Learner class is instantiated";
  } else {
    ok 1, 1, "Dummy test";
  }

t/common.pl  view on Meta::CPAN

  
  run_test_docs($l);

  # Make sure we can save state & restore state
  $l->save_state('t/state');
  $l = $l->restore_state('t/state');
  ok $l;

  run_test_docs($l);

  my $train_collection = AI::Categorizer::Collection::InMemory->new(data => $docs);
  ok $train_collection;
  
  my $h = $l->categorize_collection(collection => $train_collection);
  ok $h->micro_precision > 0.5;
}

sub num_setup_tests    () { 3 }
sub num_standard_tests () { num_setup_tests + 17 }

1;



( run in 0.580 second using v1.01-cache-2.11-cpan-2b0bae70ee8 )