AI-Categorizer

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

   Algorithm::NaiveBayes wasn't installed.  Now it skips.

0.08 - Tue Mar 20 19:39:41 2007

 - Added a ChiSquared feature selection class. [Francois Paradis]

 - Changed the web locations of the reuters-21578 corpus that
   eg/demo.pl uses, since the location it referenced previously has
   gone away.

 - The building & installing process now uses Module::Build rather
   than ExtUtils::MakeMaker.

 - When the features_kept mechanism was used to explicitly state the
   features to use, and the scan_first parameter was left as its
   default value, the features_kept mechanism would silently fail to
   do anything.  This has now been fixed. [Spotted by Arnaud Gaudinat]

 - Recent versions of Weka have changed the name of the SVM class, so
   I've updated it in our test (t/03-weka.t) of the Weka wrapper
   too. [Sebastien Aperghis-Tramoni]

README  view on Meta::CPAN

 
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.

    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.

    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .

    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of
    knowledge, and there are many things a human would consider that none of
    these algorithms consider. These are only statistical tests - at best they
    are neat tricks or helpful assistants, and at worst they are totally
    unreliable. If you plan to use this module for anything really important,
    human supervision is essential, both of the categorization process and the
    final results.

    For the usage details, please see the documentation of each individual
    module.

FRAMEWORK COMPONENTS
    This section explains the major pieces of the "AI::Categorizer" object
    framework. We give a conceptual overview, but don't get into any of the
    details about interfaces or usage. See the documentation for the individual
    classes for more details.

README  view on Meta::CPAN

    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant stuff of the documents would get lost in
    the "noise" of the insignificant stuff.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

lib/AI/Categorizer.pm  view on Meta::CPAN

=head1 DESCRIPTION

C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules.  The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.

The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use.  There are several ways to carry out this process.  The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.

A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .

Disclaimer: the results of any of the machine learning algorithms are
far from infallible (close to fallible?).  Categorization of documents
is often a difficult task even for humans well-trained in the
particular domain of knowledge, and there are many things a human
would consider that none of these algorithms consider.  These are only
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable.  If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.

For the usage details, please see the documentation of each individual
module.

=head1 FRAMEWORK COMPONENTS

This section explains the major pieces of the C<AI::Categorizer>
object framework.  We give a conceptual overview, but don't get into
any of the details about interfaces or usage.  See the documentation
for the individual classes for more details.

lib/AI/Categorizer.pm  view on Meta::CPAN

so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.

The process of selecting the most important features in the training
set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.

=head2 Collections

Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

lib/AI/Categorizer/Collection/Files.pm  view on Meta::CPAN

  
  my $file = readdir $self->{dir_fh};

  if (!defined $file) { # Directory has been exhausted
    return undef unless @{$self->{path}};
    $self->_next_path;
    return $self->_read_file;
  } elsif ($file eq '.' or $file eq '..') {
    return $self->_read_file;
  } elsif (-d (my $path = File::Spec->catdir($self->{cur_dir}, $file))) {
    push @{$self->{path}}, $path  # Add for later processing
      if $self->{recurse} and !grep {$_ eq $path} @{$self->{path}}, @{$self->{used}};
    return $self->_read_file;
  }
  return $file;
}

sub rewind {
  my $self = shift;
  push @{$self->{path}}, @{$self->{used}};
  @{$self->{used}} = ();

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN


sub parse {
  my ($self, %args) = @_;

  # it is a string which contains the content of XML
  my $body= $args{content};			

  # it is a hash which includes a pair of <elementName, weight>
  my $elementWeight= $args{elementWeight};	

  # construct Handler which receive event of element, data, comment, processing_instruction
  # And convert their values into a sequence  of string and save it into buffer
  my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);

  # construct parser
  my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);

  # let's start parsing XML, where the methids of Handler will be called
  $xmlParser->parse_string($body);

  # extract the converted string from Handler

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

# Description:
# 	it is called whenever the parser meets the comment
#	Currently, it will be ignored
sub comment{
  my ($self, $args)= @_;
}

# Input: a hash which consists of pair <Data, Value> and <Target, Value>
# Output: None
# Description:
# 	it is called whenever the parser meets the processing_instructing
#	Currently, it will be ignored
sub processing_instruction{
  my ($self, $args)= @_;
}

# Input: None
# Output: the converted data, that is, content
# Description:
# 	return the content
sub getContent{
  my ($self)= @_;
  return $self->{content};

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

=head1 SYNOPSIS

 ...

=head1 DESCRIPTION

The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.

=head1 METHODS

=over 4

=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN


=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

lib/AI/Categorizer/FeatureSelector/DocFrequency.pm  view on Meta::CPAN

AI::Categorizer::FeatureSelector - Abstract Feature Selection class

=head1 SYNOPSIS

 ...

=head1 DESCRIPTION

The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.

=head1 METHODS

=over 4

=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


 use AI::Categorizer::KnowledgeSet;
 my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
 $nb->train(knowledge_set => $k);

=head1 DESCRIPTION

The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.

=head1 METHODS

=over 4

=item new()

Creates a new KnowledgeSet and returns it.  Accepts the following
parameters:

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN


=head1 DESCRIPTION

This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/).  This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.

Currently this is a simple command-line wrapper that calls C<java>
subprocesses.  In the future this may be converted to an
C<Inline::Java> wrapper for better performance (faster running
times).  However, if you're looking for really great performance,
you're probably looking in the wrong place - this Weka wrapper is
intended more as a way to try lots of different machine learning
methods.

=head1 METHODS

This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN


=item java_path

Specifies where the C<java> executable can be found on this system.
The default is simply C<java>, meaning that it will search your
C<PATH> to find java.

=item java_args

Specifies a list of any additional arguments to give to the java
process.  Commonly it's necessary to allocate more memory than the
default, using an argument like C<-Xmx130MB>.

=item weka_path

Specifies the path to the C<weka.jar> file containing the Weka
bytecode.  If Weka has been installed somewhere in your java
C<CLASSPATH>, you needn't specify a C<weka_path>.

=item weka_classifier

t/13-document.t  view on Meta::CPAN

}
  

# Test some stemming & stopword stuff.
{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stemmed'],
     stemming => 'porter',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'stem', "stopword_behavior() is 'stem'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
  ok $d->features->includes('stem'),     '', "Shouldn't include 'stem'";
  print "Features: @{[ $d->features->names ]}\n";
}

{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stemmed'],
     stemming => 'porter',
     stopword_behavior => 'no_stem',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'no_stem', "stopword_behavior() is 'no_stem'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
  ok $d->features->includes('stem'),     1,  "Should include 'stem'";
  print "Features: @{[ $d->features->names ]}\n";
}

{
  my $d = $docclass->new
    (
     name => 'test',
     stopwords => ['stem'],
     stemming => 'porter',
     stopword_behavior => 'pre_stemmed',
     content  => 'stopword processing should happen after stemming',
     # Becomes qw(stopword process    should happen after stem    )
    );
  ok $d->stopword_behavior, 'pre_stemmed', "stopword_behavior() is 'pre_stemmed'";
  
  ok $d->features->includes('stopword'), 1,  "Should include 'stopword'";
  ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
  ok $d->features->includes('stem'),     '', "Shouldn't include 'stem'";
  print "Features: @{[ $d->features->names ]}\n";
}



( run in 0.318 second using v1.01-cache-2.11-cpan-8d75d55dd25 )