AI-Categorizer

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

Revision history for Perl extension AI::Categorizer.

 - The t/01-naive_bayes.t test was failing (instead of skipping) when
   Algorithm::NaiveBayes wasn't installed.  Now it skips.

0.08 - Tue Mar 20 19:39:41 2007

 - Added a ChiSquared feature selection class. [Francois Paradis]

 - Changed the web locations of the reuters-21578 corpus that
   eg/demo.pl uses, since the location it referenced previously has
   gone away.

 - The building & installing process now uses Module::Build rather
   than ExtUtils::MakeMaker.

 - When the features_kept mechanism was used to explicitly state the
   features to use, and the scan_first parameter was left as its

Changes  view on Meta::CPAN

 - Fixed a bug in which the 'documents' and 'categories' parameters to
   the KnowledgeSet objects were never accepted, claiming that it
   failed the "All are Document objects" or "All are Category objects"
   callbacks. [Spotted by rob@phraud.org]

 - Moved the 'stopword_file' parameter from Categorizer.pm to the
   Collection class.

0.05  Sat Mar 29 00:38:21 CST 2003

 - Feature selection is now handled by an abstract FeatureSelector
   framework class.  Currently the only concrete subclass implemented
   is FeatureSelector::DocFrequency.  The 'feature_selection'
   parameter has been replaced with a 'feature_selector_class'
   parameter.

 - Added a k-Nearest-Neighbor machine learner. [First revision
   implemented by David Bell]

 - Added a Rocchio machine learner. [Partially implemented by Xiaobo
   Li]

 - Added a "Guesser" machine learner which simply uses overall class
   probabilities to make categorization decisions.  Sometimes useful

Changes  view on Meta::CPAN

   is okay).

 - Added the Collection::InMemory class

 - Much more thorough testing with 'make test'.

 - Added add_hypothesis() method to Experiment.

 - Added dot() and value() methods to FeatureVector.

 - Added 'feature_selection' parameter to KnowledgeSet.

 - Added document($name) accessor method to KnowledgeSet.

 - In KnowledgeSet, load(), read(), and scan_*() can now accept a
   Collection object.

 - Added document_frequency(), finish(), and weigh_features() methods
   to KnowledgeSet.

 - Added save_features() and restore_features() to KnowledgeSet.

README  view on Meta::CPAN

    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of
    memory, and second, the significant stuff of the documents would get lost in
    the "noise" of the insignificant stuff.

    The process of selecting the most important features in the training set is
    called "feature selection". It is managed by the
    "AI::Categorizer::KnowledgeSet" class, and you will find the details of
    feature selection processes in that class's documentation.

  Collections

    Because documents may be stored in lots of different formats, a "collection"
    class has been created as an abstraction of a stored set of documents,
    together with a way to iterate through the set and return Document objects.
    A knowledge set contains a single collection object. A "Categorizer" doing a
    complete test run generally contains two collections, one for training and
    one for testing. A "Learner" can mass-categorize a collection.

eg/categorizer  view on Meta::CPAN

if ($@ and $@ =~ /^The following parameter/) {
  die "$@\nPlease see the AI::Categorizer documentation for a description of parameters accepted.\n";
}
die $@ if $@;

%$do_stage = map {$_, 1} 1..5 unless keys %$do_stage;

my $out_fh;
if ($outfile) {
  open $out_fh, ">> $outfile" or die "Can't create $outfile: $!";
  select((select($out_fh), $|=1)[0]);
  if (keys(%$do_stage) > 1) {
    print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
    if ($HAVE_YAML) {
      print {$out_fh} YAML::Dump($c->dump_parameters);
    } else {
      warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
    }
  }
}
  

lib/AI/Categorizer.pm  view on Meta::CPAN

meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.

The process of selecting the most important features in the training
set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.

=head2 Collections

Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.

lib/AI/Categorizer/Collection/DBI.pm  view on Meta::CPAN

use DBI;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);

use Params::Validate qw(:types);

__PACKAGE__->valid_params
  (
   connection_string => {type => SCALAR, default => undef},
   dbh => {isa => 'DBI::db', default => undef},
   select_statement => {type => SCALAR, default => "SELECT text FROM documents"},
  );

__PACKAGE__->contained_objects
  (
   document => { class => 'AI::Categorizer::Document',
		 delayed => 1 },
  );

sub new {
  my $class = shift;

lib/AI/Categorizer/Collection/DBI.pm  view on Meta::CPAN

  $self->rewind;
  return $self;
}

sub dbh { shift()->{dbh} }

sub rewind {
  my $self = shift;
  
  if (!$self->{sth}) {
    $self->{sth} = $self->dbh->prepare($self->{select_statement});
  }

  if ($self->{sth}{Active}) {
    $self->{sth}->finish;
  }

  $self->{sth}->execute;
}

sub next {

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN


  my $result = $f->intersection( \@new_features );
  print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
  return $result;
}

# Abstract methods
sub rank_features;
sub scan_features;

sub select_features {
  my ($self, %args) = @_;
  
  die "No knowledge_set parameter provided to select_features()"
    unless $args{knowledge_set};

  my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
  return $self->reduce_features( $f, features_kept => $args{features_kept} );
}


1;

__END__

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.

=item features_kept

A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.

=item feature_selection

A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.

=item tfidf_weighting

Specifies how document word counts should be converted to vector
values.  Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval".  The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN


=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

=item save_features()

Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.

=item restore_features()

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN


=item read()

Iterates through a Collection of documents and adds them to the
KnowledgeSet.  The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.

=item load()

This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).

=item add_document()

Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.

=item make_document()

This method will create a Document object with the given data and then

lib/AI/Categorizer/FeatureSelector/ChiSquare.pm  view on Meta::CPAN

AI::Categorizer::FeatureSelector::ChiSquare - ChiSquare Feature Selection class

=head1 SYNOPSIS

 # the recommended way to use this class is to let the KnowledgeSet
 # instanciate it

 use AI::Categorizer::KnowledgeSetSMART;
 my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
   tfidf_notation =>'Categorizer',
   feature_selection=>'chi_square', ...other parameters...); 

 # however it is also possible to pass an instance to the KnowledgeSet

 use AI::Categorizer::KnowledgeSet;
 use AI::Categorizer::FeatureSelector::ChiSquare;
 my $ksetCHI = new AI::Categorizer::KnowledgeSet(
   feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
   ...other parameters...
   );

=head1 DESCRIPTION

Feature selection with the ChiSquare function.

  Chi-Square(t,ci) = (N.(AD-CB)^2)
                    -----------------------
                    (A+C).(B+D).(A+B).(C+D)

where t = term
      ci = category i
      N = number of documents in the collection
      A = number of times where t and c co-occur
      B =   "     "   "   t occurs without c

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

		  default => [],
		  callbacks => { 'all are Document objects' => 
				 sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'),
					 @{$_[0]} },
			       },
		 },
   scan_first => {
		  type => BOOLEAN,
		  default => 1,
		 },
   feature_selector => {
			isa => 'AI::Categorizer::FeatureSelector',
		       },
   tfidf_weighting  => {
			type => SCALAR,
			optional => 1,
		       },
   term_weighting  => {
		       type => SCALAR,
		       default => 'x',
		      },

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

__PACKAGE__->contained_objects
  (
   document => { delayed => 1,
		 class => 'AI::Categorizer::Document' },
   category => { delayed => 1,
		 class => 'AI::Categorizer::Category' },
   collection => { delayed => 1,
		   class => 'AI::Categorizer::Collection::Files' },
   features => { delayed => 1,
		 class => 'AI::Categorizer::FeatureVector' },
   feature_selector => 'AI::Categorizer::FeatureSelector::DocFrequency',
  );

sub new {
  my ($pkg, %args) = @_;
  
  # Shortcuts
  if ($args{tfidf_weighting}) {
    @args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};
    delete $args{tfidf_weighting};
  }

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

sub documents {
  my $d = $_[0]->{documents};
  return wantarray ? $d->members : $d->size;
}

sub document {
  my ($self, $name) = @_;
  return $self->{documents}->retrieve($name);
}

sub feature_selector { $_[0]->{feature_selector} }
sub scan_first       { $_[0]->{scan_first} }

sub verbose {
  my $self = shift;
  $self->{verbose} = shift if @_;
  return $self->{verbose};
}

sub trim_doc_features {
  my ($self) = @_;

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  return \%stats;
}

sub load {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);

  if ($self->{features_kept}) {
    # Read the whole thing in, then reduce
    $self->read( collection => $c );
    $self->select_features;

  } elsif ($self->{scan_first}) {
    # Figure out the feature set first, then read data in
    $self->scan_features( collection => $c );
    $c->rewind;
    $self->read( collection => $c );

  } else {
    # Don't do any feature reduction, just read the data
    $self->read( collection => $c );

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

  }
  
  return exists $self->{doc_freq_vector}{$term} ? $self->{doc_freq_vector}{$term} : 0;
}

sub scan_features {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);

  my $pb = $self->prog_bar($c);
  my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );

  $self->delayed_object_params('document', use_features => $ranked_features);
  $self->delayed_object_params('collection', use_features => $ranked_features);
  return $ranked_features;
}

sub select_features {
  my $self = shift;
  
  my $f = $self->feature_selector->select_features(knowledge_set => $self);
  $self->features($f);
}

sub partition {
  my ($self, @sizes) = @_;
  my $num_docs = my @docs = $self->documents;
  my @groups;

  while (@sizes > 1) {
    my $size = int ($num_docs * shift @sizes);

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.

=item features_kept

A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.

=item feature_selection

A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.

=item tfidf_weighting

Specifies how document word counts should be converted to vector
values.  Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval".  The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


=item scan_stats()

Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)

=item scan_features()

This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner.  This process is known as "feature selection", and it's a
very important part of categorization.

The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.

The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.

This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies.  Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.

=item save_features()

Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.

=item restore_features()

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN


=item read()

Iterates through a Collection of documents and adds them to the
KnowledgeSet.  The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.

=item load()

This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).  

=item add_document()

Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.

=item make_document()

This method will create a Document object with the given data and then

lib/AI/Categorizer/Util.pm  view on Meta::CPAN


sub _hashify {
  return $_[0] if UNIVERSAL::isa($_[0], 'HASH');
  return {map {$_=>1} @{$_[0]}};
}

sub random_elements {
  my ($a_ref, $n) = @_;
  return @$a_ref if $n >= @$a_ref;
  
  my ($select, $mode) = ($n < @$a_ref/2) ? ($n, 'include') : (@$a_ref - $n, 'exclude');

  my %i;
  $i{int rand @$a_ref} = 1 while keys(%i) < $select;

  return @{$a_ref}[keys %i] if $mode eq 'include';
  return map {$i{$_} ? () : $a_ref->[$_]} 0..$#$a_ref;
}

1;



( run in 1.215 second using v1.01-cache-2.11-cpan-49f99fa48dc )