view release on metacpan or search on metacpan
Revision history for Perl extension AI::Categorizer.
- The t/01-naive_bayes.t test was failing (instead of skipping) when
Algorithm::NaiveBayes wasn't installed. Now it skips.
0.08 - Tue Mar 20 19:39:41 2007
- Added a ChiSquared feature selection class. [Francois Paradis]
- Changed the web locations of the reuters-21578 corpus that
eg/demo.pl uses, since the location it referenced previously has
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
failed the "All are Document objects" or "All are Category objects"
callbacks. [Spotted by rob@phraud.org]
- Moved the 'stopword_file' parameter from Categorizer.pm to the
Collection class.
0.05 Sat Mar 29 00:38:21 CST 2003
- Feature selection is now handled by an abstract FeatureSelector
framework class. Currently the only concrete subclass implemented
is FeatureSelector::DocFrequency. The 'feature_selection'
parameter has been replaced with a 'feature_selector_class'
parameter.
- Added a k-Nearest-Neighbor machine learner. [First revision
implemented by David Bell]
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
Li]
- Added a "Guesser" machine learner which simply uses overall class
probabilities to make categorization decisions. Sometimes useful
is okay).
- Added the Collection::InMemory class
- Much more thorough testing with 'make test'.
- Added add_hypothesis() method to Experiment.
- Added dot() and value() methods to FeatureVector.
- Added 'feature_selection' parameter to KnowledgeSet.
- Added document($name) accessor method to KnowledgeSet.
- In KnowledgeSet, load(), read(), and scan_*() can now accept a
Collection object.
- Added document_frequency(), finish(), and weigh_features() methods
to KnowledgeSet.
- Added save_features() and restore_features() to KnowledgeSet.
- Added default categories() and categorize() methods to Learner base
class. get_scores() is now abstract.
- Extended interface of ObjectSet class with retrieve(), includes(),
and includes_name().
- Moved 'term_weighting' parameter from Document to KnowledgeSet,
since the normalized version needs to know the maximum
term-frequency. Also changed its values to 'n', 'l', 'b', and 't',
README
t/01-naive_bayes.t
t/02-experiment.t
t/03-weka.t
t/04-decision_tree.t
t/05-svm.t
t/06-knn.t
t/07-guesser.t
t/09-rocchio.t
t/10-tools.t
t/11-feature_vector.t
t/12-hypothesis.t
t/13-document.t
t/14-collection.t
t/15-knowledge_set.t
t/common.pl
t/traindocs/doc1
t/traindocs/doc2
t/traindocs/doc3
t/traindocs/doc4
META.yml
SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
be interested in any of several pieces of information about the hypothesis
(for instance, which categories were assigned, which category was the single
most likely category, the scores assigned to each category, etc.), the
hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set", "test_set", and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training", "test_set" to "$data_root/test", and
"category_file" (used by some of the Collection classes) to
"train()", the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT". Internally, this is just a shortcut for calling
the "scan_features()", "read_training_set()", "train()", and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
eg/categorizer view on Meta::CPAN
print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
if ($HAVE_YAML) {
print {$out_fh} YAML::Dump($c->dump_parameters);
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
}
}
}
run_section('scan_features', 1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train', 3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
lib/AI/Categorizer.pm view on Meta::CPAN
# delete $p->{stopwords} if $p->{stopword_file};
# return $p;
#}
sub knowledge_set { shift->{knowledge_set} }
sub learner { shift->{learner} }
# Combines several methods in one sub
sub run_experiment {
my $self = shift;
$self->scan_features;
$self->read_training_set;
$self->train;
$self->evaluate_test_set;
print $self->stats_table;
}
sub scan_features {
my $self = shift;
return unless $self->knowledge_set->scan_first;
$self->knowledge_set->scan_features( path => $self->{training_set} );
$self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
sub read_training_set {
my $self = shift;
$self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
if -e "$self->{progress_file}-01-features";
$self->knowledge_set->read( path => $self->{training_set} );
$self->_save_progress( '02', 'knowledge_set' );
return $self->knowledge_set;
}
sub train {
my $self = shift;
$self->_load_progress( '02', 'knowledge_set' );
$self->learner->train( knowledge_set => $self->{knowledge_set} );
$self->_save_progress( '03', 'learner' );
lib/AI/Categorizer.pm view on Meta::CPAN
=head1 SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use. There are several ways to carry out this process. The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
lib/AI/Categorizer.pm view on Meta::CPAN
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
=head2 Knowledge Sets
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge". A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations. In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.
A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class. Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on. See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
=head3 Feature selection
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized. There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.
The process of selecting the most important features in the training
set is called "feature selection". It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
=head2 Collections
Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects. A knowledge set contains a single collection
object. A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing. A C<Learner>
can mass-categorize a collection.
lib/AI/Categorizer.pm view on Meta::CPAN
documentation for a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.
=head2 Feature Vectors
Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>. The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document. A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class. You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.
=head2 Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of
"statistical guess" of what categories this document should be
assigned to. Since you may be interested in any of several pieces of
information about the hypothesis (for instance, which categories were
assigned, which category was the single most likely category, the
scores assigned to each category, etc.), the hypothesis is returned as
lib/AI/Categorizer.pm view on Meta::CPAN
releases, since they're just used internally to resume where we last
left off.
=item verbose
If true, a few status messages will be printed during execution.
=item training_set
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.
=item test_set
Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.
=item data_root
A shortcut for setting the C<training_set>, C<test_set>, and
lib/AI/Categorizer.pm view on Meta::CPAN
=item knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.
=item run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>. Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.
=item scan_features()
Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner. Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
=item read_training_set()
Populates the KnowledgeSet with the data specified in the C<test_set>
parameter. Internally, this calls the C<read()> method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
=item train()
lib/AI/Categorizer/Category.pm view on Meta::CPAN
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
},
public => 0,
},
);
__PACKAGE__->contained_objects
(
features => {
class => 'AI::Categorizer::FeatureVector',
delayed => 1,
},
);
my %REGISTRY = ();
sub new {
my $self = shift()->SUPER::new(@_);
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
lib/AI/Categorizer/Category.pm view on Meta::CPAN
return wantarray ? $d->members : $d->size;
}
sub contains_document {
return $_[0]->{documents}->includes( $_[1] );
}
sub add_document {
my $self = shift;
$self->{documents}->insert( $_[0] );
delete $self->{features}; # Could be more efficient?
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features} if $self->{features};
my $v = $self->create_delayed_object('features');
return $self->{features} = $v unless $self->documents;
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
1;
__END__
=head1 NAME
AI::Categorizer::Category - A named category of documents
=head1 SYNOPSIS
my $category = AI::Categorizer::Category->by_name("sports");
my $name = $category->name;
my @docs = $category->documents;
my $num_docs = $category->documents;
my $features = $category->features;
$category->add_document($doc);
if ($category->contains_document($doc)) { ...
=head1 DESCRIPTION
This simple class represents a named category which may contain zero
or more documents. Each category is a "singleton" by name, so two
Category objects with the same name should not be created at once.
lib/AI/Categorizer/Category.pm view on Meta::CPAN
=item by_name(name => $string)
Returns the Category object with the given name, or creates one if no
such object exists.
=item documents()
Returns a list of the Document objects in this category in a list
context, or the number of such objects in a scalar context.
=item features()
Returns a FeatureVector object representing the sum of all the
FeatureVectors of the Documents in this Category.
=item add_document($document)
Informs the Category that the given Document belongs to it.
=item contains_document($document)
lib/AI/Categorizer/Document.pm view on Meta::CPAN
default => undef,
},
parse => {
type => SCALAR,
optional => 1,
},
parse_handle => {
type => HANDLE,
optional => 1,
},
features => {
isa => 'AI::Categorizer::FeatureVector',
optional => 1,
},
content_weights => {
type => HASHREF,
default => {},
},
front_bias => {
type => SCALAR,
default => 0,
},
use_features => {
type => HASHREF|UNDEF,
default => undef,
},
stemming => {
type => SCALAR|UNDEF,
optional => 1,
},
stopword_behavior => {
type => SCALAR,
default => "stem",
},
);
__PACKAGE__->contained_objects
(
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
);
### Constructors
my $NAME = 'a';
sub new {
my $pkg = shift;
my $self = $pkg->SUPER::new(name => $NAME++, # Use a default name
lib/AI/Categorizer/Document.pm view on Meta::CPAN
$self->stem_words(\@keys);
$s->{$_} = 1 foreach @keys;
# This flag is attached to the stopword structure itself so that
# other documents will notice it.
$s->{___stemmed} = 1;
}
sub finish {
my $self = shift;
$self->create_feature_vector;
# Now we're done with all the content stuff
delete @{$self}{'content', 'content_weights', 'stopwords', 'use_features'};
}
# Parse a document format - a virtual method
sub parse;
sub parse_handle {
my ($self, %args) = @_;
my $fh = $args{handle} or die "No 'handle' argument given to parse_handle()";
return $self->parse( content => join '', <$fh> );
}
### Accessors
sub name { $_[0]->{name} }
sub stopword_behavior { $_[0]->{stopword_behavior} }
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features};
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
### Workers
sub create_feature_vector {
my $self = shift;
my $content = $self->{content};
my $weights = $self->{content_weights};
die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;
$self->{features} = $self->create_delayed_object('features');
while (my ($name, $data) = each %$content) {
my $t = $self->tokenize($data);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
$self->stem_words($t);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
$self->{features}->add($h);
}
}
sub is_in_category {
return (ref $_[1]
? $_[0]->{categories}->includes( $_[1] )
: $_[0]->{categories}->includes_name( $_[1] ));
}
lib/AI/Categorizer/Document.pm view on Meta::CPAN
eval {require Lingua::Stem; 1}
or die "Porter stemming requires the Lingua::Stem module, available from CPAN.\n";
@$tokens = @{ Lingua::Stem::stem(@$tokens) };
}
sub _filter_tokens {
my ($self, $tokens_in) = @_;
if ($self->{use_features}) {
my $f = $self->{use_features}->as_hash;
return [ grep exists($f->{$_}), @$tokens_in ];
} elsif ($self->{stopwords} and keys %{$self->{stopwords}}) {
my $s = $self->{stopwords};
return [ grep !exists($s->{$_}), @$tokens_in ];
}
return $tokens_in;
}
sub _weigh_tokens {
my ($self, $tokens, $weight) = @_;
lib/AI/Categorizer/Document.pm view on Meta::CPAN
my %counts;
if (my $b = 0+$self->{front_bias}) {
die "'front_bias' value must be between -1 and 1"
unless -1 < $b and $b < 1;
my $n = @$tokens;
my $r = ($b-1)**2 / ($b+1);
my $mult = $weight * log($r)/($r-1);
my $i = 0;
foreach my $feature (@$tokens) {
$counts{$feature} += $mult * $r**($i/$n);
$i++;
}
} else {
foreach my $feature (@$tokens) {
$counts{$feature} += $weight;
}
}
return \%counts;
}
sub vectorize {
my ($self, %args) = @_;
if ($self->{stem_stopwords}) {
my $s = $self->stem_tokens([keys %{$self->{stopwords}}]);
lib/AI/Categorizer/Document.pm view on Meta::CPAN
my $self = $class->new(%args);
open my($fh), "< $path" or die "$path: $!";
$self->parse_handle(handle => $fh);
close $fh;
$self->finish;
return $self;
}
sub dump_features {
my ($self, %args) = @_;
my $path = $args{path} or die "No 'path' argument given to dump_features()";
open my($fh), "> $path" or die "Can't create $path: $!";
my $f = $self->features->as_hash;
while (my ($k, $v) = each %$f) {
print $fh "$k\t$v\n";
}
}
1;
__END__
=head1 NAME
lib/AI/Categorizer/Document.pm view on Meta::CPAN
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
=head1 DESCRIPTION
The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector. Usually
documents are plain text, but subclasses of the Document class may
lib/AI/Categorizer/Document.pm view on Meta::CPAN
A string that identifies this document. Required.
=item content
The raw content of this document. May be specified as either a string
or as a hash reference, allowing structured document types.
=item content_weights
A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector. The weight is a multiplier of the feature vector
values. For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.
If no weights are specified, all weights are set to 1.
=item front_bias
Allows smooth bias of the weights of words in a document according to
their position. The value should be a number between -1 and 1.
Positive numbers indicate that words toward the beginning of the
lib/AI/Categorizer/Document.pm view on Meta::CPAN
document. Negative numbers indicate the opposite. A bias of 0
indicates that no biasing should be done.
=item categories
A reference to an array of Category objects that this document belongs
to. Optional.
=item stopwords
A list/hash of features (words) that should be ignored when parsing
document content. A hash reference is preferred, with the features as
the keys. If you pass an array reference containing the features, it
will be converted to a hash reference internally.
=item use_features
A Feature Vector specifying the only features that should be
considered when parsing this document. This is an alternative to
using C<stopwords>.
=item stemming
Indicates the linguistic procedure that should be used to convert
tokens in the document to features. Possible values are C<none>,
which indicates that the tokens should be used without change, or
C<porter>, indicating that the Porter stemming algorithm should be
applied to each token. This requires the C<Lingua::Stem> module from
CPAN.
=item stopword_behavior
There are a few ways you might want the stopword list (specified with
the C<stopwords> parameter) to interact with the stemming algorithm
(specified with the C<stemming> parameter). These options can be
lib/AI/Categorizer/Document.pm view on Meta::CPAN
=item parse( content =E<gt> $content )
=item name()
Returns this document's C<name> property as specified when the
document was created.
=item features()
Returns the Feature Vector associated with this document.
=item categories()
In a list context, returns a list of Category objects to which this
document belongs. In a scalar context, returns the number of such
categories.
=item create_feature_vector()
Creates this document's Feature Vector by parsing its content. You
won't call this method directly, it's called by C<new()>.
=back
=head1 AUTHOR
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
features_kept => {
type => SCALAR,
default => 0.2,
},
verbose => {
type => SCALAR,
default => 0,
},
);
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub reduce_features {
# Takes a feature vector whose weights are "feature scores", and
# chops to the highest n features. n is specified by the
# 'features_kept' parameter. If it's zero, all features are kept.
# If it's between 0 and 1, we multiply by the present number of
# features. If it's greater than 1, we treat it as the number of
# features to use.
my ($self, $f, %args) = @_;
my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};
return $f unless $kept;
my $num_kept = ($kept < 1 ?
$f->length * $kept :
$kept);
print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
# This is algorithmic overkill, but the sort seems fast enough. Will revisit later.
my $features = $f->as_hash;
my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
[0 .. $num_kept-1];
my $result = $f->intersection( \@new_features );
print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
return $result;
}
# Abstract methods
sub rank_features;
sub scan_features;
sub select_features {
my ($self, %args) = @_;
die "No knowledge_set parameter provided to select_features()"
unless $args{knowledge_set};
my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
return $self->reduce_features( $f, features_kept => $args{features_kept} );
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
C<documents> parameter should also be specified.
=item documents
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet. If used, the
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
Specifies how document word counts should be converted to vector
values. Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval". The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
=item t
Raw term frequency - equal to the number of times a feature occurs in
the document.
=item x
A synonym for 't'.
=item n
Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as
the 't' specification, but with term frequency normalized to lie
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
=item documents()
In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item document()
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
=item features()
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
=item verbose()
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies. Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.
=item save_features()
Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.
=item restore_features()
Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.
=item read()
Iterates through a Collection of documents and adds them to the
KnowledgeSet. The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.
=item load()
This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
=item add_document()
Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.
=item make_document()
This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet. A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>. These are the categories that the document
belongs to. Any other parameters will be passed to the Document
class's C<new()> method.
=item finish()
This method will be called prior to training the Learner. Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.
=item weigh_features()
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
=item document_frequency()
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
=item partition()
Divides the KnowledgeSet into several subsets. This may be useful for
performing cross-validation. The relative sizes of the subsets should
be passed as arguments. For example, to split the KnowledgeSet into
four KnowledgeSets of equal size, pass the arguments .25, .25, .25
(the final size is 1 minus the sum of the other sizes). The
partitions will be returned as a list.
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
package AI::Categorizer::FeatureSelector::CategorySelector;
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
1;
sub reduction_function;
# figure out the feature set before reading collection (default)
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or
die "No 'collection' parameter provided to scan_features()";
if(!($self->{features_kept})) {return;}
my %cat_features;
my $coll_features = $self->create_delayed_object('features');
my $nbDocuments = 0;
while (my $doc = $c->next) {
$nbDocuments++;
$args{prog_bar}->() if $args{prog_bar};
my $docfeatures = $doc->features->as_hash;
foreach my $cat ($doc->categories) {
my $catname = $cat->name;
if(!(exists $cat_features{$catname})) {
$cat_features{$catname} = $self->create_delayed_object('features');
}
$cat_features{$catname}->add($docfeatures);
}
$coll_features->add( $docfeatures );
}
print STDERR "\n* Computing Chi-Square values\n" if $self->verbose;
my $r_features = $self->create_delayed_object('features');
my @terms = $coll_features->names;
my $progressBar = $self->prog_bar(scalar @terms);
my $allFeaturesSum = $coll_features->sum;
my %cat_features_sum;
while( my($catname,$features) = each %cat_features ) {
$cat_features_sum{$catname} = $features->sum;
}
foreach my $term (@terms) {
$progressBar->();
$r_features->{features}{$term} = $self->reduction_function($term,
$nbDocuments,$allFeaturesSum,$coll_features,
\%cat_features,\%cat_features_sum);
}
print STDERR "\n" if $self->verbose;
my $new_features = $self->reduce_features($r_features);
return $coll_features->intersection( $new_features );
}
# calculate feature set after reading collection (scan_first=0)
sub rank_features {
die "CategorySelector->rank_features is not implemented yet!";
# my ($self, %args) = @_;
#
# my $k = $args{knowledge_set}
# or die "No knowledge_set parameter provided to rank_features()";
#
# my %freq_counts;
# foreach my $name ($k->features->names) {
# $freq_counts{$name} = $k->document_frequency($name);
# }
# return $self->create_delayed_object('features', features => \%freq_counts);
}
# copied from KnowledgeSet->prog_bar by Ken Williams
sub prog_bar {
my ($self, $count) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
AI::Categorizer::CategorySelector - Abstract Category Selection class
=head1 SYNOPSIS
This class is abstract. For example of instanciation, see
ChiSquare.
=head1 DESCRIPTION
A base class for FeatureSelectors that calculate their global features
from a set of features by categories.
=head1 METHODS
=head1 AUTHOR
Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code
=cut
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm view on Meta::CPAN
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector::CategorySelector);
use Params::Validate qw(:types);
# Chi-Square function
# NB: this could probably be optimised a bit...
sub reduction_function {
my ($self,$term,$N,$allFeaturesSum,
$coll_features,$cat_features,$cat_features_sum) = @_;
my $CHI2SUM = 0;
my $nbcats = 0;
foreach my $catname (keys %{$cat_features}) {
# while ( my ($catname,$catfeatures) = each %{$cat_features}) {
my ($A,$B,$C,$D); # A = number of times where t and c co-occur
# B = " " " t occurs without c
# C = " " " c occurs without t
# D = " " " neither c nor t occur
$A = $cat_features->{$catname}->value($term);
$B = $coll_features->value($term) - $A;
$C = $cat_features_sum->{$catname} - $A;
$D = $allFeaturesSum - ($A+$B+$C);
my $ADminCB = ($A*$D)-($C*$B);
my $CHI2 = $N*$ADminCB*$ADminCB / (($A+$C)*($B+$D)*($A+$B)*($C+$D));
$CHI2SUM += $CHI2;
$nbcats++;
}
return $CHI2SUM/$nbcats;
}
1;
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm view on Meta::CPAN
AI::Categorizer::FeatureSelector::ChiSquare - ChiSquare Feature Selection class
=head1 SYNOPSIS
# the recommended way to use this class is to let the KnowledgeSet
# instanciate it
use AI::Categorizer::KnowledgeSetSMART;
my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
tfidf_notation =>'Categorizer',
feature_selection=>'chi_square', ...other parameters...);
# however it is also possible to pass an instance to the KnowledgeSet
use AI::Categorizer::KnowledgeSet;
use AI::Categorizer::FeatureSelector::ChiSquare;
my $ksetCHI = new AI::Categorizer::KnowledgeSet(
feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
...other parameters...
);
=head1 DESCRIPTION
Feature selection with the ChiSquare function.
Chi-Square(t,ci) = (N.(AD-CB)^2)
-----------------------
(A+C).(B+D).(A+B).(C+D)
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm view on Meta::CPAN
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
use Carp qw(croak);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
my ($self, %args) = @_;
my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
my %freq_counts;
foreach my $name ($k->features->names) {
$freq_counts{$name} = $k->document_frequency($name);
}
return $self->create_delayed_object('features', features => \%freq_counts);
}
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or die "No 'collection' parameter provided to scan_features()";
my $doc_freq = $self->create_delayed_object('features');
while (my $doc = $c->next) {
$args{prog_bar}->() if $args{prog_bar};
$doc_freq->add( $doc->features->as_boolean_hash );
}
print "\n" if $self->verbose;
return $self->reduce_features($doc_freq);
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
package AI::Categorizer::FeatureVector;
sub new {
my ($package, %args) = @_;
$args{features} ||= {};
return bless {features => $args{features}}, $package;
}
sub names {
my $self = shift;
return keys %{$self->{features}};
}
sub set {
my $self = shift;
$self->{features} = (ref $_[0] ? $_[0] : {@_});
}
sub as_hash {
my $self = shift;
return $self->{features};
}
sub euclidean_length {
my $self = shift;
my $f = $self->{features};
my $total = 0;
foreach (values %$f) {
$total += $_**2;
}
return sqrt($total);
}
sub normalize {
my $self = shift;
my $length = $self->euclidean_length;
return $length ? $self->scale(1/$length) : $self;
}
sub scale {
my ($self, $scalar) = @_;
$_ *= $scalar foreach values %{$self->{features}};
return $self;
}
sub as_boolean_hash {
my $self = shift;
return { map {($_ => 1)} keys %{$self->{features}} };
}
sub length {
my $self = shift;
return scalar keys %{$self->{features}};
}
sub clone {
my $self = shift;
return ref($self)->new( features => { %{$self->{features}} } );
}
sub intersection {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $common;
if (UNIVERSAL::isa($other, 'ARRAY')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
} elsif (UNIVERSAL::isa($other, 'HASH')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
}
return ref($self)->new( features => $common );
}
sub add {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
while (my ($k,$v) = each %$other) {
$self->{features}{$k} += $v;
}
}
sub dot {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $sum = 0;
my $f = $self->{features};
while (my ($k, $v) = each %$f) {
$sum += $other->{$k} * $v if exists $other->{$k};
}
return $sum;
}
sub sum {
my ($self) = @_;
# Return total of values in this vector
my $total = 0;
$total += $_ foreach values %{ $self->{features} };
return $total;
}
sub includes {
return exists $_[0]->{features}{$_[1]};
}
sub value {
return $_[0]->{features}{$_[1]};
}
sub values {
my $self = shift;
return @{ $self->{features} }{ @_ };
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureVector - Features vs. Values
=head1 SYNOPSIS
my $f1 = new AI::Categorizer::FeatureVector
(features => {howdy => 2, doody => 3});
my $f2 = new AI::Categorizer::FeatureVector
(features => {doody => 1, whopper => 2});
@names = $f1->names;
$x = $f1->length;
$x = $f1->sum;
$x = $f1->includes('howdy');
$x = $f1->value('howdy');
$x = $f1->dot($f2);
$f3 = $f1->clone;
$f3 = $f1->intersection($f2);
$f3 = $f1->add($f2);
$h = $f1->as_hash;
$h = $f1->as_boolean_hash;
$f1->normalize;
=head1 DESCRIPTION
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features. At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document. However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.
=head1 METHODS
=over 4
=item ...
=back
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'),
@{$_[0]} },
},
},
scan_first => {
type => BOOLEAN,
default => 1,
},
feature_selector => {
isa => 'AI::Categorizer::FeatureSelector',
},
tfidf_weighting => {
type => SCALAR,
optional => 1,
},
term_weighting => {
type => SCALAR,
default => 'x',
},
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
);
__PACKAGE__->contained_objects
(
document => { delayed => 1,
class => 'AI::Categorizer::Document' },
category => { delayed => 1,
class => 'AI::Categorizer::Category' },
collection => { delayed => 1,
class => 'AI::Categorizer::Collection::Files' },
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
feature_selector => 'AI::Categorizer::FeatureSelector::DocFrequency',
);
sub new {
my ($pkg, %args) = @_;
# Shortcuts
if ($args{tfidf_weighting}) {
@args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};
delete $args{tfidf_weighting};
}
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
if ($self->{load}) {
my $args = ref($self->{load}) ? $self->{load} : { path => $self->{load} };
$self->load(%$args);
delete $self->{load};
}
return $self;
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
$self->trim_doc_features if $self->{features};
}
return $self->{features} if $self->{features};
# Create a feature vector encompassing the whole set of documents
my $v = $self->create_delayed_object('features');
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
sub documents {
my $d = $_[0]->{documents};
return wantarray ? $d->members : $d->size;
}
sub document {
my ($self, $name) = @_;
return $self->{documents}->retrieve($name);
}
sub feature_selector { $_[0]->{feature_selector} }
sub scan_first { $_[0]->{scan_first} }
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub trim_doc_features {
my ($self) = @_;
foreach my $doc ($self->documents) {
$doc->features( $doc->features->intersection($self->features) );
}
}
sub prog_bar {
my ($self, $collection) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
my %stats;
while (my $doc = $collection->next) {
$pb->();
$stats{category_count_with_duplicates} += $doc->categories;
my ($sum, $length) = ($doc->features->sum, $doc->features->length);
$stats{document_count}++;
$stats{token_count} += $sum;
$stats{type_count} += $length;
foreach my $cat ($doc->categories) {
#warn $doc->name, ": ", $cat->name, "\n";
$stats{categories}{$cat->name}{document_count}++;
$stats{categories}{$cat->name}{token_count} += $sum;
$stats{categories}{$cat->name}{type_count} += $length;
}
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
$stats{"${thing}_skew_by_category"} = sqrt($ssum/@cats) / $stats{"${thing}s_per_category"};
}
return \%stats;
}
sub load {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
if ($self->{features_kept}) {
# Read the whole thing in, then reduce
$self->read( collection => $c );
$self->select_features;
} elsif ($self->{scan_first}) {
# Figure out the feature set first, then read data in
$self->scan_features( collection => $c );
$c->rewind;
$self->read( collection => $c );
} else {
# Don't do any feature reduction, just read the data
$self->read( collection => $c );
}
}
sub read {
my ($self, %args) = @_;
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
while (my $doc = $collection->next) {
$pb->();
$self->add_document($doc);
}
print "\n" if $self->verbose;
}
sub finish {
my $self = shift;
return if $self->{finished}++;
$self->weigh_features;
}
sub weigh_features {
# This could be made more efficient by figuring out an execution
# plan in advance
my $self = shift;
if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
# Nothing to do
} elsif ( $self->{term_weighting} eq 'l' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = 1 + log($_) foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'n' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
my $max_tf = AI::Categorizer::Util::max values %$f;
$_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'b' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = $_ ? 1 : 0 foreach values %$f;
}
} else {
die "term_weighting must be one of 'x', 't', 'l', 'b', or 'n'";
}
if ($self->{collection_weighting} eq 'x') {
# Nothing to do
} elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
my $subtrahend = ($1 eq 'f' ? 0 : 1);
my $num_docs = $self->documents;
$self->document_frequency('foo'); # Initialize
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
}
} else {
die "collection_weighting must be one of 'x', 'f', or 'p'";
}
if ( $self->{normalize_weighting} eq 'x' ) {
# Nothing to do
} elsif ( $self->{normalize_weighting} eq 'c' ) {
$_->features->normalize foreach $self->documents;
} else {
die "normalize_weighting must be one of 'x' or 'c'";
}
}
sub document_frequency {
my ($self, $term) = @_;
unless (exists $self->{doc_freq_vector}) {
die "No corpus has been scanned for features" unless $self->documents;
my $doc_freq = $self->create_delayed_object('features', features => {});
foreach my $doc ($self->documents) {
$doc_freq->add( $doc->features->as_boolean_hash );
}
$self->{doc_freq_vector} = $doc_freq->as_hash;
}
return exists $self->{doc_freq_vector}{$term} ? $self->{doc_freq_vector}{$term} : 0;
}
sub scan_features {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
my $pb = $self->prog_bar($c);
my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );
$self->delayed_object_params('document', use_features => $ranked_features);
$self->delayed_object_params('collection', use_features => $ranked_features);
return $ranked_features;
}
sub select_features {
my $self = shift;
my $f = $self->feature_selector->select_features(knowledge_set => $self);
$self->features($f);
}
sub partition {
my ($self, @sizes) = @_;
my $num_docs = my @docs = $self->documents;
my @groups;
while (@sizes > 1) {
my $size = int ($num_docs * shift @sizes);
push @groups, [];
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
sub add_document {
my ($self, $doc) = @_;
foreach ($doc->categories) {
$_->add_document($doc);
}
$self->{documents}->insert($doc);
$self->{categories}->insert($doc->categories);
}
sub save_features {
my ($self, $file) = @_;
my $f = ($self->{features} || { $self->delayed_object_params('document') }->{use_features})
or croak "No features to save";
open my($fh), "> $file" or croak "Can't create $file: $!";
my $h = $f->as_hash;
print $fh "# Total: ", $f->length, "\n";
foreach my $k (sort {$h->{$b} <=> $h->{$a}} keys %$h) {
print $fh "$k\t$h->{$k}\n";
}
close $fh;
}
sub restore_features {
my ($self, $file, $n) = @_;
open my($fh), "< $file" or croak "Can't open $file: $!";
my %hash;
while (<$fh>) {
next if /^#/;
/^(.*)\t([\d.]+)$/ or croak "Malformed line: $_";
$hash{$1} = $2;
last if defined $n and $. >= $n;
}
my $features = $self->create_delayed_object('features', features => \%hash);
$self->delayed_object_params('document', use_features => $features);
$self->delayed_object_params('collection', use_features => $features);
}
1;
__END__
=head1 NAME
AI::Categorizer::KnowledgeSet - Encapsulates set of documents
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
C<documents> parameter should also be specified.
=item documents
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet. If used, the
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
Specifies how document word counts should be converted to vector
values. Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval". The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
=item t
Raw term frequency - equal to the number of times a feature occurs in
the document.
=item x
A synonym for 't'.
=item n
Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as
the 't' specification, but with term frequency normalized to lie
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
=item documents()
In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item document()
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
=item features()
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
=item verbose()
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
C<feature_selection> and C<features_kept> parameters given to the
KnowledgeSet's C<new()> method.
This method returns the features as a FeatureVector whose values are
the "quality" of each feature, by whatever measure the
C<feature_selection> parameter specifies. Normally you won't need to
use the return value, because this FeatureVector will become the
C<use_features> parameter of any Document objects created by this
KnowledgeSet.
=item save_features()
Given the name of a file, this method writes the features (as
determined by the C<scan_features> method) to the file.
=item restore_features()
Given the name of a file written by C<save_features>, loads the
features from that file and passes them as the C<use_features>
parameter for any Document objects created in the future by this
KnowledgeSet.
=item read()
Iterates through a Collection of documents and adds them to the
KnowledgeSet. The Collection can be specified using a C<collection>
parameter - otherwise, specify the arguments to pass to the C<new()>
method of the Collection class.
=item load()
This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
=item add_document()
Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.
=item make_document()
This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet. A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>. These are the categories that the document
belongs to. Any other parameters will be passed to the Document
class's C<new()> method.
=item finish()
This method will be called prior to training the Learner. Its purpose
is to perform any operations (such as feature vector weighting) that
may require examination of the entire KnowledgeSet.
=item weigh_features()
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
=item document_frequency()
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
=item partition()
Divides the KnowledgeSet into several subsets. This may be useful for
performing cross-validation. The relative sizes of the subsets should
be passed as arguments. For example, to split the KnowledgeSet into
four KnowledgeSets of equal size, pass the arguments .25, .25, .25
(the final size is 1 minus the sum of the other sizes). The
partitions will be returned as a list.
lib/AI/Categorizer/Learner/DecisionTree.pm view on Meta::CPAN
}
}
if ($self->{model}{first_tree}) {
$t->copy_instances(from => $self->{model}{first_tree});
$t->set_results(\%results);
} else {
for ($positives, $negatives) {
foreach my $doc (@$_) {
$t->add_instance( attributes => $doc->features->as_boolean_hash,
result => $results{$doc->name},
name => $doc->name,
);
}
}
$t->purge(0);
$self->{model}{first_tree} = $t;
}
print STDERR "\nBuilding tree for category '", $cat->name, "'" if $self->verbose;
$t->train;
return $t;
}
sub get_scores {
my ($self, $doc) = @_;
local $self->{current_doc} = $doc->features->as_boolean_hash;
return $self->SUPER::get_scores($doc);
}
sub get_boolean_score {
my ($self, $doc, $t) = @_;
return $t->get_result( attributes => $self->{current_doc} ) || 0;
}
1;
__END__
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
(
threshold => {type => SCALAR, default => 0.4},
k_value => {type => SCALAR, default => 20},
knn_weighting => {type => SCALAR, default => 'score'},
max_instances => {type => SCALAR, default => 0},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->knowledge_set->features; # Initialize
}
sub threshold {
my $self = shift;
$self->{threshold} = shift if @_;
return $self->{threshold};
}
sub categorize_collection {
my $self = shift;
my $f_class = $self->knowledge_set->contained_class('features');
if ($f_class->can('all_features')) {
$f_class->all_features([$self->knowledge_set->features->names]);
}
$self->SUPER::categorize_collection(@_);
}
sub get_scores {
my ($self, $newdoc) = @_;
my $currentDocName = $newdoc->name;
#print "classifying $currentDocName\n";
my $features = $newdoc->features->intersection($self->knowledge_set->features)->normalize;
my $q = AI::Categorizer::Learner::KNN::Queue->new(size => $self->{k_value});
my @docset;
if ($self->{max_instances}) {
# Use (approximately) max_instances documents, chosen randomly from corpus
my $probability = $self->{max_instances} / $self->knowledge_set->documents;
@docset = grep {rand() < $probability} $self->knowledge_set->documents;
} else {
# Use the whole corpus
@docset = $self->knowledge_set->documents;
}
foreach my $doc (@docset) {
my $score = $doc->features->dot( $features );
warn "Score for ", $doc->name, " (", ($doc->categories)[0]->name, "): $score" if $self->verbose > 1;
$q->add($doc, $score);
}
my %scores = map {+$_->name, 0} $self->categories;
foreach my $e (@{$q->entries}) {
foreach my $cat ($e->{thing}->categories) {
$scores{$cat->name} += ($self->{knn_weighting} eq 'score' ? $e->{score} : 1); #increment cat score
}
}
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
__PACKAGE__->valid_params
(
threshold => {type => SCALAR, default => 0.3},
);
sub create_model {
my $self = shift;
my $m = $self->{model} = Algorithm::NaiveBayes->new;
foreach my $d ($self->knowledge_set->documents) {
$m->add_instance(attributes => $d->features->as_hash,
label => [ map $_->name, $d->categories ]);
}
$m->train;
}
sub get_scores {
my ($self, $newdoc) = @_;
return ($self->{model}->predict( attributes => $newdoc->features->as_hash ),
$self->{threshold});
}
sub threshold {
my $self = shift;
$self->{threshold} = shift if @_;
return $self->{threshold};
}
sub save_state {
lib/AI/Categorizer/Learner/Rocchio.pm view on Meta::CPAN
__PACKAGE__->valid_params
(
positive_setting => {type => SCALAR, default => 16 },
negative_setting => {type => SCALAR, default => 4 },
threshold => {type => SCALAR, default => 0.1},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->{model}{all_features} = $self->knowledge_set->features(undef);
$self->SUPER::create_model(@_);
delete $self->{knowledge_set};
}
sub create_boolean_model {
my ($self, $positives, $negatives, $cat) = @_;
my $posdocnum = @$positives;
my $negdocnum = @$negatives;
my $beta = $self->{positive_setting};
my $gamma = $self->{negative_setting};
my $profile = $self->{model}{all_features}->clone->scale(-$gamma/$negdocnum);
my $f = $cat->features(undef)->clone->scale( $beta/$posdocnum + $gamma/$negdocnum );
$profile->add($f);
return $profile->normalize;
}
sub get_boolean_score {
my ($self, $newdoc, $profile) = @_;
return $newdoc->features->normalize->dot($profile);
}
1;
lib/AI/Categorizer/Learner/SVM.pm view on Meta::CPAN
use Params::Validate qw(:types);
use File::Spec;
__PACKAGE__->valid_params
(
svm_kernel => {type => SCALAR, default => 'linear'},
);
sub create_model {
my $self = shift;
my $f = $self->knowledge_set->features->as_hash;
my $rmap = [ keys %$f ];
$self->{model}{feature_map} = { map { $rmap->[$_], $_ } 0..$#$rmap };
$self->{model}{feature_map_reverse} = $rmap;
$self->SUPER::create_model(@_);
}
sub _doc_2_dataset {
my ($self, $doc, $label, $fm) = @_;
my $ds = new Algorithm::SVM::DataSet(Label => $label);
my $f = $doc->features->as_hash;
while (my ($k, $v) = each %$f) {
next unless exists $fm->{$k};
$ds->attribute( $fm->{$k}, $v );
}
return $ds;
}
sub create_boolean_model {
my ($self, $positives, $negatives, $cat) = @_;
my $svm = new Algorithm::SVM(Kernel => $self->{svm_kernel});
my (@pos, @neg);
foreach my $doc (@$positives) {
push @pos, $self->_doc_2_dataset($doc, 1, $self->{model}{feature_map});
}
foreach my $doc (@$negatives) {
push @neg, $self->_doc_2_dataset($doc, 0, $self->{model}{feature_map});
}
$svm->train(@pos, @neg);
return $svm;
}
sub get_scores {
my ($self, $doc) = @_;
local $self->{current_doc} = $self->_doc_2_dataset($doc, -1, $self->{model}{feature_map});
return $self->SUPER::get_scores($doc);
}
sub get_boolean_score {
my ($self, $doc, $svm) = @_;
return $svm->predict($self->{current_doc});
}
sub save_state {
my ($self, $path) = @_;
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
java_path => {type => SCALAR, default => 'java'},
java_args => {type => SCALAR|ARRAYREF, optional => 1},
weka_path => {type => SCALAR, optional => 1},
weka_classifier => {type => SCALAR, default => 'weka.classifiers.NaiveBayes'},
weka_args => {type => SCALAR|ARRAYREF, optional => 1},
tmpdir => {type => SCALAR, default => File::Spec->tmpdir},
);
__PACKAGE__->contained_objects
(
features => {class => 'AI::Categorizer::FeatureVector', delayed => 1},
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
for ('java_args', 'weka_args') {
$self->{$_} = [] unless defined $self->{$_};
$self->{$_} = [$self->{$_}] unless UNIVERSAL::isa($self->{$_}, 'ARRAY');
}
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
delete $self->{weka_path};
}
return $self;
}
# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -t /tmp/train_file.arff -d /tmp/weka-machine
sub create_model {
my ($self) = shift;
my $m = $self->{model} ||= {};
$m->{all_features} = [ $self->knowledge_set->features->names ];
$m->{_in_dir} = File::Temp::tempdir( DIR => $self->{tmpdir} );
# Create a dummy test file $dummy_file in ARFF format (a kludgey WEKA requirement)
my $dummy_features = $self->create_delayed_object('features');
$m->{dummy_file} = $self->create_arff_file("dummy", [[$dummy_features, 0]]);
$self->SUPER::create_model(@_);
}
sub create_boolean_model {
my ($self, $pos, $neg, $cat) = @_;
my @docs = (map([$_->features, 1], @$pos),
map([$_->features, 0], @$neg));
my $train_file = $self->create_arff_file($cat->name . '_train', \@docs);
my %info = (machine_file => $cat->name . '_model');
my $outfile = File::Spec->catfile($self->{model}{_in_dir}, $info{machine_file});
my @args = ($self->{java_path},
@{$self->{java_args}},
$self->{weka_classifier},
@{$self->{weka_args}},
'-t', $train_file,
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
return \%info;
}
# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -l out -T test.arff -p 0
sub get_boolean_score {
my ($self, $doc, $info) = @_;
# Create document file
my $doc_file = $self->create_arff_file('doc', [[$doc->features, 0]], $self->{tmpdir});
my $machine_file = File::Spec->catfile($self->{model}{_in_dir}, $info->{machine_file});
my @args = ($self->{java_path},
@{$self->{java_args}},
$self->{weka_classifier},
'-l', $machine_file,
'-T', $doc_file,
'-p', 0,
);
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
}
sub categorize_collection {
my ($self, %args) = @_;
my $c = $args{collection} or die "No collection provided";
my @alldocs;
while (my $d = $c->next) {
push @alldocs, $d;
}
my $doc_file = $self->create_arff_file("docs", [map [$_->features, 0], @alldocs]);
my @assigned;
my $l = $self->{model}{learners};
foreach my $cat (keys %$l) {
my $machine_file = File::Spec->catfile($self->{model}{_in_dir}, "${cat}_model");
my @args = ($self->{java_path},
@{$self->{java_args}},
$self->{weka_classifier},
'-l', $machine_file,
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
my ($self, $name, $docs, $dir) = @_;
$dir = $self->{model}{_in_dir} unless defined $dir;
my ($fh, $filename) = File::Temp::tempfile(
$name . "_XXXX", # Template
DIR => $dir,
SUFFIX => '.arff',
);
print $fh "\@RELATION foo\n\n";
my $feature_names = $self->{model}{all_features};
foreach my $name (@$feature_names) {
print $fh "\@ATTRIBUTE feature-$name REAL\n";
}
print $fh "\@ATTRIBUTE category {1, 0}\n\n";
my %feature_indices = map {$feature_names->[$_], $_} 0..$#{$feature_names};
my $last_index = keys %feature_indices;
# We use the 'sparse' format, see http://www.cs.waikato.ac.nz/~ml/weka/arff.html
print $fh "\@DATA\n";
foreach my $doc (@$docs) {
my ($features, $cat) = @$doc;
my $f = $features->as_hash;
my @ordered_keys = (sort {$feature_indices{$a} <=> $feature_indices{$b}}
grep {exists $feature_indices{$_}}
keys %$f);
print $fh ("{",
join(', ', map("$feature_indices{$_} $f->{$_}", @ordered_keys), "$last_index '$cat'"),
"}\n"
);
}
return $filename;
}
sub save_state {
my ($self, $path) = @_;
t/01-naive_bayes.t view on Meta::CPAN
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure collection_weighting is working
ok $c->knowledge_set->document_frequency('vampires'), 2;
for ('vampires', 'mirrors') {
ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
);
}
$c->learner->train( knowledge_set => $c->knowledge_set );
ok $c->learner;
my $doc = new AI::Categorizer::Document
( name => 'test1',
content => 'I would like to begin farming sheep.' );
t/01-naive_bayes.t view on Meta::CPAN
{
ok my $c = new AI::Categorizer(term_weighting => 'b');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure term_weighting is working
ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
}
{
ok my $c = new AI::Categorizer(term_weighting => 'n');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure term_weighting is working
ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
ok $c->knowledge_set->document('doc3')->features->as_hash->{blood}, 0.75;
ok $c->knowledge_set->document('doc4')->features->as_hash->{mirrors}, 1;
}
{
ok my $c = new AI::Categorizer(tfidf_weighting => 'txx');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure term_weighting is working
ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 2;
}
t/11-feature_vector.t view on Meta::CPAN
use strict;
use Test;
BEGIN {
plan tests => 18;
}
use AI::Categorizer::FeatureVector;
ok(1);
my $f1 = new AI::Categorizer::FeatureVector(features => {sports => 2, finance => 3});
ok $f1;
ok $f1->includes('sports');
ok $f1->value('sports'), 2;
my $f2 = new AI::Categorizer::FeatureVector;
ok $f2;
$f2->set({sports => 5, hockey => 7});
ok $f2->value('sports'), 5;
ok $f2->value('hockey'), 7;
my $h = $f2->as_hash;
ok keys(%$h), 2;
ok $f1->dot($f2), 10;
ok $f2->dot($f1), 10;
my $pkg = 'AI::Categorizer::FeatureVector::FastDot';
if (eval "use $pkg; 1") {
my $f1 = $pkg->new(features => {sports => 2, finance => 3});
my $f2 = $pkg->new(features => {sports => 5, hockey => 7});
ok $f1;
ok $f2;
$pkg->all_features([qw(sports finance hockey)]);
ok keys(%{$pkg->all_features}), 3;
ok $f1->dot($f2), 10;
ok $f2->dot($f1), 10;
} else {
skip "skip $pkg is not available", 1 for 1..5;
}
{
# Call normalize() on an empty vector
my $f = AI::Categorizer::FeatureVector->new(features => {});
ok $f->euclidean_length, 0;
eval {$f->normalize};
ok $@, '';
ok $f->normalize, $f;
}
t/13-document.t view on Meta::CPAN
use AI::Categorizer::Document;
use AI::Categorizer::FeatureVector;
ok(1);
my $docclass = 'AI::Categorizer::Document';
# Test empty document creation
{
my $d = $docclass->new;
ok ref($d), $docclass, "Basic empty document creation";
ok $d->features, undef;
}
# Test basic document creation
{
my $d = $docclass->new(content => "Hello world");
ok ref($d), $docclass, "Basic document creation with 'content' parameter";
ok $d->features->includes('hello'), 1;
ok $d->features->includes('world'), 1;
ok $d->features->includes('foo'), '';
}
# Test document creation with 'parse'
{
require AI::Categorizer::Document::Text;
my $d = AI::Categorizer::Document::Text->new( parse => "Hello world" );
ok ref($d), 'AI::Categorizer::Document::Text', "Document creation with 'parse' parameter";
ok $d->features->includes('hello'), 1;
ok $d->features->includes('world'), 1;
ok $d->features->includes('foo'), '';
}
# Test document creation with 'features'
{
my $d = $docclass->new(features => AI::Categorizer::FeatureVector->new(features => {one => 1, two => 2}));
ok ref($d), $docclass, "Document creation with 'features' parameter";
ok $d->features->value('one'), 1;
ok $d->features->value('two'), 2;
ok $d->features->includes('foo'), '';
}
# Test some stemming & stopword stuff.
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stemmed'],
stemming => 'porter',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'stem', "stopword_behavior() is 'stem'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
ok $d->features->includes('stem'), '', "Shouldn't include 'stem'";
print "Features: @{[ $d->features->names ]}\n";
}
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stemmed'],
stemming => 'porter',
stopword_behavior => 'no_stem',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'no_stem', "stopword_behavior() is 'no_stem'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
ok $d->features->includes('stem'), 1, "Should include 'stem'";
print "Features: @{[ $d->features->names ]}\n";
}
{
my $d = $docclass->new
(
name => 'test',
stopwords => ['stem'],
stemming => 'porter',
stopword_behavior => 'pre_stemmed',
content => 'stopword processing should happen after stemming',
# Becomes qw(stopword process should happen after stem )
);
ok $d->stopword_behavior, 'pre_stemmed', "stopword_behavior() is 'pre_stemmed'";
ok $d->features->includes('stopword'), 1, "Should include 'stopword'";
ok $d->features->includes('stemming'), '', "Shouldn't include 'stemming'";
ok $d->features->includes('stem'), '', "Shouldn't include 'stem'";
print "Features: @{[ $d->features->names ]}\n";
}