view release on metacpan or search on metacpan
- Changed the web locations of the reuters-21578 corpus that
eg/demo.pl uses, since the location it referenced previously has
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- The 'categorizer' script now records more useful benchmarking
information about time & memory in its outfile.
- The AI::Categorizer->dump_parameters() method now tries to avoid
showing you its entire list of stopwords.
- Document objects now use a default 'name' if none is supplied.
- For some Learner classes, the generated Hypothesis objects had
non-functioning all_categories() methods. Fixed.
- The Collection::Files class now uses File::Spec internally to
manage cross-platform filenames.
- Added the 'stopword_behavior' parameter for controlling how
stopword lists and stemming interact. Previously, if stopwords &
stemming were both used, stopwords were assumed to be pre-stemmed,
- Added document($name) accessor method to KnowledgeSet.
- In KnowledgeSet, load(), read(), and scan_*() can now accept a
Collection object.
- Added document_frequency(), finish(), and weigh_features() methods
to KnowledgeSet.
- Added save_features() and restore_features() to KnowledgeSet.
- Added default categories() and categorize() methods to Learner base
class. get_scores() is now abstract.
- Extended interface of ObjectSet class with retrieve(), includes(),
and includes_name().
- Moved 'term_weighting' parameter from Document to KnowledgeSet,
since the normalized version needs to know the maximum
term-frequency. Also changed its values to 'n', 'l', 'b', and 't',
with 'x' a synonym for 't'.
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that *they* create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
lib/AI/Categorizer.pm view on Meta::CPAN
use AI::Categorizer::Learner;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::Collection;
use AI::Categorizer::Hypothesis;
use AI::Categorizer::KnowledgeSet;
__PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment',
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files',
delayed => 1 },
);
sub new {
my $package = shift;
my %args = @_;
my %defaults;
if (exists $args{data_root}) {
$defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
$defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
$defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
delete $args{data_root};
}
return $package->SUPER::new(%defaults, %args);
}
#sub dump_parameters {
# my $p = shift()->SUPER::dump_parameters;
# delete $p->{stopwords} if $p->{stopword_file};
# return $p;
#}
sub knowledge_set { shift->{knowledge_set} }
sub learner { shift->{learner} }
lib/AI/Categorizer.pm view on Meta::CPAN
L<its documentation|Class::Container> for the details of how this
works.
The specific parameters accepted here are:
=over 4
=item progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
=item verbose
If true, a few status messages will be printed during execution.
=item training_set
lib/AI/Categorizer/Category.pm view on Meta::CPAN
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
__PACKAGE__->valid_params
(
name => {type => SCALAR, public => 0},
documents => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
},
public => 0,
},
);
__PACKAGE__->contained_objects
(
features => {
lib/AI/Categorizer/Collection.pm view on Meta::CPAN
package AI::Categorizer::Collection;
use strict;
use Params::Validate qw(:types);
use Class::Container;
use base qw(Class::Container);
__PACKAGE__->valid_params
(
verbose => {type => SCALAR, default => 0},
stopword_file => { type => SCALAR, optional => 1 },
category_hash => { type => HASHREF, default => {} },
category_file => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document::Text',
delayed => 1 },
);
sub new {
lib/AI/Categorizer/Collection.pm view on Meta::CPAN
=item verbose
If true, some status/debugging information will be printed to
C<STDOUT> during operation.
=item document_class
The class indicating what type of Document object should be created.
This generally specifies the format that the documents are stored in.
The default is C<AI::Categorizer::Document::Text>.
=back
=item next()
Returns the next Document object in the Collection.
=item rewind()
Resets the iterator for further calls to C<next()>.
lib/AI/Categorizer/Collection/DBI.pm view on Meta::CPAN
use strict;
use DBI;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
connection_string => {type => SCALAR, default => undef},
dbh => {isa => 'DBI::db', default => undef},
select_statement => {type => SCALAR, default => "SELECT text FROM documents"},
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document',
delayed => 1 },
);
sub new {
my $class = shift;
lib/AI/Categorizer/Collection/Files.pm view on Meta::CPAN
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
use File::Spec;
__PACKAGE__->valid_params
(
path => { type => SCALAR|ARRAYREF },
recurse => { type => BOOLEAN, default => 0 },
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
$self->{dir_fh} = do {local *FH; *FH}; # double *FH avoids a warning
# Documents are contained in a directory, or list of directories
$self->{path} = [$self->{path}] unless ref $self->{path};
lib/AI/Categorizer/Collection/Files.pm view on Meta::CPAN
Indicates a location on disk where the documents can be found. The
path may be specified as a string giving the name of a directory, or
as a reference to an array of such strings if the documents are
located in more than one directory.
=item recurse
Indicates whether subdirectories of the directory (or directories) in
the C<path> parameter should be descended into. If set to a true
value, they will be descended into. If false, they will be ignored.
The default is false.
=back
=back
=head1 AUTHOR
Ken Williams, ken@mathforum.org
=head1 COPYRIGHT
lib/AI/Categorizer/Collection/SingleFile.pm view on Meta::CPAN
use strict;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
path => { type => SCALAR|ARRAYREF },
categories => { type => HASHREF|UNDEF, default => undef },
delimiter => { type => SCALAR },
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document::Text',
delayed => 1 },
);
sub new {
lib/AI/Categorizer/Document.pm view on Meta::CPAN
use AI::Categorizer::ObjectSet;
use AI::Categorizer::FeatureVector;
__PACKAGE__->valid_params
(
name => {
type => SCALAR,
},
categories => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Category objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Category'), @{$_[0]} },
},
public => 0,
},
stopwords => {
type => ARRAYREF|HASHREF,
default => {},
},
content => {
type => HASHREF|SCALAR,
default => undef,
},
parse => {
type => SCALAR,
optional => 1,
},
parse_handle => {
type => HANDLE,
optional => 1,
},
features => {
isa => 'AI::Categorizer::FeatureVector',
optional => 1,
},
content_weights => {
type => HASHREF,
default => {},
},
front_bias => {
type => SCALAR,
default => 0,
},
use_features => {
type => HASHREF|UNDEF,
default => undef,
},
stemming => {
type => SCALAR|UNDEF,
optional => 1,
},
stopword_behavior => {
type => SCALAR,
default => "stem",
},
);
__PACKAGE__->contained_objects
(
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
);
### Constructors
my $NAME = 'a';
sub new {
my $pkg = shift;
my $self = $pkg->SUPER::new(name => $NAME++, # Use a default name
@_);
# Get efficient internal data structures
$self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
$self->_fix_stopwords;
# A few different ways for the caller to initialize the content
if (exists $self->{parse}) {
$self->parse(content => delete $self->{parse});
lib/AI/Categorizer/Document.pm view on Meta::CPAN
Stem stopwords according to 'stemming' parameter, then match them
against stemmed document words.
=item pre_stemmed
Stopwords are already stemmed, match them against stemmed document
words.
=back
The default value is C<stem>, which seems to produce the best results
in most cases I've tried. I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.
This parameter has no effect if there are no stopwords being used, or
if stemming is not being used. In the latter case, the list of
stopwords will always be matched as-is against the document words.
Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.
lib/AI/Categorizer/Document/XML.pm view on Meta::CPAN
# it is called whenever the parser ends the element
sub end_element{
my ($self, $el)= @_;
$self->{levelPointer}--;
my $location= $self->{locationArray}[$self->{levelPointer}];
# find the name of element
my $elementName= $el->{Name};
# set the default weight
my $weight= 1;
# check if user give the weight to duplicate data
$weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};
# 0 - remove all the data to be related to this element
if($weight == 0){
$self->{content} = substr($self->{content}, 0, $location);
return;
}
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
use Class::Container;
use AI::Categorizer::Storable;
use Statistics::Contingency;
use base qw(Class::Container AI::Categorizer::Storable Statistics::Contingency);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
categories => { type => ARRAYREF|HASHREF },
sig_figs => { type => SCALAR, default => 4 },
);
sub new {
my $package = shift;
my $self = $package->Class::Container::new(@_);
$self->{$_} = 0 foreach qw(a b c d);
my $c = delete $self->{categories};
$self->{categories} = { map {($_ => {a=>0, b=>0, c=>0, d=>0})}
UNIVERSAL::isa($c, 'HASH') ? keys(%$c) : @$c
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
features_kept => {
type => SCALAR,
default => 0.2,
},
verbose => {
type => SCALAR,
default => 0,
},
);
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub reduce_features {
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
Specifies how document word counts should be converted to vector
values. Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval". The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
1;
sub reduction_function;
# figure out the feature set before reading collection (default)
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or
die "No 'collection' parameter provided to scan_features()";
if(!($self->{features_kept})) {return;}
my %cat_features;
my $coll_features = $self->create_delayed_object('features');
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
categories => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Category objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Category'),
@{$_[0]} },
},
},
documents => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'),
@{$_[0]} },
},
},
scan_first => {
type => BOOLEAN,
default => 1,
},
feature_selector => {
isa => 'AI::Categorizer::FeatureSelector',
},
tfidf_weighting => {
type => SCALAR,
optional => 1,
},
term_weighting => {
type => SCALAR,
default => 'x',
},
collection_weighting => {
type => SCALAR,
default => 'x',
},
normalize_weighting => {
type => SCALAR,
default => 'x',
},
verbose => {
type => SCALAR,
default => 0,
},
);
__PACKAGE__->contained_objects
(
document => { delayed => 1,
class => 'AI::Categorizer::Document' },
category => { delayed => 1,
class => 'AI::Categorizer::Category' },
collection => { delayed => 1,
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
C<categories> parameter should also be specified.
=item features_kept
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents. May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used. The default is
0.2.
=item feature_selection
A string indicating the type of feature selection that should be
performed. Currently the only option is also the default option:
C<document_frequency>.
=item tfidf_weighting
Specifies how document word counts should be converted to vector
values. Uses the three-character specification strings from Salton &
Buckley's paper "Term-weighting approaches in automatic text
retrieval". The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
use Class::Container;
use AI::Categorizer::Storable;
use base qw(Class::Container AI::Categorizer::Storable);
use Params::Validate qw(:types);
use AI::Categorizer::ObjectSet;
__PACKAGE__->valid_params
(
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet', optional => 1 },
verbose => {type => SCALAR, default => 0},
);
__PACKAGE__->contained_objects
(
hypothesis => {
class => 'AI::Categorizer::Hypothesis',
delayed => 1,
},
experiment => {
class => 'AI::Categorizer::Experiment',
lib/AI/Categorizer/Learner.pm view on Meta::CPAN
=item new()
Creates a new Learner and returns it. Accepts the following
parameters:
=over 4
=item knowledge_set
A Knowledge Set that will be used by default during the C<train()>
method.
=item verbose
If true, the Learner will display some diagnostic output while
training and categorizing documents.
=back
=item train()
lib/AI/Categorizer/Learner/Boolean.pm view on Meta::CPAN
package AI::Categorizer::Learner::Boolean;
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use AI::Categorizer::Util qw(random_elements);
__PACKAGE__->valid_params
(
max_instances => {type => SCALAR, default => 0},
threshold => {type => SCALAR, default => 0.5},
);
sub create_model {
my $self = shift;
my $m = $self->{model} ||= {};
my $mi = $self->{max_instances};
foreach my $cat ($self->knowledge_set->categories) {
my (@p, @n);
foreach my $doc ($self->knowledge_set->documents) {
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
package AI::Categorizer::Learner::KNN;
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
threshold => {type => SCALAR, default => 0.4},
k_value => {type => SCALAR, default => 20},
knn_weighting => {type => SCALAR, default => 'score'},
max_instances => {type => SCALAR, default => 0},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->knowledge_set->features; # Initialize
}
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
=head2 new()
Creates a new KNN Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
KNN subclass accepts the following parameters:
=over 4
=item threshold
Sets the score threshold for category membership. The default is
currently 0.1. Set the threshold lower to assign more categories per
document, set it higher to assign fewer. This can be an effective way
to trade of between precision and recall.
=item k_value
Sets the C<k> value (as in k-Nearest-Neighbor) to the given integer.
This indicates how many of each document's nearest neighbors should be
considered when assigning categories. The default is 5.
=back
=head2 threshold()
Returns the current threshold value. With an optional numeric
argument, you may set the threshold.
=head2 train(knowledge_set => $k)
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
package AI::Categorizer::Learner::NaiveBayes;
use strict;
use AI::Categorizer::Learner;
use base qw(AI::Categorizer::Learner);
use Params::Validate qw(:types);
use Algorithm::NaiveBayes;
__PACKAGE__->valid_params
(
threshold => {type => SCALAR, default => 0.3},
);
sub create_model {
my $self = shift;
my $m = $self->{model} = Algorithm::NaiveBayes->new;
foreach my $d ($self->knowledge_set->documents) {
$m->add_instance(attributes => $d->features->as_hash,
label => [ map $_->name, $d->categories ]);
}
lib/AI/Categorizer/Learner/NaiveBayes.pm view on Meta::CPAN
=head2 new()
Creates a new Naive Bayes Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Naive Bayes subclass accepts the following parameters:
=over 4
=item * threshold
Sets the score threshold for category membership. The default is
currently 0.3. Set the threshold lower to assign more categories per
document, set it higher to assign fewer. This can be an effective way
to trade of between precision and recall.
=back
=head2 threshold()
Returns the current threshold value. With an optional numeric
argument, you may set the threshold.
lib/AI/Categorizer/Learner/Rocchio.pm view on Meta::CPAN
$VERSION = '0.01';
use strict;
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Learner::Boolean;
use base qw(AI::Categorizer::Learner::Boolean);
__PACKAGE__->valid_params
(
positive_setting => {type => SCALAR, default => 16 },
negative_setting => {type => SCALAR, default => 4 },
threshold => {type => SCALAR, default => 0.1},
);
sub create_model {
my $self = shift;
foreach my $doc ($self->knowledge_set->documents) {
$doc->features->normalize;
}
$self->{model}{all_features} = $self->knowledge_set->features(undef);
$self->SUPER::create_model(@_);
lib/AI/Categorizer/Learner/SVM.pm view on Meta::CPAN
use strict;
use AI::Categorizer::Learner::Boolean;
use base qw(AI::Categorizer::Learner::Boolean);
use Algorithm::SVM;
use Algorithm::SVM::DataSet;
use Params::Validate qw(:types);
use File::Spec;
__PACKAGE__->valid_params
(
svm_kernel => {type => SCALAR, default => 'linear'},
);
sub create_model {
my $self = shift;
my $f = $self->knowledge_set->features->as_hash;
my $rmap = [ keys %$f ];
$self->{model}{feature_map} = { map { $rmap->[$_], $_ } 0..$#$rmap };
$self->{model}{feature_map_reverse} = $rmap;
$self->SUPER::create_model(@_);
}
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
use AI::Categorizer::Learner::Boolean;
use base qw(AI::Categorizer::Learner::Boolean);
use Params::Validate qw(:types);
use File::Spec;
use File::Copy;
use File::Path ();
use File::Temp ();
__PACKAGE__->valid_params
(
java_path => {type => SCALAR, default => 'java'},
java_args => {type => SCALAR|ARRAYREF, optional => 1},
weka_path => {type => SCALAR, optional => 1},
weka_classifier => {type => SCALAR, default => 'weka.classifiers.NaiveBayes'},
weka_args => {type => SCALAR|ARRAYREF, optional => 1},
tmpdir => {type => SCALAR, default => File::Spec->tmpdir},
);
__PACKAGE__->contained_objects
(
features => {class => 'AI::Categorizer::FeatureVector', delayed => 1},
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
Creates a new Weka Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Weka subclass accepts the following parameters:
=over 4
=item java_path
Specifies where the C<java> executable can be found on this system.
The default is simply C<java>, meaning that it will search your
C<PATH> to find java.
=item java_args
Specifies a list of any additional arguments to give to the java
process. Commonly it's necessary to allocate more memory than the
default, using an argument like C<-Xmx130MB>.
=item weka_path
Specifies the path to the C<weka.jar> file containing the Weka
bytecode. If Weka has been installed somewhere in your java
C<CLASSPATH>, you needn't specify a C<weka_path>.
=item weka_classifier
Specifies the Weka class to use for a categorizer. The default is
C<weka.classifiers.NaiveBayes>. Consult your Weka documentation for a
list of other classifiers available.
=item weka_args
Specifies a list of any additional arguments to pass to the Weka
classifier class when building the categorizer.
=item tmpdir
A directory in which temporary files will be written when training the
categorizer and categorizing new documents. The default is given by
C<< File::Spec->tmpdir >>.
=back
=head2 train(knowledge_set => $k)
Trains the categorizer. This prepares it for later use in
categorizing documents. The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories. See