view release on metacpan or search on metacpan
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
lib/AI/Categorizer.pm view on Meta::CPAN
__PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment',
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files',
delayed => 1 },
);
sub new {
my $package = shift;
my %args = @_;
my %defaults;
if (exists $args{data_root}) {
$defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
$defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
$defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
delete $args{data_root};
}
return $package->SUPER::new(%defaults, %args);
}
#sub dump_parameters {
# my $p = shift()->SUPER::dump_parameters;
# delete $p->{stopwords} if $p->{stopword_file};
# return $p;
#}
lib/AI/Categorizer/Collection/InMemory.pm view on Meta::CPAN
package AI::Categorizer::Collection::InMemory;
use strict;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
data => { type => HASHREF },
);
sub new {
my $self = shift()->SUPER::new(@_);
while (my ($name, $params) = each %{$self->{data}}) {
foreach (@{$params->{categories}}) {
next if ref $_;
$_ = AI::Categorizer::Category->by_name(name => $_);
}
}
return $self;
}
sub next {
my $self = shift;
my ($name, $params) = each %{$self->{data}} or return;
return AI::Categorizer::Document->new(name => $name, %$params);
}
sub rewind {
my $self = shift;
scalar keys %{$self->{data}};
return;
}
sub count_documents {
my $self = shift;
return scalar keys %{$self->{data}};
}
1;
lib/AI/Categorizer/Document.pm view on Meta::CPAN
### Constructors
my $NAME = 'a';
sub new {
my $pkg = shift;
my $self = $pkg->SUPER::new(name => $NAME++, # Use a default name
@_);
# Get efficient internal data structures
$self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
$self->_fix_stopwords;
# A few different ways for the caller to initialize the content
if (exists $self->{parse}) {
$self->parse(content => delete $self->{parse});
} elsif (exists $self->{parse_handle}) {
$self->parse_handle(handle => delete $self->{parse_handle});
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
C<Statistics::Contingency> for a description of its interface. All of
its methods are available here, with the following additions:
=over 4
=item new( categories => \%categories )
=item new( categories => \@categories, verbose => 1, sig_figs => 2 )
Returns a new Experiment object. A required C<categories> parameter
specifies the names of all categories in the data set. The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.
The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.
A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method. It
does not affect the other methods like C<micro_precision()>.
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
=item add_document()
Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.
=item make_document()
This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet. A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>. These are the categories that the document
belongs to. Any other parameters will be passed to the Document
class's C<new()> method.
=item finish()
This method will be called prior to training the Learner. Its purpose
is to perform any operations (such as feature vector weighting) that
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
$f3 = $f1->intersection($f2);
$f3 = $f1->add($f2);
$h = $f1->as_hash;
$h = $f1->as_boolean_hash;
$f1->normalize;
=head1 DESCRIPTION
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features. At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document. However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.
=head1 METHODS
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
=head1 METHODS
=over 4
=item new(%parameters)
Returns a new Hypothesis object. Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method. However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.
The following parameters are accepted when creating a new Hypothesis:
=over 4
=item all_categories
A required parameter which gives the set of all categories that could
possibly be assigned to. The categories should be specified as a
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
sub load {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
if ($self->{features_kept}) {
# Read the whole thing in, then reduce
$self->read( collection => $c );
$self->select_features;
} elsif ($self->{scan_first}) {
# Figure out the feature set first, then read data in
$self->scan_features( collection => $c );
$c->rewind;
$self->read( collection => $c );
} else {
# Don't do any feature reduction, just read the data
$self->read( collection => $c );
}
}
sub read {
my ($self, %args) = @_;
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
while (my $doc = $collection->next) {
lib/AI/Categorizer/Learner/Weka.pm view on Meta::CPAN
$nb = AI::Categorizer::Learner->restore_state('filename');
my $c = new AI::Categorizer::Collection::Files( path => ... );
while (my $document = $c->next) {
my $hypothesis = $nb->categorize($document);
print "Best assigned category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/). This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.
Currently this is a simple command-line wrapper that calls C<java>
subprocesses. In the future this may be converted to an
C<Inline::Java> wrapper for better performance (faster running
times). However, if you're looking for really great performance,
you're probably looking in the wrong place - this Weka wrapper is
intended more as a way to try lots of different machine learning
lib/AI/Categorizer/Storable.pm view on Meta::CPAN
=head1 SYNOPSIS
$object->save_state($path);
... time passes ...
$object = Class->restore_state($path);
=head1 DESCRIPTION
This class implements methods for storing the state of an object to a
file and restoring from that file later. In C<AI::Categorizer> it is
generally used in order to let data persist across multiple
invocations of a program.
=head1 METHODS
=over 4
=item save_state($path)
This object method saves the object to disk for later use. The
C<$path> argument indicates the place on disk where the object should
t/01-naive_bayes.t view on Meta::CPAN
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');
#use Carp; $SIG{__DIE__} = \&Carp::confess;
my %docs = training_docs();
{
ok my $c = new AI::Categorizer(collection_weighting => 'f');
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
$c->knowledge_set->finish;
# Make sure collection_weighting is working
ok $c->knowledge_set->document_frequency('vampires'), 2;
for ('vampires', 'mirrors') {
ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
);
t/common.pl view on Meta::CPAN
(
name => 'Vampires/Farmers',
stopwords => [qw(are be in of and)],
),
verbose => $ENV{TEST_VERBOSE} ? 1 : 0,
%params,
);
ok ref($c), 'AI::Categorizer', "Create an AI::Categorizer object";
my %docs = training_docs();
while (my ($name, $data) = each %docs) {
$c->knowledge_set->make_document(name => $name, %$data);
}
my $l = $c->learner;
ok $l;
if ($params{learner_class}) {
ok ref($l), $params{learner_class}, "Make sure the correct Learner class is instantiated";
} else {
ok 1, 1, "Dummy test";
}
view all matches for this distributionview release on metacpan - search on metacpan