view release on metacpan or search on metacpan
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
eg/categorizer view on Meta::CPAN
print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
if ($HAVE_YAML) {
print {$out_fh} YAML::Dump($c->dump_parameters);
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
}
}
}
run_section('scan_features', 1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train', 3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
my ($section, $stage, $do_stage) = @_;
return unless $do_stage->{$stage};
if (keys %$do_stage > 1) {
print " % $0 @ARGV -$stage\n" if $c->verbose;
die "$0 is not executable, please change its execution permissions"
unless -x $0;
system($0, @ARGV, "-$stage") == 0
or die "$0 returned nonzero status, \$?=$?";
return;
}
my $start = new Benchmark;
$c->$section();
my $end = new Benchmark;
my $summary = timestr(timediff($end, $start));
my ($rss, $vsz) = memory_usage();
print "$summary (memory: rss=$rss, vsz=$vsz)\n" if $c->verbose;
print $out_fh "Stage $stage: $summary (memory: rss=$rss, vsz=$vsz)\n" if $out_fh;
}
sub parse_command_line {
my (%opt, %do_stage);
lib/AI/Categorizer.pm view on Meta::CPAN
statistical tests - at best they are neat tricks or helpful
assistants, and at worst they are totally unreliable. If you plan to
use this module for anything really important, human supervision is
essential, both of the categorization process and the final results.
For the usage details, please see the documentation of each individual
module.
=head1 FRAMEWORK COMPONENTS
This section explains the major pieces of the C<AI::Categorizer>
object framework. We give a conceptual overview, but don't get into
any of the details about interfaces or usage. See the documentation
for the individual classes for more details.
A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
=head2 Knowledge Sets
lib/AI/Categorizer.pm view on Meta::CPAN
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
=head3 Feature selection
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized. There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.
The process of selecting the most important features in the training
set is called "feature selection". It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
=head2 Collections
Because documents may be stored in lots of different formats, a
lib/AI/Categorizer/Document.pm view on Meta::CPAN
A string that identifies this document. Required.
=item content
The raw content of this document. May be specified as either a string
or as a hash reference, allowing structured document types.
=item content_weights
A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector. The weight is a multiplier of the feature vector
values. For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.
If no weights are specified, all weights are set to 1.
=item front_bias
Allows smooth bias of the weights of words in a document according to
their position. The value should be a number between -1 and 1.
Positive numbers indicate that words toward the beginning of the
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
$f->length * $kept :
$kept);
print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
# This is algorithmic overkill, but the sort seems fast enough. Will revisit later.
my $features = $f->as_hash;
my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
[0 .. $num_kept-1];
my $result = $f->intersection( \@new_features );
print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
return $result;
}
# Abstract methods
sub rank_features;
sub scan_features;
sub select_features {
my ($self, %args) = @_;
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
A synonym for 't'.
=item n
Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as
the 't' specification, but with term frequency normalized to lie
between 0.5 and 1.
=back
The second character specifies the "collection frequency" component, which
can take the following values:
=over 4
=item f
Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
}
foreach my $term (@terms) {
$progressBar->();
$r_features->{features}{$term} = $self->reduction_function($term,
$nbDocuments,$allFeaturesSum,$coll_features,
\%cat_features,\%cat_features_sum);
}
print STDERR "\n" if $self->verbose;
my $new_features = $self->reduce_features($r_features);
return $coll_features->intersection( $new_features );
}
# calculate feature set after reading collection (scan_first=0)
sub rank_features {
die "CategorySelector->rank_features is not implemented yet!";
# my ($self, %args) = @_;
#
# my $k = $args{knowledge_set}
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
sub length {
my $self = shift;
return scalar keys %{$self->{features}};
}
sub clone {
my $self = shift;
return ref($self)->new( features => { %{$self->{features}} } );
}
sub intersection {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $common;
if (UNIVERSAL::isa($other, 'ARRAY')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
} elsif (UNIVERSAL::isa($other, 'HASH')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
}
return ref($self)->new( features => $common );
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
(features => {doody => 1, whopper => 2});
@names = $f1->names;
$x = $f1->length;
$x = $f1->sum;
$x = $f1->includes('howdy');
$x = $f1->value('howdy');
$x = $f1->dot($f2);
$f3 = $f1->clone;
$f3 = $f1->intersection($f2);
$f3 = $f1->add($f2);
$h = $f1->as_hash;
$h = $f1->as_boolean_hash;
$f1->normalize;
=head1 DESCRIPTION
This class implements a "feature vector", which is a flat data
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub trim_doc_features {
my ($self) = @_;
foreach my $doc ($self->documents) {
$doc->features( $doc->features->intersection($self->features) );
}
}
sub prog_bar {
my ($self, $collection) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
A synonym for 't'.
=item n
Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as
the 't' specification, but with term frequency normalized to lie
between 0.5 and 1.
=back
The second character specifies the "collection frequency" component, which
can take the following values:
=over 4
=item f
Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
lib/AI/Categorizer/Learner/Boolean.pm view on Meta::CPAN
that do I<not> belong to the given category. The final argument is
the Category object for the given category.
=head2 get_boolean_score()
Used during categorization to assign a score for a single document
relative to a single category. The score should be between 0 and 1,
with a score greater than 0.5 indicating membership in the category.
In addition to C<$self>, this method will be passed two arguments.
The first argument is the document to be categorized. The second
argument is the value returned by C<create_boolean_model()> for this
category.
=head1 AUTHOR
Ken Williams, <ken@mathforum.org>
=head1 SEE ALSO
AI::Categorizer
lib/AI/Categorizer/Learner/KNN.pm view on Meta::CPAN
$f_class->all_features([$self->knowledge_set->features->names]);
}
$self->SUPER::categorize_collection(@_);
}
sub get_scores {
my ($self, $newdoc) = @_;
my $currentDocName = $newdoc->name;
#print "classifying $currentDocName\n";
my $features = $newdoc->features->intersection($self->knowledge_set->features)->normalize;
my $q = AI::Categorizer::Learner::KNN::Queue->new(size => $self->{k_value});
my @docset;
if ($self->{max_instances}) {
# Use (approximately) max_instances documents, chosen randomly from corpus
my $probability = $self->{max_instances} / $self->knowledge_set->documents;
@docset = grep {rand() < $probability} $self->knowledge_set->documents;
} else {
# Use the whole corpus
@docset = $self->knowledge_set->documents;
lib/AI/Categorizer/Util.pm view on Meta::CPAN
package AI::Categorizer::Util;
use Exporter;
use base qw(Exporter);
@EXPORT_OK = qw(intersection average max min random_elements binary_search);
use strict;
# It's possible that this can be a class - something like
#
# $e = Evaluate->new(); $e->correct([...]); $e->assigned([...]); print $e->precision;
# A simple binary search
sub binary_search {
my ($arr, $target) = @_;
lib/AI/Categorizer/Util.pm view on Meta::CPAN
return $min;
}
sub average {
return undef unless @_;
my $total;
$total += $_ foreach @_;
return $total/@_;
}
sub intersection {
my ($one, $two) = @_;
$two = _hashify($two);
return UNIVERSAL::isa($one, 'HASH') ? # Accept hash or array for $one
grep {exists $two->{$_}} keys %$one :
grep {exists $two->{$_}} @$one;
}
sub _hashify {
return $_[0] if UNIVERSAL::isa($_[0], 'HASH');