AI-Categorizer

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
   features to use, and the scan_first parameter was left as its
   default value, the features_kept mechanism would silently fail to
   do anything.  This has now been fixed. [Spotted by Arnaud Gaudinat]
 
 - Recent versions of Weka have changed the name of the SVM class, so
   I've updated it in our test (t/03-weka.t) of the Weka wrapper
   too. [Sebastien Aperghis-Tramoni]
 
0.07  Tue May  6 16:15:04 CDT 2003
 
 - Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
   MANIFEST, so they weren't included in the 0.06 distribution.
   [Spotted by Zoltan Barta]
 
0.06 Tue Apr 22 10:27:26 CDT 2003
 
 - Added a relatively simple example script at the request of several
   people, at eg/demo.pl
 
 - Forgot to note a dependency on Algorithm::NaiveBayes in version
   0.05.  Fixed.

Changes  view on Meta::CPAN

57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
  parameter.
 
- Added a k-Nearest-Neighbor machine learner. [First revision
  implemented by David Bell]
 
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
  Li]
 
- Added a "Guesser" machine learner which simply uses overall class
  probabilities to make categorization decisions.  Sometimes useful
  for providing a set of baseline scores against which to evaluate
  other machine learners.
 
- The NaiveBayes learner is now a wrapper around my new
  Algorithm::NaiveBayes module, which is just the old NaiveBayes code
  from here, turned into its own standalone module.
 
- Much more extensive regression testing of the code.
 
- Added a Document subclass for XML documents. [Implemented by
  Jae-Moon Lee] Its interface is still unstable, it may change in

MANIFEST  view on Meta::CPAN

46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
t/04-decision_tree.t
t/05-svm.t
t/06-knn.t
t/07-guesser.t
t/09-rocchio.t
t/10-tools.t
t/11-feature_vector.t
t/12-hypothesis.t
t/13-document.t
t/14-collection.t
t/15-knowledge_set.t
t/common.pl
t/traindocs/doc1
t/traindocs/doc2
t/traindocs/doc3
t/traindocs/doc4
META.yml

README  view on Meta::CPAN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
NAME
    AI::Categorizer - Automatic Text Categorization
 
SYNOPSIS
     use AI::Categorizer;
     my $c = new AI::Categorizer(...parameters...);
  
     # Run a complete experiment - training on a corpus, testing on a test
     # set, printing a summary of results to STDOUT
     $c->run_experiment;
  
     # Or, run the parts of $c->run_experiment separately
     $c->scan_features;
     $c->read_training_set;
     $c->train;
     $c->evaluate_test_set;
     print $c->stats_table;
  
     # After training, use the Learner for categorization
     my $l = $c->learner;
     while (...) {
       my $d = ...create a document...
       my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
       print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
       print "Best category: ", $hypothesis->best_category, "\n";
     }
  
DESCRIPTION
    "AI::Categorizer" is a framework for automatic text categorization. It
    consists of a collection of Perl modules that implement common
    categorization tasks, and a set of defined relationships among those
    modules. The various details are flexible - for example, you can choose what
    categorization algorithm to use, what features (words or otherwise) of the
    documents should be used (or how to automatically choose these features),
    what format the documents are in, and so on.
 
    The basic process of using this module will typically involve obtaining a
    collection of pre-categorized documents, creating a "knowledge set"
    representation of those documents, training a categorizer on that knowledge
    set, and saving the trained categorizer for later use. There are several
    ways to carry out this process. The top-level "AI::Categorizer" module
    provides an umbrella class for high-level operations, or you may use the
    interfaces of the individual classes in the framework.
 
    A simple sample script that reads a training corpus, trains a categorizer,
    and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
 
    Disclaimer: the results of any of the machine learning algorithms are far
    from infallible (close to fallible?). Categorization of documents is often a
    difficult task even for humans well-trained in the particular domain of

README  view on Meta::CPAN

63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
  framework. We give a conceptual overview, but don't get into any of the
  details about interfaces or usage. See the documentation for the individual
  classes for more details.
 
  A diagram of the various classes in the framework can be seen in
  "doc/classes-overview.png", and a more detailed view of the same thing can
  be seen in "doc/classes.png".
 
Knowledge Sets
 
  A "knowledge set" is defined as a collection of documents, together with
  some information on the categories each document belongs to. Note that this
  term is somewhat unique to this project - other sources may call it a
  "training corpus", or "prior knowledge". A knowledge set also contains some
  information on how documents will be parsed and how their features (words)
  will be extracted and turned into meaningful representations. In this sense,
  a knowledge set represents not only a collection of data, but a particular
  view on that data.
 
  A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
  class. Before you can start playing with categorizers, you will have to
  start playing with knowledge sets, so that the categorizers have some data
  to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
  module for information on its interface.
 
 Feature selection
 
  Deciding which features are the most important is a very large part of the
  categorization task - you cannot simply consider all the words in all the
  documents when training, and all the words in the document being
  categorized. There are two main reasons for this - first, it would mean that
  your training and categorizing processes would take forever and use tons of
  memory, and second, the significant stuff of the documents would get lost in
  the "noise" of the insignificant stuff.
 
  The process of selecting the most important features in the training set is
  called "feature selection". It is managed by the
  "AI::Categorizer::KnowledgeSet" class, and you will find the details of
  feature selection processes in that class's documentation.
 
Collections
 
  Because documents may be stored in lots of different formats, a "collection"
  class has been created as an abstraction of a stored set of documents,
  together with a way to iterate through the set and return Document objects.
  A knowledge set contains a single collection object. A "Categorizer" doing a
  complete test run generally contains two collections, one for training and
  one for testing. A "Learner" can mass-categorize a collection.
 
  The "AI::Categorizer::Collection" class and its subclasses instantiate the
  idea of a collection in this sense.
 
Documents
 
  Each document is represented by an "AI::Categorizer::Document" object, or an
  object of one of its subclasses. Each document class contains methods for

README  view on Meta::CPAN

160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
  AI::Categorizer::Learner::Weka
      An interface to version 2 of the Weka Knowledge Analysis system that
      lets you use any of the machine learners it defines. This gives you
      access to lots and lots of machine learning algorithms in use by machine
      learning researches. The main drawback is that Weka tends to be quite
      slow and use a lot of memory, and the current interface between Weka and
      "AI::Categorizer" is a bit clumsy.
 
  Other machine learning methods that may be implemented soonish include
  Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
  for ensemble learning. No timetable for their creation has yet been set.
 
  Please see the documentation of these individual modules for more details on
  their guts and quirks. See the "AI::Categorizer::Learner" documentation for
  a description of the general categorizer interface.
 
  If you wish to create your own classifier, you should inherit from
  "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
  abstract classes that manage some of the work for you.
 
Feature Vectors

README  view on Meta::CPAN

219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
        internally (the KnowledgeSet, Learner, Experiment, or Collection
        classes), or any class that *they* create. This is managed by the
        "Class::Container" module, so see its documentation for the details of
        how this works.
 
        The specific parameters accepted here are:
 
        progress_file
            A string that indicates a place where objects will be saved during
            several of the methods of this class. The default value is the
            string "save", which means files like "save-01-knowledge_set" will
            get created. The exact names of these files may change in future
            releases, since they're just used internally to resume where we last
            left off.
 
        verbose
            If true, a few status messages will be printed during execution.
 
        training_set
            Specifies the "path" parameter that will be fed to the
            KnowledgeSet's "scan_features()" and "read()" methods during our
            "scan_features()" and "read_training_set()" methods.
 
        test_set
            Specifies the "path" parameter that will be used when creating a
            Collection during the "evaluate_test_set()" method.
 
        data_root
            A shortcut for setting the "training_set", "test_set", and
            "category_file" parameters separately. Sets "training_set" to
            "$data_root/training", "test_set" to "$data_root/test", and
            "category_file" (used by some of the Collection classes) to
            "$data_root/cats.txt".
 
    learner()
        Returns the Learner object associated with this Categorizer. Before
        "train()", the Learner will of course not be trained yet.
 
    knowledge_set()
        Returns the KnowledgeSet object associated with this Categorizer. If
        "read_training_set()" has not yet been called, the KnowledgeSet will not
        yet be populated with any training data.
 
    run_experiment()
        Runs a complete experiment on the training and testing data, reporting
        the results on "STDOUT". Internally, this is just a shortcut for calling
        the "scan_features()", "read_training_set()", "train()", and
        "evaluate_test_set()" methods, then printing the value of the
        "stats_table()" method.
 
    scan_features()
        Scans the Collection specified in the "test_set" parameter to determine
        the set of features (words) that will be considered when training the
        Learner. Internally, this calls the "scan_features()" method of the
        KnowledgeSet, then saves a list of the KnowledgeSet's features for later
        use.
 
        This step is not strictly necessary, but it can dramatically reduce
        memory requirements if you scan for features before reading the entire
        corpus into memory.
 
    read_training_set()
        Populates the KnowledgeSet with the data specified in the "test_set"
        parameter. Internally, this calls the "read()" method of the
        KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
        object for later use.
 
    train()
        Calls the Learner's "train()" method, passing it the KnowledgeSet
        created during "read_training_set()". Returns the Learner object. Also
        saves the Learner object for later use.
 
    evaluate_test_set()
        Creates a Collection based on the value of the "test_set" parameter, and
        calls the Learner's "categorize_collection()" method using this
        Collection. Returns the resultant Experiment object. Also saves the
        Experiment object for later use in the "stats_table()" method.
 
    stats_table()
        Returns the value of the Experiment's (as created by
        "evaluate_test_set()") "stats_table()" method. This is a string that
        shows various statistics about the accuracy/precision/recall/F1/etc. of
        the assignments made during testing.
 
HISTORY
    This module is a revised and redesigned version of the previous
    "AI::Categorize" module by the same author. Note the added 'r' in the new
    name. The older module has a different interface, and no attempt at backward
    compatibility has been made - that's why I changed the name.
 
    You can have both "AI::Categorize" and "AI::Categorizer" installed at the

eg/categorizer  view on Meta::CPAN

32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
    if ($HAVE_YAML) {
      print {$out_fh} YAML::Dump($c->dump_parameters);
    } else {
      warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
    }
  }
}
   
 
run_section('scan_features',     1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train',             3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
  my $result = $c->stats_table;
  print $result if $c->verbose;
  print $out_fh $result if $out_fh;
}
 
sub run_section {
  my ($section, $stage, $do_stage) = @_;
  return unless $do_stage->{$stage};
  if (keys %$do_stage > 1) {

eg/demo.pl  view on Meta::CPAN

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# This script is a fairly simple demonstration of how AI::Categorizer
# can be used.  There are lots of other less-simple demonstrations
# (actually, they're doing much simpler things, but are probably
# harder to follow) in the tests in the t/ subdirectory.  The
# eg/categorizer script can also be a good example if you're willing
# to figure out a bit how it works.
#
# This script reads a training corpus from a directory of plain-text
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.
 
use strict;
 
die("Usage: $0 <corpus>\n".
    "  A sample corpus (data set) can be downloaded from\n".
  unless @ARGV == 1;
 
my $corpus = shift;
 
my $training  = File::Spec->catfile( $corpus, 'training' );
my $test      = File::Spec->catfile( $corpus, 'test' );
my $cats      = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );

eg/demo.pl  view on Meta::CPAN

48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# type (any Collection subclass).  Or you could create each Document
# object manually.  Or you could let the KnowledgeSet create the
# Collection objects for you.
 
$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test     = AI::Categorizer::Collection::Files->new( path => $test, %params );
 
# We turn on verbose mode so you can watch the progress of loading &
# training.  This looks nicer if you have Time::Progress installed!
 
print "Loading training set\n";
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k->load( collection => $training );
 
print "Training categorizer\n";
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l->train( knowledge_set => $k );
 
print "Categorizing test set\n";
my $experiment = $l->categorize_collection( collection => $test );
 
print $experiment->stats_table;
 
 
# If you want to get at the specific assigned categories for a
# specific document, you can do it like this:
 
my $doc = AI::Categorizer::Document->new
  ( content => "Hello, I am a pretty generic document with not much to say." );

eg/easy_guesser.pl  view on Meta::CPAN

1
2
3
4
5
6
7
8
9
10
11
12
#!/usr/bin/perl
 
# This script can be helpful for getting a set of baseline scores for
# a categorization task.  It simulates using the "Guesser" learner,
# but is much faster.  Because it doesn't leverage using the whole
# framework, though, it expects everything to be in a very strict
# format.  <cats-file> is in the same format as the 'category_file'
# parameter to the Collection class.  <training-dir> and <test-dir>
# give paths to directories of documents, named as in <cats-file>.
 
use strict;

eg/easy_guesser.pl  view on Meta::CPAN

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
my %cats;
print "Reading category file\n";
open my($fh), $cats or die "Can't read $cats: $!";
while (<$fh>) {
    my ($doc, @cats) = split;
    $cats{$doc} = \@cats;
}
 
my (%freq, $docs);
print "Scanning training set\n";
opendir my($dh), $training or die "Can't opendir $training: $!";
while (defined(my $file = readdir $dh)) {
    next if $file eq '.' or $file eq '..';
    unless ($cats{$file}) {
        warn "No category information for '$file'";
        next;
    }
    $docs++;
    $freq{$_}++ foreach @{$cats{$file}};
}

lib/AI/Categorizer.pm  view on Meta::CPAN

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
 
 
__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );
 
__PACKAGE__->contained_objects
  (
   knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
   learner       => { class => 'AI::Categorizer::Learner::NaiveBayes' },
   experiment    => { class => 'AI::Categorizer::Experiment',
                      delayed => 1 },
   collection    => { class => 'AI::Categorizer::Collection::Files',
                      delayed => 1 },
  );
 
sub new {
  my $package = shift;
  my %args = @_;
  my %defaults;
  if (exists $args{data_root}) {
    $defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
    $defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
    $defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
    delete $args{data_root};
  }
 
  return $package->SUPER::new(%defaults, %args);
}
 
#sub dump_parameters {
#  my $p = shift()->SUPER::dump_parameters;
#  delete $p->{stopwords} if $p->{stopword_file};
#  return $p;
#}
 
sub knowledge_set { shift->{knowledge_set} }
sub learner       { shift->{learner} }
 
# Combines several methods in one sub
sub run_experiment {
  my $self = shift;
  $self->scan_features;
  $self->read_training_set;
  $self->train;
  $self->evaluate_test_set;
  print $self->stats_table;
}
 
sub scan_features {
  my $self = shift;
  return unless $self->knowledge_set->scan_first;
  $self->knowledge_set->scan_features( path => $self->{training_set} );
  $self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
 
sub read_training_set {
  my $self = shift;
  $self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
    if -e "$self->{progress_file}-01-features";
  $self->knowledge_set->read( path => $self->{training_set} );
  $self->_save_progress( '02', 'knowledge_set' );
  return $self->knowledge_set;
}
 
sub train {
  my $self = shift;
  $self->_load_progress( '02', 'knowledge_set' );
  $self->learner->train( knowledge_set => $self->{knowledge_set} );
  $self->_save_progress( '03', 'learner' );
  return $self->learner;
}
 
sub evaluate_test_set {
  my $self = shift;
  $self->_load_progress( '03', 'learner' );
  my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
  $self->{experiment} = $self->learner->categorize_collection( collection => $c );
  $self->_save_progress( '04', 'experiment' );
  return $self->{experiment};
}
 
sub stats_table {
  my $self = shift;
  $self->_load_progress( '04', 'experiment' );
  return $self->{experiment}->stats_table;
}

lib/AI/Categorizer.pm  view on Meta::CPAN

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
=head1 NAME
 
AI::Categorizer - Automatic Text Categorization
 
=head1 SYNOPSIS
 
 use AI::Categorizer;
 my $c = new AI::Categorizer(...parameters...);
  
 # Run a complete experiment - training on a corpus, testing on a test
 # set, printing a summary of results to STDOUT
 $c->run_experiment;
  
 # Or, run the parts of $c->run_experiment separately
 $c->scan_features;
 $c->read_training_set;
 $c->train;
 $c->evaluate_test_set;
 print $c->stats_table;
  
 # After training, use the Learner for categorization
 my $l = $c->learner;
 while (...) {
   my $d = ...create a document...
   my $hypothesis = $l->categorize($d);  # An AI::Categorizer::Hypothesis object
   print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
   print "Best category: ", $hypothesis->best_category, "\n";
 }
  
=head1 DESCRIPTION
 
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules.  The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
 
The basic process of using this module will typically involve
obtaining a collection of B<pre-categorized> documents, creating a
"knowledge set" representation of those documents, training a
categorizer on that knowledge set, and saving the trained categorizer
for later use.  There are several ways to carry out this process.  The
top-level C<AI::Categorizer> module provides an umbrella class for
high-level operations, or you may use the interfaces of the individual
classes in the framework.
 
A simple sample script that reads a training corpus, trains a
categorizer, and tests the categorizer on a test corpus, is
distributed as eg/demo.pl .
 
Disclaimer: the results of any of the machine learning algorithms are

lib/AI/Categorizer.pm  view on Meta::CPAN

205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
object framework.  We give a conceptual overview, but don't get into
any of the details about interfaces or usage.  See the documentation
for the individual classes for more details.
 
A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
 
=head2 Knowledge Sets
 
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.
 
A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
 
=head3 Feature selection
 
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take
forever and use tons of memory, and second, the significant stuff of
the documents would get lost in the "noise" of the insignificant stuff.
 
The process of selecting the most important features in the training
set is called "feature selection".  It is managed by the
C<AI::Categorizer::KnowledgeSet> class, and you will find the details
of feature selection processes in that class's documentation.
 
=head2 Collections
 
Because documents may be stored in lots of different formats, a
"collection" class has been created as an abstraction of a stored set
of documents, together with a way to iterate through the set and
return Document objects.  A knowledge set contains a single collection
object.  A C<Categorizer> doing a complete test run generally contains
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.
 
The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.
 
=head2 Documents
 
Each document is represented by an C<AI::Categorizer::Document>

lib/AI/Categorizer.pm  view on Meta::CPAN

314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
access to lots and lots of machine learning algorithms in use by
machine learning researches.  The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface
between Weka and C<AI::Categorizer> is a bit clumsy.
 
=back
 
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts
combiner for ensemble learning.  No timetable for their creation has
yet been set.
 
Please see the documentation of these individual modules for more
details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.
 
If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.
 
=head2 Feature Vectors

lib/AI/Categorizer.pm  view on Meta::CPAN

385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
works.
 
The specific parameters accepted here are:
 
=over 4
 
=item progress_file
 
A string that indicates a place where objects will be saved during
several of the methods of this class.  The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created.  The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
 
=item verbose
 
If true, a few status messages will be printed during execution.
 
=item training_set
 
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.
 
=item test_set
 
Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.
 
=item data_root
 
A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately.  Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.
 
=back
 
=item learner()
 
Returns the Learner object associated with this Categorizer.  Before
C<train()>, the Learner will of course not be trained yet.
 
=item knowledge_set()
 
Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.
 
=item run_experiment()
 
Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.
 
=item scan_features()
 
Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.
 
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
 
=item read_training_set()
 
Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.
 
=item train()
 
Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.
 
=item evaluate_test_set()
 
Creates a Collection based on the value of the C<test_set> parameter,
and calls the Learner's C<categorize_collection()> method using this
Collection.  Returns the resultant Experiment object.  Also saves the
Experiment object for later use in the C<stats_table()> method.
 
=item stats_table()
 
Returns the value of the Experiment's (as created by
C<evaluate_test_set()>) C<stats_table()> method.  This is a string
that shows various statistics about the
accuracy/precision/recall/F1/etc. of the assignments made during
testing.
 
=back
 
=head1 HISTORY
 
This module is a revised and redesigned version of the previous
C<AI::Categorize> module by the same author.  Note the added 'r' in

lib/AI/Categorizer/Collection.pm  view on Meta::CPAN

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
The default is C<AI::Categorizer::Document::Text>.
 
=back
 
=item next()
 
Returns the next Document object in the Collection.
 
=item rewind()
 
Resets the iterator for further calls to C<next()>.
 
=item count_documents()
 
Returns the total number of documents in the Collection.  Note that
this usually resets the iterator.  This is because it may not be
possible to resume iterating where we left off.
 
=back
 
=head1 AUTHOR
 
Ken Williams, ken@mathforum.org
 
=head1 COPYRIGHT

lib/AI/Categorizer/Collection/Files.pm  view on Meta::CPAN

133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
=item path
 
Indicates a location on disk where the documents can be found.  The
path may be specified as a string giving the name of a directory, or
as a reference to an array of such strings if the documents are
located in more than one directory.
 
=item recurse
 
Indicates whether subdirectories of the directory (or directories) in
the C<path> parameter should be descended into.  If set to a true
value, they will be descended into.  If false, they will be ignored.
The default is false.
 
=back
 
=back
 
=head1 AUTHOR
 
Ken Williams, ken@mathforum.org

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
=item content_weights
 
A hash reference indicating the weights that should be assigned to
features in different sections of a structured document when creating
its feature vector.  The weight is a multiplier of the feature vector
values.  For instance, if a C<subject> section has a weight of 3 and a
C<body> section has a weight of 1, and word counts are used as feature
vector values, then it will be as if all words appearing in the
C<subject> appeared 3 times.
 
If no weights are specified, all weights are set to 1.
 
=item front_bias
 
Allows smooth bias of the weights of words in a document according to
their position.  The value should be a number between -1 and 1.
Positive numbers indicate that words toward the beginning of the
document should have higher weight than words toward the end of the
document.  Negative numbers indicate the opposite.  A bias of 0
indicates that no biasing should be done.

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
#       it is called whenever the parser ends the element
sub end_element{
  my ($self, $el)= @_;
 
  $self->{levelPointer}--;
  my $location= $self->{locationArray}[$self->{levelPointer}];
 
  # find the name of element
  my $elementName= $el->{Name};
 
  # set the default weight
  my $weight= 1;
 
  # check if user give the weight to duplicate data
  $weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};
 
  # 0 - remove all the data to be related to this element
  if($weight == 0){
    $self->{content} = substr($self->{content}, 0, $location);
    return;
  }

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
C<Statistics::Contingency> for a description of its interface.  All of
its methods are available here, with the following additions:
 
=over 4
 
=item new( categories => \%categories )
 
=item new( categories => \@categories, verbose => 1, sig_figs => 2 )
 
Returns a new Experiment object.  A required C<categories> parameter
specifies the names of all categories in the data set.  The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.
 
The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.
 
A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method.  It
does not affect the other methods like C<micro_precision()>.
 
=item add_result($assigned, $correct, $name)
 
Adds a new result to the experiment.  Please see the
C<Statistics::Contingency> documentation for a description of this
method.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
  return $result;
}
 
# Abstract methods
sub rank_features;
sub scan_features;
 
sub select_features {
  my ($self, %args) = @_;
   
  die "No knowledge_set parameter provided to select_features()"
    unless $args{knowledge_set};
 
  my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
  return $self->reduce_features( $f, features_kept => $args{features_kept} );
}
 
 
1;
 
__END__
 
=head1 NAME
 
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
 
=head1 SYNOPSIS
 
 ...
 
=head1 DESCRIPTION
 
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
 
=head1 METHODS
 
=over 4
 
=item new()
 
Creates a new KnowledgeSet and returns it.  Accepts the following

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
If a C<load> parameter is present, the C<load()> method will be
invoked immediately.  If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>.  If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
 
=item categories
 
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet.  If used, the
C<documents> parameter should also be specified.
 
=item documents
 
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.
 
=item features_kept
 
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.
 
=item feature_selection
 
A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.
 
=item tfidf_weighting

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
No change - multiply by 1.
 
=back
 
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
 
=item verbose
 
If set to a true value, some status/debugging information will be
output on C<STDOUT>.
 
=back
 
 
=item categories()
 
In a list context returns a list of all Category objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
 
=item features()
 
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
 
=item verbose()
 
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
 
=item scan_stats()
 
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)
 
=item scan_features()
 
This method scans through a Collection object and determines the

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
 
=item document_frequency()
 
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
 
=item partition()
 
Divides the KnowledgeSet into several subsets.  This may be useful for
performing cross-validation.  The relative sizes of the subsets should
be passed as arguments.  For example, to split the KnowledgeSet into
four KnowledgeSets of equal size, pass the arguments .25, .25, .25
(the final size is 1 minus the sum of the other sizes).  The
partitions will be returned as a list.
 
=back
 
=head1 AUTHOR
 
Ken Williams, ken@mathforum.org

lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  (
   features => { class => 'AI::Categorizer::FeatureVector',
                 delayed => 1 },
  );
 
1;
 
 
sub reduction_function;
 
# figure out the feature set before reading collection (default)
 
sub scan_features {
  my ($self, %args) = @_;
  my $c = $args{collection} or
    die "No 'collection' parameter provided to scan_features()";
 
  if(!($self->{features_kept})) {return;}
 
  my %cat_features;
  my $coll_features = $self->create_delayed_object('features');

lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN

59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
    $r_features->{features}{$term} = $self->reduction_function($term,
      $nbDocuments,$allFeaturesSum,$coll_features,
      \%cat_features,\%cat_features_sum);
  }
  print STDERR "\n" if $self->verbose;
  my $new_features = $self->reduce_features($r_features);
  return $coll_features->intersection( $new_features );
}
 
 
# calculate feature set after reading collection (scan_first=0)
 
sub rank_features {
  die "CategorySelector->rank_features is not implemented yet!";
#  my ($self, %args) = @_;
#  my $k = $args{knowledge_set}
#    or die "No knowledge_set parameter provided to rank_features()";
#
#  my %freq_counts;
#  foreach my $name ($k->features->names) {
#    $freq_counts{$name} = $k->document_frequency($name);
#  }
#  return $self->create_delayed_object('features', features => \%freq_counts);
}
 
 
# copied from KnowledgeSet->prog_bar by Ken Williams

lib/AI/Categorizer/FeatureSelector/CategorySelector.pm  view on Meta::CPAN

109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
AI::Categorizer::CategorySelector - Abstract Category Selection class
 
=head1 SYNOPSIS
 
This class is abstract. For example of instanciation, see
ChiSquare.
 
=head1 DESCRIPTION
 
A base class for FeatureSelectors that calculate their global features
from a set of features by categories.
 
=head1 METHODS
 
=head1 AUTHOR
 
Francois Paradis, paradifr@iro.umontreal.ca
with inspiration from Ken Williams AI::Categorizer code
 
=cut

lib/AI/Categorizer/FeatureSelector/ChiSquare.pm  view on Meta::CPAN

39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
=head1 NAME
 
AI::Categorizer::FeatureSelector::ChiSquare - ChiSquare Feature Selection class
 
=head1 SYNOPSIS
 
 # the recommended way to use this class is to let the KnowledgeSet
 # instanciate it
 
 use AI::Categorizer::KnowledgeSetSMART;
 my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
   tfidf_notation =>'Categorizer',
   feature_selection=>'chi_square', ...other parameters...);
 
 # however it is also possible to pass an instance to the KnowledgeSet
 
 use AI::Categorizer::KnowledgeSet;
 use AI::Categorizer::FeatureSelector::ChiSquare;
 my $ksetCHI = new AI::Categorizer::KnowledgeSet(
   feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
   ...other parameters...
   );
 
=head1 DESCRIPTION
 
Feature selection with the ChiSquare function.
 
  Chi-Square(t,ci) = (N.(AD-CB)^2)
                    -----------------------

lib/AI/Categorizer/FeatureSelector/DocFrequency.pm  view on Meta::CPAN

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
__PACKAGE__->contained_objects
  (
   features => { class => 'AI::Categorizer::FeatureVector',
                 delayed => 1 },
  );
 
# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
  my ($self, %args) = @_;
   
  my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
   
  my %freq_counts;
  foreach my $name ($k->features->names) {
    $freq_counts{$name} = $k->document_frequency($name);
  }
  return $self->create_delayed_object('features', features => \%freq_counts);
}
 
sub scan_features {
  my ($self, %args) = @_;

lib/AI/Categorizer/FeatureSelector/DocFrequency.pm  view on Meta::CPAN

48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
=head1 NAME
 
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
 
=head1 SYNOPSIS
 
 ...
 
=head1 DESCRIPTION
 
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
 
=head1 METHODS
 
=over 4
 
=item new()
 
Creates a new KnowledgeSet and returns it.  Accepts the following

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  my ($package, %args) = @_;
  $args{features} ||= {};
  return bless {features => $args{features}}, $package;
}
 
sub names {
  my $self = shift;
  return keys %{$self->{features}};
}
 
sub set {
  my $self = shift;
  $self->{features} = (ref $_[0] ? $_[0] : {@_});
}
 
sub as_hash {
  my $self = shift;
  return $self->{features};
}
 
sub euclidean_length {

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
  $f3 = $f1->add($f2);
   
  $h = $f1->as_hash;
  $h = $f1->as_boolean_hash;
   
  $f1->normalize;
 
=head1 DESCRIPTION
 
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features.  At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document.  However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.
 
=head1 METHODS
 
=over 4

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
  my $self = shift;
  return @{$self->{scores}}{@_};
}
 
1;
 
__END__
 
=head1 NAME
 
AI::Categorizer::Hypothesis - Embodies a set of category assignments
 
=head1 SYNOPSIS
 
 use AI::Categorizer::Hypothesis;
  
 # Hypotheses are usually created by the Learner's categorize() method.
 # (assume here that $learner and $document have been created elsewhere)
 my $h = $learner->categorize($document);
  
 print "Assigned categories: ", join ', ', $h->categories, "\n";
 print "Best category: ", $h->best_category, "\n";
 print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
 print "Chosen from: ", join ', ', $h->all_categories, "\n";
 print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";
 
=head1 DESCRIPTION
 
A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document.  Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
whether a particular category was assigned), we provide a simple class
to help facilitate these scenarios.
 
=head1 METHODS
 
=over 4

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
are returned by the Learner's C<categorize()> method.  However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.
 
The following parameters are accepted when creating a new Hypothesis:
 
=over 4
 
=item all_categories
 
A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a
reference to an array of category names (as strings).
 
=item scores
 
A hash reference indicating the assignment score for each category.
Any score higher than the C<threshold> will be considered to be
assigned.
 
=item threshold

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
  my ($pkg, %args) = @_;
   
  # Shortcuts
  if ($args{tfidf_weighting}) {
    @args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};
    delete $args{tfidf_weighting};
  }
 
  my $self = $pkg->SUPER::new(%args);
 
  # Convert to AI::Categorizer::ObjectSet sets
  $self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
  $self->{documents}  = new AI::Categorizer::ObjectSet( @{$self->{documents}}  );
 
  if ($self->{load}) {
    my $args = ref($self->{load}) ? $self->{load} : { path => $self->{load} };
    $self->load(%$args);
    delete $self->{load};
  }
  return $self;
}
 
sub features {
  my $self = shift;
 
  if (@_) {
    $self->{features} = shift;
    $self->trim_doc_features if $self->{features};
  }
  return $self->{features} if $self->{features};
 
  # Create a feature vector encompassing the whole set of documents
  my $v = $self->create_delayed_object('features');
  foreach my $document ($self->documents) {
    $v->add( $document->features );
  }
  return $self->{features} = $v;
}
 
sub categories {
  my $c = $_[0]->{categories};
  return wantarray ? $c->members : $c->size;

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
sub load {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);
 
  if ($self->{features_kept}) {
    # Read the whole thing in, then reduce
    $self->read( collection => $c );
    $self->select_features;
 
  } elsif ($self->{scan_first}) {
    # Figure out the feature set first, then read data in
    $self->scan_features( collection => $c );
    $c->rewind;
    $self->read( collection => $c );
 
  } else {
    # Don't do any feature reduction, just read the data
    $self->read( collection => $c );
  }
}

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
  my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );
 
  $self->delayed_object_params('document', use_features => $ranked_features);
  $self->delayed_object_params('collection', use_features => $ranked_features);
  return $ranked_features;
}
 
sub select_features {
  my $self = shift;
   
  my $f = $self->feature_selector->select_features(knowledge_set => $self);
  $self->features($f);
}
 
sub partition {
  my ($self, @sizes) = @_;
  my $num_docs = my @docs = $self->documents;
  my @groups;
 
  while (@sizes > 1) {
    my $size = int ($num_docs * shift @sizes);

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
  $self->delayed_object_params('document',   use_features => $features);
  $self->delayed_object_params('collection', use_features => $features);
}
 
1;
 
__END__
 
=head1 NAME
 
AI::Categorizer::KnowledgeSet - Encapsulates set of documents
 
=head1 SYNOPSIS
 
 use AI::Categorizer::KnowledgeSet;
 my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
 $nb->train(knowledge_set => $k);
 
=head1 DESCRIPTION
 
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two.  Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
 
=head1 METHODS
 
=over 4
 
=item new()
 
Creates a new KnowledgeSet and returns it.  Accepts the following

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
If a C<load> parameter is present, the C<load()> method will be
invoked immediately.  If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>.  If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
 
=item categories
 
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet.  If used, the
C<documents> parameter should also be specified.
 
=item documents
 
An optional reference to an array of Document objects representing the
complete set of documents in a KnowledgeSet.  If used, the
C<categories> parameter should also be specified.
 
=item features_kept
 
A number indicating how many features (words) should be considered
when training the Learner or categorizing new documents.  May be
specified as a positive integer (e.g. 2000) indicating the absolute
number of features to be kept, or as a decimal between 0 and 1
(e.g. 0.2) indicating the fraction of the total number of features to
be kept, or as 0 to indicate that no feature selection should be done
and that the entire set of features should be used.  The default is
0.2.
 
=item feature_selection
 
A string indicating the type of feature selection that should be
performed.  Currently the only option is also the default option:
C<document_frequency>.
 
=item tfidf_weighting

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
No change - multiply by 1.
 
=back
 
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
 
=item verbose
 
If set to a true value, some status/debugging information will be
output on C<STDOUT>.
 
=back
 
 
=item categories()
 
In a list context returns a list of all Category objects in this
KnowledgeSet.  In a scalar context returns the number of such objects.

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
 
=item features()
 
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
 
=item verbose()
 
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
 
=item scan_stats()
 
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection.  (XXX need to describe stats)
 
=item scan_features()
 
This method scans through a Collection object and determines the

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
 
=item document_frequency()
 
Given a single feature (word) as an argument, this method will return
the number of documents in the KnowledgeSet that contain that feature.
 
=item partition()
 
Divides the KnowledgeSet into several subsets.  This may be useful for
performing cross-validation.  The relative sizes of the subsets should
be passed as arguments.  For example, to split the KnowledgeSet into
four KnowledgeSets of equal size, pass the arguments .25, .25, .25
(the final size is 1 minus the sum of the other sizes).  The
partitions will be returned as a list.
 
=back
 
=head1 AUTHOR
 
Ken Williams, ken@mathforum.org

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
use strict;
use base qw(https://metacpan.org/pod/Class::Container">Class::Container AI::Categorizer::Storable);
 
use Params::Validate qw(:types);
 
__PACKAGE__->valid_params
  (
   knowledge_set  => { isa => 'AI::Categorizer::KnowledgeSet', optional => 1 },
   verbose => {type => SCALAR, default => 0},
  );
 
__PACKAGE__->contained_objects
  (
   hypothesis => {
                  class => 'AI::Categorizer::Hypothesis',
                  delayed => 1,
                 },
   experiment => {

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
sub add_knowledge;
 
sub verbose {
  my $self = shift;
  if (@_) {
    $self->{verbose} = shift;
  }
  return $self->{verbose};
}
 
sub knowledge_set {
  my $self = shift;
  if (@_) {
    $self->{knowledge_set} = shift;
  }
  return $self->{knowledge_set};
}
 
sub categories {
  my $self = shift;
  return $self->knowledge_set->categories;
}
 
sub train {
  my ($self, %args) = @_;
  $self->{knowledge_set} = $args{knowledge_set} if $args{knowledge_set};
  die "No knowledge_set provided" unless $self->{knowledge_set};
 
  $self->{knowledge_set}->finish;
  $self->create_model;    # Creates $self->{model}
  $self->delayed_object_params('hypothesis',
                               all_categories => [map $_->name, $self->categories],
                              );
}
 
sub prog_bar {
  my ($self, $count) = @_;
   
  return sub { print STDERR '.' } unless eval "use Time::Progress; 1";

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
AI::Categorizer::Learner - Abstract Machine Learner Class
 
=head1 SYNOPSIS
 
 use AI::Categorizer::Learner::NaiveBayes;  # Or other subclass
  
 # Here $k is an AI::Categorizer::KnowledgeSet object
  
 my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
 $nb->train(knowledge_set => $k);
 $nb->save_state('filename');
  
 ... time passes ...
  
 $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
 my $c = new AI::Categorizer::Collection::Files( path => ... );
 while (my $document = $c->next) {
   my $hypothesis = $nb->categorize($document);
   print "Best assigned category: ", $hypothesis->best_category, "\n";
   print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";

lib/AI/Categorizer/Learner.pm  view on Meta::CPAN

170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
=over 4
 
=item new()
 
Creates a new Learner and returns it.  Accepts the following
parameters:
 
=over 4
 
=item knowledge_set
 
A Knowledge Set that will be used by default during the C<train()>
method.
 
=item verbose
 
If true, the Learner will display some diagnostic output while
training and categorizing documents.
 
=back
 
=item train()
 
=item train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.  If you provided a C<knowledge_set> parameter to C<new()>,
specifying one here will override it.
 
=item categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more
details on how to use this object.
 
=item categorize_collection(collection => $collection)
 
Categorizes every document in a collection and returns an Experiment
object representing the results.  Note that the Experiment does not
contain knowledge of the assigned categories for every document, only
a statistical summary of the results.
 
=item knowledge_set()
 
Gets/sets the internal C<knowledge_set> member.  Note that since the
knowledge set may be enormous, some Learners may throw away their
knowledge set after training or after restoring state from a file.
 
=item $learner-E<gt>save_state($path)
 
Saves the Learner for later use.  This method is inherited from
C<AI::Categorizer::Storable>.
 
=item $class-E<gt>restore_state($path)
 
Returns a Learner saved in a file with C<save_state()>.  This method
is inherited from C<AI::Categorizer::Storable>.

lib/AI/Categorizer/Learner/Boolean.pm  view on Meta::CPAN

10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
  (
   max_instances => {type => SCALAR, default => 0},
   threshold => {type => SCALAR, default => 0.5},
  );
 
sub create_model {
  my $self = shift;
  my $m = $self->{model} ||= {};
  my $mi = $self->{max_instances};
 
  foreach my $cat ($self->knowledge_set->categories) {
    my (@p, @n);
    foreach my $doc ($self->knowledge_set->documents) {
      if ($doc->is_in_category($cat)) {
        push @p, $doc;
      } else {
        push @n, $doc;
      }
    }
    if ($mi and @p + @n > $mi) {
      # Get rid of random instances from training set, preserving
      # current positive/negative ratio
      my $ratio = $mi / (@p + @n);
      @p = random_elements(\@p, @p * $ratio);
      @n = random_elements(\@n, @n * $ratio);
       
      warn "Limiting to ". @p ." positives and ". @n ." negatives\n" if $self->verbose;
    }
 
    warn "Creating model for ", $cat->name, "\n" if $self->verbose;
    $m->{learners}{ $cat->name } = $self->create_boolean_model(\@p, \@n, $cat);

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
my %results;
for ($positives, $negatives) {
  foreach my $doc (@$_) {
    $results{$doc->name} = $_ eq $positives ? 1 : 0;
  }
}
 
if ($self->{model}{first_tree}) {
  $t->copy_instances(from => $self->{model}{first_tree});
  $t->set_results(\%results);
 
} else {
  for ($positives, $negatives) {
    foreach my $doc (@$_) {
      $t->add_instance( attributes => $doc->features->as_boolean_hash,
                        result => $results{$doc->name},
                        name => $doc->name,
                      );
    }
  }

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
AI::Categorizer::Learner::DecisionTree - Decision Tree Learner
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::DecisionTree;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $l = new AI::Categorizer::Learner::DecisionTree(...parameters...);
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
   
  ... time passes ...
   
  $l = AI::Categorizer::Learner->restore_state('filename');
  while (my $document = ... ) {  # An AI::Categorizer::Document object
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

lib/AI/Categorizer/Learner/DecisionTree.pm  view on Meta::CPAN

91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
=head1 METHODS
 
This class inherits from the C<AI::Categorizer::Learner> class, so all
of its methods are available unless explicitly mentioned here.
 
=head2 new()
 
Creates a new DecisionTree Learner and returns it.
 
=head2 train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
 
=head2 categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
 
use strict;
use base qw(https://metacpan.org/pod/AI::Categorizer::Learner">AI::Categorizer::Learner);
 
sub create_model {
  my $self = shift;
  my $k = $self->knowledge_set;
  my $num_docs = $k->documents;
   
  foreach my $cat ($k->categories) {
    next unless $cat->documents;
    $self->{model}{$cat->name} = $cat->documents / $num_docs;
  }
}
 
sub get_scores {
  my ($self, $newdoc) = @_;

lib/AI/Categorizer/Learner/Guesser.pm  view on Meta::CPAN

34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
AI::Categorizer::Learner::Guesser - Simple guessing based on class probabilities
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::Guesser;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $l = new AI::Categorizer::Learner::Guesser;
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
   
  ... time passes ...
   
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
__PACKAGE__->valid_params
  (
   threshold => {type => SCALAR, default => 0.4},
   k_value => {type => SCALAR, default => 20},
   knn_weighting => {type => SCALAR, default => 'score'},
   max_instances => {type => SCALAR, default => 0},
  );
 
sub create_model {
  my $self = shift;
  foreach my $doc ($self->knowledge_set->documents) {
    $doc->features->normalize;
  }
  $self->knowledge_set->features;  # Initialize
}
 
sub threshold {
  my $self = shift;
  $self->{threshold} = shift if @_;
  return $self->{threshold};
}
 
sub categorize_collection {
  my $self = shift;
   
  my $f_class = $self->knowledge_set->contained_class('features');
  if ($f_class->can('all_features')) {
    $f_class->all_features([$self->knowledge_set->features->names]);
  }
  $self->SUPER::categorize_collection(@_);
}
 
sub get_scores {
  my ($self, $newdoc) = @_;
  my $currentDocName = $newdoc->name;
  #print "classifying $currentDocName\n";
 
  my $features = $newdoc->features->intersection($self->knowledge_set->features)->normalize;
  my $q = AI::Categorizer::Learner::KNN::Queue->new(size => $self->{k_value});
 
  my @docset;
  if ($self->{max_instances}) {
    # Use (approximately) max_instances documents, chosen randomly from corpus
    my $probability = $self->{max_instances} / $self->knowledge_set->documents;
    @docset = grep {rand() < $probability} $self->knowledge_set->documents;
  } else {
    # Use the whole corpus
    @docset = $self->knowledge_set->documents;
  }
   
  foreach my $doc (@docset) {
    my $score = $doc->features->dot( $features );
    warn "Score for ", $doc->name, " (", ($doc->categories)[0]->name, "): $score" if $self->verbose > 1;
    $q->add($doc, $score);
  }
   
  my %scores = map {+$_->name, 0} $self->categories;
  foreach my $e (@{$q->entries}) {
    foreach my $cat ($e->{thing}->categories) {
      $scores{$cat->name} += ($self->{knn_weighting} eq 'score' ? $e->{score} : 1); #increment cat score
    }

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
AI::Categorizer::Learner::KNN - K Nearest Neighbour Algorithm For AI::Categorizer
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::KNN;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $nb = new AI::Categorizer::Learner::KNN(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
   
  ... time passes ...
   
  $l = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";

lib/AI/Categorizer/Learner/KNN.pm  view on Meta::CPAN

155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
Creates a new KNN Learner and returns it.  In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
KNN subclass accepts the following parameters:
 
=over 4
 
=item threshold
 
Sets the score threshold for category membership.  The default is
currently 0.1.  Set the threshold lower to assign more categories per
document, set it higher to assign fewer.  This can be an effective way
to trade of between precision and recall.
 
=item k_value
 
Sets the C<k> value (as in k-Nearest-Neighbor) to the given integer.
This indicates how many of each document's nearest neighbors should be
considered when assigning categories.  The default is 5.
 
=back
 
=head2 threshold()
 
Returns the current threshold value.  With an optional numeric
argument, you may set the threshold.
 
=head2 train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
 
=head2 categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
__PACKAGE__->valid_params
  (
   threshold => {type => SCALAR, default => 0.3},
  );
 
sub create_model {
  my $self = shift;
  my $m = $self->{model} = Algorithm::NaiveBayes->new;
 
  foreach my $d ($self->knowledge_set->documents) {
    $m->add_instance(attributes => $d->features->as_hash,
                     label      => [ map $_->name, $d->categories ]);
  }
  $m->train;
}
 
sub get_scores {
  my ($self, $newdoc) = @_;
 
  return ($self->{model}->predict( attributes => $newdoc->features->as_hash ),

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
}
 
sub threshold {
  my $self = shift;
  $self->{threshold} = shift if @_;
  return $self->{threshold};
}
 
sub save_state {
  my $self = shift;
  local $self->{knowledge_set};  # Don't need the knowledge_set to categorize
  $self->SUPER::save_state(@_);
}
 
sub categories {
  my $self = shift;
  return map AI::Categorizer::Category->by_name( name => $_ ), $self->{model}->labels;
}
 
1;

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
AI::Categorizer::Learner::NaiveBayes - Naive Bayes Algorithm For AI::Categorizer
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::NaiveBayes;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
   
  ... time passes ...
   
  $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
    print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";

lib/AI/Categorizer/Learner/NaiveBayes.pm  view on Meta::CPAN

95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
Creates a new Naive Bayes Learner and returns it.  In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Naive Bayes subclass accepts the following parameters:
 
=over 4
 
=item * threshold
 
Sets the score threshold for category membership.  The default is
currently 0.3.  Set the threshold lower to assign more categories per
document, set it higher to assign fewer.  This can be an effective way
to trade of between precision and recall.
 
=back
 
=head2 threshold()
 
Returns the current threshold value.  With an optional numeric
argument, you may set the threshold.
 
=head2 train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
 
=head2 categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more

lib/AI/Categorizer/Learner/Rocchio.pm  view on Meta::CPAN

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
$VERSION = '0.01';
 
use strict;
use Params::Validate qw(:types);
use base qw(https://metacpan.org/pod/AI::Categorizer::Learner::Boolean">AI::Categorizer::Learner::Boolean);
 
__PACKAGE__->valid_params
  (
   positive_setting => {type => SCALAR, default => 16 },
   negative_setting => {type => SCALAR, default => 4  },
   threshold        => {type => SCALAR, default => 0.1},
  );
 
sub create_model {
  my $self = shift;
  foreach my $doc ($self->knowledge_set->documents) {
    $doc->features->normalize;
  }
   
  $self->{model}{all_features} = $self->knowledge_set->features(undef);
  $self->SUPER::create_model(@_);
  delete $self->{knowledge_set};
}
 
sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  my $posdocnum = @$positives;
  my $negdocnum = @$negatives;
   
  my $beta = $self->{positive_setting};
  my $gamma = $self->{negative_setting};
   
  my $profile = $self->{model}{all_features}->clone->scale(-$gamma/$negdocnum);
  my $f = $cat->features(undef)->clone->scale( $beta/$posdocnum + $gamma/$negdocnum );
  $profile->add($f);
 
  return $profile->normalize;
}
 
sub get_boolean_score {
  my ($self, $newdoc, $profile) = @_;

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
use Params::Validate qw(:types);
 
__PACKAGE__->valid_params
  (
   svm_kernel => {type => SCALAR, default => 'linear'},
  );
 
sub create_model {
  my $self = shift;
  my $f = $self->knowledge_set->features->as_hash;
  my $rmap = [ keys %$f ];
  $self->{model}{feature_map} = { map { $rmap->[$_], $_ } 0..$#$rmap };
  $self->{model}{feature_map_reverse} = $rmap;
  $self->SUPER::create_model(@_);
}
 
sub _doc_2_dataset {
  my ($self, $doc, $label, $fm) = @_;
 
  my $ds = new Algorithm::SVM::DataSet(Label => $label);
  my $f = $doc->features->as_hash;
  while (my ($k, $v) = each %$f) {
    next unless exists $fm->{$k};
    $ds->attribute( $fm->{$k}, $v );
  }
  return $ds;
}
 
sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  my $svm = new Algorithm::SVM(Kernel => $self->{svm_kernel});
   
  my (@pos, @neg);
  foreach my $doc (@$positives) {
    push @pos, $self->_doc_2_dataset($doc, 1, $self->{model}{feature_map});
  }
  foreach my $doc (@$negatives) {
    push @neg, $self->_doc_2_dataset($doc, 0, $self->{model}{feature_map});
  }
 
  $svm->train(@pos, @neg);
  return $svm;
}
 
sub get_scores {
  my ($self, $doc) = @_;
  local $self->{current_doc} = $self->_doc_2_dataset($doc, -1, $self->{model}{feature_map});
  return $self->SUPER::get_scores($doc);
}
 
sub get_boolean_score {
  my ($self, $doc, $svm) = @_;
  return $svm->predict($self->{current_doc});
}
 
sub save_state {
  my ($self, $path) = @_;
  {
    local $self->{model}{learners};
    local $self->{knowledge_set};
    $self->SUPER::save_state($path);
  }
  return unless $self->{model};
   
  my $svm_dir = File::Spec->catdir($path, 'svms');
  mkdir($svm_dir, 0777) or die "Couldn't create $svm_dir: $!";
  while (my ($name, $learner) = each %{$self->{model}{learners}}) {
    my $path = File::Spec->catfile($svm_dir, $name);
    $learner->save($path);
  }

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
AI::Categorizer::Learner::SVM - Support Vector Machine Learner
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::SVM;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $l = new AI::Categorizer::Learner::SVM(...parameters...);
  $l->train(knowledge_set => $k);
  $l->save_state('filename');
   
  ... time passes ...
   
  $l = AI::Categorizer::Learner->restore_state('filename');
  while (my $document = ... ) {  # An AI::Categorizer::Document object
    my $hypothesis = $l->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
=over 4
 
=item svm_kernel
 
Specifies what type of kernel should be used when building the SVM.
Default is 'linear'.  Possible values are 'linear', 'polynomial',
'radial' and 'sigmoid'.
 
=back
 
=head2 train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
 
=head2 categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
    delete $self->{weka_path};
  }
  return $self;
}
 
# java -classpath /Applications/Science/weka-3-2-3/weka.jar weka.classifiers.NaiveBayes -t /tmp/train_file.arff -d /tmp/weka-machine
 
sub create_model {
  my ($self) = shift;
  my $m = $self->{model} ||= {};
  $m->{all_features} = [ $self->knowledge_set->features->names ];
  $m->{_in_dir} = File::Temp::tempdir( DIR => $self->{tmpdir} );
 
  # Create a dummy test file $dummy_file in ARFF format (a kludgey WEKA requirement)
  my $dummy_features = $self->create_delayed_object('features');
  $m->{dummy_file} = $self->create_arff_file("dummy", [[$dummy_features, 0]]);
 
  $self->SUPER::create_model(@_);
}
 
sub create_boolean_model {

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
              );
  }
   
  return $filename;
}
 
sub save_state {
  my ($self, $path) = @_;
 
  {
    local $self->{knowledge_set};
    $self->SUPER::save_state($path);
  }
  return unless $self->{model};
 
  my $model_dir = File::Spec->catdir($path, 'models');
  mkdir($model_dir, 0777) or die "Couldn't create $model_dir: $!";
  while (my ($name, $learner) = each %{$self->{model}{learners}}) {
    my $oldpath = File::Spec->catdir($self->{model}{_in_dir}, $learner->{machine_file});
    my $newpath = File::Spec->catfile($model_dir, "${name}_model");
    File::Copy::copy($oldpath, $newpath);

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
AI::Categorizer::Learner::Weka - Pass-through wrapper to Weka system
 
=head1 SYNOPSIS
 
  use AI::Categorizer::Learner::Weka;
   
  # Here $k is an AI::Categorizer::KnowledgeSet object
   
  my $nb = new AI::Categorizer::Learner::Weka(...parameters...);
  $nb->train(knowledge_set => $k);
  $nb->save_state('filename');
   
  ... time passes ...
   
  $nb = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
classifier class when building the categorizer.
 
=item tmpdir
 
A directory in which temporary files will be written when training the
categorizer and categorizing new documents.  The default is given by
C<< File::Spec->tmpdir >>.
 
=back
 
=head2 train(knowledge_set => $k)
 
Trains the categorizer.  This prepares it for later use in
categorizing documents.  The C<knowledge_set> parameter must provide
an object of the class C<AI::Categorizer::KnowledgeSet> (or a subclass
thereof), populated with lots of documents and categories.  See
L<AI::Categorizer::KnowledgeSet> for the details of how to create such
an object.
 
=head2 categorize($document)
 
Returns an C<AI::Categorizer::Hypothesis> object representing the
categorizer's "best guess" about which categories the given document
should be assigned to.  See L<AI::Categorizer::Hypothesis> for more

t/01-naive_bayes.t  view on Meta::CPAN

20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');
 
#use Carp; $SIG{__DIE__} = \&Carp::confess;
 
my %docs = training_docs();
 
{
  ok my $c = new AI::Categorizer(collection_weighting => 'f');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
 
  # Make sure collection_weighting is working
  ok $c->knowledge_set->document_frequency('vampires'), 2;
  for ('vampires', 'mirrors') {
    ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
        log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );
  }
 
  $c->learner->train( knowledge_set => $c->knowledge_set );
  ok $c->learner;
   
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';
}
 
{
  ok my $c = new AI::Categorizer(term_weighting => 'b');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
}
 
{
  ok my $c = new AI::Categorizer(term_weighting => 'n');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
  ok $c->knowledge_set->document('doc3')->features->as_hash->{blood}, 0.75;
  ok $c->knowledge_set->document('doc4')->features->as_hash->{mirrors}, 1;
}
 
{
  ok my $c = new AI::Categorizer(tfidf_weighting => 'txx');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 2;
}

t/07-guesser.t  view on Meta::CPAN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#!/usr/bin/perl -w
 
#########################
 
use strict;
use Test;
BEGIN {
  require 't/common.pl';
  plan tests => 1 + num_setup_tests();
}
 
ok(1);
 
#########################
 
my ($learner, $docs) = set_up_tests(learner_class => 'AI::Categorizer::Learner::Guesser');



( run in 0.435 second using v1.01-cache-2.11-cpan-e9199f4ba4c )