AI-Categorizer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
  be seen in "doc/classes.png".
 
Knowledge Sets
 
  A "knowledge set" is defined as a collection of documents, together with
  some information on the categories each document belongs to. Note that this
  term is somewhat unique to this project - other sources may call it a
  "training corpus", or "prior knowledge". A knowledge set also contains some
  information on how documents will be parsed and how their features (words)
  will be extracted and turned into meaningful representations. In this sense,
  a knowledge set represents not only a collection of data, but a particular
  view on that data.
 
  A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
  class. Before you can start playing with categorizers, you will have to
  start playing with knowledge sets, so that the categorizers have some data
  to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
  module for information on its interface.
 
 Feature selection
 
  Deciding which features are the most important is a very large part of the
  categorization task - you cannot simply consider all the words in all the
  documents when training, and all the words in the document being
  categorized. There are two main reasons for this - first, it would mean that
  your training and categorizing processes would take forever and use tons of

README  view on Meta::CPAN

109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
  complete test run generally contains two collections, one for training and
  one for testing. A "Learner" can mass-categorize a collection.
 
  The "AI::Categorizer::Collection" class and its subclasses instantiate the
  idea of a collection in this sense.
 
Documents
 
  Each document is represented by an "AI::Categorizer::Document" object, or an
  object of one of its subclasses. Each document class contains methods for
  turning a bunch of data into a Feature Vector. Each document also has a
  method to report which categories it belongs to.
 
Categories
 
  Each category is represented by an "AI::Categorizer::Category" object. Its
  main purpose is to keep track of which documents belong to it, though you
  can also examine statistical properties of an entire category, such as
  obtaining a Feature Vector representing an amalgamation of all the documents
  that belong to it.

README  view on Meta::CPAN

172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
  Please see the documentation of these individual modules for more details on
  their guts and quirks. See the "AI::Categorizer::Learner" documentation for
  a description of the general categorizer interface.
 
  If you wish to create your own classifier, you should inherit from
  "AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
  abstract classes that manage some of the work for you.
 
Feature Vectors
 
  Most categorization algorithms don't deal directly with documents' data,
  they instead deal with a *vector representation* of a document's *features*.
  The features may be any properties of the document that seem helpful for
  determining its category, but they are usually some version of the "most
  important" words in the document. A list of features and their weights in
  each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
  You may think of this class as roughly analogous to a Perl hash, where the
  keys are the names of features and the values are their weights.
 
Hypotheses

README  view on Meta::CPAN

236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
    training_set
        Specifies the "path" parameter that will be fed to the
        KnowledgeSet's "scan_features()" and "read()" methods during our
        "scan_features()" and "read_training_set()" methods.
 
    test_set
        Specifies the "path" parameter that will be used when creating a
        Collection during the "evaluate_test_set()" method.
 
    data_root
        A shortcut for setting the "training_set", "test_set", and
        "category_file" parameters separately. Sets "training_set" to
        "$data_root/training", "test_set" to "$data_root/test", and
        "category_file" (used by some of the Collection classes) to
        "$data_root/cats.txt".
 
learner()
    Returns the Learner object associated with this Categorizer. Before
    "train()", the Learner will of course not be trained yet.
 
knowledge_set()
    Returns the KnowledgeSet object associated with this Categorizer. If
    "read_training_set()" has not yet been called, the KnowledgeSet will not
    yet be populated with any training data.
 
run_experiment()
    Runs a complete experiment on the training and testing data, reporting
    the results on "STDOUT". Internally, this is just a shortcut for calling
    the "scan_features()", "read_training_set()", "train()", and
    "evaluate_test_set()" methods, then printing the value of the
    "stats_table()" method.
 
scan_features()
    Scans the Collection specified in the "test_set" parameter to determine
    the set of features (words) that will be considered when training the
    Learner. Internally, this calls the "scan_features()" method of the
    KnowledgeSet, then saves a list of the KnowledgeSet's features for later
    use.
 
    This step is not strictly necessary, but it can dramatically reduce
    memory requirements if you scan for features before reading the entire
    corpus into memory.
 
read_training_set()
    Populates the KnowledgeSet with the data specified in the "test_set"
    parameter. Internally, this calls the "read()" method of the
    KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
    object for later use.
 
train()
    Calls the Learner's "train()" method, passing it the KnowledgeSet
    created during "read_training_set()". Returns the Learner object. Also
    saves the Learner object for later use.
 
evaluate_test_set()

eg/demo.pl  view on Meta::CPAN

11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.
 
use strict;
 
die("Usage: $0 <corpus>\n".
    "  A sample corpus (data set) can be downloaded from\n".
  unless @ARGV == 1;
 
my $corpus = shift;
 
my $training  = File::Spec->catfile( $corpus, 'training' );
my $test      = File::Spec->catfile( $corpus, 'test' );
my $cats      = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );

lib/AI/Categorizer.pm  view on Meta::CPAN

15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
 
__PACKAGE__->valid_params
  (
   progress_file => { type => SCALAR, default => 'save' },
   knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
   learner       => { isa => 'AI::Categorizer::Learner' },
   verbose       => { type => BOOLEAN, default => 0 },
   training_set  => { type => SCALAR, optional => 1 },
   test_set      => { type => SCALAR, optional => 1 },
   data_root     => { type => SCALAR, optional => 1 },
  );
 
__PACKAGE__->contained_objects
  (
   knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
   learner       => { class => 'AI::Categorizer::Learner::NaiveBayes' },
   experiment    => { class => 'AI::Categorizer::Experiment',
                      delayed => 1 },
   collection    => { class => 'AI::Categorizer::Collection::Files',
                      delayed => 1 },
  );
 
sub new {
  my $package = shift;
  my %args = @_;
  my %defaults;
  if (exists $args{data_root}) {
    $defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
    $defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
    $defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
    delete $args{data_root};
  }
 
  return $package->SUPER::new(%defaults, %args);
}
 
#sub dump_parameters {
#  my $p = shift()->SUPER::dump_parameters;
#  delete $p->{stopwords} if $p->{stopword_file};
#  return $p;
#}

lib/AI/Categorizer.pm  view on Meta::CPAN

212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
=head2 Knowledge Sets
 
A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.
 
A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.
 
=head3 Feature selection
 
Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it
would mean that your training and categorizing processes would take

lib/AI/Categorizer.pm  view on Meta::CPAN

253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
two collections, one for training and one for testing.  A C<Learner>
can mass-categorize a collection.
 
The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.
 
=head2 Documents
 
Each document is represented by an C<AI::Categorizer::Document>
object, or an object of one of its subclasses.  Each document class
contains methods for turning a bunch of data into a Feature Vector.
Each document also has a method to report which categories it belongs
to.
 
=head2 Categories
 
Each category is represented by an C<AI::Categorizer::Category>
object.  Its main purpose is to keep track of which documents belong
to it, though you can also examine statistical properties of an entire
category, such as obtaining a Feature Vector representing an
amalgamation of all the documents that belong to it.

lib/AI/Categorizer.pm  view on Meta::CPAN

327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
details on their guts and quirks.  See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.
 
If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.
 
=head2 Feature Vectors
 
Most categorization algorithms don't deal directly with documents'
data, they instead deal with a I<vector representation> of a
document's I<features>.  The features may be any properties of the
document that seem helpful for determining its category, but they are usually
some version of the "most important" words in the document.  A list of
features and their weights in each document is encapsulated by the
C<AI::Categorizer::FeatureVector> class.  You may think of this class
as roughly analogous to a Perl hash, where the keys are the names of
features and the values are their weights.
 
=head2 Hypotheses

lib/AI/Categorizer.pm  view on Meta::CPAN

405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C<read()> methods during our C<scan_features()>
and C<read_training_set()> methods.
 
=item test_set
 
Specifies the C<path> parameter that will be used when creating a
Collection during the C<evaluate_test_set()> method.
 
=item data_root
 
A shortcut for setting the C<training_set>, C<test_set>, and
C<category_file> parameters separately.  Sets C<training_set> to
C<$data_root/training>, C<test_set> to C<$data_root/test>, and
C<category_file> (used by some of the Collection classes) to
C<$data_root/cats.txt>.
 
=back
 
=item learner()
 
Returns the Learner object associated with this Categorizer.  Before
C<train()>, the Learner will of course not be trained yet.
 
=item knowledge_set()
 
Returns the KnowledgeSet object associated with this Categorizer.  If
C<read_training_set()> has not yet been called, the KnowledgeSet will
not yet be populated with any training data.
 
=item run_experiment()
 
Runs a complete experiment on the training and testing data, reporting
the results on C<STDOUT>.  Internally, this is just a shortcut for
calling the C<scan_features()>, C<read_training_set()>, C<train()>,
and C<evaluate_test_set()> methods, then printing the value of the
C<stats_table()> method.
 
=item scan_features()
 
Scans the Collection specified in the C<test_set> parameter to
determine the set of features (words) that will be considered when
training the Learner.  Internally, this calls the C<scan_features()>
method of the KnowledgeSet, then saves a list of the KnowledgeSet's
features for later use.
 
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
 
=item read_training_set()
 
Populates the KnowledgeSet with the data specified in the C<test_set>
parameter.  Internally, this calls the C<read()> method of the
KnowledgeSet.  Returns the KnowledgeSet.  Also saves the KnowledgeSet
object for later use.
 
=item train()
 
Calls the Learner's C<train()> method, passing it the KnowledgeSet
created during C<read_training_set()>.  Returns the Learner object.
Also saves the Learner object for later use.

lib/AI/Categorizer/Collection/InMemory.pm  view on Meta::CPAN

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
use strict;
 
use base qw(https://metacpan.org/pod/AI::Categorizer::Collection">AI::Categorizer::Collection);
 
use Params::Validate qw(:types);
 
__PACKAGE__->valid_params
  (
   data => { type => HASHREF },
  );
 
sub new {
  my $self = shift()->SUPER::new(@_);
   
  while (my ($name, $params) = each %{$self->{data}}) {
    foreach (@{$params->{categories}}) {
      next if ref $_;
      $_ = AI::Categorizer::Category->by_name(name => $_);
    }
  }
 
  return $self;
}
 
sub next {
  my $self = shift;
  my ($name, $params) = each %{$self->{data}} or return;
  return AI::Categorizer::Document->new(name => $name, %$params);
}
 
sub rewind {
  my $self = shift;
  scalar keys %{$self->{data}};
  return;
}
 
sub count_documents {
  my $self = shift;
  return scalar keys %{$self->{data}};
}
 
1;

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
### Constructors
 
my $NAME = 'a';
 
sub new {
  my $pkg = shift;
  my $self = $pkg->SUPER::new(name => $NAME++,  # Use a default name
                              @_);
 
  # Get efficient internal data structures
  $self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
 
  $self->_fix_stopwords;
   
  # A few different ways for the caller to initialize the content
  if (exists $self->{parse}) {
    $self->parse(content => delete $self->{parse});
     
  } elsif (exists $self->{parse_handle}) {
    $self->parse_handle(handle => delete $self->{parse_handle});

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
sub create_feature_vector {
  my $self = shift;
  my $content = $self->{content};
  my $weights = $self->{content_weights};
 
  die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
    unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;
 
  $self->{features} = $self->create_delayed_object('features');
  while (my ($name, $data) = each %$content) {
    my $t = $self->tokenize($data);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
    $self->stem_words($t);
    $t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
    my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
    $self->{features}->add($h);
  }
}
 
sub is_in_category {
  return (ref $_[1]

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
# Specify explicit feature vector:
 my $d = new AI::Categorizer::Document(name => $string);
 $d->features( $feature_vector );
  
 # Now pass the document to a categorization algorithm:
 my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
 my $hypothesis = $learner->categorize($document);
 
=head1 DESCRIPTION
 
The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector.  Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.
 
=head1 METHODS
 
=over 4
 
=item new(%parameters)
 
Creates a new Document object.  Document objects are used during
training (for the training documents), testing (for the test
documents), and when categorizing new unseen documents in an

lib/AI/Categorizer/Document.pm  view on Meta::CPAN

434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
=back
 
The default value is C<stem>, which seems to produce the best results
in most cases I've tried.  I'm not aware of any studies comparing the
C<no_stem> behavior to the C<stem> behavior in the general case.
 
This parameter has no effect if there are no stopwords being used, or
if stemming is not being used.  In the latter case, the list of
stopwords will always be matched as-is against the document words.
 
Note that if the C<stem> option is used, the data structure passed as
the C<stopwords> parameter will be modified in-place to contain the
stemmed versions of the stopwords supplied.
 
=back
 
=item read( path =E<gt> $path )
 
An alternative constructor method which reads a file on disk and
returns a document with that file's contents.

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
sub parse {
  my ($self, %args) = @_;
 
  # it is a string which contains the content of XML
  my $body= $args{content};                    
 
  # it is a hash which includes a pair of <elementName, weight>
  my $elementWeight= $args{elementWeight};     
 
  # construct Handler which receive event of element, data, comment, processing_instruction
  # And convert their values into a sequence  of string and save it into buffer
  my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);
 
  # construct parser
  my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);
 
  # let's start parsing XML, where the methids of Handler will be called
  $xmlParser->parse_string($body);
 
  # extract the converted string from Handler

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
  # call super class such as XML::SAX::Base
  my $self = $class->SUPER::new;
 
  # save weights of elements which is a hash for pairs <elementName, weight>
  # weight is times duplication of corresponding element
  # It is provided by caller(one of parameters) at construction, and
  # we must save it in order to use doing duplication at end_element
  $self->{weightHash} = $args{weights};
 
  # It is storage to store the data produced by Text, CDataSection and etc.
  $self->{content} = '';
 
  # This array is used to store the data for every element from root to the current visiting element.
  # Thus, data of 0~($levelPointer-1)th in the array is only valid.
  # The array which store the starting location(index) of the content for an element,
  # From it, we can know all the data produced by an element at the end_element
  # It is needed at the duplication of the data produced by the specific element
  $self->{locationArray} = [];
 
  return $self;
}
         
# Input: None
# Output: None
# Description:
#       it is called whenever the parser meets the document
#       it will be called at once
#       Currently, confirm if the content buffer is an empty
sub start_document{
  my ($self, $doc)= @_;
 
  # The level(depth) of the last called element in XML tree
  # Calling of start_element is the preorder of the tree traversal.
  # The level is the level of current visiting element in tree.
  # the first element is 0-level
  $self->{levelPointer} = 0;
 
  # all data will be saved into here, initially, it is an empty
  $self->{content} = "";
 
  #$self->SUPER::start_document($doc);
}
 
# Input: None
# Output: None
# Description:
#       it is called whenever the parser ends the document
#       it will be called at once

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
#               Name            $el->{Name}
#               Prefix          $el->{Prefix}
#               Value           $el->{Value}
# Output: None
# Description:
#       it is called whenever the parser meets the element
sub start_element{
  my ($self, $el)= @_;
 
  # find the last location of the content
  # its meaning is to append the new data at this location
  my $location= length $self->{content};
 
  # save the last location of the current content
  # so that at end_element the starting location of data of this element can be known
  $self->{locationArray}[$self->{levelPointer}] = $location;
 
  # for the next element, increase levelPointer
  $self->{levelPointer}++;
 
  #$self->SUPER::start_document($el);
}
 
# Input: None
# Output: None

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
  $self->{levelPointer}--;
  my $location= $self->{locationArray}[$self->{levelPointer}];
 
  # find the name of element
  my $elementName= $el->{Name};
 
  # set the default weight
  my $weight= 1;
 
  # check if user give the weight to duplicate data
  $weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};
 
  # 0 - remove all the data to be related to this element
  if($weight == 0){
    $self->{content} = substr($self->{content}, 0, $location);
    return;
  }
 
  # 1 - dont duplicate
  if($weight == 1){
    return;
  }
   
  # n - duplicate data by n times
  # get new content
  my $newContent= substr($self->{content}, $location);
 
  # start to copy
  for(my $i=1; $i<$weight;$i++){
    $self->{content} .= $newContent;
  }
 
  #$self->SUPER::end_document($el);
}
 
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
#       it is called whenever the parser meets the text which comes from Text, CDataSection and etc
#       Value must be saved into content buffer.
sub characters{
  my ($self, $args)= @_;
 
  # save "data plus new line" into content
  $self->{content} .= "$args->{Data}\n";
}
         
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
#       it is called whenever the parser meets the comment
#       Currently, it will be ignored
sub comment{
  my ($self, $args)= @_;

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
# Input: a hash which consists of pair <Data, Value> and <Target, Value>
# Output: None
# Description:
#       it is called whenever the parser meets the processing_instructing
#       Currently, it will be ignored
sub processing_instruction{
  my ($self, $args)= @_;
}
 
# Input: None
# Output: the converted data, that is, content
# Description:
#       return the content
sub getContent{
  my ($self)= @_;
  return $self->{content};
}
 
1;
__END__

lib/AI/Categorizer/Experiment.pm  view on Meta::CPAN

82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
C<Statistics::Contingency> for a description of its interface.  All of
its methods are available here, with the following additions:
 
=over 4
 
=item new( categories => \%categories )
 
=item new( categories => \@categories, verbose => 1, sig_figs => 2 )
 
Returns a new Experiment object.  A required C<categories> parameter
specifies the names of all categories in the data set.  The category
names may be specified either the keys in a reference to a hash, or as
the entries in a reference to an array.
 
The C<new()> method accepts a C<verbose> parameter which
will cause some status/debugging information to be printed to
C<STDOUT> when C<verbose> is set to a true value.
 
A C<sig_figs> indicates the number of significant figures that should
be used when showing the results in the C<results_table()> method.  It
does not affect the other methods like C<micro_precision()>.

lib/AI/Categorizer/FeatureSelector.pm  view on Meta::CPAN

297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
 
=item add_document()
 
Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.
 
=item make_document()
 
This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet.  A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.
 
=item finish()
 
This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that

lib/AI/Categorizer/FeatureVector.pm  view on Meta::CPAN

141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
  $f3 = $f1->intersection($f2);
  $f3 = $f1->add($f2);
   
  $h = $f1->as_hash;
  $h = $f1->as_boolean_hash;
   
  $f1->normalize;
 
=head1 DESCRIPTION
 
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features.  At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document.  However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
multiple documents.
 
=head1 METHODS

lib/AI/Categorizer/Hypothesis.pm  view on Meta::CPAN

83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
=head1 METHODS
 
=over 4
 
=item new(%parameters)
 
Returns a new Hypothesis object.  Generally a user of
C<AI::Categorize> doesn't create a Hypothesis object directly - they
are returned by the Learner's C<categorize()> method.  However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.
 
The following parameters are accepted when creating a new Hypothesis:
 
=over 4
 
=item all_categories
 
A required parameter which gives the set of all categories that could
possibly be assigned to.  The categories should be specified as a

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
sub load {
  my ($self, %args) = @_;
  my $c = $self->_make_collection(\%args);
 
  if ($self->{features_kept}) {
    # Read the whole thing in, then reduce
    $self->read( collection => $c );
    $self->select_features;
 
  } elsif ($self->{scan_first}) {
    # Figure out the feature set first, then read data in
    $self->scan_features( collection => $c );
    $c->rewind;
    $self->read( collection => $c );
 
  } else {
    # Don't do any feature reduction, just read the data
    $self->read( collection => $c );
  }
}
 
sub read {
  my ($self, %args) = @_;
  my $collection = $self->_make_collection(\%args);
  my $pb = $self->prog_bar($collection);
   
  while (my $doc = $collection->next) {

lib/AI/Categorizer/KnowledgeSet.pm  view on Meta::CPAN

654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally). 
 
=item add_document()
 
Given a Document object as an argument, this method will add it and
any categories it belongs to to the KnowledgeSet.
 
=item make_document()
 
This method will create a Document object with the given data and then
call C<add_document()> to add it to the KnowledgeSet.  A C<categories>
parameter should specify an array reference containing a list of
categories I<by name>.  These are the categories that the document
belongs to.  Any other parameters will be passed to the Document
class's C<new()> method.
 
=item finish()
 
This method will be called prior to training the Learner.  Its purpose
is to perform any operations (such as feature vector weighting) that

lib/AI/Categorizer/Learner/SVM.pm  view on Meta::CPAN

16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
sub create_model {
  my $self = shift;
  my $f = $self->knowledge_set->features->as_hash;
  my $rmap = [ keys %$f ];
  $self->{model}{feature_map} = { map { $rmap->[$_], $_ } 0..$#$rmap };
  $self->{model}{feature_map_reverse} = $rmap;
  $self->SUPER::create_model(@_);
}
 
sub _doc_2_dataset {
  my ($self, $doc, $label, $fm) = @_;
 
  my $ds = new Algorithm::SVM::DataSet(Label => $label);
  my $f = $doc->features->as_hash;
  while (my ($k, $v) = each %$f) {
    next unless exists $fm->{$k};
    $ds->attribute( $fm->{$k}, $v );
  }
  return $ds;
}
 
sub create_boolean_model {
  my ($self, $positives, $negatives, $cat) = @_;
  my $svm = new Algorithm::SVM(Kernel => $self->{svm_kernel});
   
  my (@pos, @neg);
  foreach my $doc (@$positives) {
    push @pos, $self->_doc_2_dataset($doc, 1, $self->{model}{feature_map});
  }
  foreach my $doc (@$negatives) {
    push @neg, $self->_doc_2_dataset($doc, 0, $self->{model}{feature_map});
  }
 
  $svm->train(@pos, @neg);
  return $svm;
}
 
sub get_scores {
  my ($self, $doc) = @_;
  local $self->{current_doc} = $self->_doc_2_dataset($doc, -1, $self->{model}{feature_map});
  return $self->SUPER::get_scores($doc);
}
 
sub get_boolean_score {
  my ($self, $doc, $svm) = @_;
  return $svm->predict($self->{current_doc});
}
 
sub save_state {
  my ($self, $path) = @_;

lib/AI/Categorizer/Learner/Weka.pm  view on Meta::CPAN

272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
  $nb = AI::Categorizer::Learner->restore_state('filename');
  my $c = new AI::Categorizer::Collection::Files( path => ... );
  while (my $document = $c->next) {
    my $hypothesis = $nb->categorize($document);
    print "Best assigned category: ", $hypothesis->best_category, "\n";
  }
 
=head1 DESCRIPTION
 
This class doesn't implement any machine learners of its own, it
merely passes the data through to the Weka machine learning system
(http://www.cs.waikato.ac.nz/~ml/weka/).  This can give you access to
a collection of machine learning algorithms not otherwise implemented
in C<AI::Categorizer>.
 
Currently this is a simple command-line wrapper that calls C<java>
subprocesses.  In the future this may be converted to an
C<Inline::Java> wrapper for better performance (faster running
times).  However, if you're looking for really great performance,
you're probably looking in the wrong place - this Weka wrapper is
intended more as a way to try lots of different machine learning

lib/AI/Categorizer/Storable.pm  view on Meta::CPAN

29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
=head1 SYNOPSIS
 
  $object->save_state($path);
  ... time passes ...
  $object = Class->restore_state($path);
   
=head1 DESCRIPTION
 
This class implements methods for storing the state of an object to a
file and restoring from that file later.  In C<AI::Categorizer> it is
generally used in order to let data persist across multiple
invocations of a program.
 
=head1 METHODS
 
=over 4
 
=item save_state($path)
 
This object method saves the object to disk for later use.  The
C<$path> argument indicates the place on disk where the object should

t/01-naive_bayes.t  view on Meta::CPAN

19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
perform_standard_tests(learner_class => 'AI::Categorizer::Learner::NaiveBayes');
 
#use Carp; $SIG{__DIE__} = \&Carp::confess;
 
my %docs = training_docs();
 
{
  ok my $c = new AI::Categorizer(collection_weighting => 'f');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
 
  # Make sure collection_weighting is working
  ok $c->knowledge_set->document_frequency('vampires'), 2;
  for ('vampires', 'mirrors') {
    ok ($c->knowledge_set->document('doc4')->features->as_hash->{$_},
        log( keys(%docs) / $c->knowledge_set->document_frequency($_) )
       );

t/01-naive_bayes.t  view on Meta::CPAN

45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
    
  my $doc = new AI::Categorizer::Document
    ( name => 'test1',
      content => 'I would like to begin farming sheep.' );
  ok $c->learner->categorize($doc)->best_category, 'farming';
}
 
{
  ok my $c = new AI::Categorizer(term_weighting => 'b');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
}
 
{
  ok my $c = new AI::Categorizer(term_weighting => 'n');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 1;
  ok $c->knowledge_set->document('doc3')->features->as_hash->{blood}, 0.75;
  ok $c->knowledge_set->document('doc4')->features->as_hash->{mirrors}, 1;
}
 
{
  ok my $c = new AI::Categorizer(tfidf_weighting => 'txx');
   
  while (my ($name, $data) = each %docs) {
    $c->knowledge_set->make_document(name => $name, %$data);
  }
   
  $c->knowledge_set->finish;
   
  # Make sure term_weighting is working
  ok $c->knowledge_set->document('doc3')->features->as_hash->{vampires}, 2;
}

t/14-collection.t  view on Meta::CPAN

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
BEGIN { plan tests => 13 };
 
require File::Spec->catfile('t', 'common.pl');
 
ok 1;  # Loaded
 
# Test InMemory collection
my $c = AI::Categorizer::Collection::InMemory->new(data => {training_docs()});
ok $c;
exercise_collection($c, 4);
 
# Test Files collection
$c = AI::Categorizer::Collection::Files->new(path => File::Spec->catdir('t', 'traindocs'),
                                             category_hash => {
                                                               doc1 => ['farming'],
                                                               doc2 => ['farming'],
                                                               doc3 => ['vampire'],

t/common.pl  view on Meta::CPAN

67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
                            (
                             name => 'Vampires/Farmers',
                             stopwords => [qw(are be in of and)],
                            ),
                            verbose => $ENV{TEST_VERBOSE} ? 1 : 0,
                            %params,
                           );
ok ref($c), 'AI::Categorizer', "Create an AI::Categorizer object";
 
my %docs = training_docs();
while (my ($name, $data) = each %docs) {
  $c->knowledge_set->make_document(name => $name, %$data);
}
 
my $l = $c->learner;
ok $l;
 
if ($params{learner_class}) {
  ok ref($l), $params{learner_class}, "Make sure the correct Learner class is instantiated";
} else {
  ok 1, 1, "Dummy test";
}

t/common.pl  view on Meta::CPAN

96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
    
  run_test_docs($l);
 
  # Make sure we can save state & restore state
  $l->save_state('t/state');
  $l = $l->restore_state('t/state');
  ok $l;
 
  run_test_docs($l);
 
  my $train_collection = AI::Categorizer::Collection::InMemory->new(data => $docs);
  ok $train_collection;
   
  my $h = $l->categorize_collection(collection => $train_collection);
  ok $h->micro_precision > 0.5;
}
 
sub num_setup_tests    () { 3 }
sub num_standard_tests () { num_setup_tests + 17 }
 
1;



( run in 1.102 second using v1.01-cache-2.11-cpan-49f99fa48dc )