AI-Categorizer
view release on metacpan
or search on metacpan
Changes
view on Meta::CPAN
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | features to use , and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
[Spotted by Zoltan Barta]
0.06 Tue Apr 22 10:27:26 CDT 2003
- Added a relatively simple example script at the request of several
people, at eg/demo.pl
- Forgot to note a dependency on Algorithm::NaiveBayes in version
0.05. Fixed.
|
Changes
view on Meta::CPAN
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | parameter.
- Added a k-Nearest-Neighbor machine learner. [First revision
implemented by David Bell]
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
Li]
- Added a "Guesser" machine learner which simply uses overall class
probabilities to make categorization decisions. Sometimes useful
for providing a set of baseline scores against which to evaluate
other machine learners.
- The NaiveBayes learner is now a wrapper around my new
Algorithm::NaiveBayes module, which is just the old NaiveBayes code
from here, turned into its own standalone module.
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
|
MANIFEST
view on Meta::CPAN
46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | t/04-decision_tree.t
t/05-svm.t
t/06-knn.t
t/07-guesser.t
t/09-rocchio.t
t/10-tools.t
t/11-feature_vector.t
t/12-hypothesis.t
t/13-document.t
t/14-collection.t
t/15-knowledge_set.t
t/common.pl
t/traindocs/doc1
t/traindocs/doc2
t/traindocs/doc3
t/traindocs/doc4
META.yml
|
README
view on Meta::CPAN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | NAME
AI::Categorizer - Automatic Text Categorization
SYNOPSIS
my $c = new AI::Categorizer(...parameters...);
$c ->run_experiment;
$c ->scan_features;
$c ->read_training_set;
$c ->train;
$c ->evaluate_test_set;
print $c ->stats_table;
my $l = $c ->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l ->categorize( $d );
print "Assigned categories: " , join ', ' , $hypothesis ->categories, "\n" ;
print "Best category: " , $hypothesis ->best_category, "\n" ;
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use , what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use . There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible ( close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
|
README
view on Meta::CPAN
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png" , and a more detailed view of the same thing can
be seen in "doc/classes.png" .
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus" , or "prior knowledge" . A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection" . It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
|
README
view on Meta::CPAN
160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 | AI::Categorizer::Learner::Weka
An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines. This gives you access to lots and lots of machine learning algorithms in use by machine learning researches. The main drawback is that Weka tends to be quite
slow and use a lot of memory, and the current interface between Weka and "AI::Categorizer" is a bit clumsy.
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
for ensemble learning. No timetable for their creation has yet been set.
Please see the documentation of these individual modules for more details on
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean" , which are
abstract classes that manage some of the work for you.
Feature Vectors
|
README
view on Meta::CPAN
219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 | internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that *they * create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save" , which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set" , "test_set" , and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training" , "test_set" to "$data_root/test" , and
"category_file" (used by some of the Collection classes) to
"$data_root/cats.txt" .
learner()
Returns the Learner object associated with this Categorizer. Before
"train()" , the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT" . Internally, this is just a shortcut for calling
the "scan_features()" , "read_training_set()" , "train()" , and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use .
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use .
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()" . Returns the Learner object. Also
saves the Learner object for later use .
evaluate_test_set()
Creates a Collection based on the value of the "test_set" parameter, and
calls the Learner's "categorize_collection()" method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the "stats_table()" method. stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()" ) "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
"AI::Categorize" module by the same author. Note the added 'r' in the new
name. The older module has a different interface, and no attempt at backward
compatibility has been made - that's why I changed the name.
You can have both "AI::Categorize" and "AI::Categorizer" installed at the
|
eg/categorizer
view on Meta::CPAN
32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | if ( $HAVE_YAML ) {
print { $out_fh } YAML::Dump( $c ->dump_parameters);
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n" ;
}
}
}
run_section( 'scan_features' , 1, $do_stage );
run_section( 'read_training_set' , 2, $do_stage );
run_section( 'train' , 3, $do_stage );
run_section( 'evaluate_test_set' , 4, $do_stage );
if ( $do_stage ->{5}) {
my $result = $c ->stats_table;
print $result if $c ->verbose;
print $out_fh $result if $out_fh ;
}
sub run_section {
my ( $section , $stage , $do_stage ) = @_ ;
return unless $do_stage ->{ $stage };
if ( keys %$do_stage > 1) {
|
eg/demo.pl
view on Meta::CPAN
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | die ( "Usage: $0 <corpus>\n" .
" A sample corpus (data set) can be downloaded from\n" .
unless @ARGV == 1;
my $corpus = shift ;
my $training = File::Spec->catfile( $corpus , 'training' );
my $test = File::Spec->catfile( $corpus , 'test' );
my $cats = File::Spec->catfile( $corpus , 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus , 'stopwords' );
|
eg/demo.pl
view on Meta::CPAN
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | $training = AI::Categorizer::Collection::Files->new( path => $training , %params );
$test = AI::Categorizer::Collection::Files->new( path => $test , %params );
print "Loading training set\n" ;
my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 );
$k ->load( collection => $training );
print "Training categorizer\n" ;
my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 );
$l ->train( knowledge_set => $k );
print "Categorizing test set\n" ;
my $experiment = $l ->categorize_collection( collection => $test );
print $experiment ->stats_table;
my $doc = AI::Categorizer::Document->new
( content => "Hello, I am a pretty generic document with not much to say." );
|
eg/easy_guesser.pl
view on Meta::CPAN
eg/easy_guesser.pl
view on Meta::CPAN
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | my %cats ;
print "Reading category file\n" ;
open my ( $fh ), $cats or die "Can't read $cats: $!" ;
while (< $fh >) {
my ( $doc , @cats ) = split ;
$cats { $doc } = \ @cats ;
}
my ( %freq , $docs );
print "Scanning training set\n" ;
opendir my ( $dh ), $training or die "Can't opendir $training: $!" ;
while ( defined ( my $file = readdir $dh )) {
next if $file eq '.' or $file eq '..' ;
unless ( $cats { $file }) {
warn "No category information for '$file'" ;
next ;
}
$docs ++;
$freq { $_ }++ foreach @{ $cats { $file }};
}
|
lib/AI/Categorizer.pm
view on Meta::CPAN
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | __PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment' ,
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files' ,
delayed => 1 },
);
sub new {
my $package = shift ;
my %args = @_ ;
my %defaults ;
if ( exists $args {data_root}) {
$defaults {training_set} = File::Spec->catfile( $args {data_root}, 'training' );
$defaults {test_set} = File::Spec->catfile( $args {data_root}, 'test' );
$defaults {category_file} = File::Spec->catfile( $args {data_root}, 'cats.txt' );
delete $args {data_root};
}
return $package ->SUPER::new( %defaults , %args );
}
sub knowledge_set { shift ->{knowledge_set} }
sub learner { shift ->{learner} }
sub run_experiment {
my $self = shift ;
$self ->scan_features;
$self ->read_training_set;
$self ->train;
$self ->evaluate_test_set;
print $self ->stats_table;
}
sub scan_features {
my $self = shift ;
return unless $self ->knowledge_set->scan_first;
$self ->knowledge_set->scan_features( path => $self ->{training_set} );
$self ->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
sub read_training_set {
my $self = shift ;
$self ->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
if -e "$self->{progress_file}-01-features" ;
$self ->knowledge_set-> read ( path => $self ->{training_set} );
$self ->_save_progress( '02' , 'knowledge_set' );
return $self ->knowledge_set;
}
sub train {
my $self = shift ;
$self ->_load_progress( '02' , 'knowledge_set' );
$self ->learner->train( knowledge_set => $self ->{knowledge_set} );
$self ->_save_progress( '03' , 'learner' );
return $self ->learner;
}
sub evaluate_test_set {
my $self = shift ;
$self ->_load_progress( '03' , 'learner' );
my $c = $self ->create_delayed_object( 'collection' , path => $self ->{test_set} );
$self ->{experiment} = $self ->learner->categorize_collection( collection => $c );
$self ->_save_progress( '04' , 'experiment' );
return $self ->{experiment};
}
sub stats_table {
my $self = shift ;
$self ->_load_progress( '04' , 'experiment' );
return $self ->{experiment}->stats_table;
}
|
lib/AI/Categorizer.pm
view on Meta::CPAN
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 | |
lib/AI/Categorizer.pm
view on Meta::CPAN
205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | object framework. We give a conceptual overview, but don't get into
any of the details about interfaces or usage. See the documentation
for the individual classes for more details.
A diagram of the various classes in the framework can be seen in
C<doc/classes-overview.png>, and a more detailed view of the same
thing can be seen in C<doc/classes.png>.
|
lib/AI/Categorizer.pm
view on Meta::CPAN
314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 | access to lots and lots of machine learning algorithms in use by machine learning researches. The main drawback is that Weka tends to
be quite slow and use a lot of memory, and the current interface between Weka and C<AI::Categorizer> is a bit clumsy.
|
lib/AI/Categorizer.pm
view on Meta::CPAN
385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 | works.
The specific parameters accepted here are:
|
lib/AI/Categorizer/Collection.pm
view on Meta::CPAN
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 | The default is C<AI::Categorizer::Document::Text>.
|
lib/AI/Categorizer/Collection/Files.pm
view on Meta::CPAN
133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 | |
lib/AI/Categorizer/Document.pm
view on Meta::CPAN
362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 | |
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | sub end_element{
my ( $self , $el )= @_ ;
$self ->{levelPointer}--;
my $location = $self ->{locationArray}[ $self ->{levelPointer}];
my $elementName = $el ->{Name};
my $weight = 1;
$weight = $self ->{weightHash}{ $elementName } if exists $self ->{weightHash}{ $elementName };
if ( $weight == 0){
$self ->{content} = substr ( $self ->{content}, 0, $location );
return ;
}
|
lib/AI/Categorizer/Experiment.pm
view on Meta::CPAN
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | C<Statistics::Contingency> for a description of its interface. All of
its methods are available here, with the following additions:
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | return $result ;
}
sub rank_features;
sub scan_features;
sub select_features {
my ( $self , %args ) = @_ ;
die "No knowledge_set parameter provided to select_features()"
unless $args {knowledge_set};
my $f = $self ->rank_features( knowledge_set => $args {knowledge_set} );
return $self ->reduce_features( $f , features_kept => $args {features_kept} );
}
1;
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | If a C<load> parameter is present, the C<load()> method will be
invoked immediately. If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>. If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 | No change - multiply by 1.
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | Given a document name, returns the Document object with that name, or
C< undef > if no such Document object exists in this KnowledgeSet.
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
|
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
view on Meta::CPAN
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | (
features => { class => 'AI::Categorizer::FeatureVector' ,
delayed => 1 },
);
1;
sub reduction_function;
sub scan_features {
my ( $self , %args ) = @_ ;
my $c = $args {collection} or
die "No 'collection' parameter provided to scan_features()" ;
if (!( $self ->{features_kept})) { return ;}
my %cat_features ;
my $coll_features = $self ->create_delayed_object( 'features' );
|
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
view on Meta::CPAN
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | $r_features ->{features}{ $term } = $self ->reduction_function( $term ,
$nbDocuments , $allFeaturesSum , $coll_features ,
\ %cat_features ,\ %cat_features_sum );
}
print STDERR "\n" if $self ->verbose;
my $new_features = $self ->reduce_features( $r_features );
return $coll_features ->intersection( $new_features );
}
sub rank_features {
die "CategorySelector->rank_features is not implemented yet!" ;
}
|
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
view on Meta::CPAN
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | AI::Categorizer::CategorySelector - Abstract Category Selection class
|
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm
view on Meta::CPAN
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | |
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm
view on Meta::CPAN
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | __PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector' ,
delayed => 1 },
);
sub rank_features {
my ( $self , %args ) = @_ ;
my $k = $args {knowledge_set} or die "No knowledge_set parameter provided to rank_features()" ;
my %freq_counts ;
foreach my $name ( $k ->features->names) {
$freq_counts { $name } = $k ->document_frequency( $name );
}
return $self ->create_delayed_object( 'features' , features => \ %freq_counts );
}
sub scan_features {
my ( $self , %args ) = @_ ;
|
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm
view on Meta::CPAN
48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
lib/AI/Categorizer/FeatureVector.pm
view on Meta::CPAN
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | my ( $package , %args ) = @_ ;
$args {features} ||= {};
return bless { features => $args {features}}, $package ;
}
sub names {
my $self = shift ;
return keys %{ $self ->{features}};
}
sub set {
my $self = shift ;
$self ->{features} = ( ref $_ [0] ? $_ [0] : { @_ });
}
sub as_hash {
my $self = shift ;
return $self ->{features};
}
sub euclidean_length {
|
lib/AI/Categorizer/FeatureVector.pm
view on Meta::CPAN
142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | $f3 = $f1 ->add( $f2 );
$h = $f1 ->as_hash;
$h = $f1 ->as_boolean_hash;
$f1 ->normalize;
|
lib/AI/Categorizer/Hypothesis.pm
view on Meta::CPAN
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | my $self = shift ;
return @{ $self ->{scores}}{ @_ };
}
1;
|
lib/AI/Categorizer/Hypothesis.pm
view on Meta::CPAN
92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 | are returned by the Learner's C<categorize()> method. However, if you
wish to create a Hypothesis directly (maybe passing it some fake data
for testing purposes) you may do so using the C<new()> method.
The following parameters are accepted when creating a new Hypothesis:
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 | my ( $pkg , %args ) = @_ ;
if ( $args {tfidf_weighting}) {
@args { 'term_weighting' , 'collection_weighting' , 'normalize_weighting' } = split '' , $args {tfidf_weighting};
delete $args {tfidf_weighting};
}
my $self = $pkg ->SUPER::new( %args );
$self ->{categories} = new AI::Categorizer::ObjectSet( @{ $self ->{categories}} );
$self ->{documents} = new AI::Categorizer::ObjectSet( @{ $self ->{documents}} );
if ( $self ->{load}) {
my $args = ref ( $self ->{load}) ? $self ->{load} : { path => $self ->{load} };
$self ->load( %$args );
delete $self ->{load};
}
return $self ;
}
sub features {
my $self = shift ;
if ( @_ ) {
$self ->{features} = shift ;
$self ->trim_doc_features if $self ->{features};
}
return $self ->{features} if $self ->{features};
my $v = $self ->create_delayed_object( 'features' );
foreach my $document ( $self ->documents) {
$v ->add( $document ->features );
}
return $self ->{features} = $v ;
}
sub categories {
my $c = $_ [0]->{categories};
return wantarray ? $c ->members : $c ->size;
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | sub load {
my ( $self , %args ) = @_ ;
my $c = $self ->_make_collection(\ %args );
if ( $self ->{features_kept}) {
$self -> read ( collection => $c );
$self ->select_features;
} elsif ( $self ->{scan_first}) {
$self ->scan_features( collection => $c );
$c ->rewind;
$self -> read ( collection => $c );
} else {
$self -> read ( collection => $c );
}
}
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 | my $ranked_features = $self ->{feature_selector}->scan_features( collection => $c , prog_bar => $pb );
$self ->delayed_object_params( 'document' , use_features => $ranked_features );
$self ->delayed_object_params( 'collection' , use_features => $ranked_features );
return $ranked_features ;
}
sub select_features {
my $self = shift ;
my $f = $self ->feature_selector->select_features( knowledge_set => $self );
$self ->features( $f );
}
sub partition {
my ( $self , @sizes ) = @_ ;
my $num_docs = my @docs = $self ->documents;
my @groups ;
while ( @sizes > 1) {
my $size = int ( $num_docs * shift @sizes );
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 | $self ->delayed_object_params( 'document' , use_features => $features );
$self ->delayed_object_params( 'collection' , use_features => $features );
}
1;
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 | If a C<load> parameter is present, the C<load()> method will be
invoked immediately. If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>. If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 | No change - multiply by 1.
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 | Given a document name, returns the Document object with that name, or
C< undef > if no such Document object exists in this KnowledgeSet.
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 | This method will be called during C<finish()> to adjust the weights of
the features according to the C<tfidf_weighting> parameter.
|
lib/AI/Categorizer/Learner.pm
view on Meta::CPAN
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | __PACKAGE__->valid_params
(
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' , optional => 1 },
verbose => { type => SCALAR, default => 0},
);
__PACKAGE__->contained_objects
(
hypothesis => {
class => 'AI::Categorizer::Hypothesis' ,
delayed => 1,
},
experiment => {
|
lib/AI/Categorizer/Learner.pm
view on Meta::CPAN
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | sub add_knowledge;
sub verbose {
my $self = shift ;
if ( @_ ) {
$self ->{verbose} = shift ;
}
return $self ->{verbose};
}
sub knowledge_set {
my $self = shift ;
if ( @_ ) {
$self ->{knowledge_set} = shift ;
}
return $self ->{knowledge_set};
}
sub categories {
my $self = shift ;
return $self ->knowledge_set->categories;
}
sub train {
my ( $self , %args ) = @_ ;
$self ->{knowledge_set} = $args {knowledge_set} if $args {knowledge_set};
die "No knowledge_set provided" unless $self ->{knowledge_set};
$self ->{knowledge_set}->finish;
$self ->create_model;
$self ->delayed_object_params( 'hypothesis' ,
all_categories => [ map $_ ->name, $self ->categories],
);
}
sub prog_bar {
my ( $self , $count ) = @_ ;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1" ;
|
lib/AI/Categorizer/Learner.pm
view on Meta::CPAN
137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | AI::Categorizer::Learner - Abstract Machine Learner Class
|
lib/AI/Categorizer/Learner.pm
view on Meta::CPAN
170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 | |
lib/AI/Categorizer/Learner/Boolean.pm
view on Meta::CPAN
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | (
max_instances => { type => SCALAR, default => 0},
threshold => { type => SCALAR, default => 0.5},
);
sub create_model {
my $self = shift ;
my $m = $self ->{model} ||= {};
my $mi = $self ->{max_instances};
foreach my $cat ( $self ->knowledge_set->categories) {
my ( @p , @n );
foreach my $doc ( $self ->knowledge_set->documents) {
if ( $doc ->is_in_category( $cat )) {
push @p , $doc ;
} else {
push @n , $doc ;
}
}
if ( $mi and @p + @n > $mi ) {
my $ratio = $mi / ( @p + @n );
@p = random_elements(\ @p , @p * $ratio );
@n = random_elements(\ @n , @n * $ratio );
warn "Limiting to " . @p . " positives and " . @n . " negatives\n" if $self ->verbose;
}
warn "Creating model for " , $cat ->name, "\n" if $self ->verbose;
$m ->{learners}{ $cat ->name } = $self ->create_boolean_model(\ @p , \ @n , $cat );
|
lib/AI/Categorizer/Learner/DecisionTree.pm
view on Meta::CPAN
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | my %results ;
for ( $positives , $negatives ) {
foreach my $doc ( @$_ ) {
$results { $doc ->name} = $_ eq $positives ? 1 : 0;
}
}
if ( $self ->{model}{first_tree}) {
$t ->copy_instances( from => $self ->{model}{first_tree});
$t ->set_results(\ %results );
} else {
for ( $positives , $negatives ) {
foreach my $doc ( @$_ ) {
$t ->add_instance( attributes => $doc ->features->as_boolean_hash,
result => $results { $doc ->name},
name => $doc ->name,
);
}
}
|
lib/AI/Categorizer/Learner/DecisionTree.pm
view on Meta::CPAN
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | AI::Categorizer::Learner::DecisionTree - Decision Tree Learner
|
lib/AI/Categorizer/Learner/DecisionTree.pm
view on Meta::CPAN
91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
lib/AI/Categorizer/Learner/Guesser.pm
view on Meta::CPAN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | sub create_model {
my $self = shift ;
my $k = $self ->knowledge_set;
my $num_docs = $k ->documents;
foreach my $cat ( $k ->categories) {
next unless $cat ->documents;
$self ->{model}{ $cat ->name} = $cat ->documents / $num_docs ;
}
}
sub get_scores {
my ( $self , $newdoc ) = @_ ;
|
lib/AI/Categorizer/Learner/Guesser.pm
view on Meta::CPAN
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | AI::Categorizer::Learner::Guesser - Simple guessing based on class probabilities
|
lib/AI/Categorizer/Learner/KNN.pm
view on Meta::CPAN
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 | __PACKAGE__->valid_params
(
threshold => { type => SCALAR, default => 0.4},
k_value => { type => SCALAR, default => 20},
knn_weighting => { type => SCALAR, default => 'score' },
max_instances => { type => SCALAR, default => 0},
);
sub create_model {
my $self = shift ;
foreach my $doc ( $self ->knowledge_set->documents) {
$doc ->features->normalize;
}
$self ->knowledge_set->features;
}
sub threshold {
my $self = shift ;
$self ->{threshold} = shift if @_ ;
return $self ->{threshold};
}
sub categorize_collection {
my $self = shift ;
my $f_class = $self ->knowledge_set->contained_class( 'features' );
if ( $f_class ->can( 'all_features' )) {
$f_class ->all_features([ $self ->knowledge_set->features->names]);
}
$self ->SUPER::categorize_collection( @_ );
}
sub get_scores {
my ( $self , $newdoc ) = @_ ;
my $currentDocName = $newdoc ->name;
my $features = $newdoc ->features->intersection( $self ->knowledge_set->features)->normalize;
my $q = AI::Categorizer::Learner::KNN::Queue->new( size => $self ->{k_value});
my @docset ;
if ( $self ->{max_instances}) {
my $probability = $self ->{max_instances} / $self ->knowledge_set->documents;
@docset = grep { rand () < $probability } $self ->knowledge_set->documents;
} else {
@docset = $self ->knowledge_set->documents;
}
foreach my $doc ( @docset ) {
my $score = $doc ->features->dot( $features );
warn "Score for " , $doc ->name, " (" , ( $doc ->categories)[0]->name, "): $score" if $self ->verbose > 1;
$q ->add( $doc , $score );
}
my %scores = map {+ $_ ->name, 0} $self ->categories;
foreach my $e (@{ $q ->entries}) {
foreach my $cat ( $e ->{thing}->categories) {
$scores { $cat ->name} += ( $self ->{knn_weighting} eq 'score' ? $e ->{score} : 1);
}
|
lib/AI/Categorizer/Learner/KNN.pm
view on Meta::CPAN
118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 | AI::Categorizer::Learner::KNN - K Nearest Neighbour Algorithm For AI::Categorizer
|
lib/AI/Categorizer/Learner/KNN.pm
view on Meta::CPAN
155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 | Creates a new KNN Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
KNN subclass accepts the following parameters:
|
lib/AI/Categorizer/Learner/NaiveBayes.pm
view on Meta::CPAN
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | __PACKAGE__->valid_params
(
threshold => { type => SCALAR, default => 0.3},
);
sub create_model {
my $self = shift ;
my $m = $self ->{model} = Algorithm::NaiveBayes->new;
foreach my $d ( $self ->knowledge_set->documents) {
$m ->add_instance( attributes => $d ->features->as_hash,
label => [ map $_ ->name, $d ->categories ]);
}
$m ->train;
}
sub get_scores {
my ( $self , $newdoc ) = @_ ;
return ( $self ->{model}->predict( attributes => $newdoc ->features->as_hash ),
|
lib/AI/Categorizer/Learner/NaiveBayes.pm
view on Meta::CPAN
30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | }
sub threshold {
my $self = shift ;
$self ->{threshold} = shift if @_ ;
return $self ->{threshold};
}
sub save_state {
my $self = shift ;
local $self ->{knowledge_set};
$self ->SUPER::save_state( @_ );
}
sub categories {
my $self = shift ;
return map AI::Categorizer::Category->by_name( name => $_ ), $self ->{model}->labels;
}
1;
|
lib/AI/Categorizer/Learner/NaiveBayes.pm
view on Meta::CPAN
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 | AI::Categorizer::Learner::NaiveBayes - Naive Bayes Algorithm For AI::Categorizer
|
lib/AI/Categorizer/Learner/NaiveBayes.pm
view on Meta::CPAN
95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | Creates a new Naive Bayes Learner and returns it. In addition to the
parameters accepted by the C<AI::Categorizer::Learner> class, the
Naive Bayes subclass accepts the following parameters:
|
lib/AI/Categorizer/Learner/Rocchio.pm
view on Meta::CPAN
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | $VERSION = '0.01' ;
__PACKAGE__->valid_params
(
positive_setting => { type => SCALAR, default => 16 },
negative_setting => { type => SCALAR, default => 4 },
threshold => { type => SCALAR, default => 0.1},
);
sub create_model {
my $self = shift ;
foreach my $doc ( $self ->knowledge_set->documents) {
$doc ->features->normalize;
}
$self ->{model}{all_features} = $self ->knowledge_set->features( undef );
$self ->SUPER::create_model( @_ );
delete $self ->{knowledge_set};
}
sub create_boolean_model {
my ( $self , $positives , $negatives , $cat ) = @_ ;
my $posdocnum = @$positives ;
my $negdocnum = @$negatives ;
my $beta = $self ->{positive_setting};
my $gamma = $self ->{negative_setting};
my $profile = $self ->{model}{all_features}->clone->scale(- $gamma / $negdocnum );
my $f = $cat ->features( undef )->clone->scale( $beta / $posdocnum + $gamma / $negdocnum );
$profile ->add( $f );
return $profile ->normalize;
}
sub get_boolean_score {
my ( $self , $newdoc , $profile ) = @_ ;
|
lib/AI/Categorizer/Learner/SVM.pm
view on Meta::CPAN
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | __PACKAGE__->valid_params
(
svm_kernel => { type => SCALAR, default => 'linear' },
);
sub create_model {
my $self = shift ;
my $f = $self ->knowledge_set->features->as_hash;
my $rmap = [ keys %$f ];
$self ->{model}{feature_map} = { map { $rmap ->[ $_ ], $_ } 0.. $#$rmap };
$self ->{model}{feature_map_reverse} = $rmap ;
$self ->SUPER::create_model( @_ );
}
sub _doc_2_dataset {
my ( $self , $doc , $label , $fm ) = @_ ;
my $ds = new Algorithm::SVM::DataSet( Label => $label );
my $f = $doc ->features->as_hash;
while ( my ( $k , $v ) = each %$f ) {
next unless exists $fm ->{ $k };
$ds ->attribute( $fm ->{ $k }, $v );
}
return $ds ;
}
sub create_boolean_model {
my ( $self , $positives , $negatives , $cat ) = @_ ;
my $svm = new Algorithm::SVM( Kernel => $self ->{svm_kernel});
my ( @pos , @neg );
foreach my $doc ( @$positives ) {
push @pos , $self ->_doc_2_dataset( $doc , 1, $self ->{model}{feature_map});
}
foreach my $doc ( @$negatives ) {
push @neg , $self ->_doc_2_dataset( $doc , 0, $self ->{model}{feature_map});
}
$svm ->train( @pos , @neg );
return $svm ;
}
sub get_scores {
my ( $self , $doc ) = @_ ;
local $self ->{current_doc} = $self ->_doc_2_dataset( $doc , -1, $self ->{model}{feature_map});
return $self ->SUPER::get_scores( $doc );
}
sub get_boolean_score {
my ( $self , $doc , $svm ) = @_ ;
return $svm ->predict( $self ->{current_doc});
}
sub save_state {
my ( $self , $path ) = @_ ;
{
local $self ->{model}{learners};
local $self ->{knowledge_set};
$self ->SUPER::save_state( $path );
}
return unless $self ->{model};
my $svm_dir = File::Spec->catdir( $path , 'svms' );
mkdir ( $svm_dir , 0777) or die "Couldn't create $svm_dir: $!" ;
while ( my ( $name , $learner ) = each %{ $self ->{model}{learners}}) {
my $path = File::Spec->catfile( $svm_dir , $name );
$learner ->save( $path );
}
|
lib/AI/Categorizer/Learner/SVM.pm
view on Meta::CPAN
101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | AI::Categorizer::Learner::SVM - Support Vector Machine Learner
|
lib/AI/Categorizer/Learner/SVM.pm
view on Meta::CPAN
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
lib/AI/Categorizer/Learner/Weka.pm
view on Meta::CPAN
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | delete $self ->{weka_path};
}
return $self ;
}
sub create_model {
my ( $self ) = shift ;
my $m = $self ->{model} ||= {};
$m ->{all_features} = [ $self ->knowledge_set->features->names ];
$m ->{_in_dir} = File::Temp::tempdir( DIR => $self ->{tmpdir} );
my $dummy_features = $self ->create_delayed_object( 'features' );
$m ->{dummy_file} = $self ->create_arff_file( "dummy" , [[ $dummy_features , 0]]);
$self ->SUPER::create_model( @_ );
}
sub create_boolean_model {
|
lib/AI/Categorizer/Learner/Weka.pm
view on Meta::CPAN
215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 | );
}
return $filename ;
}
sub save_state {
my ( $self , $path ) = @_ ;
{
local $self ->{knowledge_set};
$self ->SUPER::save_state( $path );
}
return unless $self ->{model};
my $model_dir = File::Spec->catdir( $path , 'models' );
mkdir ( $model_dir , 0777) or die "Couldn't create $model_dir: $!" ;
while ( my ( $name , $learner ) = each %{ $self ->{model}{learners}}) {
my $oldpath = File::Spec->catdir( $self ->{model}{_in_dir}, $learner ->{machine_file});
my $newpath = File::Spec->catfile( $model_dir , "${name}_model" );
File::Copy::copy( $oldpath , $newpath );
|
lib/AI/Categorizer/Learner/Weka.pm
view on Meta::CPAN
257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 | AI::Categorizer::Learner::Weka - Pass-through wrapper to Weka system
|
lib/AI/Categorizer/Learner/Weka.pm
view on Meta::CPAN
335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 | classifier class when building the categorizer.
|
t/01-naive_bayes.t
view on Meta::CPAN
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | perform_standard_tests( learner_class => 'AI::Categorizer::Learner::NaiveBayes' );
my %docs = training_docs();
{
ok my $c = new AI::Categorizer( collection_weighting => 'f' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document_frequency( 'vampires' ), 2;
for ( 'vampires' , 'mirrors' ) {
ok ( $c ->knowledge_set->document( 'doc4' )->features->as_hash->{ $_ },
log ( keys ( %docs ) / $c ->knowledge_set->document_frequency( $_ ) )
);
}
$c ->learner->train( knowledge_set => $c ->knowledge_set );
ok $c ->learner;
my $doc = new AI::Categorizer::Document
( name => 'test1' ,
content => 'I would like to begin farming sheep.' );
ok $c ->learner->categorize( $doc )->best_category, 'farming' ;
}
{
ok my $c = new AI::Categorizer( term_weighting => 'b' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 1;
}
{
ok my $c = new AI::Categorizer( term_weighting => 'n' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 1;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{blood}, 0.75;
ok $c ->knowledge_set->document( 'doc4' )->features->as_hash->{mirrors}, 1;
}
{
ok my $c = new AI::Categorizer( tfidf_weighting => 'txx' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 2;
}
|
t/07-guesser.t
view on Meta::CPAN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | #!/usr/bin/perl -w
BEGIN {
require 't/common.pl' ;
plan tests => 1 + num_setup_tests();
}
ok(1);
my ( $learner , $docs ) = set_up_tests( learner_class => 'AI::Categorizer::Learner::Guesser' );
|