AI-Categorizer
view release on metacpan
or search on metacpan
README
view on Meta::CPAN
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 | be seen in "doc/classes.png" .
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus" , or "prior knowledge" . A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of |
README
view on Meta::CPAN
109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 | complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
turning a bunch of data into a Feature Vector. Each document also has a
method to report which categories it belongs to.
Categories
Each category is represented by an "AI::Categorizer::Category" object. Its
main purpose is to keep track of which documents belong to it, though you
can also examine statistical properties of an entire category, such as
obtaining a Feature Vector representing an amalgamation of all the documents
that belong to it.
|
README
view on Meta::CPAN
172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | Please see the documentation of these individual modules for more details on
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean" , which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don 't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features *.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
|
README
view on Meta::CPAN
236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set" , "test_set" , and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training" , "test_set" to "$data_root/test" , and
"category_file" (used by some of the Collection classes) to
"$data_root/cats.txt" .
learner()
Returns the Learner object associated with this Categorizer. Before
"train()" , the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT" . Internally, this is just a shortcut for calling
the "scan_features()" , "read_training_set()" , "train()" , and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use .
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use .
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()" . Returns the Learner object. Also
saves the Learner object for later use .
evaluate_test_set()
|
eg/demo.pl
view on Meta::CPAN
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | die ( "Usage: $0 <corpus>\n" .
" A sample corpus (data set) can be downloaded from\n" .
unless @ARGV == 1;
my $corpus = shift ;
my $training = File::Spec->catfile( $corpus , 'training' );
my $test = File::Spec->catfile( $corpus , 'test' );
my $cats = File::Spec->catfile( $corpus , 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus , 'stopwords' );
|
lib/AI/Categorizer.pm
view on Meta::CPAN
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | __PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment' ,
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files' ,
delayed => 1 },
);
sub new {
my $package = shift ;
my %args = @_ ;
my %defaults ;
if ( exists $args {data_root}) {
$defaults {training_set} = File::Spec->catfile( $args {data_root}, 'training' );
$defaults {test_set} = File::Spec->catfile( $args {data_root}, 'test' );
$defaults {category_file} = File::Spec->catfile( $args {data_root}, 'cats.txt' );
delete $args {data_root};
}
return $package ->SUPER::new( %defaults , %args );
}
|
lib/AI/Categorizer.pm
view on Meta::CPAN
212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 | |
lib/AI/Categorizer.pm
view on Meta::CPAN
253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 | two collections, one for training and one for testing. A C<Learner>
can mass-categorize a collection.
The C<AI::Categorizer::Collection> class and its subclasses
instantiate the idea of a collection in this sense.
|
lib/AI/Categorizer.pm
view on Meta::CPAN
327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 | details on their guts and quirks. See the C<AI::Categorizer::Learner>
documentation for a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
C<AI::Categorizer::Learner> or C<AI::Categorizer::Learner::Boolean>,
which are abstract classes that manage some of the work for you.
|
lib/AI/Categorizer.pm
view on Meta::CPAN
405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 | Specifies the C<path> parameter that will be fed to the KnowledgeSet's
C<scan_features()> and C< read ()> methods during our C<scan_features()>
and C<read_training_set()> methods.
|
lib/AI/Categorizer/Collection/InMemory.pm
view on Meta::CPAN
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | __PACKAGE__->valid_params
(
data => { type => HASHREF },
);
sub new {
my $self = shift ()->SUPER::new( @_ );
while ( my ( $name , $params ) = each %{ $self ->{data}}) {
foreach (@{ $params ->{categories}}) {
next if ref $_ ;
$_ = AI::Categorizer::Category->by_name( name => $_ );
}
}
return $self ;
}
sub next {
my $self = shift ;
my ( $name , $params ) = each %{ $self ->{data}} or return ;
return AI::Categorizer::Document->new( name => $name , %$params );
}
sub rewind {
my $self = shift ;
scalar keys %{ $self ->{data}};
return ;
}
sub count_documents {
my $self = shift ;
return scalar keys %{ $self ->{data}};
}
1;
|
lib/AI/Categorizer/Document.pm
view on Meta::CPAN
71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | my $NAME = 'a' ;
sub new {
my $pkg = shift ;
my $self = $pkg ->SUPER::new( name => $NAME ++,
@_ );
$self ->{categories} = new AI::Categorizer::ObjectSet( @{ $self ->{categories}} );
$self ->_fix_stopwords;
if ( exists $self ->{parse}) {
$self ->parse( content => delete $self ->{parse});
} elsif ( exists $self ->{parse_handle}) {
$self ->parse_handle( handle => delete $self ->{parse_handle});
|
lib/AI/Categorizer/Document.pm
view on Meta::CPAN
165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 | sub create_feature_vector {
my $self = shift ;
my $content = $self ->{content};
my $weights = $self ->{content_weights};
die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
unless $self ->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;
$self ->{features} = $self ->create_delayed_object( 'features' );
while ( my ( $name , $data ) = each %$content ) {
my $t = $self ->tokenize( $data );
$t = $self ->_filter_tokens( $t ) if $self ->{stopword_behavior} eq 'no_stem' ;
$self ->stem_words( $t );
$t = $self ->_filter_tokens( $t ) if $self ->{stopword_behavior} =~ /^stem|pre_stemmed$/;
my $h = $self ->vectorize( tokens => $t , weight => exists ( $weights ->{ $name }) ? $weights ->{ $name } : 1 );
$self ->{features}->add( $h );
}
}
sub is_in_category {
return ( ref $_ [1]
|
lib/AI/Categorizer/Document.pm
view on Meta::CPAN
320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | my $d = new AI::Categorizer::Document( name => $string );
$d ->features( $feature_vector );
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state( $path );
my $hypothesis = $learner ->categorize( $document );
|
lib/AI/Categorizer/Document.pm
view on Meta::CPAN
434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 | |
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | sub parse {
my ( $self , %args ) = @_ ;
my $body = $args {content};
my $elementWeight = $args {elementWeight};
my $xmlHandler = $self ->create_contained_object( 'xml_handler' , weights => $elementWeight );
my $xmlParser = XML::SAX::ParserFactory->parser( Handler => $xmlHandler );
$xmlParser ->parse_string( $body );
|
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 |
my $self = $class ->SUPER::new;
$self ->{weightHash} = $args {weights};
$self ->{content} = '' ;
$self ->{locationArray} = [];
return $self ;
}
sub start_document{
my ( $self , $doc )= @_ ;
$self ->{levelPointer} = 0;
$self ->{content} = "" ;
}
|
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | sub start_element{
my ( $self , $el )= @_ ;
my $location = length $self ->{content};
$self ->{locationArray}[ $self ->{levelPointer}] = $location ;
$self ->{levelPointer}++;
}
|
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 | $self ->{levelPointer}--;
my $location = $self ->{locationArray}[ $self ->{levelPointer}];
my $elementName = $el ->{Name};
my $weight = 1;
$weight = $self ->{weightHash}{ $elementName } if exists $self ->{weightHash}{ $elementName };
if ( $weight == 0){
$self ->{content} = substr ( $self ->{content}, 0, $location );
return ;
}
if ( $weight == 1){
return ;
}
my $newContent = substr ( $self ->{content}, $location );
for ( my $i =1; $i < $weight ; $i ++){
$self ->{content} .= $newContent ;
}
}
sub characters{
my ( $self , $args )= @_ ;
$self ->{content} .= "$args->{Data}\n" ;
}
sub comment{
my ( $self , $args )= @_ ;
|
lib/AI/Categorizer/Document/XML.pm
view on Meta::CPAN
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | sub processing_instruction{
my ( $self , $args )= @_ ;
}
sub getContent{
my ( $self )= @_ ;
return $self ->{content};
}
1;
|
lib/AI/Categorizer/Experiment.pm
view on Meta::CPAN
82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | C<Statistics::Contingency> for a description of its interface. All of
its methods are available here, with the following additions:
|
lib/AI/Categorizer/FeatureSelector.pm
view on Meta::CPAN
297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 | This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
|
lib/AI/Categorizer/FeatureVector.pm
view on Meta::CPAN
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | $f3 = $f1 ->intersection( $f2 );
$f3 = $f1 ->add( $f2 );
$h = $f1 ->as_hash;
$h = $f1 ->as_boolean_hash;
$f1 ->normalize;
|
lib/AI/Categorizer/Hypothesis.pm
view on Meta::CPAN
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 | |
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 | sub load {
my ( $self , %args ) = @_ ;
my $c = $self ->_make_collection(\ %args );
if ( $self ->{features_kept}) {
$self -> read ( collection => $c );
$self ->select_features;
} elsif ( $self ->{scan_first}) {
$self ->scan_features( collection => $c );
$c ->rewind;
$self -> read ( collection => $c );
} else {
$self -> read ( collection => $c );
}
}
sub read {
my ( $self , %args ) = @_ ;
my $collection = $self ->_make_collection(\ %args );
my $pb = $self ->prog_bar( $collection );
while ( my $doc = $collection -> next ) {
|
lib/AI/Categorizer/KnowledgeSet.pm
view on Meta::CPAN
654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 | This method can do feature selection and load a Collection in one step
(though it currently uses two steps internally).
|
lib/AI/Categorizer/Learner/SVM.pm
view on Meta::CPAN
16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | sub create_model {
my $self = shift ;
my $f = $self ->knowledge_set->features->as_hash;
my $rmap = [ keys %$f ];
$self ->{model}{feature_map} = { map { $rmap ->[ $_ ], $_ } 0.. $#$rmap };
$self ->{model}{feature_map_reverse} = $rmap ;
$self ->SUPER::create_model( @_ );
}
sub _doc_2_dataset {
my ( $self , $doc , $label , $fm ) = @_ ;
my $ds = new Algorithm::SVM::DataSet( Label => $label );
my $f = $doc ->features->as_hash;
while ( my ( $k , $v ) = each %$f ) {
next unless exists $fm ->{ $k };
$ds ->attribute( $fm ->{ $k }, $v );
}
return $ds ;
}
sub create_boolean_model {
my ( $self , $positives , $negatives , $cat ) = @_ ;
my $svm = new Algorithm::SVM( Kernel => $self ->{svm_kernel});
my ( @pos , @neg );
foreach my $doc ( @$positives ) {
push @pos , $self ->_doc_2_dataset( $doc , 1, $self ->{model}{feature_map});
}
foreach my $doc ( @$negatives ) {
push @neg , $self ->_doc_2_dataset( $doc , 0, $self ->{model}{feature_map});
}
$svm ->train( @pos , @neg );
return $svm ;
}
sub get_scores {
my ( $self , $doc ) = @_ ;
local $self ->{current_doc} = $self ->_doc_2_dataset( $doc , -1, $self ->{model}{feature_map});
return $self ->SUPER::get_scores( $doc );
}
sub get_boolean_score {
my ( $self , $doc , $svm ) = @_ ;
return $svm ->predict( $self ->{current_doc});
}
sub save_state {
my ( $self , $path ) = @_ ;
|
lib/AI/Categorizer/Learner/Weka.pm
view on Meta::CPAN
272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 | $nb = AI::Categorizer::Learner->restore_state( 'filename' );
my $c = new AI::Categorizer::Collection::Files( path => ... );
while ( my $document = $c -> next ) {
my $hypothesis = $nb ->categorize( $document );
print "Best assigned category: " , $hypothesis ->best_category, "\n" ;
}
|
lib/AI/Categorizer/Storable.pm
view on Meta::CPAN
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | |
t/01-naive_bayes.t
view on Meta::CPAN
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | perform_standard_tests( learner_class => 'AI::Categorizer::Learner::NaiveBayes' );
my %docs = training_docs();
{
ok my $c = new AI::Categorizer( collection_weighting => 'f' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document_frequency( 'vampires' ), 2;
for ( 'vampires' , 'mirrors' ) {
ok ( $c ->knowledge_set->document( 'doc4' )->features->as_hash->{ $_ },
log ( keys ( %docs ) / $c ->knowledge_set->document_frequency( $_ ) )
);
|
t/01-naive_bayes.t
view on Meta::CPAN
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
my $doc = new AI::Categorizer::Document
( name => 'test1' ,
content => 'I would like to begin farming sheep.' );
ok $c ->learner->categorize( $doc )->best_category, 'farming' ;
}
{
ok my $c = new AI::Categorizer( term_weighting => 'b' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 1;
}
{
ok my $c = new AI::Categorizer( term_weighting => 'n' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 1;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{blood}, 0.75;
ok $c ->knowledge_set->document( 'doc4' )->features->as_hash->{mirrors}, 1;
}
{
ok my $c = new AI::Categorizer( tfidf_weighting => 'txx' );
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
$c ->knowledge_set->finish;
ok $c ->knowledge_set->document( 'doc3' )->features->as_hash->{vampires}, 2;
}
|
t/14-collection.t
view on Meta::CPAN
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | BEGIN { plan tests => 13 };
ok 1;
my $c = AI::Categorizer::Collection::InMemory->new( data => {training_docs()});
ok $c ;
exercise_collection( $c , 4);
$c = AI::Categorizer::Collection::Files->new( path => File::Spec->catdir( 't' , 'traindocs' ),
category_hash => {
doc1 => [ 'farming' ],
doc2 => [ 'farming' ],
doc3 => [ 'vampire' ],
|
t/common.pl
view on Meta::CPAN
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 | (
name => 'Vampires/Farmers' ,
stopwords => [ qw(are be in of and) ],
),
verbose => $ENV {TEST_VERBOSE} ? 1 : 0,
%params ,
);
ok ref ( $c ), 'AI::Categorizer' , "Create an AI::Categorizer object" ;
my %docs = training_docs();
while ( my ( $name , $data ) = each %docs ) {
$c ->knowledge_set->make_document( name => $name , %$data );
}
my $l = $c ->learner;
ok $l ;
if ( $params {learner_class}) {
ok ref ( $l ), $params {learner_class}, "Make sure the correct Learner class is instantiated" ;
} else {
ok 1, 1, "Dummy test" ;
}
|
t/common.pl
view on Meta::CPAN
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
run_test_docs( $l );
$l ->save_state( 't/state' );
$l = $l ->restore_state( 't/state' );
ok $l ;
run_test_docs( $l );
my $train_collection = AI::Categorizer::Collection::InMemory->new( data => $docs );
ok $train_collection ;
my $h = $l ->categorize_collection( collection => $train_collection );
ok $h ->micro_precision > 0.5;
}
sub num_setup_tests () { 3 }
sub num_standard_tests () { num_setup_tests + 17 }
1;
|