view release on metacpan or search on metacpan
use Module::Build;
use Config;
use File::Spec;
my $build = new Module::Build
(
module_name => 'AI::Categorizer',
license => 'perl',
requires => {
perl => '5.6.0',
Class::Container => 0.09,
Storable => 0,
Params::Validate => 0.18,
Statistics::Contingency => 0.06,
Lingua::Stem => 0.50,
File::Spec => 0,
},
recommends => {
Scalar::Util => 0,
Time::Progress => 1.1,
Algorithm::SVM => 0.06,
AI::DecisionTree => 0.06,
Algorithm::NaiveBayes => 0,
},
build_requires => {
Module::Build => 0.20,
},
create_makefile_pl => 'passthrough',
);
my $categorizer = File::Spec->catfile('eg', 'categorizer');
if ($build->y_n("Do you want to install the $categorizer script to $Config{installscript}?", 'n')) {
$build->scripts($categorizer);
}
$build->create_build_script;
{
my $path = $build->prompt
(
"\nIf you have the Weka system installed, please specify the path\n".
"to the 'weka.jar' file, or '-' to search CLASSPATH, or '!' to skip:",
'!'
);
$build->notes(classpath => $path eq '!' ? undef : $path);
}
Revision history for Perl extension AI::Categorizer.
- The t/01-naive_bayes.t test was failing (instead of skipping) when
Algorithm::NaiveBayes wasn't installed. Now it skips.
0.08 - Tue Mar 20 19:39:41 2007
- Added a ChiSquared feature selection class. [Francois Paradis]
- Changed the web locations of the reuters-21578 corpus that
eg/demo.pl uses, since the location it referenced previously has
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
[Spotted by Zoltan Barta]
0.06 Tue Apr 22 10:27:26 CDT 2003
- Added a relatively simple example script at the request of several
people, at eg/demo.pl
- Forgot to note a dependency on Algorithm::NaiveBayes in version
0.05. Fixed.
- AI::Categorizer class wasn't loading AI::Categorizer::KnowledgeSet
class. Fixed.
- Fixed a bug in which the 'documents' and 'categories' parameters to
the KnowledgeSet objects were never accepted, claiming that it
failed the "All are Document objects" or "All are Category objects"
callbacks. [Spotted by rob@phraud.org]
- Moved the 'stopword_file' parameter from Categorizer.pm to the
Collection class.
0.05 Sat Mar 29 00:38:21 CST 2003
- Feature selection is now handled by an abstract FeatureSelector
framework class. Currently the only concrete subclass implemented
is FeatureSelector::DocFrequency. The 'feature_selection'
parameter has been replaced with a 'feature_selector_class'
parameter.
- Added a k-Nearest-Neighbor machine learner. [First revision
implemented by David Bell]
- Added a Rocchio machine learner. [Partially implemented by Xiaobo
Li]
- Added a "Guesser" machine learner which simply uses overall class
probabilities to make categorization decisions. Sometimes useful
for providing a set of baseline scores against which to evaluate
other machine learners.
- The NaiveBayes learner is now a wrapper around my new
Algorithm::NaiveBayes module, which is just the old NaiveBayes code
from here, turned into its own standalone module.
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
later releases.
- Added a 'Build.PL' file for an alternate installation method using
Module::Build.
- Fixed a problem in the Hypothesis' best_category() method that
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- The 'categorizer' script now records more useful benchmarking
information about time & memory in its outfile.
- The AI::Categorizer->dump_parameters() method now tries to avoid
showing you its entire list of stopwords.
- Document objects now use a default 'name' if none is supplied.
- For some Learner classes, the generated Hypothesis objects had
non-functioning all_categories() methods. Fixed.
- The Collection::Files class now uses File::Spec internally to
manage cross-platform filenames.
- Added the 'stopword_behavior' parameter for controlling how
stopword lists and stemming interact. Previously, if stopwords &
stemming were both used, stopwords were assumed to be pre-stemmed,
which often isn't the case.
- parse() is now an instance method of the Document class, not a
class method. This means it can operate directly on an object, it
doesn't have to return a hash of content. This allows more
flexible document parsing. This may cause some backward
compatibility problems if people were overriding the parse()
method.
- Added a parse_handle() method, which can parse a document directly
from a filehandle.
- Fixed documentation for add_hypothesis() [spotted by Thierry
Guillotin]
- Added documentation for the AI::Categorizer::Collection::Files
class.
0.04 Thu Nov 7 19:27:15 AEST 2002
- Added learners for SVMs, Decision Trees, and a pass-through to
Weka.
- Added a virtual class for binary classifiers.
- Wrote documentation for lots of the undocumented classes.
- Added a PNG file giving an overview diagram of the classes.
- Added a script 'categorizer' to provide a simple command-line
interface to AI::Categorizer
- save_state() and restore_state() now save to a directory, not a
file.
- Removed F1(), precision(), recall(), etc. from Util package since
they're in Statistics::Contingency. Added random_elements() to
Util.
- Collection::Files now warns when no category information is known
about a document in the collection (knowing it's in zero categories
is okay).
- Added the Collection::InMemory class
- Much more thorough testing with 'make test'.
- Added add_hypothesis() method to Experiment.
- Added dot() and value() methods to FeatureVector.
- Added 'feature_selection' parameter to KnowledgeSet.
- Added document($name) accessor method to KnowledgeSet.
- In KnowledgeSet, load(), read(), and scan_*() can now accept a
Collection object.
- Added document_frequency(), finish(), and weigh_features() methods
to KnowledgeSet.
- Added save_features() and restore_features() to KnowledgeSet.
- Added default categories() and categorize() methods to Learner base
class. get_scores() is now abstract.
- Extended interface of ObjectSet class with retrieve(), includes(),
and includes_name().
- Moved 'term_weighting' parameter from Document to KnowledgeSet,
since the normalized version needs to know the maximum
term-frequency. Also changed its values to 'n', 'l', 'b', and 't',
with 'x' a synonym for 't'.
- Implemented full range of TF/IDF term weighting methods (see Salton
& Buckley, "Term Weighting Approaches in Automatic Text Retrieval",
in journal "Information Processing & Management", 1988 #5)
0.03 Wed Jul 24 01:57:00 AEST 2002
- First version released to CPAN
0.01 Wed Apr 17 10:47:21 2002
- original version; created by h2xs 1.21 with options
-XA -n AI::Categorizer
Installation instructions for AI::Categorizer
To install this module, follow the standard steps for installing most
Perl modules:
perl Makefile.PL
make
make test
make install
Or you may use the CPAN.pm module, which will automatically execute
these steps for you, and help you get the prerequisite dependencies
installed as well.
Alternatively, you can use the new Module::Build-style installer:
perl Build.PL
./Build
./Build test
./Build install
-Ken
---
name: AI-Categorizer
version: 0.09
author:
- 'Ken Williams <ken@mathforum.org>'
- |-
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai
abstract: Automatic Text Categorization
license: perl
resources:
license: http://dev.perl.org/licenses/
requires:
Class::Container: 0.09
File::Spec: 0
Lingua::Stem: 0.5
Params::Validate: 0.18
Statistics::Contingency: 0.06
Storable: 0
perl: 5.6.0
build_requires:
Module::Build: 0.2
recommends:
AI::DecisionTree: 0.06
Algorithm::NaiveBayes: 0
Algorithm::SVM: 0.06
Scalar::Util: 0
Time::Progress: 1.1
provides:
AI::Categorizer:
file: lib/AI/Categorizer.pm
version: 0.09
AI::Categorizer::Category:
file: lib/AI/Categorizer/Category.pm
AI::Categorizer::Collection:
file: lib/AI/Categorizer/Collection.pm
AI::Categorizer::Collection::DBI:
file: lib/AI/Categorizer/Collection/DBI.pm
AI::Categorizer::Collection::Files:
file: lib/AI/Categorizer/Collection/Files.pm
AI::Categorizer::Collection::InMemory:
file: lib/AI/Categorizer/Collection/InMemory.pm
AI::Categorizer::Collection::SingleFile:
file: lib/AI/Categorizer/Collection/SingleFile.pm
AI::Categorizer::Document:
file: lib/AI/Categorizer/Document.pm
AI::Categorizer::Document::SMART:
file: lib/AI/Categorizer/Document/SMART.pm
AI::Categorizer::Document::Text:
file: lib/AI/Categorizer/Document/Text.pm
AI::Categorizer::Document::XML:
file: lib/AI/Categorizer/Document/XML.pm
AI::Categorizer::Document::XML::Handler:
file: lib/AI/Categorizer/Document/XML.pm
AI::Categorizer::Experiment:
file: lib/AI/Categorizer/Experiment.pm
AI::Categorizer::FeatureSelector:
file: lib/AI/Categorizer/FeatureSelector.pm
AI::Categorizer::FeatureSelector::CategorySelector:
file: lib/AI/Categorizer/FeatureSelector/CategorySelector.pm
AI::Categorizer::FeatureSelector::ChiSquare:
file: lib/AI/Categorizer/FeatureSelector/ChiSquare.pm
AI::Categorizer::FeatureSelector::DocFrequency:
file: lib/AI/Categorizer/FeatureSelector/DocFrequency.pm
AI::Categorizer::FeatureVector:
file: lib/AI/Categorizer/FeatureVector.pm
AI::Categorizer::Hypothesis:
file: lib/AI/Categorizer/Hypothesis.pm
AI::Categorizer::KnowledgeSet:
file: lib/AI/Categorizer/KnowledgeSet.pm
AI::Categorizer::Learner:
file: lib/AI/Categorizer/Learner.pm
AI::Categorizer::Learner::Boolean:
file: lib/AI/Categorizer/Learner/Boolean.pm
AI::Categorizer::Learner::DecisionTree:
file: lib/AI/Categorizer/Learner/DecisionTree.pm
version: 0.01
AI::Categorizer::Learner::Guesser:
file: lib/AI/Categorizer/Learner/Guesser.pm
AI::Categorizer::Learner::KNN:
file: lib/AI/Categorizer/Learner/KNN.pm
AI::Categorizer::Learner::KNN::Queue:
file: lib/AI/Categorizer/Learner/KNN.pm
AI::Categorizer::Learner::NaiveBayes:
file: lib/AI/Categorizer/Learner/NaiveBayes.pm
AI::Categorizer::Learner::Rocchio:
file: lib/AI/Categorizer/Learner/Rocchio.pm
version: 0.01
AI::Categorizer::Learner::SVM:
file: lib/AI/Categorizer/Learner/SVM.pm
version: 0.01
AI::Categorizer::Learner::Weka:
file: lib/AI/Categorizer/Learner/Weka.pm
AI::Categorizer::ObjectSet:
file: lib/AI/Categorizer/ObjectSet.pm
AI::Categorizer::Storable:
file: lib/AI/Categorizer/Storable.pm
AI::Categorizer::Util:
file: lib/AI/Categorizer/Util.pm
generated_by: Module::Build version 0.2806
meta-spec:
url: http://module-build.sourceforge.net/META-spec-v1.2.html
version: 1.2
Makefile.PL view on Meta::CPAN
# Note: this file was auto-generated by Module::Build::Compat version 0.03
unless (eval "use Module::Build::Compat 0.02; 1" ) {
print "This module requires Module::Build to install itself.\n";
require ExtUtils::MakeMaker;
my $yn = ExtUtils::MakeMaker::prompt
(' Install Module::Build now from CPAN?', 'y');
unless ($yn =~ /^y/i) {
die " *** Cannot install without Module::Build. Exiting ...\n";
}
require Cwd;
require File::Spec;
require CPAN;
# Save this 'cause CPAN will chdir all over the place.
my $cwd = Cwd::cwd();
CPAN::Shell->install('Module::Build::Compat');
CPAN::Shell->expand("Module", "Module::Build::Compat")->uptodate
or die "Couldn't install Module::Build, giving up.\n";
chdir $cwd or die "Cannot chdir() back to $cwd: $!";
}
eval "use Module::Build::Compat 0.02; 1" or die $@;
Module::Build::Compat->run_build_pl(args => \@ARGV);
require Module::Build;
Module::Build::Compat->write_makefile(build_class => 'Module::Build');
NAME
AI::Categorizer - Automatic Text Categorization
SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
DESCRIPTION
"AI::Categorizer" is a framework for automatic text categorization. It
consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can choose what
categorization algorithm to use, what features (words or otherwise) of the
documents should be used (or how to automatically choose these features),
what format the documents are in, and so on.
The basic process of using this module will typically involve obtaining a
collection of pre-categorized documents, creating a "knowledge set"
representation of those documents, training a categorizer on that knowledge
set, and saving the trained categorizer for later use. There are several
ways to carry out this process. The top-level "AI::Categorizer" module
provides an umbrella class for high-level operations, or you may use the
interfaces of the individual classes in the framework.
A simple sample script that reads a training corpus, trains a categorizer,
and tests the categorizer on a test corpus, is distributed as eg/demo.pl .
Disclaimer: the results of any of the machine learning algorithms are far
from infallible (close to fallible?). Categorization of documents is often a
difficult task even for humans well-trained in the particular domain of
knowledge, and there are many things a human would consider that none of
these algorithms consider. These are only statistical tests - at best they
are neat tricks or helpful assistants, and at worst they are totally
unreliable. If you plan to use this module for anything really important,
human supervision is essential, both of the categorization process and the
final results.
For the usage details, please see the documentation of each individual
module.
FRAMEWORK COMPONENTS
This section explains the major pieces of the "AI::Categorizer" object
framework. We give a conceptual overview, but don't get into any of the
details about interfaces or usage. See the documentation for the individual
classes for more details.
A diagram of the various classes in the framework can be seen in
"doc/classes-overview.png", and a more detailed view of the same thing can
be seen in "doc/classes.png".
Knowledge Sets
A "knowledge set" is defined as a collection of documents, together with
some information on the categories each document belongs to. Note that this
term is somewhat unique to this project - other sources may call it a
"training corpus", or "prior knowledge". A knowledge set also contains some
information on how documents will be parsed and how their features (words)
will be extracted and turned into meaningful representations. In this sense,
a knowledge set represents not only a collection of data, but a particular
view on that data.
A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
class. Before you can start playing with categorizers, you will have to
start playing with knowledge sets, so that the categorizers have some data
to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
module for information on its interface.
Feature selection
Deciding which features are the most important is a very large part of the
categorization task - you cannot simply consider all the words in all the
documents when training, and all the words in the document being
categorized. There are two main reasons for this - first, it would mean that
your training and categorizing processes would take forever and use tons of
memory, and second, the significant stuff of the documents would get lost in
the "noise" of the insignificant stuff.
The process of selecting the most important features in the training set is
called "feature selection". It is managed by the
"AI::Categorizer::KnowledgeSet" class, and you will find the details of
feature selection processes in that class's documentation.
Collections
Because documents may be stored in lots of different formats, a "collection"
class has been created as an abstraction of a stored set of documents,
together with a way to iterate through the set and return Document objects.
A knowledge set contains a single collection object. A "Categorizer" doing a
complete test run generally contains two collections, one for training and
one for testing. A "Learner" can mass-categorize a collection.
The "AI::Categorizer::Collection" class and its subclasses instantiate the
idea of a collection in this sense.
Documents
Each document is represented by an "AI::Categorizer::Document" object, or an
object of one of its subclasses. Each document class contains methods for
turning a bunch of data into a Feature Vector. Each document also has a
method to report which categories it belongs to.
Categories
Each category is represented by an "AI::Categorizer::Category" object. Its
main purpose is to keep track of which documents belong to it, though you
can also examine statistical properties of an entire category, such as
obtaining a Feature Vector representing an amalgamation of all the documents
that belong to it.
Machine Learning Algorithms
There are lots of different ways to make the inductive leap from the
training documents to unseen documents. The Machine Learning community has
studied many algorithms for this purpose. To allow flexibility in choosing
and configuring categorization algorithms, each such algorithm is a subclass
of "AI::Categorizer::Learner". There are currently four categorizers
included in the distribution:
AI::Categorizer::Learner::NaiveBayes
A pure-perl implementation of a Naive Bayes classifier. No dependencies
on external modules or other resources. Naive Bayes is usually very fast
to train and fast to make categorization decisions, but isn't always the
most accurate categorizer.
AI::Categorizer::Learner::SVM
An interface to Corey Spencer's "Algorithm::SVM", which implements a
Support Vector Machine classifier. SVMs can take a while to train
(though in certain conditions there are optimizations to make them quite
fast), but are pretty quick to categorize. They often have very good
accuracy.
AI::Categorizer::Learner::DecisionTree
An interface to "AI::DecisionTree", which implements a Decision Tree
classifier. Decision Trees generally take longer to train than Naive
Bayes or SVM classifiers, but they are also quite fast when
categorizing. Decision Trees have the advantage that you can scrutinize
the structures of trained decision trees to see how decisions are being
made.
AI::Categorizer::Learner::Weka
An interface to version 2 of the Weka Knowledge Analysis system that
lets you use any of the machine learners it defines. This gives you
access to lots and lots of machine learning algorithms in use by machine
learning researches. The main drawback is that Weka tends to be quite
slow and use a lot of memory, and the current interface between Weka and
"AI::Categorizer" is a bit clumsy.
Other machine learning methods that may be implemented soonish include
Neural Networks, k-Nearest-Neighbor, and/or a mixture-of-experts combiner
for ensemble learning. No timetable for their creation has yet been set.
Please see the documentation of these individual modules for more details on
their guts and quirks. See the "AI::Categorizer::Learner" documentation for
a description of the general categorizer interface.
If you wish to create your own classifier, you should inherit from
"AI::Categorizer::Learner" or "AI::Categorizer::Learner::Boolean", which are
abstract classes that manage some of the work for you.
Feature Vectors
Most categorization algorithms don't deal directly with documents' data,
they instead deal with a *vector representation* of a document's *features*.
The features may be any properties of the document that seem helpful for
determining its category, but they are usually some version of the "most
important" words in the document. A list of features and their weights in
each document is encapsulated by the "AI::Categorizer::FeatureVector" class.
You may think of this class as roughly analogous to a Perl hash, where the
keys are the names of features and the values are their weights.
Hypotheses
The result of asking a categorizer to categorize a previously unseen
document is called a hypothesis, because it is some kind of "statistical
guess" of what categories this document should be assigned to. Since you may
be interested in any of several pieces of information about the hypothesis
(for instance, which categories were assigned, which category was the single
most likely category, the scores assigned to each category, etc.), the
hypothesis is returned as an object of the "AI::Categorizer::Hypothesis"
class, and you can use its object methods to get information about the
hypothesis. See its class documentation for the details.
Experiments
The "AI::Categorizer::Experiment" class helps you organize the results of
categorization experiments. As you get lots of categorization results
(Hypotheses) back from the Learner, you can feed these results to the
Experiment class, along with the correct answers. When all results have been
collected, you can get a report on accuracy, precision, recall, F1, and so
on, with both micro-averaging and macro-averaging over categories. We use
the "Statistics::Contingency" module from CPAN to manage the calculations.
See the docs for "AI::Categorizer::Experiment" for more details.
METHODS
new()
Creates a new Categorizer object and returns it. Accepts lots of
parameters controlling behavior. In addition to the parameters listed
here, you may pass any parameter accepted by any class that we create
internally (the KnowledgeSet, Learner, Experiment, or Collection
classes), or any class that *they* create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
test_set
Specifies the "path" parameter that will be used when creating a
Collection during the "evaluate_test_set()" method.
data_root
A shortcut for setting the "training_set", "test_set", and
"category_file" parameters separately. Sets "training_set" to
"$data_root/training", "test_set" to "$data_root/test", and
"category_file" (used by some of the Collection classes) to
"$data_root/cats.txt".
learner()
Returns the Learner object associated with this Categorizer. Before
"train()", the Learner will of course not be trained yet.
knowledge_set()
Returns the KnowledgeSet object associated with this Categorizer. If
"read_training_set()" has not yet been called, the KnowledgeSet will not
yet be populated with any training data.
run_experiment()
Runs a complete experiment on the training and testing data, reporting
the results on "STDOUT". Internally, this is just a shortcut for calling
the "scan_features()", "read_training_set()", "train()", and
"evaluate_test_set()" methods, then printing the value of the
"stats_table()" method.
scan_features()
Scans the Collection specified in the "test_set" parameter to determine
the set of features (words) that will be considered when training the
Learner. Internally, this calls the "scan_features()" method of the
KnowledgeSet, then saves a list of the KnowledgeSet's features for later
use.
This step is not strictly necessary, but it can dramatically reduce
memory requirements if you scan for features before reading the entire
corpus into memory.
read_training_set()
Populates the KnowledgeSet with the data specified in the "test_set"
parameter. Internally, this calls the "read()" method of the
KnowledgeSet. Returns the KnowledgeSet. Also saves the KnowledgeSet
object for later use.
train()
Calls the Learner's "train()" method, passing it the KnowledgeSet
created during "read_training_set()". Returns the Learner object. Also
saves the Learner object for later use.
evaluate_test_set()
Creates a Collection based on the value of the "test_set" parameter, and
calls the Learner's "categorize_collection()" method using this
Collection. Returns the resultant Experiment object. Also saves the
Experiment object for later use in the "stats_table()" method.
stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()") "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
"AI::Categorize" module by the same author. Note the added 'r' in the new
name. The older module has a different interface, and no attempt at backward
compatibility has been made - that's why I changed the name.
You can have both "AI::Categorize" and "AI::Categorizer" installed at the
same time on the same machine, if you want. They don't know about each other
or use conflicting namespaces.
AUTHOR
Ken Williams <ken@mathforum.org>
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
http://lists.perl.org/showlist.cgi?name=perl-ai
REFERENCES
An excellent introduction to the academic field of Text Categorization is
Fabrizio Sebastiani's "Machine Learning in Automated Text Categorization":
ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp. 1-47.
COPYRIGHT
Copyright 2000-2003 Ken Williams. All rights reserved.
This distribution is free software; you can redistribute it and/or modify it
under the same terms as Perl itself. These terms apply to every file in the
distribution - if you have questions, please contact the author.
eg/categorizer view on Meta::CPAN
use strict;
use AI::Categorizer;
use Benchmark;
my $HAVE_YAML = eval "use YAML; 1";
my ($opt, $do_stage, $outfile) = parse_command_line(@ARGV);
@ARGV = grep !/^-\d$/, @ARGV;
my $c = eval {new AI::Categorizer(%$opt)};
if ($@ and $@ =~ /^The following parameter/) {
die "$@\nPlease see the AI::Categorizer documentation for a description of parameters accepted.\n";
}
die $@ if $@;
%$do_stage = map {$_, 1} 1..5 unless keys %$do_stage;
my $out_fh;
if ($outfile) {
open $out_fh, ">> $outfile" or die "Can't create $outfile: $!";
select((select($out_fh), $|=1)[0]);
if (keys(%$do_stage) > 1) {
print $out_fh "~~~~~~~~~~~~~~~~", scalar(localtime), "~~~~~~~~~~~~~~~~~~~~~~~~~~~\n";
if ($HAVE_YAML) {
print {$out_fh} YAML::Dump($c->dump_parameters);
} else {
warn "More detailed parameter dumping is available if you install the YAML module from CPAN.\n";
}
}
}
run_section('scan_features', 1, $do_stage);
run_section('read_training_set', 2, $do_stage);
run_section('train', 3, $do_stage);
run_section('evaluate_test_set', 4, $do_stage);
if ($do_stage->{5}) {
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
my ($section, $stage, $do_stage) = @_;
return unless $do_stage->{$stage};
if (keys %$do_stage > 1) {
print " % $0 @ARGV -$stage\n" if $c->verbose;
die "$0 is not executable, please change its execution permissions"
unless -x $0;
system($0, @ARGV, "-$stage") == 0
or die "$0 returned nonzero status, \$?=$?";
return;
}
my $start = new Benchmark;
$c->$section();
my $end = new Benchmark;
my $summary = timestr(timediff($end, $start));
my ($rss, $vsz) = memory_usage();
print "$summary (memory: rss=$rss, vsz=$vsz)\n" if $c->verbose;
print $out_fh "Stage $stage: $summary (memory: rss=$rss, vsz=$vsz)\n" if $out_fh;
}
sub parse_command_line {
my (%opt, %do_stage);
while (@_) {
if ($_[0] =~ /^-(\d+)$/) {
shift;
$do_stage{$1} = 1;
} elsif ( $_[0] eq '--config_file' ) {
die "--config_file requires the YAML module from CPAN to be installed.\n" unless $HAVE_YAML;
shift;
my $file = shift;
my $href = YAML::LoadFile($file);
@opt{keys %$href} = values %$href;
} elsif ( $_[0] =~ /^--/ ) {
my ($k, $v) = (shift, shift);
$k =~ s/^--//;
$opt{$k} = $v;
} else {
die usage();
}
}
while (my ($k, $v) = each %opt) {
# Allow abbreviations
if ($k =~ /^(\w+)_class$/) {
my $name = $1;
$v =~ s/^::/AI::Categorizer::\u${name}::/;
$opt{$k} = $v;
}
}
my $outfile;
unless ($outfile = delete $opt{outfile}) {
$outfile = $opt{progress_file} ? "$opt{progress_file}-results.txt" : "results.txt";
}
return (\%opt, \%do_stage, $outfile);
}
sub usage {
return <<EOF;
Usage:
$0 --parameter_1 <value_1> --parameter_2 <value_2>
# You may specify a YAML config file as follows:
$0 --config_file <path> --parameter_3 <value_3>
# Or, to run only step 3 (of 5)
$0 --config_file <path> -3
--parameter_1, --parameter_2, etc. are parameters accepted by
AI::Categorizer objects' new() methods.
EOF
}
sub memory_usage {
my ($rss, $vsz);
if ($^O eq 'darwin' or $^O eq 'linux') {
($rss, $vsz) = `ps -eo rss,vsz -p $$` =~ /(\d+)\s+(\d+)/;
} elsif ($^O eq 'solaris') {
($rss, $vsz) = `ps -o rss,vsz -p $$` =~ /(\d+)\s+(\d+)/;
} else {
warn "Unknown system, can't get memory usage";
}
return ($rss, $vsz);
}
# documents, trains a Naive Bayes categorizer on it, then tests the
# categorizer on a set of test documents.
use strict;
use AI::Categorizer;
use AI::Categorizer::Collection::Files;
use AI::Categorizer::Learner::NaiveBayes;
use File::Spec;
die("Usage: $0 <corpus>\n".
" A sample corpus (data set) can be downloaded from\n".
" http://www.cpan.org/authors/Ken_Williams/data/reuters-21578.tar.gz\n".
" or http://www.limnus.com/~ken/reuters-21578.tar.gz\n")
unless @ARGV == 1;
my $corpus = shift;
my $training = File::Spec->catfile( $corpus, 'training' );
my $test = File::Spec->catfile( $corpus, 'test' );
my $cats = File::Spec->catfile( $corpus, 'cats.txt' );
my $stopwords = File::Spec->catfile( $corpus, 'stopwords' );
my %params;
if (-e $stopwords) {
$params{stopword_file} = $stopwords;
} else {
warn "$stopwords not found - no stopwords will be used.\n";
}
if (-e $cats) {
$params{category_file} = $cats;
} else {
die "$cats not found - can't proceed without category information.\n";
}
# In a real-world application these Collection objects could be of any
# type (any Collection subclass). Or you could create each Document
# object manually. Or you could let the KnowledgeSet create the
# Collection objects for you.
$training = AI::Categorizer::Collection::Files->new( path => $training, %params );
$test = AI::Categorizer::Collection::Files->new( path => $test, %params );
lib/AI/Categorizer.pm view on Meta::CPAN
use File::Spec;
use AI::Categorizer::Learner;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::Collection;
use AI::Categorizer::Hypothesis;
use AI::Categorizer::KnowledgeSet;
__PACKAGE__->valid_params
(
progress_file => { type => SCALAR, default => 'save' },
knowledge_set => { isa => 'AI::Categorizer::KnowledgeSet' },
learner => { isa => 'AI::Categorizer::Learner' },
verbose => { type => BOOLEAN, default => 0 },
training_set => { type => SCALAR, optional => 1 },
test_set => { type => SCALAR, optional => 1 },
data_root => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
knowledge_set => { class => 'AI::Categorizer::KnowledgeSet' },
learner => { class => 'AI::Categorizer::Learner::NaiveBayes' },
experiment => { class => 'AI::Categorizer::Experiment',
delayed => 1 },
collection => { class => 'AI::Categorizer::Collection::Files',
delayed => 1 },
);
sub new {
my $package = shift;
my %args = @_;
my %defaults;
if (exists $args{data_root}) {
$defaults{training_set} = File::Spec->catfile($args{data_root}, 'training');
$defaults{test_set} = File::Spec->catfile($args{data_root}, 'test');
$defaults{category_file} = File::Spec->catfile($args{data_root}, 'cats.txt');
delete $args{data_root};
}
return $package->SUPER::new(%defaults, %args);
}
#sub dump_parameters {
# my $p = shift()->SUPER::dump_parameters;
# delete $p->{stopwords} if $p->{stopword_file};
# return $p;
#}
sub knowledge_set { shift->{knowledge_set} }
sub learner { shift->{learner} }
# Combines several methods in one sub
sub run_experiment {
my $self = shift;
$self->scan_features;
$self->read_training_set;
$self->train;
$self->evaluate_test_set;
print $self->stats_table;
}
sub scan_features {
my $self = shift;
return unless $self->knowledge_set->scan_first;
$self->knowledge_set->scan_features( path => $self->{training_set} );
$self->knowledge_set->save_features( "$self->{progress_file}-01-features" );
}
sub read_training_set {
my $self = shift;
$self->knowledge_set->restore_features( "$self->{progress_file}-01-features" )
if -e "$self->{progress_file}-01-features";
$self->knowledge_set->read( path => $self->{training_set} );
$self->_save_progress( '02', 'knowledge_set' );
return $self->knowledge_set;
}
sub train {
my $self = shift;
$self->_load_progress( '02', 'knowledge_set' );
$self->learner->train( knowledge_set => $self->{knowledge_set} );
$self->_save_progress( '03', 'learner' );
return $self->learner;
}
sub evaluate_test_set {
my $self = shift;
$self->_load_progress( '03', 'learner' );
my $c = $self->create_delayed_object('collection', path => $self->{test_set} );
$self->{experiment} = $self->learner->categorize_collection( collection => $c );
$self->_save_progress( '04', 'experiment' );
return $self->{experiment};
}
sub stats_table {
my $self = shift;
$self->_load_progress( '04', 'experiment' );
return $self->{experiment}->stats_table;
}
sub progress_file {
shift->{progress_file};
}
sub verbose {
shift->{verbose};
}
sub _save_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Saving to $file\n" if $self->{verbose};
$self->{$node}->save_state($file);
}
sub _load_progress {
my ($self, $stage, $node) = @_;
return unless $self->{progress_file};
my $file = "$self->{progress_file}-$stage-$node";
warn "Loading $file\n" if $self->{verbose};
$self->{$node} = $self->contained_class($node)->restore_state($file);
}
1;
__END__
=head1 NAME
AI::Categorizer - Automatic Text Categorization
=head1 SYNOPSIS
use AI::Categorizer;
my $c = new AI::Categorizer(...parameters...);
# Run a complete experiment - training on a corpus, testing on a test
# set, printing a summary of results to STDOUT
$c->run_experiment;
# Or, run the parts of $c->run_experiment separately
$c->scan_features;
$c->read_training_set;
$c->train;
$c->evaluate_test_set;
print $c->stats_table;
# After training, use the Learner for categorization
my $l = $c->learner;
while (...) {
my $d = ...create a document...
my $hypothesis = $l->categorize($d); # An AI::Categorizer::Hypothesis object
print "Assigned categories: ", join ', ', $hypothesis->categories, "\n";
print "Best category: ", $hypothesis->best_category, "\n";
}
=head1 DESCRIPTION
C<AI::Categorizer> is a framework for automatic text categorization.
It consists of a collection of Perl modules that implement common
categorization tasks, and a set of defined relationships among those
modules. The various details are flexible - for example, you can
choose what categorization algorithm to use, what features (words or
otherwise) of the documents should be used (or how to automatically
choose these features), what format the documents are in, and so on.
lib/AI/Categorizer/Category.pm view on Meta::CPAN
use strict;
use AI::Categorizer::ObjectSet;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
__PACKAGE__->valid_params
(
name => {type => SCALAR, public => 0},
documents => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'), @_ },
},
public => 0,
},
);
__PACKAGE__->contained_objects
(
features => {
class => 'AI::Categorizer::FeatureVector',
delayed => 1,
},
);
my %REGISTRY = ();
sub new {
my $self = shift()->SUPER::new(@_);
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
$REGISTRY{$self->{name}} = $self;
return $self;
}
sub by_name {
my ($class, %args) = @_;
return $REGISTRY{$args{name}} if exists $REGISTRY{$args{name}};
return $class->new(%args);
}
sub name { $_[0]->{name} }
sub documents {
my $d = $_[0]->{documents};
return wantarray ? $d->members : $d->size;
}
sub contains_document {
return $_[0]->{documents}->includes( $_[1] );
}
sub add_document {
my $self = shift;
$self->{documents}->insert( $_[0] );
delete $self->{features}; # Could be more efficient?
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features} if $self->{features};
my $v = $self->create_delayed_object('features');
return $self->{features} = $v unless $self->documents;
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
1;
__END__
=head1 NAME
AI::Categorizer::Category - A named category of documents
=head1 SYNOPSIS
my $category = AI::Categorizer::Category->by_name("sports");
my $name = $category->name;
my @docs = $category->documents;
my $num_docs = $category->documents;
my $features = $category->features;
$category->add_document($doc);
if ($category->contains_document($doc)) { ...
=head1 DESCRIPTION
This simple class represents a named category which may contain zero
or more documents. Each category is a "singleton" by name, so two
Category objects with the same name should not be created at once.
=head1 METHODS
=over 4
lib/AI/Categorizer/Collection.pm view on Meta::CPAN
package AI::Categorizer::Collection;
use strict;
use Params::Validate qw(:types);
use Class::Container;
use base qw(Class::Container);
__PACKAGE__->valid_params
(
verbose => {type => SCALAR, default => 0},
stopword_file => { type => SCALAR, optional => 1 },
category_hash => { type => HASHREF, default => {} },
category_file => { type => SCALAR, optional => 1 },
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document::Text',
delayed => 1 },
);
sub new {
my ($class, %args) = @_;
# Optimize so every document doesn't have to convert the stopword list to a hash
if ($args{stopwords} and UNIVERSAL::isa($args{stopwords}, 'ARRAY')) {
$args{stopwords} = { map {+$_ => 1} @{ $args{stopwords} } };
}
my $self = $class->SUPER::new(%args);
if ($self->{category_file}) {
local *FH;
open FH, $self->{category_file} or die "Can't open $self->{category_file}: $!";
while (<FH>) {
my ($doc, @cats) = split;
$self->{category_hash}{$doc} = \@cats;
}
close FH;
}
if (exists $self->{stopword_file}) {
my %stopwords;
local *FH;
open FH, "< $self->{stopword_file}" or die "$self->{stopword_file}: $!";
while (<FH>) {
chomp;
$stopwords{$_} = 1;
}
close FH;
$self->delayed_object_params('document', stopwords => \%stopwords);
}
return $self;
}
# This should usually be replaced in subclasses with a faster version that doesn't
# need to create actual documents each time through
sub count_documents {
my $self = shift;
return $self->{document_count} if exists $self->{document_count};
$self->rewind;
my $count = 0;
$count++ while $self->next;
$self->rewind;
return $self->{document_count} = $count;
}
# Abstract methods
sub next;
sub rewind;
1;
__END__
=head1 NAME
AI::Categorizer::Collection - Access stored documents
=head1 SYNOPSIS
my $c = new AI::Categorizer::Collection::Files
(path => '/tmp/docs/training',
category_file => '/tmp/docs/cats.txt');
print "Total number of docs: ", $c->count_documents, "\n";
while (my $document = $c->next) {
...
}
$c->rewind; # For further operations
=head1 DESCRIPTION
This abstract class implements an iterator for accessing documents in
their natively stored format. You cannot directly create an instance
of the Collection class, because it is abstract - see the
documentation for the C<Files>, C<SingleFile>, or C<InMemory>
subclasses for a concrete interface.
=head1 METHODS
lib/AI/Categorizer/Collection/DBI.pm view on Meta::CPAN
package AI::Categorizer::Collection::DBI;
use strict;
use DBI;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
connection_string => {type => SCALAR, default => undef},
dbh => {isa => 'DBI::db', default => undef},
select_statement => {type => SCALAR, default => "SELECT text FROM documents"},
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document',
delayed => 1 },
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
die "Must provide 'dbh' or 'connection_string' arguments"
unless $self->{dbh} or $self->{connection_string};
unless ($self->{dbh}) {
$self->{dbh} = DBI->connect($self->{connection_string}, '', '', {RaiseError => 1})
or die DBI->errstr;
delete $self->{connection_string};
}
$self->rewind;
return $self;
}
sub dbh { shift()->{dbh} }
sub rewind {
my $self = shift;
if (!$self->{sth}) {
$self->{sth} = $self->dbh->prepare($self->{select_statement});
}
if ($self->{sth}{Active}) {
$self->{sth}->finish;
}
$self->{sth}->execute;
}
sub next {
my $self = shift;
my @result = $self->{sth}->fetchrow_array;
return undef unless @result;
return $self->create_delayed_object('document',
name => $result[0],
categories => [$result[1]],
content => $result[2],
);
}
1;
lib/AI/Categorizer/Collection/Files.pm view on Meta::CPAN
package AI::Categorizer::Collection::Files;
use strict;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
use File::Spec;
__PACKAGE__->valid_params
(
path => { type => SCALAR|ARRAYREF },
recurse => { type => BOOLEAN, default => 0 },
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
$self->{dir_fh} = do {local *FH; *FH}; # double *FH avoids a warning
# Documents are contained in a directory, or list of directories
$self->{path} = [$self->{path}] unless ref $self->{path};
$self->{used} = [];
$self->_next_path;
return $self;
}
sub _next_path {
my $self = shift;
closedir $self->{dir_fh} if $self->{cur_dir};
$self->{cur_dir} = shift @{$self->{path}};
push @{$self->{used}}, $self->{cur_dir};
opendir $self->{dir_fh}, $self->{cur_dir} or die "$self->{cur_dir}: $!";
}
sub next {
my $self = shift;
my $file = $self->_read_file;
return unless defined $file;
warn "No category information about '$file'" unless defined $self->{category_hash}{$file};
my @cats = map AI::Categorizer::Category->by_name(name => $_), @{ $self->{category_hash}{$file} || [] };
return $self->call_method('document', 'read',
path => File::Spec->catfile($self->{cur_dir}, $file),
name => $file,
categories => \@cats,
);
}
sub _read_file {
my ($self) = @_;
my $file = readdir $self->{dir_fh};
if (!defined $file) { # Directory has been exhausted
return undef unless @{$self->{path}};
$self->_next_path;
return $self->_read_file;
} elsif ($file eq '.' or $file eq '..') {
return $self->_read_file;
} elsif (-d (my $path = File::Spec->catdir($self->{cur_dir}, $file))) {
push @{$self->{path}}, $path # Add for later processing
if $self->{recurse} and !grep {$_ eq $path} @{$self->{path}}, @{$self->{used}};
return $self->_read_file;
}
return $file;
}
sub rewind {
my $self = shift;
push @{$self->{path}}, @{$self->{used}};
@{$self->{used}} = ();
$self->_next_path;
}
# This should share an iterator with next()
sub count_documents {
my $self = shift;
return $self->{document_count} if defined $self->{document_count};
$self->rewind;
my $count = 0;
$count++ while defined $self->_read_file;
$self->rewind;
return $self->{document_count} = $count;
}
1;
__END__
=head1 NAME
AI::Categorizer::Collection::Files - One document per file
=head1 SYNOPSIS
my $c = new AI::Categorizer::Collection::Files
(path => '/tmp/docs/training',
category_file => '/tmp/docs/cats.txt');
print "Total number of docs: ", $c->count_documents, "\n";
while (my $document = $c->next) {
...
}
$c->rewind; # For further operations
=head1 DESCRIPTION
This implements a Collection class in which each document exists as a
single file on a filesystem. The documents can exist in a single
directory, or in several directories.
=head1 METHODS
This is a subclass of the abstract AI::Categorizer::Collection class,
so any methods mentioned in its documentation are available here.
lib/AI/Categorizer/Collection/InMemory.pm view on Meta::CPAN
package AI::Categorizer::Collection::InMemory;
use strict;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
data => { type => HASHREF },
);
sub new {
my $self = shift()->SUPER::new(@_);
while (my ($name, $params) = each %{$self->{data}}) {
foreach (@{$params->{categories}}) {
next if ref $_;
$_ = AI::Categorizer::Category->by_name(name => $_);
}
}
return $self;
}
sub next {
my $self = shift;
my ($name, $params) = each %{$self->{data}} or return;
return AI::Categorizer::Document->new(name => $name, %$params);
}
sub rewind {
my $self = shift;
scalar keys %{$self->{data}};
return;
}
sub count_documents {
my $self = shift;
return scalar keys %{$self->{data}};
}
1;
lib/AI/Categorizer/Collection/SingleFile.pm view on Meta::CPAN
package AI::Categorizer::Collection::SingleFile;
use strict;
use AI::Categorizer::Collection;
use base qw(AI::Categorizer::Collection);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
path => { type => SCALAR|ARRAYREF },
categories => { type => HASHREF|UNDEF, default => undef },
delimiter => { type => SCALAR },
);
__PACKAGE__->contained_objects
(
document => { class => 'AI::Categorizer::Document::Text',
delayed => 1 },
);
sub new {
my $class = shift;
my $self = $class->SUPER::new(@_);
$self->{fh} = do {local *FH; *FH}; # double *FH avoids a warning
# Documents are contained in a file, or list of files
$self->{path} = [$self->{path}] unless ref $self->{path};
$self->{used} = [];
$self->_next_path;
return $self;
}
sub _next_path {
my $self = shift;
close $self->{fh} if $self->{cur_file};
push @{$self->{used}}, shift @{$self->{path}};
$self->{cur_file} = $self->{used}[-1];
open $self->{fh}, "< $self->{cur_file}" or die "$self->{cur_file}: $!";
}
sub next {
my $self = shift;
my $fh = $self->{fh}; # Must put in a simple scalar
my $content = do {local $/ = $self->{delimiter}; <$fh>};
if (!defined $content) { # File has been exhausted
unless (@{$self->{path}}) { # All files have been exhausted
$self->{fh} = undef;
return undef;
}
$self->_next_path;
return $self->next;
} elsif ($content =~ /^\s*$self->{delimiter}$/) { # Skip empty docs
return $self->next;
}
#warn "doc is $content";
#warn "creating document=>@{[ %{$self->{container}{delayed}{document}} ]}";
return $self->create_delayed_object('document', content => $content);
}
sub count_documents {
my ($self) = @_;
return $self->{document_count} if defined $self->{document_count};
$self->rewind;
my $count = 0;
local $/ = $self->{delimiter};
my $fh = $self->{fh};
while (1) {
$count++ while <$fh>;
last unless @{$self->{path}};
$self->_next_path;
}
$self->rewind;
return $self->{document_count} = $count;
}
sub rewind {
my ($self) = @_;
close $self->{fh} if $self->{cur_file};
unshift @{$self->{path}}, @{$self->{used}};
$self->{used} = [];
$self->_next_path;
}
1;
lib/AI/Categorizer/Document.pm view on Meta::CPAN
use strict;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::ObjectSet;
use AI::Categorizer::FeatureVector;
__PACKAGE__->valid_params
(
name => {
type => SCALAR,
},
categories => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Category objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Category'), @{$_[0]} },
},
public => 0,
},
stopwords => {
type => ARRAYREF|HASHREF,
default => {},
},
content => {
type => HASHREF|SCALAR,
default => undef,
},
parse => {
type => SCALAR,
optional => 1,
},
parse_handle => {
type => HANDLE,
optional => 1,
},
features => {
isa => 'AI::Categorizer::FeatureVector',
optional => 1,
},
content_weights => {
type => HASHREF,
default => {},
},
front_bias => {
type => SCALAR,
default => 0,
},
use_features => {
type => HASHREF|UNDEF,
default => undef,
},
stemming => {
type => SCALAR|UNDEF,
optional => 1,
},
stopword_behavior => {
type => SCALAR,
default => "stem",
},
);
__PACKAGE__->contained_objects
(
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
);
### Constructors
my $NAME = 'a';
sub new {
my $pkg = shift;
my $self = $pkg->SUPER::new(name => $NAME++, # Use a default name
@_);
# Get efficient internal data structures
$self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
$self->_fix_stopwords;
# A few different ways for the caller to initialize the content
if (exists $self->{parse}) {
$self->parse(content => delete $self->{parse});
} elsif (exists $self->{parse_handle}) {
$self->parse_handle(handle => delete $self->{parse_handle});
} elsif (defined $self->{content}) {
# Allow a simple string as the content
$self->{content} = { body => $self->{content} } unless ref $self->{content};
}
$self->finish if $self->{content};
return $self;
}
sub _fix_stopwords {
my $self = shift;
# Convert to hash
$self->{stopwords} = { map {($_ => 1)} @{ $self->{stopwords} } }
if UNIVERSAL::isa($self->{stopwords}, 'ARRAY');
my $s = $self->{stopwords};
# May need to perform stemming on the stopwords
return unless keys %$s; # No point in doing anything if there are no stopwords
return unless $self->{stopword_behavior} eq 'stem';
return if !defined($self->{stemming}) or $self->{stemming} eq 'none';
return if $s->{___stemmed};
my @keys = keys %$s;
%$s = ();
$self->stem_words(\@keys);
$s->{$_} = 1 foreach @keys;
# This flag is attached to the stopword structure itself so that
# other documents will notice it.
$s->{___stemmed} = 1;
}
sub finish {
my $self = shift;
$self->create_feature_vector;
# Now we're done with all the content stuff
delete @{$self}{'content', 'content_weights', 'stopwords', 'use_features'};
}
# Parse a document format - a virtual method
sub parse;
sub parse_handle {
my ($self, %args) = @_;
my $fh = $args{handle} or die "No 'handle' argument given to parse_handle()";
return $self->parse( content => join '', <$fh> );
}
### Accessors
sub name { $_[0]->{name} }
sub stopword_behavior { $_[0]->{stopword_behavior} }
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
}
return $self->{features};
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
### Workers
sub create_feature_vector {
my $self = shift;
my $content = $self->{content};
my $weights = $self->{content_weights};
die "'stopword_behavior' must be one of 'stem', 'no_stem', or 'pre_stemmed'"
unless $self->{stopword_behavior} =~ /^stem|no_stem|pre_stemmed$/;
$self->{features} = $self->create_delayed_object('features');
while (my ($name, $data) = each %$content) {
my $t = $self->tokenize($data);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} eq 'no_stem';
$self->stem_words($t);
$t = $self->_filter_tokens($t) if $self->{stopword_behavior} =~ /^stem|pre_stemmed$/;
my $h = $self->vectorize(tokens => $t, weight => exists($weights->{$name}) ? $weights->{$name} : 1 );
$self->{features}->add($h);
}
}
sub is_in_category {
return (ref $_[1]
? $_[0]->{categories}->includes( $_[1] )
: $_[0]->{categories}->includes_name( $_[1] ));
}
sub tokenize {
my $self = shift;
my @tokens;
while ($_[0] =~ /([-\w]+)/g) {
my $word = lc $1;
next unless $word =~ /[a-z]/;
$word =~ s/^[^a-z]+//; # Trim leading non-alpha characters (helps with ordinals)
push @tokens, $word;
}
return \@tokens;
}
sub stem_words {
my ($self, $tokens) = @_;
return unless $self->{stemming};
return if $self->{stemming} eq 'none';
die "Unknown stemming option '$self->{stemming}' - options are 'porter' or 'none'"
unless $self->{stemming} eq 'porter';
eval {require Lingua::Stem; 1}
or die "Porter stemming requires the Lingua::Stem module, available from CPAN.\n";
@$tokens = @{ Lingua::Stem::stem(@$tokens) };
}
sub _filter_tokens {
my ($self, $tokens_in) = @_;
if ($self->{use_features}) {
my $f = $self->{use_features}->as_hash;
return [ grep exists($f->{$_}), @$tokens_in ];
} elsif ($self->{stopwords} and keys %{$self->{stopwords}}) {
my $s = $self->{stopwords};
return [ grep !exists($s->{$_}), @$tokens_in ];
}
return $tokens_in;
}
sub _weigh_tokens {
my ($self, $tokens, $weight) = @_;
my %counts;
if (my $b = 0+$self->{front_bias}) {
die "'front_bias' value must be between -1 and 1"
unless -1 < $b and $b < 1;
my $n = @$tokens;
my $r = ($b-1)**2 / ($b+1);
my $mult = $weight * log($r)/($r-1);
my $i = 0;
foreach my $feature (@$tokens) {
$counts{$feature} += $mult * $r**($i/$n);
$i++;
}
} else {
foreach my $feature (@$tokens) {
$counts{$feature} += $weight;
}
}
return \%counts;
}
sub vectorize {
my ($self, %args) = @_;
if ($self->{stem_stopwords}) {
my $s = $self->stem_tokens([keys %{$self->{stopwords}}]);
$self->{stopwords} = { map {+$_, 1} @$s };
$args{tokens} = $self->_filter_tokens($args{tokens});
}
return $self->_weigh_tokens($args{tokens}, $args{weight});
}
sub read {
my ($class, %args) = @_;
my $path = delete $args{path} or die "Must specify 'path' argument to read()";
my $self = $class->new(%args);
open my($fh), "< $path" or die "$path: $!";
$self->parse_handle(handle => $fh);
close $fh;
$self->finish;
return $self;
}
sub dump_features {
my ($self, %args) = @_;
my $path = $args{path} or die "No 'path' argument given to dump_features()";
open my($fh), "> $path" or die "Can't create $path: $!";
my $f = $self->features->as_hash;
while (my ($k, $v) = each %$f) {
print $fh "$k\t$v\n";
}
}
1;
__END__
=head1 NAME
AI::Categorizer::Document - Embodies a document
=head1 SYNOPSIS
use AI::Categorizer::Document;
# Simplest way to create a document:
my $d = new AI::Categorizer::Document(name => $string,
content => $string);
# Other parameters are accepted:
my $d = new AI::Categorizer::Document(name => $string,
categories => \@category_objects,
content => { subject => $string,
body => $string2, ... },
content_weights => { subject => 3,
body => 1, ... },
stopwords => \%skip_these_words,
stemming => $string,
front_bias => $float,
use_features => $feature_vector,
);
# Specify explicit feature vector:
my $d = new AI::Categorizer::Document(name => $string);
$d->features( $feature_vector );
# Now pass the document to a categorization algorithm:
my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
my $hypothesis = $learner->categorize($document);
=head1 DESCRIPTION
The Document class embodies the data in a single document, and
contains methods for turning this data into a FeatureVector. Usually
documents are plain text, but subclasses of the Document class may
handle any kind of data.
=head1 METHODS
lib/AI/Categorizer/Document/SMART.pm view on Meta::CPAN
package AI::Categorizer::Document::SMART;
use strict;
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);
sub parse {
my ($self, %args) = @_;
$args{content} =~ s{
^(?:\.I)?\s+(\d+)\n # ID number - becomes document name
\.C\n
([^\n]+)\n # Categories
\.T\n
(.+)\n+ # Title
\.W\n
}
{}sx
or die "Malformed record: $args{content}";
my ($id, $categories, $title) = ($1, $2, $3);
$self->{name} = $id;
$self->{content} = { title => $title,
body => $args{content} };
my @categories = $categories =~ m/(.*?)\s+\d+[\s;]*/g;
@categories = map AI::Categorizer::Category->by_name(name => $_), @categories;
$self->{categories} = \@categories;
}
1;
lib/AI/Categorizer/Document/Text.pm view on Meta::CPAN
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);
#use Params::Validate qw(:types);
#use AI::Categorizer::ObjectSet;
#use AI::Categorizer::FeatureVector;
### Constructors
sub parse {
my ($self, %args) = @_;
$self->{content} = { body => $args{content} };
}
1;
lib/AI/Categorizer/Document/XML.pm view on Meta::CPAN
package AI::Categorizer::Document::XML;
use strict;
use AI::Categorizer::Document;
use base qw(AI::Categorizer::Document);
use XML::SAX;
__PACKAGE__->contained_objects
(
xml_handler => 'AI::Categorizer::Document::XML::Handler',
);
### Constructors
sub parse {
my ($self, %args) = @_;
# it is a string which contains the content of XML
my $body= $args{content};
# it is a hash which includes a pair of <elementName, weight>
my $elementWeight= $args{elementWeight};
# construct Handler which receive event of element, data, comment, processing_instruction
# And convert their values into a sequence of string and save it into buffer
my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);
# construct parser
my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);
# let's start parsing XML, where the methids of Handler will be called
$xmlParser->parse_string($body);
# extract the converted string from Handler
$body= $xmlHandler->getContent;
# Now, construct Document Object and return it
return { body => $body };
}
##########################################################################
package AI::Categorizer::Document::XML::Handler;
use strict;
use base qw(XML::SAX::Base);
# Input: a hash which is weights of elements
# Output: object of this class
# Description: this is constructor
sub new{
my ($class, %args) = @_;
# call super class such as XML::SAX::Base
my $self = $class->SUPER::new;
# save weights of elements which is a hash for pairs <elementName, weight>
# weight is times duplication of corresponding element
# It is provided by caller(one of parameters) at construction, and
# we must save it in order to use doing duplication at end_element
$self->{weightHash} = $args{weights};
# It is storage to store the data produced by Text, CDataSection and etc.
$self->{content} = '';
# This array is used to store the data for every element from root to the current visiting element.
# Thus, data of 0~($levelPointer-1)th in the array is only valid.
# The array which store the starting location(index) of the content for an element,
# From it, we can know all the data produced by an element at the end_element
# It is needed at the duplication of the data produced by the specific element
$self->{locationArray} = [];
return $self;
}
# Input: None
# Output: None
# Description:
# it is called whenever the parser meets the document
# it will be called at once
# Currently, confirm if the content buffer is an empty
sub start_document{
my ($self, $doc)= @_;
# The level(depth) of the last called element in XML tree
# Calling of start_element is the preorder of the tree traversal.
# The level is the level of current visiting element in tree.
# the first element is 0-level
$self->{levelPointer} = 0;
# all data will be saved into here, initially, it is an empty
$self->{content} = "";
#$self->SUPER::start_document($doc);
}
# Input: None
# Output: None
# Description:
# it is called whenever the parser ends the document
# it will be called at once
# Nothing to do
sub end_document{
my ($self, $doc)= @_;
#$self->SUPER::end_document($doc);
}
# Input
# LocalName: $el->{LocalName}
# NamespaceURI: $el->{NamespaceURI}
# Name $el->{Name}
# Prefix $el->{Prefix}
# Attributes $el->{Attributes}
# for each attribute
# LocalName: $el->{LocalName}
# NamespaceURI: $el->{NamespaceURI}
# Name $el->{Name}
# Prefix $el->{Prefix}
# Value $el->{Value}
# Output: None
# Description:
# it is called whenever the parser meets the element
sub start_element{
my ($self, $el)= @_;
# find the last location of the content
# its meaning is to append the new data at this location
my $location= length $self->{content};
# save the last location of the current content
# so that at end_element the starting location of data of this element can be known
$self->{locationArray}[$self->{levelPointer}] = $location;
# for the next element, increase levelPointer
$self->{levelPointer}++;
#$self->SUPER::start_document($el);
}
# Input: None
# Output: None
# Description:
# it is called whenever the parser ends the element
sub end_element{
my ($self, $el)= @_;
$self->{levelPointer}--;
my $location= $self->{locationArray}[$self->{levelPointer}];
# find the name of element
my $elementName= $el->{Name};
# set the default weight
my $weight= 1;
# check if user give the weight to duplicate data
$weight= $self->{weightHash}{$elementName} if exists $self->{weightHash}{$elementName};
# 0 - remove all the data to be related to this element
if($weight == 0){
$self->{content} = substr($self->{content}, 0, $location);
return;
}
# 1 - dont duplicate
if($weight == 1){
return;
}
# n - duplicate data by n times
# get new content
my $newContent= substr($self->{content}, $location);
# start to copy
for(my $i=1; $i<$weight;$i++){
$self->{content} .= $newContent;
}
#$self->SUPER::end_document($el);
}
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# it is called whenever the parser meets the text which comes from Text, CDataSection and etc
# Value must be saved into content buffer.
sub characters{
my ($self, $args)= @_;
# save "data plus new line" into content
$self->{content} .= "$args->{Data}\n";
}
# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:
# it is called whenever the parser meets the comment
# Currently, it will be ignored
sub comment{
my ($self, $args)= @_;
}
# Input: a hash which consists of pair <Data, Value> and <Target, Value>
# Output: None
# Description:
# it is called whenever the parser meets the processing_instructing
# Currently, it will be ignored
sub processing_instruction{
my ($self, $args)= @_;
}
# Input: None
# Output: the converted data, that is, content
# Description:
# return the content
sub getContent{
my ($self)= @_;
return $self->{content};
}
1;
__END__
lib/AI/Categorizer/Experiment.pm view on Meta::CPAN
use strict;
use Class::Container;
use AI::Categorizer::Storable;
use Statistics::Contingency;
use base qw(Class::Container AI::Categorizer::Storable Statistics::Contingency);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
categories => { type => ARRAYREF|HASHREF },
sig_figs => { type => SCALAR, default => 4 },
);
sub new {
my $package = shift;
my $self = $package->Class::Container::new(@_);
$self->{$_} = 0 foreach qw(a b c d);
my $c = delete $self->{categories};
$self->{categories} = { map {($_ => {a=>0, b=>0, c=>0, d=>0})}
UNIVERSAL::isa($c, 'HASH') ? keys(%$c) : @$c
};
return $self;
}
sub add_hypothesis {
my ($self, $h, $correct, $name) = @_;
die "No hypothesis given to add_hypothesis()" unless $h;
$name = $h->document_name unless defined $name;
$self->add_result([$h->categories], $correct, $name);
}
sub stats_table {
my $self = shift;
$self->SUPER::stats_table($self->{sig_figs});
}
1;
__END__
=head1 NAME
AI::Categorizer::Experiment - Coordinate experimental results
=head1 SYNOPSIS
use AI::Categorizer::Experiment;
my $e = new AI::Categorizer::Experiment(categories => \%categories);
my $l = AI::Categorizer::Learner->restore_state(...path...);
while (my $d = ... get document ...) {
my $h = $l->categorize($d); # A Hypothesis
$e->add_hypothesis($h, [map $_->name, $d->categories]);
}
print "Micro F1: ", $e->micro_F1, "\n"; # Access a single statistic
print $e->stats_table; # Show several stats in table form
=head1 DESCRIPTION
The C<AI::Categorizer::Experiment> class helps you organize the
results of categorization experiments. As you get lots of
categorization results (Hypotheses) back from the Learner, you can
feed these results to the Experiment class, along with the correct
answers. When all results have been collected, you can get a report
on accuracy, precision, recall, F1, and so on, with both
macro-averaging and micro-averaging over categories.
lib/AI/Categorizer/FeatureSelector/CategorySelector.pm view on Meta::CPAN
package AI::Categorizer::FeatureSelector::CategorySelector;
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
1;
sub reduction_function;
# figure out the feature set before reading collection (default)
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or
die "No 'collection' parameter provided to scan_features()";
if(!($self->{features_kept})) {return;}
my %cat_features;
my $coll_features = $self->create_delayed_object('features');
my $nbDocuments = 0;
while (my $doc = $c->next) {
$nbDocuments++;
$args{prog_bar}->() if $args{prog_bar};
my $docfeatures = $doc->features->as_hash;
foreach my $cat ($doc->categories) {
my $catname = $cat->name;
if(!(exists $cat_features{$catname})) {
$cat_features{$catname} = $self->create_delayed_object('features');
}
$cat_features{$catname}->add($docfeatures);
}
$coll_features->add( $docfeatures );
}
print STDERR "\n* Computing Chi-Square values\n" if $self->verbose;
my $r_features = $self->create_delayed_object('features');
my @terms = $coll_features->names;
my $progressBar = $self->prog_bar(scalar @terms);
my $allFeaturesSum = $coll_features->sum;
my %cat_features_sum;
while( my($catname,$features) = each %cat_features ) {
$cat_features_sum{$catname} = $features->sum;
}
foreach my $term (@terms) {
$progressBar->();
$r_features->{features}{$term} = $self->reduction_function($term,
$nbDocuments,$allFeaturesSum,$coll_features,
\%cat_features,\%cat_features_sum);
}
print STDERR "\n" if $self->verbose;
my $new_features = $self->reduce_features($r_features);
return $coll_features->intersection( $new_features );
}
# calculate feature set after reading collection (scan_first=0)
sub rank_features {
die "CategorySelector->rank_features is not implemented yet!";
# my ($self, %args) = @_;
#
# my $k = $args{knowledge_set}
# or die "No knowledge_set parameter provided to rank_features()";
#
# my %freq_counts;
# foreach my $name ($k->features->names) {
# $freq_counts{$name} = $k->document_frequency($name);
# }
# return $self->create_delayed_object('features', features => \%freq_counts);
}
# copied from KnowledgeSet->prog_bar by Ken Williams
sub prog_bar {
my ($self, $count) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
my $pb = 'Time::Progress'->new;
$pb->attr(max => $count);
my $i = 0;
return sub {
$i++;
return if $i % 25;
print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
};
}
__END__
=head1 NAME
AI::Categorizer::CategorySelector - Abstract Category Selection class
=head1 SYNOPSIS
lib/AI/Categorizer/FeatureSelector/ChiSquare.pm view on Meta::CPAN
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector::CategorySelector);
use Params::Validate qw(:types);
# Chi-Square function
# NB: this could probably be optimised a bit...
sub reduction_function {
my ($self,$term,$N,$allFeaturesSum,
$coll_features,$cat_features,$cat_features_sum) = @_;
my $CHI2SUM = 0;
my $nbcats = 0;
foreach my $catname (keys %{$cat_features}) {
# while ( my ($catname,$catfeatures) = each %{$cat_features}) {
my ($A,$B,$C,$D); # A = number of times where t and c co-occur
# B = " " " t occurs without c
# C = " " " c occurs without t
# D = " " " neither c nor t occur
$A = $cat_features->{$catname}->value($term);
$B = $coll_features->value($term) - $A;
$C = $cat_features_sum->{$catname} - $A;
$D = $allFeaturesSum - ($A+$B+$C);
my $ADminCB = ($A*$D)-($C*$B);
my $CHI2 = $N*$ADminCB*$ADminCB / (($A+$C)*($B+$D)*($A+$B)*($C+$D));
$CHI2SUM += $CHI2;
$nbcats++;
}
return $CHI2SUM/$nbcats;
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector::ChiSquare - ChiSquare Feature Selection class
=head1 SYNOPSIS
# the recommended way to use this class is to let the KnowledgeSet
# instanciate it
use AI::Categorizer::KnowledgeSetSMART;
my $ksetCHI = new AI::Categorizer::KnowledgeSetSMART(
tfidf_notation =>'Categorizer',
feature_selection=>'chi_square', ...other parameters...);
# however it is also possible to pass an instance to the KnowledgeSet
use AI::Categorizer::KnowledgeSet;
use AI::Categorizer::FeatureSelector::ChiSquare;
my $ksetCHI = new AI::Categorizer::KnowledgeSet(
feature_selector => new ChiSquare(features_kept=>2000,verbose=>1),
...other parameters...
);
=head1 DESCRIPTION
Feature selection with the ChiSquare function.
Chi-Square(t,ci) = (N.(AD-CB)^2)
-----------------------
(A+C).(B+D).(A+B).(C+D)
where t = term
ci = category i
N = number of documents in the collection
A = number of times where t and c co-occur
B = " " " t occurs without c
C = " " " c occurs without t
D = " " " neither c nor t occur
for more details, see :
Yiming Yang, Jan O. Pedersen, A Comparative Study on Feature Selection
in Text Categorization, in Proceedings of ICML-97, 14th International
Conference on Machine Learning, 1997.
(available on citeseer.nj.nec.com)
=head1 METHODS
=head1 AUTHOR
lib/AI/Categorizer/FeatureSelector/DocFrequency.pm view on Meta::CPAN
package AI::Categorizer::FeatureSelector::DocFrequency;
use strict;
use AI::Categorizer::FeatureSelector;
use base qw(AI::Categorizer::FeatureSelector);
use Params::Validate qw(:types);
use Carp qw(croak);
__PACKAGE__->contained_objects
(
features => { class => 'AI::Categorizer::FeatureVector',
delayed => 1 },
);
# The KnowledgeSet keeps track of document frequency, so just use that.
sub rank_features {
my ($self, %args) = @_;
my $k = $args{knowledge_set} or die "No knowledge_set parameter provided to rank_features()";
my %freq_counts;
foreach my $name ($k->features->names) {
$freq_counts{$name} = $k->document_frequency($name);
}
return $self->create_delayed_object('features', features => \%freq_counts);
}
sub scan_features {
my ($self, %args) = @_;
my $c = $args{collection} or die "No 'collection' parameter provided to scan_features()";
my $doc_freq = $self->create_delayed_object('features');
while (my $doc = $c->next) {
$args{prog_bar}->() if $args{prog_bar};
$doc_freq->add( $doc->features->as_boolean_hash );
}
print "\n" if $self->verbose;
return $self->reduce_features($doc_freq);
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
=head1 SYNOPSIS
...
=head1 DESCRIPTION
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two. Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
=head1 METHODS
lib/AI/Categorizer/FeatureVector.pm view on Meta::CPAN
package AI::Categorizer::FeatureVector;
sub new {
my ($package, %args) = @_;
$args{features} ||= {};
return bless {features => $args{features}}, $package;
}
sub names {
my $self = shift;
return keys %{$self->{features}};
}
sub set {
my $self = shift;
$self->{features} = (ref $_[0] ? $_[0] : {@_});
}
sub as_hash {
my $self = shift;
return $self->{features};
}
sub euclidean_length {
my $self = shift;
my $f = $self->{features};
my $total = 0;
foreach (values %$f) {
$total += $_**2;
}
return sqrt($total);
}
sub normalize {
my $self = shift;
my $length = $self->euclidean_length;
return $length ? $self->scale(1/$length) : $self;
}
sub scale {
my ($self, $scalar) = @_;
$_ *= $scalar foreach values %{$self->{features}};
return $self;
}
sub as_boolean_hash {
my $self = shift;
return { map {($_ => 1)} keys %{$self->{features}} };
}
sub length {
my $self = shift;
return scalar keys %{$self->{features}};
}
sub clone {
my $self = shift;
return ref($self)->new( features => { %{$self->{features}} } );
}
sub intersection {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $common;
if (UNIVERSAL::isa($other, 'ARRAY')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} @$other};
} elsif (UNIVERSAL::isa($other, 'HASH')) {
$common = {map {exists $self->{features}{$_} ? ($_ => $self->{features}{$_}) : ()} keys %$other};
}
return ref($self)->new( features => $common );
}
sub add {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
while (my ($k,$v) = each %$other) {
$self->{features}{$k} += $v;
}
}
sub dot {
my ($self, $other) = @_;
$other = $other->as_hash if UNIVERSAL::isa($other, __PACKAGE__);
my $sum = 0;
my $f = $self->{features};
while (my ($k, $v) = each %$f) {
$sum += $other->{$k} * $v if exists $other->{$k};
}
return $sum;
}
sub sum {
my ($self) = @_;
# Return total of values in this vector
my $total = 0;
$total += $_ foreach values %{ $self->{features} };
return $total;
}
sub includes {
return exists $_[0]->{features}{$_[1]};
}
sub value {
return $_[0]->{features}{$_[1]};
}
sub values {
my $self = shift;
return @{ $self->{features} }{ @_ };
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureVector - Features vs. Values
=head1 SYNOPSIS
my $f1 = new AI::Categorizer::FeatureVector
(features => {howdy => 2, doody => 3});
my $f2 = new AI::Categorizer::FeatureVector
(features => {doody => 1, whopper => 2});
@names = $f1->names;
$x = $f1->length;
$x = $f1->sum;
$x = $f1->includes('howdy');
$x = $f1->value('howdy');
$x = $f1->dot($f2);
$f3 = $f1->clone;
$f3 = $f1->intersection($f2);
$f3 = $f1->add($f2);
$h = $f1->as_hash;
$h = $f1->as_boolean_hash;
$f1->normalize;
=head1 DESCRIPTION
This class implements a "feature vector", which is a flat data
structure indicating the values associated with a set of features. At
its base level, a FeatureVector usually represents the set of words in
a document, with the value for each feature indicating the number of
times each word appears in the document. However, the values are
arbitrary so they can represent other quantities as well, and
FeatureVectors may also be combined to represent the features of
lib/AI/Categorizer/Hypothesis.pm view on Meta::CPAN
package AI::Categorizer::Hypothesis;
use strict;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
__PACKAGE__->valid_params
(
all_categories => {type => ARRAYREF},
scores => {type => HASHREF},
threshold => {type => SCALAR},
document_name => {type => SCALAR, optional => 1},
);
sub all_categories { @{$_[0]->{all_categories}} }
sub document_name { $_[0]->{document_name} }
sub threshold { $_[0]->{threshold} }
sub best_category {
my ($self) = @_;
my $sc = $self->{scores};
return unless %$sc;
my ($best_cat, $best_score) = each %$sc;
while (my ($key, $val) = each %$sc) {
($best_cat, $best_score) = ($key, $val) if $val > $best_score;
}
return $best_cat;
}
sub in_category {
my ($self, $cat) = @_;
return '' unless exists $self->{scores}{$cat};
return $self->{scores}{$cat} > $self->{threshold};
}
sub categories {
my $self = shift;
return @{$self->{cats}} if $self->{cats};
$self->{cats} = [sort {$self->{scores}{$b} <=> $self->{scores}{$a}}
grep {$self->{scores}{$_} >= $self->{threshold}}
keys %{$self->{scores}}];
return @{$self->{cats}};
}
sub scores {
my $self = shift;
return @{$self->{scores}}{@_};
}
1;
__END__
=head1 NAME
AI::Categorizer::Hypothesis - Embodies a set of category assignments
=head1 SYNOPSIS
use AI::Categorizer::Hypothesis;
# Hypotheses are usually created by the Learner's categorize() method.
# (assume here that $learner and $document have been created elsewhere)
my $h = $learner->categorize($document);
print "Assigned categories: ", join ', ', $h->categories, "\n";
print "Best category: ", $h->best_category, "\n";
print "Assigned scores: ", join ', ', $h->scores( $h->categories ), "\n";
print "Chosen from: ", join ', ', $h->all_categories, "\n";
print +($h->in_category('geometry') ? '' : 'not '), "assigned to geometry\n";
=head1 DESCRIPTION
A Hypothesis embodies a set of category assignments that a categorizer
makes about a single document. Because one may be interested in
knowing different kinds of things about the assignments (for instance,
what categories were assigned, which category had the highest score,
whether a particular category was assigned), we provide a simple class
to help facilitate these scenarios.
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
use Params::Validate qw(:types);
use AI::Categorizer::ObjectSet;
use AI::Categorizer::Document;
use AI::Categorizer::Category;
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
categories => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Category objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Category'),
@{$_[0]} },
},
},
documents => {
type => ARRAYREF,
default => [],
callbacks => { 'all are Document objects' =>
sub { ! grep !UNIVERSAL::isa($_, 'AI::Categorizer::Document'),
@{$_[0]} },
},
},
scan_first => {
type => BOOLEAN,
default => 1,
},
feature_selector => {
isa => 'AI::Categorizer::FeatureSelector',
},
tfidf_weighting => {
type => SCALAR,
optional => 1,
},
term_weighting => {
type => SCALAR,
default => 'x',
},
collection_weighting => {
type => SCALAR,
default => 'x',
},
normalize_weighting => {
type => SCALAR,
default => 'x',
},
verbose => {
type => SCALAR,
default => 0,
},
);
__PACKAGE__->contained_objects
(
document => { delayed => 1,
class => 'AI::Categorizer::Document' },
category => { delayed => 1,
class => 'AI::Categorizer::Category' },
collection => { delayed => 1,
class => 'AI::Categorizer::Collection::Files' },
features => { delayed => 1,
class => 'AI::Categorizer::FeatureVector' },
feature_selector => 'AI::Categorizer::FeatureSelector::DocFrequency',
);
sub new {
my ($pkg, %args) = @_;
# Shortcuts
if ($args{tfidf_weighting}) {
@args{'term_weighting', 'collection_weighting', 'normalize_weighting'} = split '', $args{tfidf_weighting};
delete $args{tfidf_weighting};
}
my $self = $pkg->SUPER::new(%args);
# Convert to AI::Categorizer::ObjectSet sets
$self->{categories} = new AI::Categorizer::ObjectSet( @{$self->{categories}} );
$self->{documents} = new AI::Categorizer::ObjectSet( @{$self->{documents}} );
if ($self->{load}) {
my $args = ref($self->{load}) ? $self->{load} : { path => $self->{load} };
$self->load(%$args);
delete $self->{load};
}
return $self;
}
sub features {
my $self = shift;
if (@_) {
$self->{features} = shift;
$self->trim_doc_features if $self->{features};
}
return $self->{features} if $self->{features};
# Create a feature vector encompassing the whole set of documents
my $v = $self->create_delayed_object('features');
foreach my $document ($self->documents) {
$v->add( $document->features );
}
return $self->{features} = $v;
}
sub categories {
my $c = $_[0]->{categories};
return wantarray ? $c->members : $c->size;
}
sub documents {
my $d = $_[0]->{documents};
return wantarray ? $d->members : $d->size;
}
sub document {
my ($self, $name) = @_;
return $self->{documents}->retrieve($name);
}
sub feature_selector { $_[0]->{feature_selector} }
sub scan_first { $_[0]->{scan_first} }
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub trim_doc_features {
my ($self) = @_;
foreach my $doc ($self->documents) {
$doc->features( $doc->features->intersection($self->features) );
}
}
sub prog_bar {
my ($self, $collection) = @_;
return sub {} unless $self->verbose;
return sub { print STDERR '.' } unless eval "use Time::Progress; 1";
my $count = $collection->can('count_documents') ? $collection->count_documents : 0;
my $pb = 'Time::Progress'->new;
$pb->attr(max => $count);
my $i = 0;
return sub {
$i++;
return if $i % 25;
print STDERR $pb->report("%50b %p ($i/$count)\r", $i);
};
}
# A little utility method for several other methods like scan_stats(),
# load(), read(), etc.
sub _make_collection {
my ($self, $args) = @_;
return $args->{collection} || $self->create_delayed_object('collection', %$args);
}
sub scan_stats {
# Should determine:
# - number of documents
# - number of categories
# - avg. number of categories per document (whole corpus)
# - avg. number of tokens per document (whole corpus)
# - avg. number of types per document (whole corpus)
# - number of documents, tokens, & types for each category
# - "category skew index" (% variance?) by num. documents, tokens, and types
my ($self, %args) = @_;
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
my %stats;
while (my $doc = $collection->next) {
$pb->();
$stats{category_count_with_duplicates} += $doc->categories;
my ($sum, $length) = ($doc->features->sum, $doc->features->length);
$stats{document_count}++;
$stats{token_count} += $sum;
$stats{type_count} += $length;
foreach my $cat ($doc->categories) {
#warn $doc->name, ": ", $cat->name, "\n";
$stats{categories}{$cat->name}{document_count}++;
$stats{categories}{$cat->name}{token_count} += $sum;
$stats{categories}{$cat->name}{type_count} += $length;
}
}
print "\n" if $self->verbose;
my @cats = keys %{ $stats{categories} };
$stats{category_count} = @cats;
$stats{categories_per_document} = $stats{category_count_with_duplicates} / $stats{document_count};
$stats{tokens_per_document} = $stats{token_count} / $stats{document_count};
$stats{types_per_document} = $stats{type_count} / $stats{document_count};
foreach my $thing ('type', 'token', 'document') {
$stats{"${thing}s_per_category"} = AI::Categorizer::Util::average
( map { $stats{categories}{$_}{"${thing}_count"} } @cats );
next unless @cats;
# Compute the skews
my $ssum;
foreach my $cat (@cats) {
$ssum += ($stats{categories}{$cat}{"${thing}_count"} - $stats{"${thing}s_per_category"}) ** 2;
}
$stats{"${thing}_skew_by_category"} = sqrt($ssum/@cats) / $stats{"${thing}s_per_category"};
}
return \%stats;
}
sub load {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
if ($self->{features_kept}) {
# Read the whole thing in, then reduce
$self->read( collection => $c );
$self->select_features;
} elsif ($self->{scan_first}) {
# Figure out the feature set first, then read data in
$self->scan_features( collection => $c );
$c->rewind;
$self->read( collection => $c );
} else {
# Don't do any feature reduction, just read the data
$self->read( collection => $c );
}
}
sub read {
my ($self, %args) = @_;
my $collection = $self->_make_collection(\%args);
my $pb = $self->prog_bar($collection);
while (my $doc = $collection->next) {
$pb->();
$self->add_document($doc);
}
print "\n" if $self->verbose;
}
sub finish {
my $self = shift;
return if $self->{finished}++;
$self->weigh_features;
}
sub weigh_features {
# This could be made more efficient by figuring out an execution
# plan in advance
my $self = shift;
if ( $self->{term_weighting} =~ /^(t|x)$/ ) {
# Nothing to do
} elsif ( $self->{term_weighting} eq 'l' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = 1 + log($_) foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'n' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
my $max_tf = AI::Categorizer::Util::max values %$f;
$_ = 0.5 + 0.5 * $_ / $max_tf foreach values %$f;
}
} elsif ( $self->{term_weighting} eq 'b' ) {
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$_ = $_ ? 1 : 0 foreach values %$f;
}
} else {
die "term_weighting must be one of 'x', 't', 'l', 'b', or 'n'";
}
if ($self->{collection_weighting} eq 'x') {
# Nothing to do
} elsif ($self->{collection_weighting} =~ /^(f|p)$/) {
my $subtrahend = ($1 eq 'f' ? 0 : 1);
my $num_docs = $self->documents;
$self->document_frequency('foo'); # Initialize
foreach my $doc ($self->documents) {
my $f = $doc->features->as_hash;
$f->{$_} *= log($num_docs / $self->{doc_freq_vector}{$_} - $subtrahend) foreach keys %$f;
}
} else {
die "collection_weighting must be one of 'x', 'f', or 'p'";
}
if ( $self->{normalize_weighting} eq 'x' ) {
# Nothing to do
} elsif ( $self->{normalize_weighting} eq 'c' ) {
$_->features->normalize foreach $self->documents;
} else {
die "normalize_weighting must be one of 'x' or 'c'";
}
}
sub document_frequency {
my ($self, $term) = @_;
unless (exists $self->{doc_freq_vector}) {
die "No corpus has been scanned for features" unless $self->documents;
my $doc_freq = $self->create_delayed_object('features', features => {});
foreach my $doc ($self->documents) {
$doc_freq->add( $doc->features->as_boolean_hash );
}
$self->{doc_freq_vector} = $doc_freq->as_hash;
}
return exists $self->{doc_freq_vector}{$term} ? $self->{doc_freq_vector}{$term} : 0;
}
sub scan_features {
my ($self, %args) = @_;
my $c = $self->_make_collection(\%args);
my $pb = $self->prog_bar($c);
my $ranked_features = $self->{feature_selector}->scan_features( collection => $c, prog_bar => $pb );
$self->delayed_object_params('document', use_features => $ranked_features);
$self->delayed_object_params('collection', use_features => $ranked_features);
return $ranked_features;
}
sub select_features {
my $self = shift;
my $f = $self->feature_selector->select_features(knowledge_set => $self);
$self->features($f);
}
sub partition {
my ($self, @sizes) = @_;
my $num_docs = my @docs = $self->documents;
my @groups;
while (@sizes > 1) {
my $size = int ($num_docs * shift @sizes);
push @groups, [];
for (0..$size) {
push @{ $groups[-1] }, splice @docs, rand(@docs), 1;
}
}
push @groups, \@docs;
return map { ref($self)->new( documents => $_ ) } @groups;
}
sub make_document {
my ($self, %args) = @_;
my $cats = delete $args{categories};
my @cats = map { $self->call_method('category', 'by_name', name => $_) } @$cats;
my $d = $self->create_delayed_object('document', %args, categories => \@cats);
$self->add_document($d);
}
sub add_document {
my ($self, $doc) = @_;
foreach ($doc->categories) {
$_->add_document($doc);
}
$self->{documents}->insert($doc);
$self->{categories}->insert($doc->categories);
}
sub save_features {
my ($self, $file) = @_;
my $f = ($self->{features} || { $self->delayed_object_params('document') }->{use_features})
or croak "No features to save";
open my($fh), "> $file" or croak "Can't create $file: $!";
my $h = $f->as_hash;
print $fh "# Total: ", $f->length, "\n";
foreach my $k (sort {$h->{$b} <=> $h->{$a}} keys %$h) {
print $fh "$k\t$h->{$k}\n";
}
close $fh;
}
sub restore_features {
my ($self, $file, $n) = @_;
open my($fh), "< $file" or croak "Can't open $file: $!";
my %hash;
while (<$fh>) {
next if /^#/;
/^(.*)\t([\d.]+)$/ or croak "Malformed line: $_";
$hash{$1} = $2;
last if defined $n and $. >= $n;
}
my $features = $self->create_delayed_object('features', features => \%hash);
$self->delayed_object_params('document', use_features => $features);
$self->delayed_object_params('collection', use_features => $features);
}
1;
__END__
=head1 NAME
AI::Categorizer::KnowledgeSet - Encapsulates set of documents
=head1 SYNOPSIS
use AI::Categorizer::KnowledgeSet;
my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
$nb->train(knowledge_set => $k);
=head1 DESCRIPTION
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two. Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
=head1 METHODS
view all matches for this distributionview release on metacpan - search on metacpan