AI-Categorizer
view release on metacpan or search on metacpan
gone away.
- The building & installing process now uses Module::Build rather
than ExtUtils::MakeMaker.
- When the features_kept mechanism was used to explicitly state the
features to use, and the scan_first parameter was left as its
default value, the features_kept mechanism would silently fail to
do anything. This has now been fixed. [Spotted by Arnaud Gaudinat]
- Recent versions of Weka have changed the name of the SVM class, so
I've updated it in our test (t/03-weka.t) of the Weka wrapper
too. [Sebastien Aperghis-Tramoni]
0.07 Tue May 6 16:15:04 CDT 2003
- Oops - eg/demo.pl and t/15-knowledge_set.t didn't make it into the
MANIFEST, so they weren't included in the 0.06 distribution.
[Spotted by Zoltan Barta]
0.06 Tue Apr 22 10:27:26 CDT 2003
for providing a set of baseline scores against which to evaluate
other machine learners.
- The NaiveBayes learner is now a wrapper around my new
Algorithm::NaiveBayes module, which is just the old NaiveBayes code
from here, turned into its own standalone module.
- Much more extensive regression testing of the code.
- Added a Document subclass for XML documents. [Implemented by
Jae-Moon Lee] Its interface is still unstable, it may change in
later releases.
- Added a 'Build.PL' file for an alternate installation method using
Module::Build.
- Fixed a problem in the Hypothesis' best_category() method that
would often result in the wrong category being reported. Added a
regression test to exercise the Hypothesis class. [Spotted by
Xiaobo Li]
- Added save_features() and restore_features() to KnowledgeSet.
- Added default categories() and categorize() methods to Learner base
class. get_scores() is now abstract.
- Extended interface of ObjectSet class with retrieve(), includes(),
and includes_name().
- Moved 'term_weighting' parameter from Document to KnowledgeSet,
since the normalized version needs to know the maximum
term-frequency. Also changed its values to 'n', 'l', 'b', and 't',
with 'x' a synonym for 't'.
- Implemented full range of TF/IDF term weighting methods (see Salton
& Buckley, "Term Weighting Approaches in Automatic Text Retrieval",
in journal "Information Processing & Management", 1988 #5)
0.03 Wed Jul 24 01:57:00 AEST 2002
- First version released to CPAN
classes), or any class that *they* create. This is managed by the
"Class::Container" module, so see its documentation for the details of
how this works.
The specific parameters accepted here are:
progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the
string "save", which means files like "save-01-knowledge_set" will
get created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
verbose
If true, a few status messages will be printed during execution.
training_set
Specifies the "path" parameter that will be fed to the
KnowledgeSet's "scan_features()" and "read()" methods during our
"scan_features()" and "read_training_set()" methods.
stats_table()
Returns the value of the Experiment's (as created by
"evaluate_test_set()") "stats_table()" method. This is a string that
shows various statistics about the accuracy/precision/recall/F1/etc. of
the assignments made during testing.
HISTORY
This module is a revised and redesigned version of the previous
"AI::Categorize" module by the same author. Note the added 'r' in the new
name. The older module has a different interface, and no attempt at backward
compatibility has been made - that's why I changed the name.
You can have both "AI::Categorize" and "AI::Categorizer" installed at the
same time on the same machine, if you want. They don't know about each other
or use conflicting namespaces.
AUTHOR
Ken Williams <ken@mathforum.org>
Discussion about this module can be directed to the perl-AI list at
<perl-ai@perl.org>. For more info about the list, see
eg/categorizer view on Meta::CPAN
my $result = $c->stats_table;
print $result if $c->verbose;
print $out_fh $result if $out_fh;
}
sub run_section {
my ($section, $stage, $do_stage) = @_;
return unless $do_stage->{$stage};
if (keys %$do_stage > 1) {
print " % $0 @ARGV -$stage\n" if $c->verbose;
die "$0 is not executable, please change its execution permissions"
unless -x $0;
system($0, @ARGV, "-$stage") == 0
or die "$0 returned nonzero status, \$?=$?";
return;
}
my $start = new Benchmark;
$c->$section();
my $end = new Benchmark;
my $summary = timestr(timediff($end, $start));
my ($rss, $vsz) = memory_usage();
lib/AI/Categorizer.pm view on Meta::CPAN
The specific parameters accepted here are:
=over 4
=item progress_file
A string that indicates a place where objects will be saved during
several of the methods of this class. The default value is the string
C<save>, which means files like C<save-01-knowledge_set> will get
created. The exact names of these files may change in future
releases, since they're just used internally to resume where we last
left off.
=item verbose
If true, a few status messages will be printed during execution.
=item training_set
Specifies the C<path> parameter that will be fed to the KnowledgeSet's
lib/AI/Categorizer.pm view on Meta::CPAN
accuracy/precision/recall/F1/etc. of the assignments made during
testing.
=back
=head1 HISTORY
This module is a revised and redesigned version of the previous
C<AI::Categorize> module by the same author. Note the added 'r' in
the new name. The older module has a different interface, and no
attempt at backward compatibility has been made - that's why I changed
the name.
You can have both C<AI::Categorize> and C<AI::Categorizer> installed
at the same time on the same machine, if you want. They don't know
about each other or use conflicting namespaces.
=head1 AUTHOR
Ken Williams <ken@mathforum.org>
lib/AI/Categorizer/Document.pm view on Meta::CPAN
=item use_features
A Feature Vector specifying the only features that should be
considered when parsing this document. This is an alternative to
using C<stopwords>.
=item stemming
Indicates the linguistic procedure that should be used to convert
tokens in the document to features. Possible values are C<none>,
which indicates that the tokens should be used without change, or
C<porter>, indicating that the Porter stemming algorithm should be
applied to each token. This requires the C<Lingua::Stem> module from
CPAN.
=item stopword_behavior
There are a few ways you might want the stopword list (specified with
the C<stopwords> parameter) to interact with the stemming algorithm
(specified with the C<stemming> parameter). These options can be
controlled with the C<stopword_behavior> parameter, which can take the
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
=item p
Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).
=item x
No change - multiply by 1.
=back
The third character specifies the "normalization" component, which
can take the following values:
=over 4
=item c
Apply cosine normalization - multiply by 1/length(document_vector).
=item x
No change - multiply by 1.
=back
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
=item verbose
If set to a true value, some status/debugging information will be
lib/AI/Categorizer/KnowledgeSet.pm view on Meta::CPAN
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
=item p
Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).
=item x
No change - multiply by 1.
=back
The third character specifies the "normalization" component, which
can take the following values:
=over 4
=item c
Apply cosine normalization - multiply by 1/length(document_vector).
=item x
No change - multiply by 1.
=back
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
=item verbose
If set to a true value, some status/debugging information will be
( run in 0.552 second using v1.01-cache-2.11-cpan-5dc5da66d9d )