AI-Categorizer
view release on metacpan or search on metacpan
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
package AI::Categorizer::FeatureSelector;
use strict;
use Class::Container;
use base qw(Class::Container);
use Params::Validate qw(:types);
use AI::Categorizer::FeatureVector;
use AI::Categorizer::Util;
use Carp qw(croak);
__PACKAGE__->valid_params
(
features_kept => {
type => SCALAR,
default => 0.2,
},
verbose => {
type => SCALAR,
default => 0,
},
);
sub verbose {
my $self = shift;
$self->{verbose} = shift if @_;
return $self->{verbose};
}
sub reduce_features {
# Takes a feature vector whose weights are "feature scores", and
# chops to the highest n features. n is specified by the
# 'features_kept' parameter. If it's zero, all features are kept.
# If it's between 0 and 1, we multiply by the present number of
# features. If it's greater than 1, we treat it as the number of
# features to use.
my ($self, $f, %args) = @_;
my $kept = defined $args{features_kept} ? $args{features_kept} : $self->{features_kept};
return $f unless $kept;
my $num_kept = ($kept < 1 ?
$f->length * $kept :
$kept);
print "Trimming features - # features = " . $f->length . "\n" if $self->verbose;
# This is algorithmic overkill, but the sort seems fast enough. Will revisit later.
my $features = $f->as_hash;
my @new_features = (sort {$features->{$b} <=> $features->{$a}} keys %$features)
[0 .. $num_kept-1];
my $result = $f->intersection( \@new_features );
print "Finished trimming features - # features = " . $result->length . "\n" if $self->verbose;
return $result;
}
# Abstract methods
sub rank_features;
sub scan_features;
sub select_features {
my ($self, %args) = @_;
die "No knowledge_set parameter provided to select_features()"
unless $args{knowledge_set};
my $f = $self->rank_features( knowledge_set => $args{knowledge_set} );
return $self->reduce_features( $f, features_kept => $args{features_kept} );
}
1;
__END__
=head1 NAME
AI::Categorizer::FeatureSelector - Abstract Feature Selection class
=head1 SYNOPSIS
...
=head1 DESCRIPTION
The KnowledgeSet class that provides an interface to a set of
documents, a set of categories, and a mapping between the two. Many
parameters for controlling the processing of documents are managed by
the KnowledgeSet class.
=head1 METHODS
=over 4
=item new()
Creates a new KnowledgeSet and returns it. Accepts the following
parameters:
=over 4
=item load
If a C<load> parameter is present, the C<load()> method will be
invoked immediately. If the C<load> parameter is a string, it will be
passed as the C<path> parameter to C<load()>. If the C<load>
parameter is a hash reference, it will represent all the parameters to
pass to C<load()>.
=item categories
An optional reference to an array of Category objects representing the
complete set of categories in a KnowledgeSet. If used, the
lib/AI/Categorizer/FeatureSelector.pm view on Meta::CPAN
retrieval". The three characters indicate the three factors that will
be multiplied for each feature to find the final vector value for that
feature. The default weighting is C<xxx>.
The first character specifies the "term frequency" component, which
can take the following values:
=over 4
=item b
Binary weighting - 1 for terms present in a document, 0 for terms absent.
=item t
Raw term frequency - equal to the number of times a feature occurs in
the document.
=item x
A synonym for 't'.
=item n
Normalized term frequency - 0.5 + 0.5 * t/max(t). This is the same as
the 't' specification, but with term frequency normalized to lie
between 0.5 and 1.
=back
The second character specifies the "collection frequency" component, which
can take the following values:
=over 4
=item f
Inverse document frequency - multiply term C<t>'s value by C<log(N/n)>,
where C<N> is the total number of documents in the collection, and
C<n> is the number of documents in which term C<t> is found.
=item p
Probabilistic inverse document frequency - multiply term C<t>'s value
by C<log((N-n)/n)> (same variable meanings as above).
=item x
No change - multiply by 1.
=back
The third character specifies the "normalization" component, which
can take the following values:
=over 4
=item c
Apply cosine normalization - multiply by 1/length(document_vector).
=item x
No change - multiply by 1.
=back
The three components may alternatively be specified by the
C<term_weighting>, C<collection_weighting>, and C<normalize_weighting>
parameters respectively.
=item verbose
If set to a true value, some status/debugging information will be
output on C<STDOUT>.
=back
=item categories()
In a list context returns a list of all Category objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item documents()
In a list context returns a list of all Document objects in this
KnowledgeSet. In a scalar context returns the number of such objects.
=item document()
Given a document name, returns the Document object with that name, or
C<undef> if no such Document object exists in this KnowledgeSet.
=item features()
Returns a FeatureSet object which represents the features of all the
documents in this KnowledgeSet.
=item verbose()
Returns the C<verbose> parameter of this KnowledgeSet, or sets it with
an optional argument.
=item scan_stats()
Scans all the documents of a Collection and returns a hash reference
containing several statistics about the Collection. (XXX need to describe stats)
=item scan_features()
This method scans through a Collection object and determines the
"best" features (words) to use when loading the documents and training
the Learner. This process is known as "feature selection", and it's a
very important part of categorization.
The Collection object should be specified as a C<collection> parameter,
or by giving the arguments to pass to the Collection's C<new()> method.
The process of feature selection is governed by the
( run in 0.790 second using v1.01-cache-2.11-cpan-39bf76dae61 )