AI-NaiveBayes
view release on metacpan or search on metacpan
received the program in object code or executable form alone.)
Source code for a work means the preferred form of the work for making
modifications to it. For an executable file, complete source code means
all the source code for all modules it contains; but, as a special
exception, it need not include source code for modules which are standard
libraries that accompany the operating system on which the executable
file runs, or for standard header files or definitions files that
accompany that operating system.
4. You may not copy, modify, sublicense, distribute or transfer the
Program except as expressly provided under this General Public License.
Any attempt otherwise to copy, modify, sublicense, distribute or transfer
the Program is void, and will automatically terminate your rights to use
the Program under this License. However, parties who have received
copies, or rights to use copies, from you under this General Public
License will not have their licenses terminated so long as such parties
remain in full compliance.
5. By copying, distributing or modifying the Program (or any work based
on the Program) you indicate your acceptance of this license to do so,
and all its terms and conditions.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the original
licensor to copy, distribute or modify the Program subject to these
terms and conditions. You may not impose any further restrictions on the
recipients' exercise of the rights granted herein.
7. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of the license which applies to it and "any
may not charge a fee for this Package itself. However, you may distribute this
Package in aggregate with other (possibly commercial) programs as part of a
larger (possibly commercial) software distribution provided that you do not
advertise this Package as a product of your own.
6. The scripts and library files supplied as input to or produced as output
from the programs of this Package do not automatically fall under the copyright
of this Package, but belong to whomever generated them, and may be sold
commercially, and may be aggregated with this Package.
7. C or perl subroutines supplied by you and linked into this Package shall not
be considered part of this Package.
8. The name of the Copyright Holder may not be used to endorse or promote
products derived from this software without specific prior written permission.
9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The End
use AI::NaiveBayes::Learner;
use Moose;
use MooseX::Storage;
use List::Util qw(max);
with Storage(format => 'Storable', io => 'File');
has model => (is => 'ro', isa => 'HashRef[HashRef]', required => 1);
sub train {
my $self = shift;
my $learner = AI::NaiveBayes::Learner->new();
for my $example ( @_ ){
$learner->add_example( %$example );
}
return $learner->classifier;
}
sub classify {
my ($self, $newattrs) = @_;
$newattrs or die "Missing parameter for classify()";
my $m = $self->model;
# Note that we're using the log(prob) here. That's why we add instead of multiply.
my %scores = %{$m->{prior_probs}};
my %features;
while (my ($feature, $value) = each %$newattrs) {
$scores{$label} += $score;
$features{$feature}{$label} = $score;
}
}
rescale(\%scores);
return AI::NaiveBayes::Classification->new( label_sums => \%scores, features => \%features );
}
sub rescale {
my ($scores) = @_;
# Scale everything back to a reasonable area in logspace (near zero), un-loggify, and normalize
my $total = 0;
my $max = max(values %$scores);
foreach (values %$scores) {
$_ = exp($_ - $max);
$total += $_**2;
}
$total = sqrt($total);
cat in cats
That's the formula I use in my document categorization code. The last
step is the only non-rigorous one in the derivation, and this is the
"naive" part of the Naive Bayes technique. It assumes that the
probability of each word appearing in a document is unaffected by the
presence or absence of each other word in the document. We assume
this even though we know this isn't true: for example, the word
"iodized" is far more likely to appear in a document that contains the
word "salt" than it is to appear in a document that contains the word
"subroutine". Luckily, as it turns out, making this assumption even
when it isn't true may have little effect on our results, as the
following paper by Pedro Domingos argues:
L<"http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz">
=head1 SEE ALSO
Algorithm::NaiveBayes (3), AI::Classifier::Text(3)
=head1 BASED ON
lib/AI/NaiveBayes.pm view on Meta::CPAN
use AI::NaiveBayes::Learner;
use Moose;
use MooseX::Storage;
use List::Util qw(max);
with Storage(format => 'Storable', io => 'File');
has model => (is => 'ro', isa => 'HashRef[HashRef]', required => 1);
sub train {
my $self = shift;
my $learner = AI::NaiveBayes::Learner->new();
for my $example ( @_ ){
$learner->add_example( %$example );
}
return $learner->classifier;
}
sub classify {
my ($self, $newattrs) = @_;
$newattrs or die "Missing parameter for classify()";
my $m = $self->model;
# Note that we're using the log(prob) here. That's why we add instead of multiply.
my %scores = %{$m->{prior_probs}};
my %features;
while (my ($feature, $value) = each %$newattrs) {
lib/AI/NaiveBayes.pm view on Meta::CPAN
$scores{$label} += $score;
$features{$feature}{$label} = $score;
}
}
rescale(\%scores);
return AI::NaiveBayes::Classification->new( label_sums => \%scores, features => \%features );
}
sub rescale {
my ($scores) = @_;
# Scale everything back to a reasonable area in logspace (near zero), un-loggify, and normalize
my $total = 0;
my $max = max(values %$scores);
foreach (values %$scores) {
$_ = exp($_ - $max);
$total += $_**2;
}
$total = sqrt($total);
lib/AI/NaiveBayes.pm view on Meta::CPAN
cat in cats
That's the formula I use in my document categorization code. The last
step is the only non-rigorous one in the derivation, and this is the
"naive" part of the Naive Bayes technique. It assumes that the
probability of each word appearing in a document is unaffected by the
presence or absence of each other word in the document. We assume
this even though we know this isn't true: for example, the word
"iodized" is far more likely to appear in a document that contains the
word "salt" than it is to appear in a document that contains the word
"subroutine". Luckily, as it turns out, making this assumption even
when it isn't true may have little effect on our results, as the
following paper by Pedro Domingos argues:
L<"http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz">
=head1 SEE ALSO
Algorithm::NaiveBayes (3), AI::Classifier::Text(3)
=head1 BASED ON
lib/AI/NaiveBayes/Classification.pm view on Meta::CPAN
$AI::NaiveBayes::Classification::VERSION = '0.04';
use strict;
use warnings;
use 5.010;
use Moose;
has features => (is => 'ro', isa => 'HashRef[HashRef]', required => 1);
has label_sums => (is => 'ro', isa => 'HashRef', required => 1);
has best_category => (is => 'ro', isa => 'Str', lazy_build => 1);
sub _build_best_category {
my $self = shift;
my $sc = $self->label_sums;
my ($best_cat, $best_score) = each %$sc;
while (my ($key, $val) = each %$sc) {
($best_cat, $best_score) = ($key, $val) if $val > $best_score;
}
return $best_cat;
}
sub find_predictors{
my $self = shift;
my $best_cat = $self->best_category;
my $features = $self->features;
my @predictors;
for my $feature ( keys %$features ) {
for my $cat ( keys %{ $features->{$feature } } ){
next if $cat eq $best_cat;
push @predictors, [ $feature, $features->{$feature}{$best_cat} - $features->{$feature}{$cat} ];
}
lib/AI/NaiveBayes/Learner.pm view on Meta::CPAN
package AI::NaiveBayes::Learner;
$AI::NaiveBayes::Learner::VERSION = '0.04';
use strict;
use warnings;
use 5.010;
use List::Util qw( min sum );
use Moose;
use AI::NaiveBayes;
has attributes => (is => 'ro', isa => 'HashRef', default => sub { {} }, clearer => '_clear_attrs');
has labels => (is => 'ro', isa => 'HashRef', default => sub { {} }, clearer => '_clear_labels');
has examples => (is => 'ro', isa => 'Int', default => 0, clearer => '_clear_examples');
has features_kept => (is => 'ro', predicate => 'limit_features');
has classifier_class => ( is => 'ro', isa => 'Str', default => 'AI::NaiveBayes' );
sub add_example {
my ($self, %params) = @_;
for ('attributes', 'labels') {
die "Missing required '$_' parameter" unless exists $params{$_};
}
$self->{examples}++;
my $attributes = $params{attributes};
my $labels = $params{labels};
add_hash($self->attributes(), $attributes);
my $our_labels = $self->labels;
foreach my $label ( @$labels ) {
$our_labels->{$label}{count}++;
$our_labels->{$label}{attributes} //= {};
add_hash($our_labels->{$label}{attributes}, $attributes);
}
}
sub classifier {
my $self = shift;
my $examples = $self->examples;
my $labels = $self->labels;
my $vocab_size = keys %{ $self->attributes };
my $model;
$model->{attributes} = $self->attributes;
# Calculate the log-probabilities for each category
lib/AI/NaiveBayes/Learner.pm view on Meta::CPAN
}
my @top = @features[0..$limit-1];
my %kept = map { $_ => $old{$_} } @top;
$model->{probs}{$label} = \%kept;
}
}
my $classifier_class = $self->classifier_class;
return $classifier_class->new( model => $model );
}
sub add_hash {
my ($first, $second) = @_;
$first //= {};
foreach my $k (keys %$second) {
$first->{$k} //= 0;
$first->{$k} += $second->{$k};
}
}
__PACKAGE__->meta->make_immutable;
t/01-learner.t view on Meta::CPAN
$learner->add_example( attributes => _hash(qw(one two three four)),
labels => ['farming'] );
$learner->add_example( attributes => _hash(qw(five six seven eight)),
labels => ['farming'] );
$learner->add_example( attributes => _hash(qw(one two three four)),
labels => ['farming'] );
$model = $learner->classifier->model;
is keys %{$model->{probs}{farming}}, 4, 'half features kept';
is join(" ", sort { $a cmp $b } keys %{$model->{probs}{farming}}), 'four one three two';
sub _hash { +{ map {$_,1} @_ } }
t/02-predict.t view on Meta::CPAN
$classifier = $lr->classifier;
# Predict
$s = $classifier->classify( _hash(qw(jakis tekst po polsku)) );
$h = $s->label_sums;
ok(abs( 3 - $h->{farming} / $h->{vampire} ) < 0.01, 'Prior probabillities' );
################################################################
sub _hash { +{ map {$_,1} @_ } }
t/default_training.t view on Meta::CPAN
{
attributes => _hash(qw(vampires cannot see their images mirrors)),
labels => ['vampire']
},
);
isa_ok( $classifier, 'AI::NaiveBayes' );
################################################################
sub _hash { +{ map {$_,1} @_ } }
( run in 1.331 second using v1.01-cache-2.11-cpan-88abd93f124 )