AI-NaiveBayes1
view release on metacpan or search on metacpan
NaiveBayes1.pm view on Meta::CPAN
# (c) 2003-21 Vlado Keselj https://web.cs.dal.ca/~vlado
package AI::NaiveBayes1;
use strict;
require Exporter;
use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
@EXPORT = qw(new);
use vars qw($Version);
$Version = $VERSION = '2.012';
use vars @EXPORT_OK;
# non-exported package globals go here
use vars qw();
sub new {
my $package = shift;
return bless {
attributes => [ ],
labels => [ ],
attvals => {},
real_stat => {},
numof_instances => 0,
stat_labels => {},
stat_attributes => {},
smoothing => {},
attribute_type => {},
}, $package;
}
sub set_real {
my ($self, @attr) = @_;
foreach my $a (@attr) { $self->{attribute_type}{$a} = 'real' }
}
sub import_from_YAML {
my $package = shift;
my $yaml = shift;
my $self = YAML::Load($yaml);
return bless $self, $package;
}
sub import_from_YAML_file {
my $package = shift;
my $yamlf = shift;
my $self = YAML::LoadFile($yamlf);
return bless $self, $package;
}
# assume that the last header count means counts
# after optionally removing counts, the last header is label
sub add_table {
my $self = shift;
my @atts = (); my $lbl=''; my $cnt = '';
while (@_) {
my $table = shift;
if ($table =~ /^(.*)\n[ \t]*-+\n/) {
my $a = $1; $table = $';
$a =~ s/^\s+//; $a =~ s/\s+$//;
if ($a =~ /\s*\bcount\s*$/) {
$a=$`; $cnt=1; } else { $cnt='' }
@atts = split(/\s+/, $a);
$lbl = pop @atts;
}
while ($table ne '') {
$table =~ /^(.*)\n?/ or die;
my $r=$1; $table = $';
$r =~ s/^\s+//; $r=~ s/\s+$//;
if ($r =~ /^-+$/) { next }
my @v = split(/\s+/, $r);
die "values (#=$#v): {@v}\natts (#=$#atts): @atts, lbl=$lbl,\n".
"count: $cnt\n" unless $#v-($cnt?2:1) == $#atts;
my %av=(); my @a = @atts;
while (@a) { $av{shift @a} = shift(@v) }
$self->add_instances(attributes=>\%av,
label=>"$lbl=$v[0]",
cases=>($cnt?$v[1]:1) );
}
}
} # end of add_table
# Simplified; not generally compatible.
# Assume that the last header is label. The first row contains
# attribute names.
sub add_csv_file {
my $self = shift; my $fn = shift; local *F;
open(F,$fn) or die "Cannot open CSV file `$fn': $!";
local $_ = <F>; my @atts = (); my $lbl=''; my $cnt = '';
chomp; @atts = split(/\s*,\s*/, $_); $lbl = pop @atts;
while (<F>) {
chomp; my @v = split(/\s*,\s*/, $_);
NaiveBayes1.pm view on Meta::CPAN
This module implements the classic "Naive Bayes" machine learning
algorithm.
=head2 Data Structure
An object contains the following fields:
=over 4
=item C<{attributes}>
List of attribute names.
=item C<{attribute_type}{$a}>
Attribute types - 'real', or not (e.g., 'nominal')
=item C<{labels}>
List of labels.
=item C<{attvals}{$a}>
List of attribute values
=item C<{real_stat}{$a}{$v}{$l}{sum}>
Statistics for real valued attributes; besides 'sum' also: count, mean, stddev
=item C<{numof_instances}>
Number of training instances.
=item C<{stat_labels}{$l}>
Label count in training data.
=item C<{stat_attributes}{$a}>
Statistics for an attribute: C<...{$value}{$label}> = count of
instances.
=item C<{smoothing}{$attribute}>
Attribute smoothing. No smoothing if does not exist. Implemented smoothing:
- /^unseen count=/ followed by number, e.g., 0.5
=back
=head2 Attribute Smoothing
For an attribute A one can specify:
$nb->{smoothing}{A} = 'unseen count=0.5';
to provide a count for unseen data. The count is taken into
consideration in training and prediction, when any unseen attribute
values are observed. Zero probabilities can be prevented in this way.
A count other than 0.5 can be provided, but if it is <=0 it will be
set to 0.5. The method is similar to add-one smoothing. A special
attribute value '*' is used for all unseen data.
=head1 METHODS
=head2 Constructor Methods
=over 4
=item new()
Constructor. Creates a new C<AI::NaiveBayes1> object and returns it.
=item import_from_YAML($string)
Constructor. Creates a new C<AI::NaiveBayes1> object from a string where it is
represented in C<YAML>. Requires YAML module.
=item import_from_YAML_file($file_name)
Constructor. Creates a new C<AI::NaiveBayes1> object from a file where it is
represented in C<YAML>. Requires YAML module.
=back
=head2 Non-Constructor Methods
=over 4
=item add_table()
Add instances from a table. The first row are attributes, followed by
values. If the name of the last attribute is `count', it is
interpreted as a repetition count and used appropriatelly. The last
attribute (after optionally removing `count') is the class attribute.
The attributes and values are separated by white space.
=item add_csv_file($filename)
Add instances from a CSV file. Primitive format implementation (e.g.,
no commas allowed in attribute names or values).
=item drop_attributes(@attributes)
Delete attributes after adding instances.
=item set_real(list_of_attributes)
Delares a list of attributes to be real-valued. During training,
their conditional probabilities will be modeled with Gaussian (normal)
distributions.
=item C<add_instance(attributes=E<gt>HASH,label=E<gt>STRING|ARRAY)>
Adds a training instance to the categorizer.
=item C<add_instances(attributes=E<gt>HASH,label=E<gt>STRING|ARRAY,cases=E<gt>NUMBER)>
Adds a number of identical instances to the categorizer.
=item export_to_YAML()
Returns a C<YAML> string representation of an C<AI::NaiveBayes1>
object. Requires YAML module.
=item C<export_to_YAML_file( $file_name )>
Writes a C<YAML> string representation of an C<AI::NaiveBayes1>
object to a file. Requires YAML module.
=item C<print_model( OPTIONAL 'with counts' )>
Returns a string, human-friendly representation of the model.
The model is supposed to be trained before calling this method.
One argument 'with counts' can be supplied, in which case explanatory
expressions with counts are printed as well.
=item train()
Calculates the probabilities that will be necessary for categorization
using the C<predict()> method.
=item C<predict( attributes =E<gt> HASH )>
Use this method to predict the label of an unknown instance. The
attributes should be of the same format as you passed to
C<add_instance()>. C<predict()> returns a hash reference whose keys
are the names of labels, and whose values are corresponding
probabilities.
=item C<labels>
Returns a list of all the labels the object knows about (in no
particular order), or the number of labels if called in a scalar
context.
=back
=head1 THEORY
Bayes' Theorem is a way of inverting a conditional probability. It
states:
P(y|x) P(x)
P(x|y) = -------------
P(y)
NaiveBayes1.pm view on Meta::CPAN
then this conditional probability is estimated using a table of all
possible values for A and C.
If A is real-valued, then the distribution P(A|C) is modeled as a
Gaussian (normal) distribution for each possible value of C=c, Hence,
for each C=c we collect the mean value (m) and standard deviation (s)
for A during training. During classification, P(A=a|C=c) is estimated
using Gaussian distribution, i.e., in the following way:
1 (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
sqrt(2*Pi)*s 2*s^2
this boils down to the following lines of code:
$scores{$label} *=
0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
exp( -0.5 *
( ( $newattrs->{$att} -
$m->{real_stat}{$att}{$label}{mean})
/ $m->{real_stat}{$att}{$label}{stddev}
) ** 2
);
i.e.,
P(A=a|C=c) = 0.398942280401433 / s *
exp( -0.5 * ( ( a-m ) / s ) ** 2 );
=head1 EXAMPLES
Example with a real-valued attribute modeled by a Gaussian
distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA
book), page 86):
# @relation weather
#
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
#
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no
$nb->set_real('temperature', 'humidity');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');
$nb->train;
my $printedmodel = "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});
YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'} - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);
=head1 HISTORY
L<Algorithm::NaiveBayes> by Ken Williams was not what I needed so I
wrote this one. L<Algorithm::NaiveBayes> is oriented towards text
categorization, it includes smoothing, and log probabilities. This
module is a generic, basic Naive Bayes algorithm.
=head1 THANKS
I would like to thank Daniel Bohmer for documentation corrections,
Yung-chung Lin (cpan:xern) for the implementation of the Gaussian model
for continuous variables, and the following people for bug reports, support,
and comments (in no particular order):
Michael Stevens, Tom Dyson, Dan Von Kohorn, Craig Talbert,
Andrew Brian Clegg,
and CPAN-testers, including: Andreas Koenig, Alexandr Ciornii, jlatour,
Jost.Krieger, tvmaly, Matthew Musgrove, Michael Stevens, Nigel Horne,
Graham Crookham, David Cantrell (dcantrell).
=head1 AUTHOR
Copyright 2003-21 Vlado Keselj L<https://web.cs.dal.ca/~vlado>.
In 2004 Yung-chung Lin provided implementation of the Gaussian model for
continous variables.
This script is provided "as is" without expressed or implied warranty.
This is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.
The module is available on CPAN (L<https://metacpan.org/author/VLADO>), and
L<https://web.cs.dal.ca/~vlado/srcperl/>. The latter site is
updated more frequently.
( run in 0.519 second using v1.01-cache-2.11-cpan-13bb782fe5a )