AI-NaiveBayes1
view release on metacpan or search on metacpan
NaiveBayes1.pm view on Meta::CPAN
# Find results for unseen instances
my $result = $nb->predict
(attributes => {model=>'T', place=>'N'});
foreach my $k (keys(%{ $result })) {
print "for label $k P = " . $result->{$k} . "\n";
}
# export the model into a string
my $string = $nb->export_to_YAML();
# create the same model from the string
my $nb1 = AI::NaiveBayes1->import_from_YAML($string);
# write the model to a file (shorter than model->string->file)
$nb->export_to_YAML_file('t/tmp1');
# read the model from a file (shorter than file->string->model)
my $nb2 = AI::NaiveBayes1->import_from_YAML_file('t/tmp1');
See Examples for more examples.
=head1 DESCRIPTION
This module implements the classic "Naive Bayes" machine learning
algorithm.
=head2 Data Structure
An object contains the following fields:
=over 4
=item C<{attributes}>
List of attribute names.
=item C<{attribute_type}{$a}>
Attribute types - 'real', or not (e.g., 'nominal')
=item C<{labels}>
List of labels.
=item C<{attvals}{$a}>
List of attribute values
=item C<{real_stat}{$a}{$v}{$l}{sum}>
Statistics for real valued attributes; besides 'sum' also: count, mean, stddev
=item C<{numof_instances}>
Number of training instances.
=item C<{stat_labels}{$l}>
Label count in training data.
=item C<{stat_attributes}{$a}>
Statistics for an attribute: C<...{$value}{$label}> = count of
instances.
=item C<{smoothing}{$attribute}>
Attribute smoothing. No smoothing if does not exist. Implemented smoothing:
- /^unseen count=/ followed by number, e.g., 0.5
=back
=head2 Attribute Smoothing
For an attribute A one can specify:
$nb->{smoothing}{A} = 'unseen count=0.5';
to provide a count for unseen data. The count is taken into
consideration in training and prediction, when any unseen attribute
values are observed. Zero probabilities can be prevented in this way.
A count other than 0.5 can be provided, but if it is <=0 it will be
set to 0.5. The method is similar to add-one smoothing. A special
attribute value '*' is used for all unseen data.
=head1 METHODS
=head2 Constructor Methods
=over 4
=item new()
Constructor. Creates a new C<AI::NaiveBayes1> object and returns it.
=item import_from_YAML($string)
Constructor. Creates a new C<AI::NaiveBayes1> object from a string where it is
represented in C<YAML>. Requires YAML module.
=item import_from_YAML_file($file_name)
Constructor. Creates a new C<AI::NaiveBayes1> object from a file where it is
represented in C<YAML>. Requires YAML module.
=back
=head2 Non-Constructor Methods
=over 4
=item add_table()
Add instances from a table. The first row are attributes, followed by
values. If the name of the last attribute is `count', it is
interpreted as a repetition count and used appropriatelly. The last
attribute (after optionally removing `count') is the class attribute.
The attributes and values are separated by white space.
=item add_csv_file($filename)
Add instances from a CSV file. Primitive format implementation (e.g.,
no commas allowed in attribute names or values).
=item drop_attributes(@attributes)
Delete attributes after adding instances.
=item set_real(list_of_attributes)
Delares a list of attributes to be real-valued. During training,
their conditional probabilities will be modeled with Gaussian (normal)
distributions.
=item C<add_instance(attributes=E<gt>HASH,label=E<gt>STRING|ARRAY)>
Adds a training instance to the categorizer.
=item C<add_instances(attributes=E<gt>HASH,label=E<gt>STRING|ARRAY,cases=E<gt>NUMBER)>
Adds a number of identical instances to the categorizer.
=item export_to_YAML()
NaiveBayes1.pm view on Meta::CPAN
=head1 THEORY
Bayes' Theorem is a way of inverting a conditional probability. It
states:
P(y|x) P(x)
P(x|y) = -------------
P(y)
and so on...
This is a pretty standard algorithm explained in many machine learning
textbooks (e.g., "Data Mining" by Witten and Eibe).
The algorithm relies on estimating P(A|C), where A is an arbitrary
attribute, and C is the class attribute. If A is not real-valued,
then this conditional probability is estimated using a table of all
possible values for A and C.
If A is real-valued, then the distribution P(A|C) is modeled as a
Gaussian (normal) distribution for each possible value of C=c, Hence,
for each C=c we collect the mean value (m) and standard deviation (s)
for A during training. During classification, P(A=a|C=c) is estimated
using Gaussian distribution, i.e., in the following way:
1 (a-m)^2
P(A=a|C=c) = ------------ * exp( - ------- )
sqrt(2*Pi)*s 2*s^2
this boils down to the following lines of code:
$scores{$label} *=
0.398942280401433 / $m->{real_stat}{$att}{$label}{stddev}*
exp( -0.5 *
( ( $newattrs->{$att} -
$m->{real_stat}{$att}{$label}{mean})
/ $m->{real_stat}{$att}{$label}{stddev}
) ** 2
);
i.e.,
P(A=a|C=c) = 0.398942280401433 / s *
exp( -0.5 * ( ( a-m ) / s ) ** 2 );
=head1 EXAMPLES
Example with a real-valued attribute modeled by a Gaussian
distribution (from Witten I. and Frank E. book "Data Mining" (the WEKA
book), page 86):
# @relation weather
#
# @attribute outlook {sunny, overcast, rainy}
# @attribute temperature real
# @attribute humidity real
# @attribute windy {TRUE, FALSE}
# @attribute play {yes, no}
#
# @data
# sunny,85,85,FALSE,no
# sunny,80,90,TRUE,no
# overcast,83,86,FALSE,yes
# rainy,70,96,FALSE,yes
# rainy,68,80,FALSE,yes
# rainy,65,70,TRUE,no
# overcast,64,65,TRUE,yes
# sunny,72,95,FALSE,no
# sunny,69,70,FALSE,yes
# rainy,75,80,FALSE,yes
# sunny,75,70,TRUE,yes
# overcast,72,90,TRUE,yes
# overcast,81,75,FALSE,yes
# rainy,71,91,TRUE,no
$nb->set_real('temperature', 'humidity');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>85,humidity=>85,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>80,humidity=>90,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>83,humidity=>86,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>70,humidity=>96,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>68,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>65,humidity=>70,windy=>'TRUE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>64,humidity=>65,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>72,humidity=>95,windy=>'FALSE'},label=>'play=no');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>69,humidity=>70,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>75,humidity=>80,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'sunny',temperature=>75,humidity=>70,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>72,humidity=>90,windy=>'TRUE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'overcast',temperature=>81,humidity=>75,windy=>'FALSE'},label=>'play=yes');
$nb->add_instance(attributes=>{outlook=>'rainy',temperature=>71,humidity=>91,windy=>'TRUE'},label=>'play=no');
$nb->train;
my $printedmodel = "Model:\n" . $nb->print_model;
my $p = $nb->predict(attributes=>{outlook=>'sunny',temperature=>66,humidity=>90,windy=>'TRUE'});
YAML::DumpFile('file', $p);
die unless (abs($p->{'play=no'} - 0.792) < 0.001);
die unless(abs($p->{'play=yes'} - 0.208) < 0.001);
=head1 HISTORY
L<Algorithm::NaiveBayes> by Ken Williams was not what I needed so I
wrote this one. L<Algorithm::NaiveBayes> is oriented towards text
categorization, it includes smoothing, and log probabilities. This
module is a generic, basic Naive Bayes algorithm.
=head1 THANKS
I would like to thank Daniel Bohmer for documentation corrections,
Yung-chung Lin (cpan:xern) for the implementation of the Gaussian model
for continuous variables, and the following people for bug reports, support,
and comments (in no particular order):
Michael Stevens, Tom Dyson, Dan Von Kohorn, Craig Talbert,
Andrew Brian Clegg,
and CPAN-testers, including: Andreas Koenig, Alexandr Ciornii, jlatour,
Jost.Krieger, tvmaly, Matthew Musgrove, Michael Stevens, Nigel Horne,
( run in 1.480 second using v1.01-cache-2.11-cpan-8f98c5d2c55 )