formula results from the CPAN

Algorithm-NaiveBayes
view release on metacpan or search on metacpan
lib/Algorithm/NaiveBayes.pm view on Meta::CPAN
Purges training instances and their associated information from the
NaiveBayes object.  This can save memory after training.

=item purge()

Returns true or false depending on the value of the object's C<purge>
property.  An optional boolean argument sets the property.

=item save_state($path)

This object method saves the object to disk for later use.  The
C<$path> argument indicates the place on disk where the object should
be saved:

  $nb->save_state($path);

=item restore_state($path)

This class method reads the file specified by C<$path> and returns the
object that was previously stored there using C<save_state()>:

  $nb = Algorithm::NaiveBayes->restore_state($path);

=back

=head1 THEORY

Bayes' Theorem is a way of inverting a conditional probability. It
states:

                P(y|x) P(x)
      P(x|y) = -------------
                   P(y)

The notation C<P(x|y)> means "the probability of C<x> given C<y>."  See also
L<"http://mathforum.org/dr.math/problems/battisfore.03.22.99.html">
for a simple but complete example of Bayes' Theorem.

In this case, we want to know the probability of a given category given a
certain string of words in a document, so we have:

                    P(words | cat) P(cat)
  P(cat | words) = --------------------
                           P(words)

We have applied Bayes' Theorem because C<P(cat | words)> is a difficult
quantity to compute directly, but C<P(words | cat)> and C<P(cat)> are accessible
(see below).

The greater the expression above, the greater the probability that the given
document belongs to the given category.  So we want to find the maximum
value.  We write this as

                                 P(words | cat) P(cat)
  Best category =   ArgMax      -----------------------
                   cat in cats          P(words)


Since C<P(words)> doesn't change over the range of categories, we can get rid
of it.  That's good, because we didn't want to have to compute these values
anyway.  So our new formula is:

  Best category =   ArgMax      P(words | cat) P(cat)
                   cat in cats

Finally, we note that if C<w1, w2, ... wn> are the words in the document,
then this expression is equivalent to:

  Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
                   cat in cats

That's the formula I use in my document categorization code.  The last
step is the only non-rigorous one in the derivation, and this is the
"naive" part of the Naive Bayes technique.  It assumes that the
probability of each word appearing in a document is unaffected by the
presence or absence of each other word in the document.  We assume
this even though we know this isn't true: for example, the word
"iodized" is far more likely to appear in a document that contains the
word "salt" than it is to appear in a document that contains the word
"subroutine".  Luckily, as it turns out, making this assumption even
when it isn't true may have little effect on our results, as the
following paper by Pedro Domingos argues:
L<"http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz">


=head1 HISTORY

My first implementation of a Naive Bayes algorithm was in the
now-obsolete AI::Categorize module, first released in May 2001.  I
replaced it with the Naive Bayes implementation in AI::Categorizer
(note the extra 'r'), first released in July 2002.  I then extracted
that implementation into its own module that could be used outside the
framework, and that's what you see here.

=head1 AUTHOR

Ken Williams, ken@mathforum.org

=head1 COPYRIGHT

Copyright 2003-2004 Ken Williams.  All rights reserved.

This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.


=head1 SEE ALSO

AI::Categorizer(3), L<perl>.

=cut
( run in 0.551 second using v1.01-cache-2.11-cpan-39bf76dae61 )