AI-Calibrate

 view release on metacpan or  search on metacpan

lib/AI/Calibrate.pm  view on Meta::CPAN

package AI::Calibrate;

use 5.008008;
use strict;
use warnings;

use vars qw($VERSION);
$VERSION = "1.5";

require Exporter;

our @ISA = qw(Exporter);

# This allows declaration:
#	use AI::Calibrate ':all';
# If you do not need this, moving things directly into @EXPORT or @EXPORT_OK
# will save memory.
our %EXPORT_TAGS = (
    'all' => [
        qw(
              calibrate
              score_prob
              print_mapping
            )
    ]
);

our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = qw( );

use constant DEBUG => 0;

# Structure slot names
use constant SCORE => 0;
use constant PROB  => 1;

=head1 NAME

AI::Calibrate - Perl module for producing probabilities from classifier scores

=head1 SYNOPSIS

  use AI::Calibrate ':all';
  ... train a classifier ...
  ... test classifier on $points ...
  $calibrated = calibrate($points);

=head1 DESCRIPTION

Classifiers usually return some sort of an instance score with their
classifications.  These scores can be used as probabilities in various
calculations, but first they need to be I<calibrated>.  Naive Bayes, for
example, is a very useful classifier, but the scores it produces are usually
"bunched" around 0 and 1, making these scores poor probability estimates.
Support vector machines have a similar problem.  Both classifier types should
be calibrated before their scores are used as probability estimates.

This module calibrates classifier scores using a method called the Pool
Adjacent Violators (PAV) algorithm.  After you train a classifier, you take a
(usually separate) set of test instances and run them through the classifier,
collecting the scores assigned to each.  You then supply this set of instances
to the calibrate function defined here, and it will return a set of ranges
mapping from a score range to a probability estimate.

For example, assume you have the following set of instance results from your
classifier.  Each result is of the form C<[ASSIGNED_SCORE, TRUE_CLASS]>:

 my $points = [
              [.9, 1],
              [.8, 1],
              [.7, 0],
              [.6, 1],
              [.55, 1],
              [.5, 1],
              [.45, 0],
              [.4, 1],
              [.35, 1],
              [.3, 0 ],
              [.27, 1],
              [.2, 0 ],
              [.18, 0],
              [.1, 1 ],
              [.02, 0]
             ];

If you then call calibrate($points), it will return this structure:

 [
   [.9,    1 ],
   [.7,  3/4 ],
   [.45, 2/3 ],
   [.3,  1/2 ],
   [.2,  1/3 ],
   [.02,   0 ]
  ]

This means that, given a SCORE produced by the classifier, you can map the
SCORE onto a probability like this:

               SCORE >= .9        prob = 1
         .9  > SCORE >= .7        prob = 3/4
         .7  > SCORE >= .45       prob = 2/3
         .45 > SCORE >= .3        prob = 3/4
         .2  > SCORE >= .7        prob = 3/4
         .02 > SCORE              prob = 0

For a realistic example of classifier calibration, see the test file
t/AI-Calibrate-NB.t, which uses the AI::NaiveBayes1 module to train a Naive
Bayes classifier then calibrates it using this module.

=cut

=head1 FUNCTIONS

=over 4

=item B<calibrate>

This is the main calibration function.  The calling form is:

my $calibrated = calibrate( $data, $sorted);

$data looks like: C<[ [score, class], [score, class], [score, class]...]>
Each score is a number.  Each class is either 0 (negative class) or 1
(positive class).

$sorted is boolean (0 by default) indicating whether the data are already
sorted by score.  Unless this is set to 1, calibrate() will sort the data
itself.

Calibrate returns a reference to an ordered list of references:

  [ [score, prob], [score, prob], [score, prob] ... ]

Scores will be in descending numerical order.  See the DESCRIPTION section for
how this structure is interpreted.  You can pass this structure to the
B<score_prob> function, along with a new score, to get a probability.

=cut

sub calibrate {
    my($data, $sorted) = @_;

    if (DEBUG) {
        print "Original data:\n";
        for my $pair (@$data) {
            my($score, $prob) = @$pair;
            print "($score, $prob)\n";
        }
    }

    #  Copy the data over so PAV can clobber the PROB field
    my $new_data = [ map([@$_], @$data) ];

    #   If not already sorted, sort data decreasing by score
    if (!$sorted) {
        $new_data = [ sort { $b->[SCORE] <=> $a->[SCORE] } @$new_data ];
    }

    PAV($new_data);

    if (DEBUG) {
        print("After PAV, vector is:\n");
        print_vector($new_data);
    }

    my(@result);



( run in 0.774 second using v1.01-cache-2.11-cpan-39bf76dae61 )