AI-Calibrate
view release on metacpan or search on metacpan
1.3 Fri Nov 4
- Removed dependency on Test::Deep, added explicit declaration of
dependency on Test::More to Makefile.PL
1.2 Thu Nov 3
- Fixed test ./t/AI-Calibrate-NB.t so that test wouldn't fail. Used to
call is_deeply, which was failing on slight differences between
floating point numbers. Now compares with a small tolerance.
1.1 Thu Feb 28 19:00:06 2008
- Added new function print_mapping
- Added new test file AI-Calibrate-NB.t which, if AI::NaiveBayes1 is
present, trains a classifier and calibrates it.
1.0 Thu Feb 05 11:37:31 2008
- First public release to CPAN.
0.01 Thu Jan 24 11:37:31 2008
- original version; created by h2xs 1.23 with options
-XA -n AI::Calibrate
lib/AI/Calibrate.pm view on Meta::CPAN
# This allows declaration:
# use AI::Calibrate ':all';
# If you do not need this, moving things directly into @EXPORT or @EXPORT_OK
# will save memory.
our %EXPORT_TAGS = (
'all' => [
qw(
calibrate
score_prob
print_mapping
)
]
);
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
our @EXPORT = qw( );
use constant DEBUG => 0;
# Structure slot names
lib/AI/Calibrate.pm view on Meta::CPAN
example, is a very useful classifier, but the scores it produces are usually
"bunched" around 0 and 1, making these scores poor probability estimates.
Support vector machines have a similar problem. Both classifier types should
be calibrated before their scores are used as probability estimates.
This module calibrates classifier scores using a method called the Pool
Adjacent Violators (PAV) algorithm. After you train a classifier, you take a
(usually separate) set of test instances and run them through the classifier,
collecting the scores assigned to each. You then supply this set of instances
to the calibrate function defined here, and it will return a set of ranges
mapping from a score range to a probability estimate.
For example, assume you have the following set of instance results from your
classifier. Each result is of the form C<[ASSIGNED_SCORE, TRUE_CLASS]>:
my $points = [
[.9, 1],
[.8, 1],
[.7, 0],
[.6, 1],
[.55, 1],
lib/AI/Calibrate.pm view on Meta::CPAN
[
[.9, 1 ],
[.7, 3/4 ],
[.45, 2/3 ],
[.3, 1/2 ],
[.2, 1/3 ],
[.02, 0 ]
]
This means that, given a SCORE produced by the classifier, you can map the
SCORE onto a probability like this:
SCORE >= .9 prob = 1
.9 > SCORE >= .7 prob = 3/4
.7 > SCORE >= .45 prob = 2/3
.45 > SCORE >= .3 prob = 3/4
.2 > SCORE >= .7 prob = 3/4
.02 > SCORE prob = 0
For a realistic example of classifier calibration, see the test file
lib/AI/Calibrate.pm view on Meta::CPAN
if (DEBUG) {
print "Original data:\n";
for my $pair (@$data) {
my($score, $prob) = @$pair;
print "($score, $prob)\n";
}
}
# Copy the data over so PAV can clobber the PROB field
my $new_data = [ map([@$_], @$data) ];
# If not already sorted, sort data decreasing by score
if (!$sorted) {
$new_data = [ sort { $b->[SCORE] <=> $a->[SCORE] } @$new_data ];
}
PAV($new_data);
if (DEBUG) {
print("After PAV, vector is:\n");
lib/AI/Calibrate.pm view on Meta::CPAN
for my $tuple (@$calibrated) {
my($bound, $prob) = @$tuple;
return $prob if $score >= $bound;
$last_prob = $prob;
}
# If we drop off the end, probability estimate is zero
return 0;
}
=item B<print_mapping>
This is a simple utility function that takes the structure returned by
B<calibrate> and prints out a simple list of lines describing the mapping
created.
Example calling form:
print_mapping($calibrated);
Sample output:
1.00 > SCORE >= 1.00 prob = 1.000
1.00 > SCORE >= 0.71 prob = 0.667
0.71 > SCORE >= 0.39 prob = 0.000
0.39 > SCORE >= 0.00 prob = 0.000
These ranges are not necessarily compressed/optimized, as this sample output
shows.
=back
=cut
sub print_mapping {
my($calibrated) = @_;
my $last_bound = 1.0;
for my $tuple (@$calibrated) {
my($bound, $prob) = @$tuple;
printf("%0.3f > SCORE >= %0.3f prob = %0.3f\n",
$last_bound, $bound, $prob);
$last_bound = $bound;
}
if ($last_bound != 0) {
printf("%0.3f > SCORE >= %0.3f prob = %0.3f\n",
lib/AI/Calibrate.pm view on Meta::CPAN
The PAV algorithm is conceptually straightforward. Given a set of training
cases ordered by the scores assigned by the classifier, it first assigns a
probability of one to each positive instance and a probability of zero to each
negative instance, and puts each instance in its own group. It then looks, at
each iteration, for adjacent violators: adjacent groups whose probabilities
locally increase rather than decrease. When it finds such groups, it pools
them and replaces their probability estimates with the average of the group's
values. It continues this process of averaging and replacement until the
entire sequence is monotonically decreasing. The result is a sequence of
instances, each of which has a score and an associated probability estimate,
which can then be used to map scores into probability estimates.
For further information on the PAV algorithm, you can read the section in my
paper referenced below.
=head1 EXPORT
This module exports three functions: calibrate, score_prob and print_mapping.
=head1 BUGS
None known. This implementation is straightforward but inefficient (its time
is O(n^2) in the length of the data series). A linear time algorithm is
known, and in a later version of this module I'll probably implement it.
=head1 SEE ALSO
The AI::NaiveBayes1 perl module.
t/AI-Calibrate-1.t view on Meta::CPAN
[.5, 3/4 ],
[.45, 2/3 ],
[.35, 2/3 ],
[.3, 1/2 ],
[.2, 1/3 ],
[.02, 0 ],
[.00001, 0]
);
print "Using this mapping:\n";
print_mapping($calibrated_got);
print;
for my $pair (@test_estimates) {
my($score, $prob_expected) = @$pair;
my $prob_got = score_prob($calibrated_got, $score);
is($prob_got, $prob_expected, "score_prob test @$pair");
}
t/AI-Calibrate-KL.t view on Meta::CPAN
];
my $calibrated_got = calibrate( $points, 1 );
pass("ran_ok");
is_deeply($calibrated_got, $calibrated_expected, "calibration");
my $expected_mapping = "
1.000 > SCORE >= 0.998 prob = 1.000
0.998 > SCORE >= 0.505 prob = 0.900
0.505 > SCORE >= 0.475 prob = 0.667
0.475 > SCORE >= 0.425 prob = 0.500
0.425 > SCORE >= 0.359 prob = 0.385
0.359 > SCORE >= 0.000 prob = 0.000
";
my $output = '';
open TOOUTPUT, '>', \$output or die "Can't open TOOUTPUT: $!";
my $stdout = select(TOOUTPUT);
print_mapping($calibrated_got);
close(TOOUTPUT);
select $stdout;
is(trim($output), trim($expected_mapping), "printed mapping");
t/AI-Calibrate-NB.t view on Meta::CPAN
my $ph = $nb->predict(attributes=>$attrs);
my $play_score = $ph->{"play=yes"};
push(@points, [$play_score, ($play eq "yes" ? 1 : 0)]);
}
my $calibrated = calibrate(\@points, 0); # not sorted
print "Mapping:\n";
print_mapping($calibrated);
my(@expected) =
(
[0.779495793582905, 1],
[0.535425255450615, 0.666666666666667]
);
for my $i (0 .. $#expected) {
print "$i = @{$expected[$i]}\n";
}
( run in 0.499 second using v1.01-cache-2.11-cpan-49f99fa48dc )