AI-DecisionTree
view release on metacpan or search on metacpan
- The Instance class now has a numeric-only interface, without string
translation. String translation is done in the main DecisionTree
class. This isn't really a user-visible change.
0.04 Wed Sep 4 19:52:23 AEST 2002
- Now uses regular XS instead of Inline for the C code parts. [patch
by Matt Sergeant]
- Converted the inner loop of the best_attr() method to C code,
because it was spending a lot of time in accessor methods for the C
structures it was using. Don't worry, I'm not going C-crazy. I
won't be making many (any?) more of these kinds of changes, but
these ones were probably necessary.
- Removed a bit of debugging code that I left in for 0.03.
0.03 Mon Sep 2 11:41:18 AEST 2002
- Added a 'prune' parameter to new(), which controls whether the tree
- Internally, each node in the tree now contains information about
how many training examples contributed to training this node, and
what the distribution of their classes was.
- Added an as_graphviz() method, which will help visualize trees.
They're not terribly pretty graphviz objects yet, but they're
visual.
0.02 Sat Aug 10, 2002 21:02 AEST
- Added support for noisy data, currently by picking the best (most
common) result when noise is encountered. See the 'noise_mode'
parameter to new().
- Added the rule_tree() method, which returns a data structure
representing the tree. [James Smith]
- Significantly sped up the train() method, especially for large data
sets.
- The get_result() method is no longer implemented recursively, which
TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
END OF TERMS AND CONDITIONS
Appendix: How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to humanity, the best way to achieve this is to make it
free software which everyone can redistribute and change under these
terms.
To do so, attach the following notices to the program. It is safest to
attach them to the start of each source file to most effectively convey
the exclusion of warranty; and each file should have at least the
"copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) 19yy <name of author>
lib/AI/DecisionTree.pm view on Meta::CPAN
foreach (keys %results) {
$self->{prior_freqs}{$_} += $results{$_};
}
if (keys(%results) == 1) {
# All these instances have the same result - make this node a leaf
$node{result} = $self->_result($instances->[0]);
return \%node;
}
# Multiple values are present - find the best predictor attribute and split on it
my $best_attr = $self->best_attr($instances);
croak "Inconsistent data, can't build tree with noise_mode='fatal'"
if $self->{noise_mode} eq 'fatal' and !defined $best_attr;
if ( !defined($best_attr)
or $self->{max_depth} && $self->{curr_depth} > $self->{max_depth} ) {
# Pick the most frequent result for this leaf
$node{result} = (sort {$results{$b} <=> $results{$a}} keys %results)[0];
return \%node;
}
$node{split_on} = $best_attr;
my %split;
foreach my $i (@$instances) {
my $v = $self->_value($i, $best_attr);
push @{$split{ defined($v) ? $v : '<undef>' }}, $i;
}
die ("Something's wrong: attribute '$best_attr' didn't split ",
scalar @$instances, " instances into multiple buckets (@{[ keys %split ]})")
unless keys %split > 1;
foreach my $value (keys %split) {
$node{children}{$value} = $self->_expand_node( instances => $split{$value} );
}
return \%node;
}
sub best_attr {
my ($self, $instances) = @_;
# 0 is a perfect score, entropy(#instances) is the worst possible score
my ($best_score, $best_attr) = (@$instances * $self->entropy( map $_->result_int, @$instances ), undef);
my $all_attr = $self->{attributes};
foreach my $attr (keys %$all_attr) {
# %tallies is correlation between each attr value and result
# %total is number of instances with each attr value
my (%totals, %tallies);
my $num_undef = AI::DecisionTree::Instance::->tally($instances, \%tallies, \%totals, $all_attr->{$attr});
next unless keys %totals; # Make sure at least one instance defines this attribute
my $score = 0;
while (my ($opt, $vals) = each %tallies) {
$score += $totals{$opt} * $self->entropy2( $vals, $totals{$opt} );
}
($best_attr, $best_score) = ($attr, $score) if $score < $best_score;
}
return $best_attr;
}
sub entropy2 {
shift;
my ($counts, $total) = @_;
# Entropy is defined with log base 2 - we just divide by log(2) at the end to adjust.
my $sum = 0;
$sum += $_ * log($_) foreach values %$counts;
return +(log($total) - $sum/$total)/log(2);
lib/AI/DecisionTree.pm view on Meta::CPAN
the biggest reasons for using a decision tree instead of many other
machine learning techniques is that a decision tree is a much more
scrutable decision maker than, say, a neural network.
The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training instances.
It uses an Information Gain metric (based on expected reduction in
entropy) to select the "most informative" attribute at each node in
the tree. This is essentially the ID3 algorithm, developed by
J. R. Quinlan in 1986. The idea is that the attribute with the
highest Information Gain will (probably) be the best attribute to
split the tree on at each point if we're interested in making small
trees.
=head1 METHODS
=head2 Building and Querying the Tree
=over 4
=item new(...parameters...)
lib/AI/DecisionTree.pm view on Meta::CPAN
=item noise_mode
Controls the behavior of the
C<train()> method when "noisy" data is encountered. Here "noisy"
means that two or more training instances contradict each other, such
that they have identical attributes but different results.
If C<noise_mode> is set to C<fatal> (the default), the C<train()>
method will throw an exception (die). If C<noise_mode> is set to
C<pick_best>, the most frequent result at each noisy node will be
selected.
=item prune
A boolean C<prune> parameter which specifies
whether the tree should be pruned after training. This is usually a
good idea, so the default is to prune. Currently we prune using a
simple minimum-description-length criterion.
=item verbose
t/02-noisy.t view on Meta::CPAN
#########################
use Test;
BEGIN { plan tests => 5 };
use AI::DecisionTree;
ok(1); # If we made it this far, we're ok.
#########################
my $dtree = AI::DecisionTree->new(noise_mode => 'pick_best');
ok $dtree;
my @names = split /, /, <DATA>;
chomp $names[-1];
# Train on first 600 instances
printf "Loading 600 training instances with %d attribute types each\n", scalar @names;
while (<DATA>) {
last unless 2..601;
( run in 0.354 second using v1.01-cache-2.11-cpan-4e96b696675 )