best results from the CPAN

best

AI-DecisionTree

view release on metacpan or search on metacpan


 - The Instance class now has a numeric-only interface, without string
   translation.  String translation is done in the main DecisionTree
   class.  This isn't really a user-visible change.

0.04  Wed Sep  4 19:52:23 AEST 2002

 - Now uses regular XS instead of Inline for the C code parts. [patch
   by Matt Sergeant]

 - Converted the inner loop of the best_attr() method to C code,
   because it was spending a lot of time in accessor methods for the C
   structures it was using.  Don't worry, I'm not going C-crazy.  I
   won't be making many (any?) more of these kinds of changes, but
   these ones were probably necessary.

 - Removed a bit of debugging code that I left in for 0.03.

0.03  Mon Sep  2 11:41:18 AEST 2002

 - Added a 'prune' parameter to new(), which controls whether the tree

Changes view on Meta::CPAN

 - Internally, each node in the tree now contains information about
   how many training examples contributed to training this node, and
   what the distribution of their classes was.

 - Added an as_graphviz() method, which will help visualize trees.
   They're not terribly pretty graphviz objects yet, but they're
   visual.

0.02  Sat Aug 10, 2002 21:02 AEST

 - Added support for noisy data, currently by picking the best (most
   common) result when noise is encountered.  See the 'noise_mode'
   parameter to new().

 - Added the rule_tree() method, which returns a data structure
   representing the tree. [James Smith]

 - Significantly sped up the train() method, especially for large data
   sets.

 - The get_result() method is no longer implemented recursively, which

LICENSE view on Meta::CPAN

TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.

                     END OF TERMS AND CONDITIONS

        Appendix: How to Apply These Terms to Your New Programs

  If you develop a new program, and you want it to be of the greatest
possible use to humanity, the best way to achieve this is to make it
free software which everyone can redistribute and change under these
terms.

  To do so, attach the following notices to the program.  It is safest to
attach them to the start of each source file to most effectively convey
the exclusion of warranty; and each file should have at least the
"copyright" line and a pointer to where the full notice is found.

    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) 19yy  <name of author>

lib/AI/DecisionTree.pm view on Meta::CPAN

  foreach (keys %results) {
    $self->{prior_freqs}{$_} += $results{$_};
  }

  if (keys(%results) == 1) {
    # All these instances have the same result - make this node a leaf
    $node{result} = $self->_result($instances->[0]);
    return \%node;
  }
  
  # Multiple values are present - find the best predictor attribute and split on it
  my $best_attr = $self->best_attr($instances);

  croak "Inconsistent data, can't build tree with noise_mode='fatal'"
    if $self->{noise_mode} eq 'fatal' and !defined $best_attr;

  if ( !defined($best_attr)
       or $self->{max_depth} && $self->{curr_depth} > $self->{max_depth} ) {
    # Pick the most frequent result for this leaf
    $node{result} = (sort {$results{$b} <=> $results{$a}} keys %results)[0];
    return \%node;
  }
  
  $node{split_on} = $best_attr;
  
  my %split;
  foreach my $i (@$instances) {
    my $v = $self->_value($i, $best_attr);
    push @{$split{ defined($v) ? $v : '<undef>' }}, $i;
  }
  die ("Something's wrong: attribute '$best_attr' didn't split ",
       scalar @$instances, " instances into multiple buckets (@{[ keys %split ]})")
    unless keys %split > 1;

  foreach my $value (keys %split) {
    $node{children}{$value} = $self->_expand_node( instances => $split{$value} );
  }
  
  return \%node;
}

sub best_attr {
  my ($self, $instances) = @_;

  # 0 is a perfect score, entropy(#instances) is the worst possible score
  
  my ($best_score, $best_attr) = (@$instances * $self->entropy( map $_->result_int, @$instances ), undef);
  my $all_attr = $self->{attributes};
  foreach my $attr (keys %$all_attr) {

    # %tallies is correlation between each attr value and result
    # %total is number of instances with each attr value
    my (%totals, %tallies);
    my $num_undef = AI::DecisionTree::Instance::->tally($instances, \%tallies, \%totals, $all_attr->{$attr});
    next unless keys %totals; # Make sure at least one instance defines this attribute
    
    my $score = 0;
    while (my ($opt, $vals) = each %tallies) {
      $score += $totals{$opt} * $self->entropy2( $vals, $totals{$opt} );
    }

    ($best_attr, $best_score) = ($attr, $score) if $score < $best_score;
  }
  
  return $best_attr;
}

sub entropy2 {
  shift;
  my ($counts, $total) = @_;

  # Entropy is defined with log base 2 - we just divide by log(2) at the end to adjust.
  my $sum = 0;
  $sum += $_ * log($_) foreach values %$counts;
  return +(log($total) - $sum/$total)/log(2);

lib/AI/DecisionTree.pm view on Meta::CPAN

the biggest reasons for using a decision tree instead of many other
machine learning techniques is that a decision tree is a much more
scrutable decision maker than, say, a neural network.

The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training instances.
It uses an Information Gain metric (based on expected reduction in
entropy) to select the "most informative" attribute at each node in
the tree.  This is essentially the ID3 algorithm, developed by
J. R. Quinlan in 1986.  The idea is that the attribute with the
highest Information Gain will (probably) be the best attribute to
split the tree on at each point if we're interested in making small
trees.

=head1 METHODS

=head2 Building and Querying the Tree

=over 4

=item new(...parameters...)

lib/AI/DecisionTree.pm view on Meta::CPAN


=item noise_mode

Controls the behavior of the
C<train()> method when "noisy" data is encountered.  Here "noisy"
means that two or more training instances contradict each other, such
that they have identical attributes but different results.

If C<noise_mode> is set to C<fatal> (the default), the C<train()>
method will throw an exception (die).  If C<noise_mode> is set to
C<pick_best>, the most frequent result at each noisy node will be
selected.

=item prune

A boolean C<prune> parameter which specifies
whether the tree should be pruned after training.  This is usually a
good idea, so the default is to prune.  Currently we prune using a
simple minimum-description-length criterion.

=item verbose

t/02-noisy.t view on Meta::CPAN


#########################

use Test;
BEGIN { plan tests => 5 };
use AI::DecisionTree;
ok(1); # If we made it this far, we're ok.

#########################

my $dtree = AI::DecisionTree->new(noise_mode => 'pick_best');
ok $dtree;

my @names = split /, /, <DATA>;
chomp $names[-1];

# Train on first 600 instances
printf "Loading 600 training instances with %d attribute types each\n", scalar @names;
while (<DATA>) {
  last unless 2..601;

( run in 0.354 second using v1.01-cache-2.11-cpan-4e96b696675 )