data results from the CPAN

AI-DecisionTree

view release on metacpan or search on metacpan

   do_purge() method.

 - Added the set_results() and copy_instances() methods, which let you
   re-use training instances from one tree to another.

 - Added the instances() and purge() accessor methods.

0.05  Thu Sep 12 01:22:34 AEST 2002

 - Fixed a concurrency problem that occurred when making more than one
   decision tree.  All tree data is now stored as member data, not
   class data.

 - DecisionTree.pm is now pure-perl again (though Instance.pm still
   has an XS component).

 - Fixed a one-off bug in the Instance.xs code that could create
   garbage data.

 - Handles "sparse" data better.  Sparse data means that every
   attribute doesn't have to be defined for every training/test
   instance.  This can now be a meaningful property - the absence of a
   value is currently equivalent to a special "<undef>" value.

 - Don't trigger warnings when undefined attribute values are
   encountered (as happens with sparse data).

 - Added documentation for the 'prune' parameter to new()

 - More consistent with memory allocation in Instance.xs - uses the
   perl memory macros/functions from `perldoc perlclib` instead of raw
   malloc/realloc/free.

 - Catches possible infinite loop situations when growing the tree
   (which shouldn't usually happen, but other mistakes can cause it)

Changes view on Meta::CPAN

   minimum-description-length criterion.

 - Training instances are now represented using a C struct rather than
   a Perl hash.  This can dramatically reduce memory usage, though it
   doesn't have much effect on speed.  Note that Inline.pm is now
   required.

 - The list of instances is now deleted after training, since it's no
   longer needed.

 - Small speedup to the train() method, achieved by less copying of data.

 - If get_result() is called in a list context, it now returns a list
   containing the assigned result, a "confidence" score (tentative,
   subject to change), and the tree depth of the leaf this instance
   ended up at.

 - Internally, each node in the tree now contains information about
   how many training examples contributed to training this node, and
   what the distribution of their classes was.

 - Added an as_graphviz() method, which will help visualize trees.
   They're not terribly pretty graphviz objects yet, but they're
   visual.

0.02  Sat Aug 10, 2002 21:02 AEST

 - Added support for noisy data, currently by picking the best (most
   common) result when noise is encountered.  See the 'noise_mode'
   parameter to new().

 - Added the rule_tree() method, which returns a data structure
   representing the tree. [James Smith]

 - Significantly sped up the train() method, especially for large data
   sets.

 - The get_result() method is no longer implemented recursively, which
   simplifies it and speeds it up.

 - Reformatted the documentation and added a TO DO section.

 - Documented the nodes() method.

0.01  Sat Jun  8 12:45:03 2002

Instance/typemap view on Meta::CPAN

TYPEMAP
Instance *		WRAPPED_C_OBJECT


INPUT
WRAPPED_C_OBJECT
	$var = (($type)SvIV(SvRV($arg)))

OUTPUT
WRAPPED_C_OBJECT
	/* 'class' must be an argument to any function that returns this datatype */
	sv_setref_pv($arg, class, (void*)$var);

eg/example.pl view on Meta::CPAN

					     wind => 'strong',
					    } );
print "Result 2: $result\n";  # yes



# Show the created tree structure as rules
print map "$_\n", $dtree->rule_statements;


# Will barf on inconsistent data
my $t2 = new AI::DecisionTree;
$t2->add_instance( attributes => { foo => 'bar' },
		   result => 1 );
$t2->add_instance( attributes => { foo => 'bar' },
		   result => 0 );
eval {$t2->train};
print "$@\n";

lib/AI/DecisionTree.pm view on Meta::CPAN

  
  foreach my $attr (keys %{$self->{attribute_values}}) {
    my $h = $self->{attribute_values}{$attr};
    $self->{attribute_values_reverse}{$attr} = [ undef, sort {$h->{$a} <=> $h->{$b}} keys %$h ];
  }
}

sub train {
  my ($self, %args) = @_;
  if (not @{ $self->{instances} }) {
    croak "Training data has been purged, can't re-train" if $self->{tree};
    croak "Must add training instances before calling train()";
  }
  
  $self->_create_lookup_hashes;
  local $self->{curr_depth} = 0;
  local $self->{max_depth} = $args{max_depth} if exists $args{max_depth};
  $self->{depth} = 0;
  $self->{tree} = $self->_expand_node( instances => $self->{instances} );
  $self->{total_instances} = @{$self->{instances}};

lib/AI/DecisionTree.pm view on Meta::CPAN


  if (keys(%results) == 1) {
    # All these instances have the same result - make this node a leaf
    $node{result} = $self->_result($instances->[0]);
    return \%node;
  }
  
  # Multiple values are present - find the best predictor attribute and split on it
  my $best_attr = $self->best_attr($instances);

  croak "Inconsistent data, can't build tree with noise_mode='fatal'"
    if $self->{noise_mode} eq 'fatal' and !defined $best_attr;

  if ( !defined($best_attr)
       or $self->{max_depth} && $self->{curr_depth} > $self->{max_depth} ) {
    # Pick the most frequent result for this leaf
    $node{result} = (sort {$results{$b} <=> $results{$a}} keys %results)[0];
    return \%node;
  }
  
  $node{split_on} = $best_attr;

lib/AI/DecisionTree.pm view on Meta::CPAN


=head1 VERSION

version 0.11

=head1 SYNOPSIS

  use AI::DecisionTree;
  my $dtree = new AI::DecisionTree;
  
  # A set of training data for deciding whether to play tennis
  $dtree->add_instance
    (attributes => {outlook     => 'sunny',
                    temperature => 'hot',
                    humidity    => 'high'},
     result => 'no');
  
  $dtree->add_instance
    (attributes => {outlook     => 'overcast',
                    temperature => 'hot',
                    humidity    => 'normal'},

lib/AI/DecisionTree.pm view on Meta::CPAN

  
  # Find results for unseen instances
  my $result = $dtree->get_result
    (attributes => {outlook     => 'sunny',
                    temperature => 'hot',
                    humidity    => 'normal'});

=head1 DESCRIPTION

The C<AI::DecisionTree> module automatically creates so-called
"decision trees" to explain a set of training data.  A decision tree
is a kind of categorizer that use a flowchart-like process for
categorizing new instances.  For instance, a learned decision tree
might look like the following, which classifies for the concept "play
tennis":

                   OUTLOOK
                   /  |  \
                  /   |   \
                 /    |    \
           sunny/  overcast \rainy

lib/AI/DecisionTree.pm view on Meta::CPAN

         /    \              /    \
    high/      \normal      /      \
       /        \    strong/        \weak
     *no*      *yes*      /          \
                        *no*        *yes*

(This example, and the inspiration for the C<AI::DecisionTree> module,
come directly from Tom Mitchell's excellent book "Machine Learning",
available from McGraw Hill.)

A decision tree like this one can be learned from training data, and
then applied to previously unseen data to obtain results that are
consistent with the training data.

The usual goal of a decision tree is to somehow encapsulate the
training data in the smallest possible tree.  This is motivated by an
"Occam's Razor" philosophy, in which the simplest possible explanation
for a set of phenomena should be preferred over other explanations.
Also, small trees will make decisions faster than large trees, and
they are much easier for a human to look at and understand.  One of
the biggest reasons for using a decision tree instead of many other
machine learning techniques is that a decision tree is a much more
scrutable decision maker than, say, a neural network.

The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training instances.

lib/AI/DecisionTree.pm view on Meta::CPAN

=item new(...parameters...)

Creates a new decision tree object and returns it.  Accepts the
following parameters:

=over 4

=item noise_mode

Controls the behavior of the
C<train()> method when "noisy" data is encountered.  Here "noisy"
means that two or more training instances contradict each other, such
that they have identical attributes but different results.

If C<noise_mode> is set to C<fatal> (the default), the C<train()>
method will throw an exception (die).  If C<noise_mode> is set to
C<pick_best>, the most frequent result at each noisy node will be
selected.

=item prune

lib/AI/DecisionTree.pm view on Meta::CPAN

C<add_instance()>) for the set of attribute values given.  An
C<attributes> parameter specifies a hash of attribute-value pairs for
the instance.  If the decision tree doesn't have enough information to
find a result, it will return C<undef>.

=item do_purge()

Purges training instances and their associated information from the
DecisionTree object.  This can save memory after training, and since
the training instances are implemented as C structs, this turns the
DecisionTree object into a pure-perl data structure that can be more
easily saved with C<Storable.pm>, for instance.

=item purge()

Returns true or false depending on the value of the tree's C<purge>
property.  An optional boolean argument sets the property.

=item copy_instances(from =E<gt> $other_tree)

Allows two trees to share the same set of training instances.  More

lib/AI/DecisionTree.pm view on Meta::CPAN


=item depth()

Returns the depth of the tree.  This is the maximum number of
decisions that would need to be made to classify an unseen instance,
i.e. the length of the longest path from the tree's root to a leaf.  A
tree with a single node would have a depth of zero.

=item rule_tree()

Returns a data structure representing the decision tree.  For 
instance, for the tree diagram above, the following data structure 
is returned:

 [ 'outlook', {
     'rain' => [ 'wind', {
         'strong' => 'no',
         'weak' => 'yes',
     } ],
     'sunny' => [ 'humidity', {
         'normal' => 'yes',
         'high' => 'no',

lib/AI/DecisionTree.pm view on Meta::CPAN

removed in future versions - especially with your help. =)

=over 4

=item No continuous attributes

In the current implementation, only discrete-valued attributes are
supported.  This means that an attribute like "temperature" can have
values like "cool", "medium", and "hot", but using actual temperatures
like 87 or 62.3 is not going to work.  This is because the values
would split the data too finely - the tree-building process would
probably think that it could make all its decisions based on the exact
temperature value alone, ignoring all other attributes, because each
temperature would have only been seen once in the training data.

The usual way to deal with this problem is for the tree-building
process to figure out how to place the continuous attribute values
into a set of bins (like "cool", "medium", and "hot") and then build
the tree based on these bin values.  Future versions of
C<AI::DecisionTree> may provide support for this.  For now, you have
to do it yourself.

=back

t/01-simple.t view on Meta::CPAN


{
  # Test max_depth
  $dtree->train(max_depth => 1);
  my @rules = $dtree->rule_statements;
  ok @rules, 3;
  ok $dtree->depth, 1;
}

{
  # Should barf on inconsistent data
  my $t2 = new AI::DecisionTree;
  $t2->add_instance( attributes => { foo => 'bar' },
		     result => 1 );
  $t2->add_instance( attributes => { foo => 'bar' },
		     result => 0 );
  eval {$t2->train};
  ok( "$@", '/Inconsistent data/' );
}

{
  # Make sure two trees can be trained concurrently
  my $t1 = new AI::DecisionTree;
  my $t2 = new AI::DecisionTree;
  
  my @train = (
	       [farming => 'sheep very valuable farming'],
	       [farming => 'farming requires many kinds animals'],

t/02-noisy.t view on Meta::CPAN

  my %pairs = map {$names[$_], $values[$_]} 0..$#names;

  $dtree->add_instance(attributes => \%pairs,
		       result => $result,
		      );
}
print "Building decision tree\n";
$dtree->train;
ok(1);

# Test on rest of data, get at least 80%
print "Testing on remainder of data\n";
my ($good, $bad) = (0,0);
while (<DATA>) {
  chomp;
  my @values = split /, /, $_;
  my $result = pop @values;
  my %pairs = map {$names[$_], $values[$_]} 0..$#names;

  my ($guess, $confidence) = $dtree->get_result(attributes => \%pairs);
  $guess ||= '';  $confidence ||= '';
  ($guess eq $result ? $good : $bad)++;

t/02-noisy.t view on Meta::CPAN

    my $file = '/tmp/tree2.png';
    open my($fh), "> $file" or die "$file: $!";
    print $fh $graphviz->as_png;
    close $fh;
    system('open', $file);
  }
} else {
  skip("Skipping: GraphViz is not installed", 0);
}

# The following data comes from the "C4.5" software package, in the
# "soybean.data" data file.  It is somewhat noisy.  I chose it because
# it was a pretty big data set, and because there are published
# results on it that I can compare to.  Since the data seemed to be in
# order from most-information to least-information or something in the
# C4.5 distribution, I randomized the order of the instances.  Note
# also that I'm treating the '?' value as a normal string value.

# It looks like the original data source is
#     (a) Michalski,R.S. Learning by being told and learning from
#         examples: an experimental comparison of the two methodes of knowledge
#         acquisition in the context of developing an expert system for soybean
#         desease diagnoiss", International Journal of Policy Analysis and
#         Information Systems, 1980, 4(2), 125-161.

# The "C4.5" package is written by J.R. Quinlan and may be downloaded
# and used for free, but it is not supported and may not be
# redistributed.

( run in 0.434 second using v1.01-cache-2.11-cpan-beeb90c9504 )