AI-DecisionTree

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

   hash reference, to check the attributes of the given instance.
   This allows lazy instance checking.

0.08  Mon Jul  7 18:01:16 CDT 2003

 - Added a 'leaf_color' parameter for making GraphViz objects more
   colorful.

0.07  Fri Jun  6 10:37:51 CDT 2003

 - Created tests for the set_results() and copy_instances() methods.

 - Added documentation for as_graphviz() and increased the information
   contained in the GraphViz object.

 - Added the ability to limit the absolute depth of the tree when
   training.

0.06  Wed Sep 18 13:59:24 EST 2002

 - Fixed an XS memory leak that was afflicting all training instances.
   Added tests to make sure leak stays plugged.

 - Added the 'purge' and 'verbose' parameters to new().

 - add_instance() now accepts a 'name' parameter.

 - Users can now control whether training instances are purged after
   training, using the 'purge' parameter to new() and/or the
   do_purge() method.

 - Added the set_results() and copy_instances() methods, which let you
   re-use training instances from one tree to another.

 - Added the instances() and purge() accessor methods.

0.05  Thu Sep 12 01:22:34 AEST 2002

 - Fixed a concurrency problem that occurred when making more than one
   decision tree.  All tree data is now stored as member data, not
   class data.

Changes  view on Meta::CPAN

0.02  Sat Aug 10, 2002 21:02 AEST

 - Added support for noisy data, currently by picking the best (most
   common) result when noise is encountered.  See the 'noise_mode'
   parameter to new().

 - Added the rule_tree() method, which returns a data structure
   representing the tree. [James Smith]

 - Significantly sped up the train() method, especially for large data
   sets.

 - The get_result() method is no longer implemented recursively, which
   simplifies it and speeds it up.

 - Reformatted the documentation and added a TO DO section.

 - Documented the nodes() method.

0.01  Sat Jun  8 12:45:03 2002

INSTALL  view on Meta::CPAN


    % cpanm AI::DecisionTree

If you are installing into a system-wide directory, you may need to pass the
"-S" flag to cpanm, which uses sudo to install the module:

    % cpanm -S AI::DecisionTree

## Installing with the CPAN shell

Alternatively, if your CPAN shell is set up, you should just be able to do:

    % cpan AI::DecisionTree

## Manual installation

As a last resort, you can manually install it. Download the tarball, untar it,
then build it:

    % perl Makefile.PL
    % make && make test

Instance/Instance.pm  view on Meta::CPAN

  
  my $i = new AI::DecisionTree::Instance([3,5], 7, 'this_instance');
  $i->value_int(0) == 3;
  $i->value_int(1) == 5;
  $i->result_int == 7;

=head1 DESCRIPTION

This class is just a simple Perl wrapper around a C struct embodying a
single training instance.  Its purpose is to reduce memory usage.  In
a "typical" training set with about 1000 instances, memory usage can
be reduced by about a factor of 5 (from 43.7M to 8.2M in my test
program).

A fairly tight loop is also implemented that helps speed up the
C<train()> AI::DecisionTree method by about a constant factor of 4.

Please do not consider this interface stable - I change it whenever I
have a new need in AI::DecisionTree.

=head1 AUTHOR

Instance/Instance.xs  view on Meta::CPAN

name (instance)
    Instance*   instance
  CODE:
    {
      RETVAL = instance->name;
    }
  OUTPUT:
    RETVAL

void
set_result (instance, result)
    Instance*   instance
    int         result
  CODE:
    {
      instance->result = result;
    }

void
set_value (instance, attribute, value)
    Instance*   instance
    int         attribute
    int         value
  PPCODE:
    {
      int *new_values;
      int i;
    
      if (attribute >= instance->num_values) {
        if (!value) return; /* Nothing to do */

Instance/Instance.xs  view on Meta::CPAN

      Instance *instance;
      
      for (i=0; i<=top; i++) {
	instance_r = av_fetch(instances, i, 0);
	instance = (Instance *) SvIV(SvRV(*instance_r));
        v = attr < instance->num_values ? instance->values[attr] : 0;
	/* if (!v) { num_undef++; continue; } */
	
	/* $totals{$v}++ */
	hash_entry = hv_fetch(totals, (char *)&v, sizeof(I32), 1);
	if (!SvIOK(*hash_entry)) sv_setiv(*hash_entry, 0);
	sv_setiv( *hash_entry, 1+SvIV(*hash_entry) );
	
	/* $tallies{$v}{$_->result_int}++ */
	hash_entry = hv_fetch(tallies, (char *)&v, sizeof(I32), 0);
	
	if (!hash_entry) {
	  hash_entry = hv_store(tallies, (char *)&v, sizeof(I32), newRV_noinc((SV*) newHV()), 0);
	}
	
	sub_hash_entry = hv_fetch((HV*) SvRV(*hash_entry), (char *)&(instance->result), sizeof(int), 1);
	if (!SvIOK(*sub_hash_entry)) sv_setiv(*sub_hash_entry, 0);
	sv_setiv( *sub_hash_entry, 1+SvIV(*sub_hash_entry) );
      }

      RETVAL = num_undef;

	/*  Old code:
      foreach (@$instances) {
	my $v = $_->value_int($all_attr->{$attr});
	next unless $v;
	$totals{ $v }++;
	$tallies{ $v }{ $_->result_int }++;

Instance/t/01-basic.t  view on Meta::CPAN

BEGIN { plan tests => 7 }

use AI::DecisionTree::Instance;
ok(1);

my $i = AI::DecisionTree::Instance->new([1, 2], 0, "foo");
ok $i->value_int(0), 1;
ok $i->value_int(1), 2;
ok $i->result_int, 0;

$i->set_value(0, 3);
ok $i->value_int(0), 3;

$i = new AI::DecisionTree::Instance([4], 2, "bar");
ok $i->value_int(0), 4;
ok $i->result_int, 2;

Instance/typemap  view on Meta::CPAN

Instance *		WRAPPED_C_OBJECT


INPUT
WRAPPED_C_OBJECT
	$var = (($type)SvIV(SvRV($arg)))

OUTPUT
WRAPPED_C_OBJECT
	/* 'class' must be an argument to any function that returns this datatype */
	sv_setref_pv($arg, class, (void*)$var);

lib/AI/DecisionTree.pm  view on Meta::CPAN

  croak "Missing 'from' parameter to copy_instances()" unless exists $opt{from};
  my $other = $opt{from};
  croak "'from' parameter is not a decision tree" unless UNIVERSAL::isa($other, __PACKAGE__);

  foreach (qw(instances attributes attribute_values results)) {
    $self->{$_} = $other->{$_};
  }
  $self->_create_lookup_hashes;
}

sub set_results {
  my ($self, $hashref) = @_;
  foreach my $instance (@{$self->{instances}}) {
    my $name = $instance->name;
    croak "No result given for instance '$name'" unless exists $hashref->{$name};
    $instance->set_result( $self->{results}{ $hashref->{$name} } );
  }
}

sub instances { $_[0]->{instances} }

sub purge {
  my $self = shift;
  $self->{purge} = shift if @_;
  return $self->{purge};
}

lib/AI/DecisionTree.pm  view on Meta::CPAN

  my ($self, $instance) = @_;
  my $int = $instance->result_int;
  return $self->{results_reverse}[$int];
}

sub _delete_value {
  my ($self, $instance, $attr) = @_;
  my $val = $self->_value($instance, $attr);
  return unless defined $val;
  
  $instance->set_value($self->{attributes}{$attr}, 0);
  return $val;
}

sub _value {
  my ($self, $instance, $attr) = @_;
  return unless exists $self->{attributes}{$attr};
  my $val_int = $instance->value_int($self->{attributes}{$attr});
  return $self->{attribute_values_reverse}{$attr}[$val_int];
}

lib/AI/DecisionTree.pm  view on Meta::CPAN


=head1 VERSION

version 0.11

=head1 SYNOPSIS

  use AI::DecisionTree;
  my $dtree = new AI::DecisionTree;
  
  # A set of training data for deciding whether to play tennis
  $dtree->add_instance
    (attributes => {outlook     => 'sunny',
                    temperature => 'hot',
                    humidity    => 'high'},
     result => 'no');
  
  $dtree->add_instance
    (attributes => {outlook     => 'overcast',
                    temperature => 'hot',
                    humidity    => 'normal'},

lib/AI/DecisionTree.pm  view on Meta::CPAN

  
  # Find results for unseen instances
  my $result = $dtree->get_result
    (attributes => {outlook     => 'sunny',
                    temperature => 'hot',
                    humidity    => 'normal'});

=head1 DESCRIPTION

The C<AI::DecisionTree> module automatically creates so-called
"decision trees" to explain a set of training data.  A decision tree
is a kind of categorizer that use a flowchart-like process for
categorizing new instances.  For instance, a learned decision tree
might look like the following, which classifies for the concept "play
tennis":

                   OUTLOOK
                   /  |  \
                  /   |   \
                 /    |    \
           sunny/  overcast \rainy

lib/AI/DecisionTree.pm  view on Meta::CPAN

come directly from Tom Mitchell's excellent book "Machine Learning",
available from McGraw Hill.)

A decision tree like this one can be learned from training data, and
then applied to previously unseen data to obtain results that are
consistent with the training data.

The usual goal of a decision tree is to somehow encapsulate the
training data in the smallest possible tree.  This is motivated by an
"Occam's Razor" philosophy, in which the simplest possible explanation
for a set of phenomena should be preferred over other explanations.
Also, small trees will make decisions faster than large trees, and
they are much easier for a human to look at and understand.  One of
the biggest reasons for using a decision tree instead of many other
machine learning techniques is that a decision tree is a much more
scrutable decision maker than, say, a neural network.

The current implementation of this module uses an extremely simple
method for creating the decision tree based on the training instances.
It uses an Information Gain metric (based on expected reduction in
entropy) to select the "most informative" attribute at each node in

lib/AI/DecisionTree.pm  view on Meta::CPAN


=over 4

=item noise_mode

Controls the behavior of the
C<train()> method when "noisy" data is encountered.  Here "noisy"
means that two or more training instances contradict each other, such
that they have identical attributes but different results.

If C<noise_mode> is set to C<fatal> (the default), the C<train()>
method will throw an exception (die).  If C<noise_mode> is set to
C<pick_best>, the most frequent result at each noisy node will be
selected.

=item prune

A boolean C<prune> parameter which specifies
whether the tree should be pruned after training.  This is usually a
good idea, so the default is to prune.  Currently we prune using a
simple minimum-description-length criterion.

=item verbose

If set to a true value, some status information will be output while
training a decision tree.  Default is false.

=item purge

If set to a true value, the C<do_purge()> method will be invoked
during C<train()>.  The default is true.

=item max_depth

Controls the maximum depth of the tree that will be created during
C<train()>.  The default is 0, which means that trees of unlimited
depth can be constructed.

=back

=item add_instance(attributes => \%hash, result => $string, name => $string)

Adds a training instance to the set of instances which will be used to
form the tree.  An C<attributes> parameter specifies a hash of
attribute-value pairs for the instance, and a C<result> parameter
specifies the result.

An optional C<name> parameter lets you give a unique name to each
training instance.  This can be used in coordination with the
C<set_results()> method below.

=item train()

Builds the decision tree from the list of training instances.  If a
numeric C<max_depth> parameter is supplied, the maximum tree depth can
be controlled (see also the C<new()> method).

=item get_result(attributes => \%hash)

Returns the most likely result (from the set of all results given to
C<add_instance()>) for the set of attribute values given.  An
C<attributes> parameter specifies a hash of attribute-value pairs for
the instance.  If the decision tree doesn't have enough information to
find a result, it will return C<undef>.

=item do_purge()

Purges training instances and their associated information from the
DecisionTree object.  This can save memory after training, and since
the training instances are implemented as C structs, this turns the
DecisionTree object into a pure-perl data structure that can be more
easily saved with C<Storable.pm>, for instance.

=item purge()

Returns true or false depending on the value of the tree's C<purge>
property.  An optional boolean argument sets the property.

=item copy_instances(from =E<gt> $other_tree)

Allows two trees to share the same set of training instances.  More
commonly, this lets you train one tree, then re-use its instances in
another tree (possibly changing the instance C<result> values using
C<set_results()>), which is much faster than re-populating the second
tree's instances from scratch.

=item set_results(\%results)

Given a hash that relates instance names to instance result values,
change the result values as specified.

=back

=head2 Tree Introspection

=over 4

lib/AI/DecisionTree.pm  view on Meta::CPAN

supported.  This means that an attribute like "temperature" can have
values like "cool", "medium", and "hot", but using actual temperatures
like 87 or 62.3 is not going to work.  This is because the values
would split the data too finely - the tree-building process would
probably think that it could make all its decisions based on the exact
temperature value alone, ignoring all other attributes, because each
temperature would have only been seen once in the training data.

The usual way to deal with this problem is for the tree-building
process to figure out how to place the continuous attribute values
into a set of bins (like "cool", "medium", and "hot") and then build
the tree based on these bin values.  Future versions of
C<AI::DecisionTree> may provide support for this.  For now, you have
to do it yourself.

=back

=head1 TO DO

All the stuff in the LIMITATIONS section.  Also, revisit the pruning
algorithm to see how it can be improved.

t/01-simple.t  view on Meta::CPAN

  ok $t1->instances->[1]->name, 2;
  ok $t1->_result($t1->instances->[0]), 1;  # Not a public interface
  ok $t1->_result($t1->instances->[1]), 0;  # Not a public interface

  $t2->copy_instances(from => $t1);
  ok $t2->instances->[0]->name, 1;
  ok $t2->instances->[1]->name, 2;
  ok $t2->_result($t2->instances->[0]), 1;  # Not a public interface
  ok $t2->_result($t2->instances->[1]), 0;  # Not a public interface

  $t2->set_results( {1=>0, 2=>1} );
  ok $t2->_result($t2->instances->[0]), 0;  # Not a public interface
  ok $t2->_result($t2->instances->[1]), 1;  # Not a public interface
}

t/02-noisy.t  view on Meta::CPAN

    print $fh $graphviz->as_png;
    close $fh;
    system('open', $file);
  }
} else {
  skip("Skipping: GraphViz is not installed", 0);
}

# The following data comes from the "C4.5" software package, in the
# "soybean.data" data file.  It is somewhat noisy.  I chose it because
# it was a pretty big data set, and because there are published
# results on it that I can compare to.  Since the data seemed to be in
# order from most-information to least-information or something in the
# C4.5 distribution, I randomized the order of the instances.  Note
# also that I'm treating the '?' value as a normal string value.

# It looks like the original data source is
#     (a) Michalski,R.S. Learning by being told and learning from
#         examples: an experimental comparison of the two methodes of knowledge
#         acquisition in the context of developing an expert system for soybean
#         desease diagnoiss", International Journal of Policy Analysis and



( run in 1.263 second using v1.01-cache-2.11-cpan-49f99fa48dc )