Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


  # The option `symbolic_to_numeric_cardinality_threshold' is also important.  For
  # the example shown above, if an ostensibly numeric feature takes on only 10 or
  # fewer different values in your training datafile, it will be treated like a
  # symbolic features.  The option `entropy_threshold' determines the granularity
  # with which the entropies are sampled for the purpose of calculating entropy gain
  # with a particular choice of decision threshold for a numeric feature or a feature
  # value for a symbolic feature.

  # The option 'csv_cleanup_needed' is by default set to 0.  If you set it
  # to 1, that would cause all line records in your CSV file to be "sanitized" before
  # they are used for constructing a decision tree.  You need this option if your CSV
  # file uses double-quoted field names and field values in the line records and if
  # such double-quoted strings are allowed to include commas for, presumably, better
  # readability.

  # After you have constructed an instance of the DecisionTree class as shown above,
  # you read in the training data file and initialize the probability cache by
  # calling:

      $dt->get_training_data();
      $dt->calculate_first_order_probabilities();
      $dt->calculate_class_priors();

  # Next you construct a decision tree for your training data by calling:

      $root_node = $dt->construct_decision_tree_classifier();

  # where $root_node is an instance of the DTNode class that is also defined in the
  # module file.  Now you are ready to classify a new data record.  Let's say that
  # your data record looks like:

      my @test_sample  = qw /  g2=4.2
                               grade=2.3
                               gleason=4
                               eet=1.7
                               age=55.0
                               ploidy=diploid /;

  # You can classify it by calling:

      my $classification = $dt->classify($root_node, \@test_sample);

  # The call to `classify()' returns a reference to a hash whose keys are the class
  # names and the values the associated classification probabilities.  This hash also
  # includes another key-value pair for the solution path from the root node to the
  # leaf node at which the final classification was carried out.


=head1 CHANGES

B<Version 3.43:> This version fixes a bug in the C<csv_cleanup_needed()> function.
The source of the bug was a typo in a regex component meant for matching with white
space.  I have also made one additional change to this function to increase its
versatility.  With this change, you are now allowed to have empty strings as values
for features.

B<Version 3.42:> This version reintroduces C<csv_cleanup_needed> as an optional
parameter in the module constructor.  This was done in response to several requests
received from the user community. (Previously, all line records from a CSV file were
processed by the C<cleanup_csv()> function no matter what.)  The main point made by
the users was that invoking C<cleanup_csv()> when there was no need for CSV clean-up
extracted a performance penalty when ingesting large database files with tens of
thousands of line records.  In addition to making C<csv_cleanup_needed> optional, I
have also tweaked up the code in the C<cleanup_csv()> function in order to extract
data from a larger range of messy CSV files.

B<Version 3.41:> All the changes made in this version relate to the construction of
regression trees.  I have fixed a couple of bugs in the calculation of the regression
coefficients. Additionally, the C<RegressionTree> class now comes with a new
constructor parameter named C<jacobian_choice>.  For most cases, you'd set this
parameter to 0, which causes the regression coefficients to be estimated through
linear least-squares minimization.

B<Version 3.40:> In addition to constructing decision trees, this version of the
module also allows you to construct regression trees. The regression tree capability
has been packed into a separate subclass, named C<RegressionTree>, of the main
C<DecisionTree> class.  The subdirectory C<ExamplesRegression> in the main
installation directory illustrates how you can use this new functionality of the
module.

B<Version 3.30:> This version incorporates four very significant upgrades/changes to
the C<DecisionTree> module: B<(1)> The CSV cleanup is now the default. So you do not
have to set any special parameters in the constructor calls to initiate CSV
cleanup. B<(2)> In the form of a new Perl class named C<RandomizedTreesForBigData>,
this module provides you with an easy-to-use programming interface for attempting
needle-in-a-haystack solutions for the case when your training data is overwhelmingly
dominated by a single class.  You need to set the constructor parameter
C<looking_for_needles_in_haystack> to invoke the logic that constructs multiple
decision trees, each using the minority class samples along with samples drawn
randomly from the majority class.  The final classification is made through a
majority vote from all the decision trees.  B<(3)> Assuming you are faced with a
big-data problem --- in the sense that you have been given a training database with a
very large number of training records --- the class C<RandomizedTreesForBigData> will
also let you construct multiple decision trees by pulling training data randomly from
your training database (without paying attention to the relative populations of the
classes).  The final classification decision for a test sample is based on a majority
vote from all the decision trees thus constructed.  See the C<ExamplesRandomizedTrees>
directory for how to use these new features of the module. And, finally, B<(4)>
Support for the old-style '.dat' training files has been dropped in this version.

B<Version 3.21:> This version makes it easier to use a CSV training file that
violates the assumption that a comma be used only to separate the different field
values in a line record.  Some large econometrics databases use double-quoted values
for fields, and these values may contain commas (presumably for better readability).
This version also allows you to specify the leftmost entry in the first CSV record
that names all the fields. Previously, this entry was required to be an empty
double-quoted string.  I have also made some minor changes to the
'C<get_training_data_from_csv()>' method to make it more user friendly for large
training files that may contain tens of thousands of records.  When pulling training
data from such files, this method prints out a dot on the terminal screen for every
10000 records it has processed. 

B<Version 3.20:> This version brings the boosting capability to the C<DecisionTree>
module.

B<Version 3.0:> This version adds bagging to the C<DecisionTree> module. If your
training dataset is large enough, you can ask the module to construct multiple
decision trees using data bags extracted from your dataset.  The module can show you
the results returned by the individual decision trees and also the results obtained
by taking a majority vote of the classification decisions made by the individual
trees.  You can specify any arbitrary extent of overlap between the data bags.

B<Version 2.31:> The introspection capability in this version packs more of a punch.
For each training data sample, you can now figure out not only the decision-tree
nodes that are affected directly by that sample, but also those nodes that are
affected indirectly through the generalization achieved by the probabilistic modeling
of the data.  The 'examples' directory of this version includes additional scripts
that illustrate these enhancements to the introspection capability.  See the section
"The Introspection API" for a declaration of the introspection related methods, old
and new.

B<Version 2.30:> In response to requests from several users, this version includes a new
capability: You can now ask the module to introspect about the classification
decisions returned by the decision tree.  Toward that end, the module includes a new
class named C<DTIntrospection>.  Perhaps the most important bit of information you
are likely to seek through DT introspection is the list of the training samples that
fall directly in the portion of the feature space that is assigned to a node.
B<CAVEAT:> When training samples are non-uniformly distributed in the underlying
feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING SAMPLES FALL
IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE.  B<(This is an important
part of the generalization achieved by probabilistic modeling of the training data.)>
For additional information related to DT introspection, see the section titled
"DECISION TREE INTROSPECTION" in this documentation page.

B<Version 2.26> fixes a bug in the part of the module that some folks use for generating
synthetic data for experimenting with decision tree construction and classification.
In the class C<TrainingDataGeneratorNumeric> that is a part of the module, there
was a problem with the order in which the features were recorded from the
user-supplied parameter file.  The basic code for decision tree construction and
classification remains unchanged.

B<Version 2.25> further downshifts the required version of Perl for this module.  This
was a result of testing the module with Version 5.10.1 of Perl.  Only one statement
in the module code needed to be changed for the module to work with the older version
of Perl.

B<Version 2.24> fixes the C<Makefile.PL> restriction on the required Perl version.  This
version should work with Perl versions 5.14.0 and higher.

B<Version 2.23> changes the required version of Perl from 5.18.0 to 5.14.0.  Everything
else remains the same.

B<Version 2.22> should prove more robust when the probability distribution for the
values of a feature is expected to be heavy-tailed; that is, when the supposedly rare
observations can occur with significant probabilities.  A new option in the
DecisionTree constructor lets the user specify the precision with which the
probability distributions are estimated for such features.

B<Version 2.21> fixes a bug that was caused by the explicitly set zero values for
numerical features being misconstrued as "false" in the conditional statements in
some of the method definitions.

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

is an anonymous array that holds the path, in the form of a list of nodes, from the
root node to the leaf node in the decision tree where the final classification was
made.


=item B<classify_by_asking_questions($root_node):>

This method allows you to use a decision-tree based classifier in an interactive
mode.  In this mode, a user is prompted for answers to the questions pertaining to
the feature tests at the nodes of the tree.  The syntax for invoking this method is:

    my $classification = $dt->classify_by_asking_questions($root_node);

where C<$dt> is an instance of the C<Algorithm::DecisionTree> class returned by a
call to C<new()> and C<$root_node> the root node of the decision tree returned by a
call to C<construct_decision_tree_classifier()>.

=back


=head1 THE INTROSPECTION API

To construct an instance of C<DTIntrospection>, you call

    my $introspector = DTIntrospection->new($dt);

where you supply the instance of the C<DecisionTree> class you used for constructing
the decision tree through the parameter C<$dt>.  After you have constructed an
instance of the introspection class, you must initialize it by

    $introspector->initialize();

Subsequently, you can invoke either of the following methods:

    $introspector->explain_classification_at_one_node($node);

    $introspector->explain_classifications_at_multiple_nodes_interactively();

depending on whether you want introspection at a single specified node or inside an
infinite loop for an arbitrary number of nodes.

If you want to output a tabular display that shows for each node in the decision tree
all the training samples that fall in the portion of the feature space that belongs
to that node, call

    $introspector->display_training_samples_at_all_nodes_direct_influence_only();

If you want to output a tabular display that shows for each training sample a list of
all the nodes that are affected directly AND indirectly by that sample, call

    $introspector->display_training_training_samples_to_nodes_influence_propagation();

A training sample affects a node directly if the sample falls in the portion of the
features space assigned to that node. On the other hand, a training sample is
considered to affect a node indirectly if the node is a descendant of a node that is
affected directly by the sample.


=head1 BULK CLASSIFICATION OF DATA RECORDS

For large test datasets, you would obviously want to process an entire file of test
data at a time. The following scripts in the C<Examples> directory illustrate how you
can do that:

      classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the
training datafile, the second the test datafile, and the third the file in which the
classification results are to be deposited.  

The other examples directories, C<ExamplesBagging>, C<ExamplesBoosting>, and
C<ExamplesRandomizedTrees>, also contain scripts that illustrate how to carry out
bulk classification of data records when you wish to take advantage of bagging,
boosting, or tree randomization.  In their respective directories, these scripts are
named:

    bagging_for_bulk_classification.pl
    boosting_for_bulk_classification.pl
    classify_database_records.pl


=head1 HOW THE CLASSIFICATION RESULTS ARE DISPLAYED

It depends on whether you apply the classifier at once to all the data samples in a
file, or whether you feed one data sample at a time into the classifier.

In general, the classifier returns soft classification for a test data vector.  What
that means is that, in general, the classifier will list all the classes to which a
given data vector could belong and the probability of each such class label for the
data vector. Run the examples scripts in the Examples directory to see how the output
of classification can be displayed.

With regard to the soft classifications returned by this classifier, if the
probability distributions for the different classes overlap in the underlying feature
space, you would want the classifier to return all of the applicable class labels for
a data vector along with the corresponding class probabilities.  Another reason for
why the decision tree classifier may associate significant probabilities with
multiple class labels is that you used inadequate number of training samples to
induce the decision tree.  The good thing is that the classifier does not lie to you
(unlike, say, a hard classification rule that would return a single class label
corresponding to the partitioning of the underlying feature space).  The decision
tree classifier give you the best classification that can be made given the training
data you fed into it.


=head1 USING BAGGING

Starting with Version 3.0, you can use the class C<DecisionTreeWithBagging> that
comes with the module to incorporate bagging in your decision tree based
classification.  Bagging means constructing multiple decision trees for different
(possibly overlapping) segments of the data extracted from your training dataset and
then aggregating the decisions made by the individual decision trees for the final
classification.  The aggregation of the classification decisions can average out the
noise and bias that may otherwise affect the classification decision obtained from
just one tree.

=over 4

=item B<Calling the bagging constructor::>

A typical call to the constructor for the C<DecisionTreeWithBagging> class looks



( run in 1.207 second using v1.01-cache-2.11-cpan-02777c243ea )