Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

=head1 NAME

Algorithm::DecisionTree - A Perl module for decision-tree based classification of
multidimensional data.


=head1 SYNOPSIS

  # FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:

  # In general, your call for constructing an instance of the DecisionTree class 
  # will look like:

      my $training_datafile = "stage3cancer.csv";
      my $dt = Algorithm::DecisionTree->new( 
                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              csv_cleanup_needed => 1,
               );

  # The constructor option `csv_class_column_index' informs the module as to which
  # column of your CSV file contains the class label.  THE COLUMN INDEXING IS ZERO
  # BASED.  The constructor option `csv_columns_for_features' specifies which columns
  # are to be used for feature values.  The first row of the CSV file must specify
  # the names of the features.  See examples of CSV files in the `Examples'
  # subdirectory.

  # The option `symbolic_to_numeric_cardinality_threshold' is also important.  For
  # the example shown above, if an ostensibly numeric feature takes on only 10 or
  # fewer different values in your training datafile, it will be treated like a
  # symbolic features.  The option `entropy_threshold' determines the granularity
  # with which the entropies are sampled for the purpose of calculating entropy gain
  # with a particular choice of decision threshold for a numeric feature or a feature
  # value for a symbolic feature.

  # The option 'csv_cleanup_needed' is by default set to 0.  If you set it
  # to 1, that would cause all line records in your CSV file to be "sanitized" before
  # they are used for constructing a decision tree.  You need this option if your CSV
  # file uses double-quoted field names and field values in the line records and if
  # such double-quoted strings are allowed to include commas for, presumably, better
  # readability.

  # After you have constructed an instance of the DecisionTree class as shown above,
  # you read in the training data file and initialize the probability cache by
  # calling:

      $dt->get_training_data();
      $dt->calculate_first_order_probabilities();
      $dt->calculate_class_priors();

  # Next you construct a decision tree for your training data by calling:

      $root_node = $dt->construct_decision_tree_classifier();

  # where $root_node is an instance of the DTNode class that is also defined in the
  # module file.  Now you are ready to classify a new data record.  Let's say that
  # your data record looks like:

      my @test_sample  = qw /  g2=4.2
                               grade=2.3
                               gleason=4
                               eet=1.7
                               age=55.0
                               ploidy=diploid /;

  # You can classify it by calling:

      my $classification = $dt->classify($root_node, \@test_sample);

  # The call to `classify()' returns a reference to a hash whose keys are the class
  # names and the values the associated classification probabilities.  This hash also
  # includes another key-value pair for the solution path from the root node to the
  # leaf node at which the final classification was carried out.


=head1 CHANGES

B<Version 3.43:> This version fixes a bug in the C<csv_cleanup_needed()> function.
The source of the bug was a typo in a regex component meant for matching with white
space.  I have also made one additional change to this function to increase its
versatility.  With this change, you are now allowed to have empty strings as values
for features.

B<Version 3.42:> This version reintroduces C<csv_cleanup_needed> as an optional
parameter in the module constructor.  This was done in response to several requests
received from the user community. (Previously, all line records from a CSV file were
processed by the C<cleanup_csv()> function no matter what.)  The main point made by
the users was that invoking C<cleanup_csv()> when there was no need for CSV clean-up
extracted a performance penalty when ingesting large database files with tens of
thousands of line records.  In addition to making C<csv_cleanup_needed> optional, I
have also tweaked up the code in the C<cleanup_csv()> function in order to extract
data from a larger range of messy CSV files.

B<Version 3.41:> All the changes made in this version relate to the construction of
regression trees.  I have fixed a couple of bugs in the calculation of the regression
coefficients. Additionally, the C<RegressionTree> class now comes with a new
constructor parameter named C<jacobian_choice>.  For most cases, you'd set this
parameter to 0, which causes the regression coefficients to be estimated through
linear least-squares minimization.

B<Version 3.40:> In addition to constructing decision trees, this version of the
module also allows you to construct regression trees. The regression tree capability
has been packed into a separate subclass, named C<RegressionTree>, of the main
C<DecisionTree> class.  The subdirectory C<ExamplesRegression> in the main
installation directory illustrates how you can use this new functionality of the
module.

B<Version 3.30:> This version incorporates four very significant upgrades/changes to
the C<DecisionTree> module: B<(1)> The CSV cleanup is now the default. So you do not
have to set any special parameters in the constructor calls to initiate CSV
cleanup. B<(2)> In the form of a new Perl class named C<RandomizedTreesForBigData>,
this module provides you with an easy-to-use programming interface for attempting
needle-in-a-haystack solutions for the case when your training data is overwhelmingly
dominated by a single class.  You need to set the constructor parameter
C<looking_for_needles_in_haystack> to invoke the logic that constructs multiple
decision trees, each using the minority class samples along with samples drawn
randomly from the majority class.  The final classification is made through a

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

in the module file.  A description of how to run this test is in the previous section
of this document.


=head1 DECISION TREE INTROSPECTION

Starting with Version 2.30, you can ask the C<DTIntrospection> class of the module to
explain the classification decisions made at the different nodes of the decision
tree.

Perhaps the most important bit of information you are likely to seek through DT
introspection is the list of the training samples that fall directly in the portion
of the feature space that is assigned to a node.

However, note that, when training samples are non-uniformly distributed in the
underlying feature space, it is possible for a node to exist even when there are no
training samples in the portion of the feature space assigned to the node.  That is
because the decision tree is constructed from the probability densities estimated
from the training data.  When the training samples are non-uniformly distributed, it
is entirely possible for the estimated probability densities to be non-zero in a
small region around a point even when there are no training samples specifically in
that region.  (After you have created a statistical model for, say, the height
distribution of people in a community, the model may return a non-zero probability
for the height values in a small interval even if the community does not include a
single individual whose height falls in that interval.)

That a decision-tree node can exist even when there are no training samples in that
portion of the feature space that belongs to the node is an important indication of
the generalization ability of a decision-tree-based classifier.

In light of the explanation provided above, before the DTIntrospection class supplies
any answers at all, it asks you to accept the fact that features can take on non-zero
probabilities at a point in the feature space even though there are zero training
samples at that point (or in a small region around that point).  If you do not accept
this rudimentary fact, the introspection class will not yield any answers (since you
are not going to believe the answers anyway).

The point made above implies that the path leading to a node in the decision tree may
test a feature for a certain value or threshold despite the fact that the portion of
the feature space assigned to that node is devoid of any training data.

See the following three scripts in the Examples directory for how to carry out DT
introspection:

    introspection_in_a_loop_interactive.pl

    introspection_show_training_samples_at_all_nodes_direct_influence.pl

    introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script places you in an interactive session in which you will first be
asked for the node number you are interested in.  Subsequently, you will be asked for
whether or not you are interested in specific questions that the introspection can
provide answers for. The second script descends down the decision tree and shows for
each node the training samples that fall directly in the portion of the feature space
assigned to that node.  The third script shows for each training sample how it
affects the decision-tree nodes either directly or indirectly through the
generalization achieved by the probabilistic modeling of the data.

The output of the script
C<introspection_show_training_samples_at_all_nodes_direct_influence.pl> looks like:

    Node 0: the samples are: None
    Node 1: the samples are: [sample_46 sample_58]
    Node 2: the samples are: [sample_1 sample_4 sample_7 .....]
    Node 3: the samples are: []
    Node 4: the samples are: []
    ...
    ...            

The nodes for which no samples are listed come into existence through
the generalization achieved by the probabilistic modeling of the data.

The output produced by the script
C<introspection_show_training_samples_to_nodes_influence_propagation.pl> looks like

    sample_1:                                                                 
       nodes affected directly: [2 5 19 23]                                
       nodes affected through probabilistic generalization:                   
            2=> [3 4 25]                                                    
                25=> [26]                                                     
            5=> [6]                                                           
                6=> [7 13]                                                   
                    7=> [8 11]                                               
                        8=> [9 10]                                           
                        11=> [12]                                             
                    13=> [14 18]                                             
                        14=> [15 16]                                         
                            16=> [17]                                         
            19=> [20]                                                         
                20=> [21 22]                                                 
            23=> [24]                                                         
                             
    sample_4:                                                                 
       nodes affected directly: [2 5 6 7 11]                              
       nodes affected through probabilistic generalization:                   
            2=> [3 4 25]                                                    
                25=> [26]                                                     
            5=> [19]                                                          
                19=> [20 23]                                                 
                    20=> [21 22]                                             
                    23=> [24]                                                 
            6=> [13]                                                          
                13=> [14 18]                                                 
                    14=> [15 16]                                             
                        16=> [17]                                             
            7=> [8]                                                           
                8=> [9 10]                                                   
            11=> [12]                                                         
                                                                              
    ...                                                                       
    ...  
    ...

For each training sample, the display shown above first presents the list of nodes
that are directly affected by the sample.  A node is affected directly by a sample if
the latter falls in the portion of the feature space that belongs to the former.
Subsequently, for each training sample, the display shows a subtree of the nodes that
are affected indirectly by the sample through the generalization achieved by the
probabilistic modeling of the data.  In general, a node is affected indirectly by a
sample if it is a descendant of another node that is affected directly.

Also see the section titled B<The Introspection API> regarding how to invoke the
introspection capabilities of the module in your own code.

=head1 METHODS

The module provides the following methods for constructing a decision tree from
training data in a disk file and for classifying new data records with the decision
tree thus constructed:

=over 4

=item B<new():>

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

For large test datasets, you would obviously want to process an entire file of test
data at a time. The following scripts in the C<Examples> directory illustrate how you
can do that:

      classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the
training datafile, the second the test datafile, and the third the file in which the
classification results are to be deposited.  

The other examples directories, C<ExamplesBagging>, C<ExamplesBoosting>, and
C<ExamplesRandomizedTrees>, also contain scripts that illustrate how to carry out
bulk classification of data records when you wish to take advantage of bagging,
boosting, or tree randomization.  In their respective directories, these scripts are
named:

    bagging_for_bulk_classification.pl
    boosting_for_bulk_classification.pl
    classify_database_records.pl


=head1 HOW THE CLASSIFICATION RESULTS ARE DISPLAYED

It depends on whether you apply the classifier at once to all the data samples in a
file, or whether you feed one data sample at a time into the classifier.

In general, the classifier returns soft classification for a test data vector.  What
that means is that, in general, the classifier will list all the classes to which a
given data vector could belong and the probability of each such class label for the
data vector. Run the examples scripts in the Examples directory to see how the output
of classification can be displayed.

With regard to the soft classifications returned by this classifier, if the
probability distributions for the different classes overlap in the underlying feature
space, you would want the classifier to return all of the applicable class labels for
a data vector along with the corresponding class probabilities.  Another reason for
why the decision tree classifier may associate significant probabilities with
multiple class labels is that you used inadequate number of training samples to
induce the decision tree.  The good thing is that the classifier does not lie to you
(unlike, say, a hard classification rule that would return a single class label
corresponding to the partitioning of the underlying feature space).  The decision
tree classifier give you the best classification that can be made given the training
data you fed into it.


=head1 USING BAGGING

Starting with Version 3.0, you can use the class C<DecisionTreeWithBagging> that
comes with the module to incorporate bagging in your decision tree based
classification.  Bagging means constructing multiple decision trees for different
(possibly overlapping) segments of the data extracted from your training dataset and
then aggregating the decisions made by the individual decision trees for the final
classification.  The aggregation of the classification decisions can average out the
noise and bias that may otherwise affect the classification decision obtained from
just one tree.

=over 4

=item B<Calling the bagging constructor::>

A typical call to the constructor for the C<DecisionTreeWithBagging> class looks
like:

    use Algorithm::DecisionTreeWithBagging;
    
    my $training_datafile = "stage3cancer.csv";
    
    my $dtbag = Algorithm::DecisionTreeWithBagging->new(
                                  training_datafile => $training_datafile,
                                  csv_class_column_index => 2,
                                  csv_columns_for_features => [3,4,5,6,7,8],
                                  entropy_threshold => 0.01,
                                  max_depth_desired => 8,
                                  symbolic_to_numeric_cardinality_threshold => 10,
                                  how_many_bags => 4,
                                  bag_overlap_fraction => 0.2,
                                  csv_cleanup_needed => 1,
                 );
    
Note in particular the following two constructor parameters:
                                                                                               
    how_many_bags

    bag_overlap_fraction

where, as the name implies, the parameter C<how_many_bags> controls how many bags
(and, therefore, how many decision trees) will be constructed from your training
dataset; and where the parameter C<bag_overlap_fraction> controls the degree of
overlap between the bags.  To understand what exactly is achieved by setting the
parameter C<bag_overlap_fraction> to 0.2 in the above example, let's say that the
non-overlapping partitioning of the training data between the bags results in 100
training samples per bag. With bag_overlap_fraction set to 0.2, additional 20 samples
drawn randomly from the other bags will be added to the data in each bag.

=back

=head2 B<Methods defined for C<DecisionTreeWithBagging> class>

=over 8

=item B<get_training_data_for_bagging():>

This method reads your training datafile, randomizes it, and then partitions it into
the specified number of bags.  Subsequently, if the constructor parameter
C<bag_overlap_fraction> is non-zero, it adds to each bag additional samples drawn at
random from the other bags.  The number of these additional samples added to each bag
is controlled by the constructor parameter C<bag_overlap_fraction>.  If this
parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples
drawn from the other bags.

=item B<show_training_data_in_bags():>

Shows for each bag the names of the training data samples in that bag.

=item B<calculate_first_order_probabilities():>

Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities from the data samples in each bag.

=item B<calculate_class_priors():>

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


=item B<display_decision_trees_for_bags():>

Display separately the decision tree for each bag..

=item B<classify_with_bagging( test_sample ):>

Calls on the appropriate methods of the main C<DecisionTree> class to classify the
argument test sample.

=item B<display_classification_results_for_each_bag():>

Displays separately the classification decision made by each the decision tree
constructed for each bag.

=item B<get_majority_vote_classification():>

Using majority voting, this method aggregates the classification decisions made by
the individual decision trees into a single decision.

=back

See the example scripts in the directory C<bagging_examples> for how to call these
methods for classifying individual samples and for bulk classification when you place
all your test samples in a single file.

=head1 USING BOOSTING

Starting with Version 3.20, you can use the class C<BoostedDecisionTree> for
constructing a boosted decision-tree classifier.  Boosting results in a cascade of
decision trees in which each decision tree is constructed with samples that are
mostly those that are misclassified by the previous decision tree.  To be precise,
you create a probability distribution over the training samples for the selection of
samples for training each decision tree in the cascade.  To start out, the
distribution is uniform over all of the samples. Subsequently, this probability
distribution changes according to the misclassifications by each tree in the cascade:
if a sample is misclassified by a given tree in the cascade, the probability of its
being selected for training the next tree is increased significantly.  You also
associate a trust factor with each decision tree depending on its power to classify
correctly all of the training data samples.  After a cascade of decision trees is
constructed in this manner, you construct a final classifier that calculates the
class label for a test data sample by taking into account the classification
decisions made by each individual tree in the cascade, the decisions being weighted
by the trust factors associated with the individual classifiers.  These boosting
notions --- generally referred to as the AdaBoost algorithm --- are based on a now
celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in
the Proceedings of the 2nd European Conf. on Computational Learning Theory.  For a
tutorial introduction to AdaBoost, see L<https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf>

Keep in mind the fact that, ordinarily, the theoretical guarantees provided by
boosting apply only to the case of binary classification.  Additionally, your
training dataset must capture all of the significant statistical variations in the
classes represented therein.

=over 4

=item B<Calling the BoostedDecisionTree constructor:>

If you'd like to experiment with boosting, a typical call to the constructor for the
C<BoostedDecisionTree> class looks like: 

    use Algorithm::BoostedDecisionTree;
    my $training_datafile = "training6.csv";
    my $boosted = Algorithm::BoostedDecisionTree->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 1,
                              csv_columns_for_features => [2,3],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_stages => 4,
                              csv_cleanup_needed => 1,
                  );

Note in particular the constructor parameter:
    
    how_many_stages

As its name implies, this parameter controls how many stages will be used in the
boosted decision tree classifier.  As mentioned above, a separate decision tree is
constructed for each stage of boosting using a set of training samples that are drawn
through a probability distribution maintained over the entire training dataset.

=back

=head2 B<Methods defined for C<BoostedDecisionTree> class>

=over 8

=item B<get_training_data_for_base_tree():>

This method reads your training datafile, creates the data structures from the data
ingested for constructing the base decision tree.

=item B<show_training_data_for_base_tree():>

Writes to the standard output the training data samples and also some relevant
properties of the features used in the training dataset.

=item B<calculate_first_order_probabilities_and_class_priors():>

Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities and the class priors.

=item B<construct_base_decision_tree():>

Calls on the appropriate method of the main C<DecisionTree> class to construct the
base decision tree.

=item B<display_base_decision_tree():>

Displays the base decision tree in your terminal window. (The textual form of the
decision tree is written out to the standard output.)

=item B<construct_cascade_of_trees():>

Uses the AdaBoost algorithm to construct a cascade of decision trees.  As mentioned
earlier, the training samples for each tree in the cascade are drawn using a
probability distribution over the entire training dataset. This probability
distribution for any given tree in the cascade is heavily influenced by which



( run in 2.125 seconds using v1.01-cache-2.11-cpan-ceb78f64989 )