Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

            if ($param eq $legal) {
                $found_match_flag = 1;
                last;
            }
        }
        last if $found_match_flag == 0;
    }
    return $found_match_flag;
}

sub print_array_with_msg {
    my $message = shift;
    my $arr = shift;
    print "\n$message: ";
    print_nested_array( $arr );
}

sub print_nested_array {
    my $arr = shift;
    my @arr = @$arr;
    print "[";
    foreach my $item (@arr) {
        if (ref $item) {
            print_nested_array($item);
        } else {
            print "$item";
        }
    }
    print "]";
}    

sub cleanup_csv {
    my $line = shift;
    $line =~ tr/\/:?()[]{}'/          /;
#    my @double_quoted = substr($line, index($line,',')) =~ /\"[^\"]+\"/g;
    my @double_quoted = substr($line, index($line,',')) =~ /\"[^\"]*\"/g;
    for (@double_quoted) {
        my $item = $_;
        $item = substr($item, 1, -1);
        $item =~ s/^\s+|,|\s+$//g;
        $item = join '_',  split /\s+/, $item;
        substr($line, index($line, $_), length($_)) = $item;
    }
    my @white_spaced = $line =~ /,(\s*[^,]+)(?=,|$)/g;
    for (@white_spaced) {
        my $item = $_;
        $item =~ s/\s+/_/g;
        $item =~ s/^\s*_|_\s*$//g;
        substr($line, index($line, $_), length($_)) = $item;
    }
    $line =~ s/,\s*(?=,|$)/,NA/g;
    return $line;
}

######################################### Class EvalTrainingData  ########################################

##  This subclass of the DecisionTree class is used to evaluate the quality of your
##  training data by running a 10-fold cross-validation test on it. This test divides
##  all of the training data into ten parts, with nine parts used for training a
##  decision tree and one part used for testing its ability to classify correctly.
##  This selection of nine parts for training and one part for testing is carried out
##  in all of the ten different possible ways.  This testing functionality can also
##  be used to find the best values to use for the constructor parameters
##  entropy_threshold, max_depth_desired, and
##  symbolic_to_numeric_cardinality_threshold.

##  Only the CSV training files can be evaluated in this manner (because only CSV
##  training are allowed to have numeric features --- which is the more interesting
##  case for evaluation analytics.

package EvalTrainingData;

@EvalTrainingData::ISA = ('Algorithm::DecisionTree');

sub new {
    my $class = shift;
    my $instance = Algorithm::DecisionTree->new(@_);
    bless $instance, $class;
}

sub evaluate_training_data {
    my $self = shift;
    my $evaldebug = 0;
    die "The data evaluation function in the module can only be used when your " .
        "training data is in a CSV file" unless $self->{_training_datafile} =~ /\.csv$/;
    print "\nWill run a 10-fold cross-validation test on your training data to test its " .
          "class-discriminatory power:\n";
    my %all_training_data = %{$self->{_training_data_hash}};
    my @all_sample_names = sort {Algorithm::DecisionTree::sample_index($a) <=> 
                                     Algorithm::DecisionTree::sample_index($b)}  keys %all_training_data;
    my $fold_size = int(0.1 * (scalar keys %all_training_data));
    print "fold size: $fold_size\n";
    my %confusion_matrix = ();
    foreach my $class_name (@{$self->{_class_names}}) {
        foreach my $inner_class_name (@{$self->{_class_names}}) {
            $confusion_matrix{$class_name}->{$inner_class_name} = 0;
        }
    }
    foreach my $fold_index (0..9) {
        print "\nStarting the iteration indexed $fold_index of the 10-fold cross-validation test\n"; 
        my @testing_samples = @all_sample_names[$fold_size * $fold_index .. $fold_size * ($fold_index+1) - 1];
        my @training_samples = (@all_sample_names[0 .. $fold_size * $fold_index-1],  
                     @all_sample_names[$fold_size * ($fold_index+1) .. (scalar keys %all_training_data) - 1]);
        my %testing_data = ();
        foreach my $x (@testing_samples) {
            $testing_data{$x} = $all_training_data{$x};
        }
        my %training_data = ();
        foreach my $x (@training_samples) {
            $training_data{$x} = $all_training_data{$x};
        }
        my $trainingDT = Algorithm::DecisionTree->new('evalmode');
        $trainingDT->{_training_data_hash} = \%training_data;
        $trainingDT->{_class_names} = $self->{_class_names};
        $trainingDT->{_feature_names} = $self->{_feature_names};
        $trainingDT->{_entropy_threshold} = $self->{_entropy_threshold};
        $trainingDT->{_max_depth_desired} = $self->{_max_depth_desired};
        $trainingDT->{_symbolic_to_numeric_cardinality_threshold} = 
                                                $self->{_symbolic_to_numeric_cardinality_threshold};
        foreach my $sample_name (@training_samples) {
            $trainingDT->{_samples_class_label_hash}->{$sample_name} = 

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

For each training data sample, you can now figure out not only the decision-tree
nodes that are affected directly by that sample, but also those nodes that are
affected indirectly through the generalization achieved by the probabilistic modeling
of the data.  The 'examples' directory of this version includes additional scripts
that illustrate these enhancements to the introspection capability.  See the section
"The Introspection API" for a declaration of the introspection related methods, old
and new.

B<Version 2.30:> In response to requests from several users, this version includes a new
capability: You can now ask the module to introspect about the classification
decisions returned by the decision tree.  Toward that end, the module includes a new
class named C<DTIntrospection>.  Perhaps the most important bit of information you
are likely to seek through DT introspection is the list of the training samples that
fall directly in the portion of the feature space that is assigned to a node.
B<CAVEAT:> When training samples are non-uniformly distributed in the underlying
feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING SAMPLES FALL
IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE.  B<(This is an important
part of the generalization achieved by probabilistic modeling of the training data.)>
For additional information related to DT introspection, see the section titled
"DECISION TREE INTROSPECTION" in this documentation page.

B<Version 2.26> fixes a bug in the part of the module that some folks use for generating
synthetic data for experimenting with decision tree construction and classification.
In the class C<TrainingDataGeneratorNumeric> that is a part of the module, there
was a problem with the order in which the features were recorded from the
user-supplied parameter file.  The basic code for decision tree construction and
classification remains unchanged.

B<Version 2.25> further downshifts the required version of Perl for this module.  This
was a result of testing the module with Version 5.10.1 of Perl.  Only one statement
in the module code needed to be changed for the module to work with the older version
of Perl.

B<Version 2.24> fixes the C<Makefile.PL> restriction on the required Perl version.  This
version should work with Perl versions 5.14.0 and higher.

B<Version 2.23> changes the required version of Perl from 5.18.0 to 5.14.0.  Everything
else remains the same.

B<Version 2.22> should prove more robust when the probability distribution for the
values of a feature is expected to be heavy-tailed; that is, when the supposedly rare
observations can occur with significant probabilities.  A new option in the
DecisionTree constructor lets the user specify the precision with which the
probability distributions are estimated for such features.

B<Version 2.21> fixes a bug that was caused by the explicitly set zero values for
numerical features being misconstrued as "false" in the conditional statements in
some of the method definitions.

B<Version 2.2> makes it easier to write code for classifying in one go all of your test
data samples in a CSV file.  The bulk classifications obtained can be written out to
either a CSV file or to a regular text file.  See the script
C<classify_test_data_in_a_file_numeric.pl> in the C<Examples> directory for how to
classify all of your test data records in a CSV file.  This version also includes
improved code for generating synthetic numeric/symbolic training and test data
records for experimenting with the decision tree classifier.

B<Version 2.1> allows you to test the quality of your training data by running a 10-fold
cross-validation test on the data.  This test divides all of the training data into
ten parts, with nine parts used for training a decision tree and one part used for
testing its ability to classify correctly. This selection of nine parts for training
and one part for testing is carried out in all of the ten different ways that are
possible.  This testing functionality in Version 2.1 can also be used to find the
best values to use for the constructor parameters C<entropy_threshold>,
C<max_depth_desired>, and C<symbolic_to_numeric_cardinality_threshold>.

B<Version 2.0 is a major rewrite of this module.> Now you can use both numeric and
symbolic features for constructing a decision tree. A feature is numeric if it can
take any floating-point value over an interval.

B<Version 1.71> fixes a bug in the code that was triggered by 0 being declared as one of
the features values in the training datafile. Version 1.71 also include an additional
safety feature that is useful for training datafiles that contain a very large number
of features.  The new version makes sure that the number of values you declare for
each sample record matches the number of features declared at the beginning of the
training datafile.

B<Version 1.7> includes safety checks on the consistency of the data you place in your
training datafile. When a training file contains thousands of samples, it is
difficult to manually check that you used the same class names in your sample records
that you declared at top of your training file or that the values you have for your
features are legal vis-a-vis the earlier declarations of the values in the training
file.  Another safety feature incorporated in this version is the non-consideration
of classes that are declared at the top of the training file but that have no sample
records in the file.

B<Version 1.6> uses probability caching much more extensively compared to the previous
versions.  This should result in faster construction of large decision trees.
Another new feature in Version 1.6 is the use of a decision tree for interactive
classification. In this mode, after you have constructed a decision tree from the
training data, the user is prompted for answers to the questions pertaining to the
feature tests at the nodes of the tree.

Some key elements of the documentation were cleaned up and made more readable in
B<Version 1.41>.  The implementation code remains unchanged from Version 1.4.

B<Version 1.4> should make things faster (and easier) for folks who want to use this
module with training data that creates very large decision trees (that is, trees with
tens of thousands or more decision nodes).  The speedup in Version 1.4 has been
achieved by eliminating duplicate calculation of probabilities as the tree grows.  In
addition, this version provides an additional constructor parameter,
C<max_depth_desired> for controlling the size of the decision tree.  This is in
addition to the tree size control achieved by the parameter C<entropy_threshold> that
was introduced in Version 1.3.  Since large decision trees can take a long time to
create, you may find yourself wishing you could store the tree you just created in a
disk file and that, subsequently, you could use the stored tree for classification
work.  The C<Examples> directory contains two scripts, C<store_dt_on_disk.pl> and
C<classify_from_disk_stored_dt.pl>, that show how you can do exactly that with the
help of Perl's C<Storable> module.

B<Version 1.3> addresses the issue that arises when the header of a training datafile
declares a certain possible value for a feature but that (feature,value) pair does
NOT show up anywhere in the training data.  Version 1.3 also makes it possible for a
user to control the size of the decision tree by changing the value of the parameter
C<entropy_threshold.> Additionally, Version 1.3 includes a method called
C<determine_data_condition()> that displays useful information regarding the size and
some other attributes of the training data.  It also warns the user if the training
data might result in a decision tree that would simply be much too large --- unless
the user controls the size with the entropy_threshold parameter.

In addition to the removal of a couple of serious bugs, B<version 1.2> incorporates a
number of enhancements: (1) Version 1.2 includes checks on the names of the features
and values used in test data --- this is the data you want to classify with the
decision tree classifier constructed by this module.  (2) Version 1.2 includes a
separate constructor for generating test data.  To make it easier to generate test
data whose probabilistic parameters may not be identical to that used for the
training data, I have used separate routines for generating the test data.  (3)
Version 1.2 also includes in its examples directory a script that classifies the test
data in a file and outputs the class labels into another file.  This is for folks who
do not wish to write their own scripts using this module. (4) Version 1.2 also
includes addition to the documentation regarding the issue of numeric values for
features.

=head1 DESCRIPTION

B<Algorithm::DecisionTree> is a I<perl5> module for constructing a decision tree from
a training datafile containing multidimensional data.  In one form or another,
decision trees have been around for about fifty years.  From a statistical
perspective, they are closely related to classification and regression by recursive
partitioning of multidimensional data.  Early work that demonstrated the usefulness
of such partitioning of data for classification and regression can be traced to the
work of Terry Therneau in the early 1980's in the statistics community, and to the
work of Ross Quinlan in the mid 1990's in the machine learning community.

For those not familiar with decision tree ideas, the traditional way to classify
multidimensional data is to start with a feature space whose dimensionality is the
same as that of the data.  Each feature in this space corresponds to the attribute
that each dimension of the data measures.  You then use the training data to carve up
the feature space into different regions, each corresponding to a different class.
Subsequently, when you try to classify a new data sample, you locate it in the
feature space and find the class label of the region to which it belongs.  One can
also give the new data point the same class label as that of the nearest training
sample. This is referred to as the nearest neighbor classification.  There exist
hundreds of variations of varying power on these two basic approaches to the
classification of multidimensional data.

A decision tree classifier works differently.  When you construct a decision tree,
you select for the root node a feature test that partitions the training data in a
way that causes maximal disambiguation of the class labels associated with the data.
In terms of information content as measured by entropy, such a feature test would
cause maximum reduction in class entropy in going from all of the training data taken
together to the data as partitioned by the feature test.  You then drop from the root
node a set of child nodes, one for each partition of the training data created by the
feature test at the root node. When your features are purely symbolic, you'll have
one child node for each value of the feature chosen for the feature test at the root.
When the test at the root involves a numeric feature, you find the decision threshold
for the feature that best bipartitions the data and you drop from the root node two
child nodes, one for each partition.  Now at each child node you pose the same
question that you posed when you found the best feature to use at the root: Which
feature at the child node in question would maximally disambiguate the class labels
associated with the training data corresponding to that child node?

As the reader would expect, the two key steps in any approach to decision-tree based
classification are the construction of the decision tree itself from a file
containing the training data, and then using the decision tree thus obtained for
classifying new data.

What is cool about decision tree classification is that it gives you soft
classification, meaning it may associate more than one class label with a given data
vector.  When this happens, it may mean that your classes are indeed overlapping in
the underlying feature space.  It could also mean that you simply have not supplied
sufficient training data to the decision tree classifier.  For a tutorial
introduction to how a decision tree is constructed and used, visit
L<https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf>

This module also allows you to generate your own synthetic training and test
data. Generating your own training data, using it for constructing a decision-tree
classifier, and subsequently testing the classifier on a synthetically generated
test set of data is a good way to develop greater proficiency with decision trees.


=head1 WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE

If you are new to the concept of a decision tree, their practical utility is best
understood with an example that only involves symbolic features. However, as
mentioned earlier, versions of the module higher than 2.0 allow you to use both
symbolic and numeric features.

Consider the following scenario: Let's say you are running a small investment company
that employs a team of stockbrokers who make buy/sell decisions for the customers of
your company.  Assume that your company has asked the traders to make each investment
decision on the basis of the following four criteria:

  price_to_earnings_ratio   (P_to_E)

  price_to_sales_ratio      (P_to_S)

  return_on_equity          (R_on_E)

  market_share              (MS)

Since you are the boss, you keep track of the buy/sell decisions made by the
individual traders.  But one unfortunate day, all of your traders decide to quit
because you did not pay them enough.  So what do you do?  If you had a module like
the one here, you could still run your company and do so in such a way that, on the
average, would do better than any of the individual traders who worked for your
company.  This is what you do: You pool together the individual trader buy/sell
decisions you have accumulated during the last one year.  This pooled information is

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

the test data value for such a feature to its closest value in the training data.
The module does that automatically for you for those numeric features for which the
number different numeric values is less than a user-specified threshold.  For those
numeric features that the module is allowed to treat symbolically, this snapping of
the values of the features in the test data to the small set of values in the training
data is carried out automatically by the module.  That is, after a user has told the
module which numeric features to treat symbolically, the user need not worry about
how the feature values appear in the test data.

The constructor parameter C<symbolic_to_numeric_cardinality_threshold> let's you tell
the module when to consider an otherwise numeric feature symbolically. Suppose you
set this parameter to 10, that means that all numeric looking features that take 10
or fewer different values in the training datafile will be considered to be symbolic
features by the module.  See the tutorial at
L<https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf> for
further information on the implementation issues related to the symbolic and numeric
features.

=head1 FEATURES WITH NOT SO "NICE" STATISTICAL PROPERTIES

For the purpose of estimating the probabilities, it is necessary to sample the range
of values taken on by a numerical feature. For features with "nice" statistical
properties, this sampling interval is set to the median of the differences between
the successive feature values in the training data.  (Obviously, as you would expect,
you first sort all the values for a feature before computing the successive
differences.)  This logic will not work for the sort of a feature described below.

Consider a feature whose values are heavy-tailed, and, at the same time, the values
span a million to one range.  What I mean by heavy-tailed is that rare values can
occur with significant probabilities.  It could happen that most of the values for
such a feature are clustered at one of the two ends of the range. At the same time,
there may exist a significant number of values near the end of the range that is less
populated.  (Typically, features related to human economic activities --- such as
wealth, incomes, etc. --- are of this type.)  With the logic described in the
previous paragraph, you could end up with a sampling interval that is much too small,
which could result in millions of sampling points for the feature if you are not
careful.

Beginning with Version 2.22, you have two options in dealing with such features.  You
can choose to go with the default behavior of the module, which is to sample the
value range for such a feature over a maximum of 500 points.  Or, you can supply an
additional option to the constructor that sets a user-defined value for the number of
points to use.  The name of the option is C<number_of_histogram_bins>.  The following
script 

    construct_dt_for_heavytailed.pl 

in the C<Examples> directory shows an example of how to call the constructor of the
module with the C<number_of_histogram_bins> option.


=head1 TESTING THE QUALITY OF YOUR TRAINING DATA

Versions 2.1 and higher include a new class named C<EvalTrainingData>, derived from
the main class C<DecisionTree>, that runs a 10-fold cross-validation test on your
training data to test its ability to discriminate between the classes mentioned in
the training file.

The 10-fold cross-validation test divides all of the training data into ten parts,
with nine parts used for training a decision tree and one part used for testing its
ability to classify correctly. This selection of nine parts for training and one part
for testing is carried out in all of the ten different possible ways.

The following code fragment illustrates how you invoke the testing function of the
EvalTrainingData class:

    my $training_datafile = "training.csv";                                         
    my $eval_data = EvalTrainingData->new(
                                  training_datafile => $training_datafile,
                                  csv_class_column_index => 1,
                                  csv_columns_for_features => [2,3],
                                  entropy_threshold => 0.01,
                                  max_depth_desired => 3,
                                  symbolic_to_numeric_cardinality_threshold => 10,
                                  csv_cleanup_needed => 1,
                    );
    $eval_data->get_training_data();
    $eval_data->evaluate_training_data()

The last statement above prints out a Confusion Matrix and the value of Training Data
Quality Index on a scale of 0 to 100, with 100 designating perfect training data.
The Confusion Matrix shows how the different classes were mislabeled in the 10-fold
cross-validation test.

This testing functionality can also be used to find the best values to use for the
constructor parameters C<entropy_threshold>, C<max_depth_desired>, and
C<symbolic_to_numeric_cardinality_threshold>.

The following two scripts in the C<Examples> directory illustrate the use of the
C<EvalTrainingData> class for testing the quality of your data:

    evaluate_training_data1.pl
    evaluate_training_data2.pl


=head1 HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS

Assuming your training data is good, the quality of the results you get from a
decision tree would depend on the choices you make for the constructor parameters
C<entropy_threshold>, C<max_depth_desired>, and
C<symbolic_to_numeric_cardinality_threshold>.  You can optimize your choices for
these parameters by running the 10-fold cross-validation test that is made available
in Versions 2.2 and higher through the new class C<EvalTrainingData> that is included
in the module file.  A description of how to run this test is in the previous section
of this document.


=head1 DECISION TREE INTROSPECTION

Starting with Version 2.30, you can ask the C<DTIntrospection> class of the module to
explain the classification decisions made at the different nodes of the decision
tree.

Perhaps the most important bit of information you are likely to seek through DT
introspection is the list of the training samples that fall directly in the portion
of the feature space that is assigned to a node.

However, note that, when training samples are non-uniformly distributed in the
underlying feature space, it is possible for a node to exist even when there are no
training samples in the portion of the feature space assigned to the node.  That is
because the decision tree is constructed from the probability densities estimated

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

=item B<get_training_data_for_bagging():>

This method reads your training datafile, randomizes it, and then partitions it into
the specified number of bags.  Subsequently, if the constructor parameter
C<bag_overlap_fraction> is non-zero, it adds to each bag additional samples drawn at
random from the other bags.  The number of these additional samples added to each bag
is controlled by the constructor parameter C<bag_overlap_fraction>.  If this
parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples
drawn from the other bags.

=item B<show_training_data_in_bags():>

Shows for each bag the names of the training data samples in that bag.

=item B<calculate_first_order_probabilities():>

Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities from the data samples in each bag.

=item B<calculate_class_priors():>

Calls on the appropriate method of the main C<DecisionTree> class to estimate the
class priors for the data samples in each bag.

=item B<construct_decision_trees_for_bags():>

Calls on the appropriate method of the main C<DecisionTree> class to construct a
decision tree from the training data in each bag.

=item B<display_decision_trees_for_bags():>

Display separately the decision tree for each bag..

=item B<classify_with_bagging( test_sample ):>

Calls on the appropriate methods of the main C<DecisionTree> class to classify the
argument test sample.

=item B<display_classification_results_for_each_bag():>

Displays separately the classification decision made by each the decision tree
constructed for each bag.

=item B<get_majority_vote_classification():>

Using majority voting, this method aggregates the classification decisions made by
the individual decision trees into a single decision.

=back

See the example scripts in the directory C<bagging_examples> for how to call these
methods for classifying individual samples and for bulk classification when you place
all your test samples in a single file.

=head1 USING BOOSTING

Starting with Version 3.20, you can use the class C<BoostedDecisionTree> for
constructing a boosted decision-tree classifier.  Boosting results in a cascade of
decision trees in which each decision tree is constructed with samples that are
mostly those that are misclassified by the previous decision tree.  To be precise,
you create a probability distribution over the training samples for the selection of
samples for training each decision tree in the cascade.  To start out, the
distribution is uniform over all of the samples. Subsequently, this probability
distribution changes according to the misclassifications by each tree in the cascade:
if a sample is misclassified by a given tree in the cascade, the probability of its
being selected for training the next tree is increased significantly.  You also
associate a trust factor with each decision tree depending on its power to classify
correctly all of the training data samples.  After a cascade of decision trees is
constructed in this manner, you construct a final classifier that calculates the
class label for a test data sample by taking into account the classification
decisions made by each individual tree in the cascade, the decisions being weighted
by the trust factors associated with the individual classifiers.  These boosting
notions --- generally referred to as the AdaBoost algorithm --- are based on a now
celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in
the Proceedings of the 2nd European Conf. on Computational Learning Theory.  For a
tutorial introduction to AdaBoost, see L<https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf>

Keep in mind the fact that, ordinarily, the theoretical guarantees provided by
boosting apply only to the case of binary classification.  Additionally, your
training dataset must capture all of the significant statistical variations in the
classes represented therein.

=over 4

=item B<Calling the BoostedDecisionTree constructor:>

If you'd like to experiment with boosting, a typical call to the constructor for the
C<BoostedDecisionTree> class looks like: 

    use Algorithm::BoostedDecisionTree;
    my $training_datafile = "training6.csv";
    my $boosted = Algorithm::BoostedDecisionTree->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 1,
                              csv_columns_for_features => [2,3],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_stages => 4,
                              csv_cleanup_needed => 1,
                  );

Note in particular the constructor parameter:
    
    how_many_stages

As its name implies, this parameter controls how many stages will be used in the
boosted decision tree classifier.  As mentioned above, a separate decision tree is
constructed for each stage of boosting using a set of training samples that are drawn
through a probability distribution maintained over the entire training dataset.

=back

=head2 B<Methods defined for C<BoostedDecisionTree> class>

=over 8

=item B<get_training_data_for_base_tree():>

This method reads your training datafile, creates the data structures from the data
ingested for constructing the base decision tree.

=item B<show_training_data_for_base_tree():>

Writes to the standard output the training data samples and also some relevant

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

probability distribution over the entire training dataset. This probability
distribution for any given tree in the cascade is heavily influenced by which
training samples are misclassified by the previous tree.

=item B<display_decision_trees_for_different_stages():>

Displays separately in your terminal window the decision tree constructed for each
stage of the cascade. (The textual form of the trees is written out to the standard
output.)

=item B<classify_with_boosting( $test_sample ):>

Calls on each decision tree in the cascade to classify the argument C<$test_sample>.

=item B<display_classification_results_for_each_stage():>

You can call this method to display in your terminal window the classification
decision made by each decision tree in the cascade.  The method also prints out the
trust factor associated with each decision tree.  It is important to look
simultaneously at the classification decision and the trust factor for each tree ---
since a classification decision made by a specific tree may appear bizarre for a
given test sample.  This method is useful primarily for debugging purposes.

=item B<show_class_labels_for_misclassified_samples_in_stage( $stage_index ):>

As with the previous method, this method is useful mostly for debugging. It returns
class labels for the samples misclassified by the stage whose integer index is
supplied as an argument to the method.  Say you have 10 stages in your cascade.  The
value of the argument C<stage_index> would go from 0 to 9, with 0 corresponding to
the base tree.

=item B<trust_weighted_majority_vote_classifier():>

Uses the "final classifier" formula of the AdaBoost algorithm to pool together the
classification decisions made by the individual trees while taking into account the
trust factors associated with the trees.  As mentioned earlier, we associate with
each tree of the cascade a trust factor that depends on the overall misclassification
rate associated with that tree.

=back

See the example scripts in the C<ExamplesBoosting> subdirectory for how to call the
methods listed above for classifying individual data samples with boosting and for
bulk classification when you place all your test samples in a single file.


=head1 USING RANDOMIZED DECISION TREES

Consider the following two situations that call for using randomized decision trees,
meaning multiple decision trees that are trained using data extracted randomly from a
large database of training samples: 

(1) Consider a two-class problem for which the training database is grossly
imbalanced in how many majority-class samples it contains vis-a-vis the number of
minority class samples.  Let's assume for a moment that the ratio of majority class
samples to minority class samples is 1000 to 1.  Let's also assume that you have a
test dataset that is drawn randomly from the same population mixture from which the
training database was created.  Now consider a stupid data classification program
that classifies everything as belonging to the majority class.  If you measure the
classification accuracy rate as the ratio of the number of samples correctly
classified to the total number of test samples selected randomly from the population,
this classifier would work with an accuracy of 99.99%. 

(2) Let's now consider another situation in which we are faced with a huge training
database but in which every class is equally well represented.  Feeding all the data
into a single decision tree would be akin to polling all of the population of the
United States for measuring the Coke-versus-Pepsi preference in the country.  You are
likely to get better results if you construct multiple decision trees, each trained
with a collection of training samples drawn randomly from the training database.
After you have created all the decision trees, your final classification decision
could then be based on, say, majority voting by the trees.

In summary, the C<RandomizedTreesForBigData> class allows you to solve the following
two problems: (1) Data classification using the needle-in-a-haystack metaphor, that
is, when a vast majority of your training samples belong to just one class.  And (2)
You have access to a very large database of training samples and you wish to
construct an ensemble of decision trees for classification.

=over 4

=item B<Calling the RandomizedTreesForBigData constructor:>

Here is how you'd call the C<RandomizedTreesForBigData> constructor for
needle-in-a-haystack classification:

    use Algorithm::RandomizedTreesForBigData;
    my $training_datafile = "your_database.csv";
    my $rt = Algorithm::RandomizedTreesForBigData->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 48,
                              csv_columns_for_features => [24,32,33,34,41],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_trees => 5,
                              looking_for_needles_in_haystack => 1,
                              csv_cleanup_needed => 1,
             );

Note in particular the constructor parameters:

    looking_for_needles_in_haystack
    how_many_trees

The first of these parameters, C<looking_for_needles_in_haystack>, invokes the logic for
constructing an ensemble of decision trees, each based on a training dataset that
uses all of the minority class samples, and a random drawing from the majority class
samples.

Here is how you'd call the C<RandomizedTreesForBigData> constructor for a more
general attempt at constructing an ensemble of decision trees, with each tree trained
with randomly drawn samples from a large database of training data (without paying
attention to the differences in the sizes of the populations for the different
classes):

    use Algorithm::RandomizedTreesForBigData;
    my $training_datafile = "your_database.csv";
    my $rt = Algorithm::RandomizedTreesForBigData->new(
                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],



( run in 0.649 second using v1.01-cache-2.11-cpan-f0fbb3f571b )