Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
=over 8
=item B<get_training_data_for_bagging():>
This method reads your training datafile, randomizes it, and then partitions it into
the specified number of bags. Subsequently, if the constructor parameter
C<bag_overlap_fraction> is non-zero, it adds to each bag additional samples drawn at
random from the other bags. The number of these additional samples added to each bag
is controlled by the constructor parameter C<bag_overlap_fraction>. If this
parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples
drawn from the other bags.
=item B<show_training_data_in_bags():>
Shows for each bag the names of the training data samples in that bag.
=item B<calculate_first_order_probabilities():>
Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities from the data samples in each bag.
=item B<calculate_class_priors():>
Calls on the appropriate method of the main C<DecisionTree> class to estimate the
class priors for the data samples in each bag.
=item B<construct_decision_trees_for_bags():>
Calls on the appropriate method of the main C<DecisionTree> class to construct a
decision tree from the training data in each bag.
=item B<display_decision_trees_for_bags():>
Display separately the decision tree for each bag..
=item B<classify_with_bagging( test_sample ):>
Calls on the appropriate methods of the main C<DecisionTree> class to classify the
argument test sample.
=item B<display_classification_results_for_each_bag():>
Displays separately the classification decision made by each the decision tree
constructed for each bag.
=item B<get_majority_vote_classification():>
Using majority voting, this method aggregates the classification decisions made by
the individual decision trees into a single decision.
=back
See the example scripts in the directory C<bagging_examples> for how to call these
methods for classifying individual samples and for bulk classification when you place
all your test samples in a single file.
=head1 USING BOOSTING
Starting with Version 3.20, you can use the class C<BoostedDecisionTree> for
constructing a boosted decision-tree classifier. Boosting results in a cascade of
decision trees in which each decision tree is constructed with samples that are
mostly those that are misclassified by the previous decision tree. To be precise,
you create a probability distribution over the training samples for the selection of
samples for training each decision tree in the cascade. To start out, the
distribution is uniform over all of the samples. Subsequently, this probability
distribution changes according to the misclassifications by each tree in the cascade:
if a sample is misclassified by a given tree in the cascade, the probability of its
being selected for training the next tree is increased significantly. You also
associate a trust factor with each decision tree depending on its power to classify
correctly all of the training data samples. After a cascade of decision trees is
constructed in this manner, you construct a final classifier that calculates the
class label for a test data sample by taking into account the classification
decisions made by each individual tree in the cascade, the decisions being weighted
by the trust factors associated with the individual classifiers. These boosting
notions --- generally referred to as the AdaBoost algorithm --- are based on a now
celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in
the Proceedings of the 2nd European Conf. on Computational Learning Theory. For a
tutorial introduction to AdaBoost, see L<https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf>
Keep in mind the fact that, ordinarily, the theoretical guarantees provided by
boosting apply only to the case of binary classification. Additionally, your
training dataset must capture all of the significant statistical variations in the
classes represented therein.
=over 4
=item B<Calling the BoostedDecisionTree constructor:>
If you'd like to experiment with boosting, a typical call to the constructor for the
C<BoostedDecisionTree> class looks like:
use Algorithm::BoostedDecisionTree;
my $training_datafile = "training6.csv";
my $boosted = Algorithm::BoostedDecisionTree->new(
training_datafile => $training_datafile,
csv_class_column_index => 1,
csv_columns_for_features => [2,3],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
how_many_stages => 4,
csv_cleanup_needed => 1,
);
Note in particular the constructor parameter:
how_many_stages
As its name implies, this parameter controls how many stages will be used in the
boosted decision tree classifier. As mentioned above, a separate decision tree is
constructed for each stage of boosting using a set of training samples that are drawn
through a probability distribution maintained over the entire training dataset.
=back
=head2 B<Methods defined for C<BoostedDecisionTree> class>
=over 8
=item B<get_training_data_for_base_tree():>
This method reads your training datafile, creates the data structures from the data
ingested for constructing the base decision tree.
=item B<show_training_data_for_base_tree():>
Writes to the standard output the training data samples and also some relevant
properties of the features used in the training dataset.
=item B<calculate_first_order_probabilities_and_class_priors():>
Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities and the class priors.
=item B<construct_base_decision_tree():>
Calls on the appropriate method of the main C<DecisionTree> class to construct the
base decision tree.
=item B<display_base_decision_tree():>
Displays the base decision tree in your terminal window. (The textual form of the
decision tree is written out to the standard output.)
=item B<construct_cascade_of_trees():>
Uses the AdaBoost algorithm to construct a cascade of decision trees. As mentioned
earlier, the training samples for each tree in the cascade are drawn using a
probability distribution over the entire training dataset. This probability
distribution for any given tree in the cascade is heavily influenced by which
training samples are misclassified by the previous tree.
=item B<display_decision_trees_for_different_stages():>
Displays separately in your terminal window the decision tree constructed for each
stage of the cascade. (The textual form of the trees is written out to the standard
output.)
=item B<classify_with_boosting( $test_sample ):>
Calls on each decision tree in the cascade to classify the argument C<$test_sample>.
=item B<display_classification_results_for_each_stage():>
You can call this method to display in your terminal window the classification
decision made by each decision tree in the cascade. The method also prints out the
trust factor associated with each decision tree. It is important to look
simultaneously at the classification decision and the trust factor for each tree ---
since a classification decision made by a specific tree may appear bizarre for a
given test sample. This method is useful primarily for debugging purposes.
=item B<show_class_labels_for_misclassified_samples_in_stage( $stage_index ):>
As with the previous method, this method is useful mostly for debugging. It returns
class labels for the samples misclassified by the stage whose integer index is
supplied as an argument to the method. Say you have 10 stages in your cascade. The
value of the argument C<stage_index> would go from 0 to 9, with 0 corresponding to
the base tree.
=item B<trust_weighted_majority_vote_classifier():>
Uses the "final classifier" formula of the AdaBoost algorithm to pool together the
classification decisions made by the individual trees while taking into account the
trust factors associated with the trees. As mentioned earlier, we associate with
each tree of the cascade a trust factor that depends on the overall misclassification
rate associated with that tree.
=back
See the example scripts in the C<ExamplesBoosting> subdirectory for how to call the
methods listed above for classifying individual data samples with boosting and for
bulk classification when you place all your test samples in a single file.
=head1 USING RANDOMIZED DECISION TREES
Consider the following two situations that call for using randomized decision trees,
meaning multiple decision trees that are trained using data extracted randomly from a
large database of training samples:
(1) Consider a two-class problem for which the training database is grossly
imbalanced in how many majority-class samples it contains vis-a-vis the number of
minority class samples. Let's assume for a moment that the ratio of majority class
samples to minority class samples is 1000 to 1. Let's also assume that you have a
test dataset that is drawn randomly from the same population mixture from which the
training database was created. Now consider a stupid data classification program
that classifies everything as belonging to the majority class. If you measure the
classification accuracy rate as the ratio of the number of samples correctly
classified to the total number of test samples selected randomly from the population,
this classifier would work with an accuracy of 99.99%.
(2) Let's now consider another situation in which we are faced with a huge training
database but in which every class is equally well represented. Feeding all the data
into a single decision tree would be akin to polling all of the population of the
United States for measuring the Coke-versus-Pepsi preference in the country. You are
likely to get better results if you construct multiple decision trees, each trained
with a collection of training samples drawn randomly from the training database.
After you have created all the decision trees, your final classification decision
could then be based on, say, majority voting by the trees.
In summary, the C<RandomizedTreesForBigData> class allows you to solve the following
two problems: (1) Data classification using the needle-in-a-haystack metaphor, that
is, when a vast majority of your training samples belong to just one class. And (2)
You have access to a very large database of training samples and you wish to
construct an ensemble of decision trees for classification.
=over 4
=item B<Calling the RandomizedTreesForBigData constructor:>
Here is how you'd call the C<RandomizedTreesForBigData> constructor for
needle-in-a-haystack classification:
use Algorithm::RandomizedTreesForBigData;
my $training_datafile = "your_database.csv";
my $rt = Algorithm::RandomizedTreesForBigData->new(
training_datafile => $training_datafile,
csv_class_column_index => 48,
csv_columns_for_features => [24,32,33,34,41],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
how_many_trees => 5,
looking_for_needles_in_haystack => 1,
csv_cleanup_needed => 1,
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
of the feature space assigned that node. (As mentioned elsewhere in this
documentation, when this list is empty for a node, that means the node is a result of
the generalization achieved by probabilistic modeling of the data. Note that this
module constructs a decision tree NOT by partitioning the set of training samples,
BUT by partitioning the domains of the probability density functions.) The third
script listed above also generates a tabular display, but one that shows how the
influence of each training sample propagates in the tree. This display first shows
the list of nodes that are affected directly by the data in a training sample. This
list is followed by an indented display of the nodes that are affected indirectly by
the training sample. A training sample affects a node indirectly if the node is a
descendant of one of the nodes affected directly.
The latest addition to the Examples directory is the script:
get_indexes_associated_with_fields.py
As to why you may find this script useful, note that large database files may have
hundreds of fields and it is not always easy to figure out what numerical index is
associated with a given field. At the same time, the constructor of the DecisionTree
module requires that the field that holds the class label and the fields that contain
the feature values be specified by their numerical zero-based indexes. If you have a
large database and you are faced with this problem, you can run this script to see
the zero-based numerical index values associated with the different columns of your
CSV file.
=head1 THE C<ExamplesBagging> DIRECTORY
The C<ExamplesBagging> directory contains the following scripts:
bagging_for_classifying_one_test_sample.pl
bagging_for_bulk_classification.pl
As the names of the scripts imply, the first shows how to call the different methods
of the C<DecisionTreeWithBagging> class for classifying a single test sample. When
you are classifying a single test sample, you can also see how each bag is
classifying the test sample. You can, for example, display the training data used in
each bag, the decision tree constructed for each bag, etc.
The second script is for the case when you place all of the test samples in a single
file. The demonstration script displays for each test sample a single aggregate
classification decision that is obtained through majority voting by all the decision
trees.
=head1 THE C<ExamplesBoosting> DIRECTORY
The C<ExamplesBoosting> subdirectory in the main installation directory contains the
following three scripts:
boosting_for_classifying_one_test_sample_1.pl
boosting_for_classifying_one_test_sample_2.pl
boosting_for_bulk_classification.pl
As the names of the first two scripts imply, these show how to call the different
methods of the C<BoostedDecisionTree> class for classifying a single test sample.
When you are classifying a single test sample, you can see how each stage of the
cascade of decision trees is classifying the test sample. You can also view each
decision tree separately and also see the trust factor associated with the tree.
The third script is for the case when you place all of the test samples in a single
file. The demonstration script outputs for each test sample a single aggregate
classification decision that is obtained through trust-factor weighted majority
voting by all the decision trees.
=head1 THE C<ExamplesRandomizedTrees> DIRECTORY
The C<ExamplesRandomizedTrees> directory shows example scripts that you can use to
become more familiar with the C<RandomizedTreesForBigData> class for solving
needle-in-a-haystack and big-data data classification problems. These scripts are:
randomized_trees_for_classifying_one_test_sample_1.pl
randomized_trees_for_classifying_one_test_sample_2.pl
classify_database_records.pl
The first script shows the constructor options to use for solving a
needle-in-a-haystack problem --- that is, a problem in which a vast majority of the
training data belongs to just one class. The second script shows the constructor
options for using randomized decision trees for the case when you have access to a
very large database of training samples and you'd like to construct an ensemble of
decision trees using training samples pulled randomly from the training database.
The last script illustrates how you can evaluate the classification power of an
ensemble of decision trees as constructed by C<RandomizedTreesForBigData> by classifying
a large number of test samples extracted randomly from the training database.
=head1 THE C<ExamplesRegression> DIRECTORY
The C<ExamplesRegression> subdirectory in the main installation directory shows
example scripts that you can use to become familiar with regression trees and how
they can be used for nonlinear regression. If you are new to the concept of
regression trees, start by executing the following scripts without changing them and
see what sort of output is produced by them:
regression4.pl
regression5.pl
regression6.pl
regression8.pl
The C<regression4.pl> script involves only one predictor variable and one dependent
variable. The training data for this exercise is drawn from the file C<gendata4.csv>.
This data file contains strongly nonlinear data. When you run the script
C<regression4.pl>, you will see how much better the result from tree regression is
compared to what you can get with linear regression.
The C<regression5.pl> script is essentially the same as the previous script except
for the fact that the training datafile used in this case, C<gendata5.csv>, consists
of three noisy segments, as opposed to just two in the previous case.
The script C<regression6.pl> deals with the case when we have two predictor variables
and one dependent variable. You can think of the data as consisting of noisy height
values over an C<(x1,x2)> plane. The data used in this script is drawn from the csv
file C<gen3Ddata1.csv>.
( run in 0.372 second using v1.01-cache-2.11-cpan-0bd6704ced7 )