Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
B<Version 1.6> uses probability caching much more extensively compared to the previous
versions. This should result in faster construction of large decision trees.
Another new feature in Version 1.6 is the use of a decision tree for interactive
classification. In this mode, after you have constructed a decision tree from the
training data, the user is prompted for answers to the questions pertaining to the
feature tests at the nodes of the tree.
Some key elements of the documentation were cleaned up and made more readable in
B<Version 1.41>. The implementation code remains unchanged from Version 1.4.
B<Version 1.4> should make things faster (and easier) for folks who want to use this
module with training data that creates very large decision trees (that is, trees with
tens of thousands or more decision nodes). The speedup in Version 1.4 has been
achieved by eliminating duplicate calculation of probabilities as the tree grows. In
addition, this version provides an additional constructor parameter,
C<max_depth_desired> for controlling the size of the decision tree. This is in
addition to the tree size control achieved by the parameter C<entropy_threshold> that
was introduced in Version 1.3. Since large decision trees can take a long time to
create, you may find yourself wishing you could store the tree you just created in a
disk file and that, subsequently, you could use the stored tree for classification
work. The C<Examples> directory contains two scripts, C<store_dt_on_disk.pl> and
C<classify_from_disk_stored_dt.pl>, that show how you can do exactly that with the
help of Perl's C<Storable> module.
B<Version 1.3> addresses the issue that arises when the header of a training datafile
declares a certain possible value for a feature but that (feature,value) pair does
NOT show up anywhere in the training data. Version 1.3 also makes it possible for a
user to control the size of the decision tree by changing the value of the parameter
C<entropy_threshold.> Additionally, Version 1.3 includes a method called
C<determine_data_condition()> that displays useful information regarding the size and
some other attributes of the training data. It also warns the user if the training
data might result in a decision tree that would simply be much too large --- unless
the user controls the size with the entropy_threshold parameter.
In addition to the removal of a couple of serious bugs, B<version 1.2> incorporates a
number of enhancements: (1) Version 1.2 includes checks on the names of the features
and values used in test data --- this is the data you want to classify with the
decision tree classifier constructed by this module. (2) Version 1.2 includes a
separate constructor for generating test data. To make it easier to generate test
data whose probabilistic parameters may not be identical to that used for the
training data, I have used separate routines for generating the test data. (3)
Version 1.2 also includes in its examples directory a script that classifies the test
data in a file and outputs the class labels into another file. This is for folks who
do not wish to write their own scripts using this module. (4) Version 1.2 also
includes addition to the documentation regarding the issue of numeric values for
features.
=head1 DESCRIPTION
B<Algorithm::DecisionTree> is a I<perl5> module for constructing a decision tree from
a training datafile containing multidimensional data. In one form or another,
decision trees have been around for about fifty years. From a statistical
perspective, they are closely related to classification and regression by recursive
partitioning of multidimensional data. Early work that demonstrated the usefulness
of such partitioning of data for classification and regression can be traced to the
work of Terry Therneau in the early 1980's in the statistics community, and to the
work of Ross Quinlan in the mid 1990's in the machine learning community.
For those not familiar with decision tree ideas, the traditional way to classify
multidimensional data is to start with a feature space whose dimensionality is the
same as that of the data. Each feature in this space corresponds to the attribute
that each dimension of the data measures. You then use the training data to carve up
the feature space into different regions, each corresponding to a different class.
Subsequently, when you try to classify a new data sample, you locate it in the
feature space and find the class label of the region to which it belongs. One can
also give the new data point the same class label as that of the nearest training
sample. This is referred to as the nearest neighbor classification. There exist
hundreds of variations of varying power on these two basic approaches to the
classification of multidimensional data.
A decision tree classifier works differently. When you construct a decision tree,
you select for the root node a feature test that partitions the training data in a
way that causes maximal disambiguation of the class labels associated with the data.
In terms of information content as measured by entropy, such a feature test would
cause maximum reduction in class entropy in going from all of the training data taken
together to the data as partitioned by the feature test. You then drop from the root
node a set of child nodes, one for each partition of the training data created by the
feature test at the root node. When your features are purely symbolic, you'll have
one child node for each value of the feature chosen for the feature test at the root.
When the test at the root involves a numeric feature, you find the decision threshold
for the feature that best bipartitions the data and you drop from the root node two
child nodes, one for each partition. Now at each child node you pose the same
question that you posed when you found the best feature to use at the root: Which
feature at the child node in question would maximally disambiguate the class labels
associated with the training data corresponding to that child node?
As the reader would expect, the two key steps in any approach to decision-tree based
classification are the construction of the decision tree itself from a file
containing the training data, and then using the decision tree thus obtained for
classifying new data.
What is cool about decision tree classification is that it gives you soft
classification, meaning it may associate more than one class label with a given data
vector. When this happens, it may mean that your classes are indeed overlapping in
the underlying feature space. It could also mean that you simply have not supplied
sufficient training data to the decision tree classifier. For a tutorial
introduction to how a decision tree is constructed and used, visit
L<https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf>
This module also allows you to generate your own synthetic training and test
data. Generating your own training data, using it for constructing a decision-tree
classifier, and subsequently testing the classifier on a synthetically generated
test set of data is a good way to develop greater proficiency with decision trees.
=head1 WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE
If you are new to the concept of a decision tree, their practical utility is best
understood with an example that only involves symbolic features. However, as
mentioned earlier, versions of the module higher than 2.0 allow you to use both
symbolic and numeric features.
Consider the following scenario: Let's say you are running a small investment company
that employs a team of stockbrokers who make buy/sell decisions for the customers of
your company. Assume that your company has asked the traders to make each investment
decision on the basis of the following four criteria:
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
This method reads your training datafile, randomizes it, and then partitions it into
the specified number of bags. Subsequently, if the constructor parameter
C<bag_overlap_fraction> is non-zero, it adds to each bag additional samples drawn at
random from the other bags. The number of these additional samples added to each bag
is controlled by the constructor parameter C<bag_overlap_fraction>. If this
parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples
drawn from the other bags.
=item B<show_training_data_in_bags():>
Shows for each bag the names of the training data samples in that bag.
=item B<calculate_first_order_probabilities():>
Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities from the data samples in each bag.
=item B<calculate_class_priors():>
Calls on the appropriate method of the main C<DecisionTree> class to estimate the
class priors for the data samples in each bag.
=item B<construct_decision_trees_for_bags():>
Calls on the appropriate method of the main C<DecisionTree> class to construct a
decision tree from the training data in each bag.
=item B<display_decision_trees_for_bags():>
Display separately the decision tree for each bag..
=item B<classify_with_bagging( test_sample ):>
Calls on the appropriate methods of the main C<DecisionTree> class to classify the
argument test sample.
=item B<display_classification_results_for_each_bag():>
Displays separately the classification decision made by each the decision tree
constructed for each bag.
=item B<get_majority_vote_classification():>
Using majority voting, this method aggregates the classification decisions made by
the individual decision trees into a single decision.
=back
See the example scripts in the directory C<bagging_examples> for how to call these
methods for classifying individual samples and for bulk classification when you place
all your test samples in a single file.
=head1 USING BOOSTING
Starting with Version 3.20, you can use the class C<BoostedDecisionTree> for
constructing a boosted decision-tree classifier. Boosting results in a cascade of
decision trees in which each decision tree is constructed with samples that are
mostly those that are misclassified by the previous decision tree. To be precise,
you create a probability distribution over the training samples for the selection of
samples for training each decision tree in the cascade. To start out, the
distribution is uniform over all of the samples. Subsequently, this probability
distribution changes according to the misclassifications by each tree in the cascade:
if a sample is misclassified by a given tree in the cascade, the probability of its
being selected for training the next tree is increased significantly. You also
associate a trust factor with each decision tree depending on its power to classify
correctly all of the training data samples. After a cascade of decision trees is
constructed in this manner, you construct a final classifier that calculates the
class label for a test data sample by taking into account the classification
decisions made by each individual tree in the cascade, the decisions being weighted
by the trust factors associated with the individual classifiers. These boosting
notions --- generally referred to as the AdaBoost algorithm --- are based on a now
celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in
the Proceedings of the 2nd European Conf. on Computational Learning Theory. For a
tutorial introduction to AdaBoost, see L<https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf>
Keep in mind the fact that, ordinarily, the theoretical guarantees provided by
boosting apply only to the case of binary classification. Additionally, your
training dataset must capture all of the significant statistical variations in the
classes represented therein.
=over 4
=item B<Calling the BoostedDecisionTree constructor:>
If you'd like to experiment with boosting, a typical call to the constructor for the
C<BoostedDecisionTree> class looks like:
use Algorithm::BoostedDecisionTree;
my $training_datafile = "training6.csv";
my $boosted = Algorithm::BoostedDecisionTree->new(
training_datafile => $training_datafile,
csv_class_column_index => 1,
csv_columns_for_features => [2,3],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
how_many_stages => 4,
csv_cleanup_needed => 1,
);
Note in particular the constructor parameter:
how_many_stages
As its name implies, this parameter controls how many stages will be used in the
boosted decision tree classifier. As mentioned above, a separate decision tree is
constructed for each stage of boosting using a set of training samples that are drawn
through a probability distribution maintained over the entire training dataset.
=back
=head2 B<Methods defined for C<BoostedDecisionTree> class>
=over 8
=item B<get_training_data_for_base_tree():>
This method reads your training datafile, creates the data structures from the data
ingested for constructing the base decision tree.
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
you are classifying a single test sample, you can also see how each bag is
classifying the test sample. You can, for example, display the training data used in
each bag, the decision tree constructed for each bag, etc.
The second script is for the case when you place all of the test samples in a single
file. The demonstration script displays for each test sample a single aggregate
classification decision that is obtained through majority voting by all the decision
trees.
=head1 THE C<ExamplesBoosting> DIRECTORY
The C<ExamplesBoosting> subdirectory in the main installation directory contains the
following three scripts:
boosting_for_classifying_one_test_sample_1.pl
boosting_for_classifying_one_test_sample_2.pl
boosting_for_bulk_classification.pl
As the names of the first two scripts imply, these show how to call the different
methods of the C<BoostedDecisionTree> class for classifying a single test sample.
When you are classifying a single test sample, you can see how each stage of the
cascade of decision trees is classifying the test sample. You can also view each
decision tree separately and also see the trust factor associated with the tree.
The third script is for the case when you place all of the test samples in a single
file. The demonstration script outputs for each test sample a single aggregate
classification decision that is obtained through trust-factor weighted majority
voting by all the decision trees.
=head1 THE C<ExamplesRandomizedTrees> DIRECTORY
The C<ExamplesRandomizedTrees> directory shows example scripts that you can use to
become more familiar with the C<RandomizedTreesForBigData> class for solving
needle-in-a-haystack and big-data data classification problems. These scripts are:
randomized_trees_for_classifying_one_test_sample_1.pl
randomized_trees_for_classifying_one_test_sample_2.pl
classify_database_records.pl
The first script shows the constructor options to use for solving a
needle-in-a-haystack problem --- that is, a problem in which a vast majority of the
training data belongs to just one class. The second script shows the constructor
options for using randomized decision trees for the case when you have access to a
very large database of training samples and you'd like to construct an ensemble of
decision trees using training samples pulled randomly from the training database.
The last script illustrates how you can evaluate the classification power of an
ensemble of decision trees as constructed by C<RandomizedTreesForBigData> by classifying
a large number of test samples extracted randomly from the training database.
=head1 THE C<ExamplesRegression> DIRECTORY
The C<ExamplesRegression> subdirectory in the main installation directory shows
example scripts that you can use to become familiar with regression trees and how
they can be used for nonlinear regression. If you are new to the concept of
regression trees, start by executing the following scripts without changing them and
see what sort of output is produced by them:
regression4.pl
regression5.pl
regression6.pl
regression8.pl
The C<regression4.pl> script involves only one predictor variable and one dependent
variable. The training data for this exercise is drawn from the file C<gendata4.csv>.
This data file contains strongly nonlinear data. When you run the script
C<regression4.pl>, you will see how much better the result from tree regression is
compared to what you can get with linear regression.
The C<regression5.pl> script is essentially the same as the previous script except
for the fact that the training datafile used in this case, C<gendata5.csv>, consists
of three noisy segments, as opposed to just two in the previous case.
The script C<regression6.pl> deals with the case when we have two predictor variables
and one dependent variable. You can think of the data as consisting of noisy height
values over an C<(x1,x2)> plane. The data used in this script is drawn from the csv
file C<gen3Ddata1.csv>.
Finally, the script C<regression8.pl> shows how you can carry out bulk prediction for
all your test data records in a disk file. The script writes all the calculated
predictions into another disk file whose name is derived from the name of the test
data file.
=head1 EXPORT
None by design.
=head1 BUGS
Please notify the author if you encounter any bugs. When sending email, please place
the string 'DecisionTree' in the subject line.
=head1 INSTALLATION
Download the archive from CPAN in any directory of your choice. Unpack the archive
with a command that on a Linux machine would look like:
tar zxvf Algorithm-DecisionTree-3.43.tar.gz
This will create an installation directory for you whose name will be
C<Algorithm-DecisionTree-3.43>. Enter this directory and execute the following
commands for a standard install of the module if you have root privileges:
perl Makefile.PL
make
make test
sudo make install
If you do not have root privileges, you can carry out a non-standard install the
module in any directory of your choice by:
perl Makefile.PL prefix=/some/other/directory
( run in 1.393 second using v1.01-cache-2.11-cpan-98e64b0badf )