Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
=head1 NAME
Algorithm::DecisionTree - A Perl module for decision-tree based classification of
multidimensional data.
=head1 SYNOPSIS
# FOR CONSTRUCTING A DECISION TREE AND FOR CLASSIFYING A SAMPLE:
# In general, your call for constructing an instance of the DecisionTree class
# will look like:
my $training_datafile = "stage3cancer.csv";
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
csv_class_column_index => 2,
csv_columns_for_features => [3,4,5,6,7,8],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
csv_cleanup_needed => 1,
);
# The constructor option `csv_class_column_index' informs the module as to which
# column of your CSV file contains the class label. THE COLUMN INDEXING IS ZERO
# BASED. The constructor option `csv_columns_for_features' specifies which columns
# are to be used for feature values. The first row of the CSV file must specify
# the names of the features. See examples of CSV files in the `Examples'
# subdirectory.
# The option `symbolic_to_numeric_cardinality_threshold' is also important. For
# the example shown above, if an ostensibly numeric feature takes on only 10 or
# fewer different values in your training datafile, it will be treated like a
# symbolic features. The option `entropy_threshold' determines the granularity
# with which the entropies are sampled for the purpose of calculating entropy gain
# with a particular choice of decision threshold for a numeric feature or a feature
# value for a symbolic feature.
# The option 'csv_cleanup_needed' is by default set to 0. If you set it
# to 1, that would cause all line records in your CSV file to be "sanitized" before
# they are used for constructing a decision tree. You need this option if your CSV
# file uses double-quoted field names and field values in the line records and if
# such double-quoted strings are allowed to include commas for, presumably, better
# readability.
# After you have constructed an instance of the DecisionTree class as shown above,
# you read in the training data file and initialize the probability cache by
# calling:
$dt->get_training_data();
$dt->calculate_first_order_probabilities();
$dt->calculate_class_priors();
# Next you construct a decision tree for your training data by calling:
$root_node = $dt->construct_decision_tree_classifier();
# where $root_node is an instance of the DTNode class that is also defined in the
# module file. Now you are ready to classify a new data record. Let's say that
# your data record looks like:
my @test_sample = qw / g2=4.2
grade=2.3
gleason=4
eet=1.7
age=55.0
ploidy=diploid /;
# You can classify it by calling:
my $classification = $dt->classify($root_node, \@test_sample);
# The call to `classify()' returns a reference to a hash whose keys are the class
# names and the values the associated classification probabilities. This hash also
# includes another key-value pair for the solution path from the root node to the
# leaf node at which the final classification was carried out.
=head1 CHANGES
B<Version 3.43:> This version fixes a bug in the C<csv_cleanup_needed()> function.
The source of the bug was a typo in a regex component meant for matching with white
space. I have also made one additional change to this function to increase its
versatility. With this change, you are now allowed to have empty strings as values
for features.
B<Version 3.42:> This version reintroduces C<csv_cleanup_needed> as an optional
parameter in the module constructor. This was done in response to several requests
received from the user community. (Previously, all line records from a CSV file were
processed by the C<cleanup_csv()> function no matter what.) The main point made by
the users was that invoking C<cleanup_csv()> when there was no need for CSV clean-up
extracted a performance penalty when ingesting large database files with tens of
thousands of line records. In addition to making C<csv_cleanup_needed> optional, I
have also tweaked up the code in the C<cleanup_csv()> function in order to extract
data from a larger range of messy CSV files.
B<Version 3.41:> All the changes made in this version relate to the construction of
regression trees. I have fixed a couple of bugs in the calculation of the regression
coefficients. Additionally, the C<RegressionTree> class now comes with a new
constructor parameter named C<jacobian_choice>. For most cases, you'd set this
parameter to 0, which causes the regression coefficients to be estimated through
linear least-squares minimization.
B<Version 3.40:> In addition to constructing decision trees, this version of the
module also allows you to construct regression trees. The regression tree capability
has been packed into a separate subclass, named C<RegressionTree>, of the main
C<DecisionTree> class. The subdirectory C<ExamplesRegression> in the main
installation directory illustrates how you can use this new functionality of the
module.
B<Version 3.30:> This version incorporates four very significant upgrades/changes to
the C<DecisionTree> module: B<(1)> The CSV cleanup is now the default. So you do not
have to set any special parameters in the constructor calls to initiate CSV
cleanup. B<(2)> In the form of a new Perl class named C<RandomizedTreesForBigData>,
this module provides you with an easy-to-use programming interface for attempting
needle-in-a-haystack solutions for the case when your training data is overwhelmingly
dominated by a single class. You need to set the constructor parameter
C<looking_for_needles_in_haystack> to invoke the logic that constructs multiple
decision trees, each using the minority class samples along with samples drawn
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
perspective, they are closely related to classification and regression by recursive
partitioning of multidimensional data. Early work that demonstrated the usefulness
of such partitioning of data for classification and regression can be traced to the
work of Terry Therneau in the early 1980's in the statistics community, and to the
work of Ross Quinlan in the mid 1990's in the machine learning community.
For those not familiar with decision tree ideas, the traditional way to classify
multidimensional data is to start with a feature space whose dimensionality is the
same as that of the data. Each feature in this space corresponds to the attribute
that each dimension of the data measures. You then use the training data to carve up
the feature space into different regions, each corresponding to a different class.
Subsequently, when you try to classify a new data sample, you locate it in the
feature space and find the class label of the region to which it belongs. One can
also give the new data point the same class label as that of the nearest training
sample. This is referred to as the nearest neighbor classification. There exist
hundreds of variations of varying power on these two basic approaches to the
classification of multidimensional data.
A decision tree classifier works differently. When you construct a decision tree,
you select for the root node a feature test that partitions the training data in a
way that causes maximal disambiguation of the class labels associated with the data.
In terms of information content as measured by entropy, such a feature test would
cause maximum reduction in class entropy in going from all of the training data taken
together to the data as partitioned by the feature test. You then drop from the root
node a set of child nodes, one for each partition of the training data created by the
feature test at the root node. When your features are purely symbolic, you'll have
one child node for each value of the feature chosen for the feature test at the root.
When the test at the root involves a numeric feature, you find the decision threshold
for the feature that best bipartitions the data and you drop from the root node two
child nodes, one for each partition. Now at each child node you pose the same
question that you posed when you found the best feature to use at the root: Which
feature at the child node in question would maximally disambiguate the class labels
associated with the training data corresponding to that child node?
As the reader would expect, the two key steps in any approach to decision-tree based
classification are the construction of the decision tree itself from a file
containing the training data, and then using the decision tree thus obtained for
classifying new data.
What is cool about decision tree classification is that it gives you soft
classification, meaning it may associate more than one class label with a given data
vector. When this happens, it may mean that your classes are indeed overlapping in
the underlying feature space. It could also mean that you simply have not supplied
sufficient training data to the decision tree classifier. For a tutorial
introduction to how a decision tree is constructed and used, visit
L<https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf>
This module also allows you to generate your own synthetic training and test
data. Generating your own training data, using it for constructing a decision-tree
classifier, and subsequently testing the classifier on a synthetically generated
test set of data is a good way to develop greater proficiency with decision trees.
=head1 WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE
If you are new to the concept of a decision tree, their practical utility is best
understood with an example that only involves symbolic features. However, as
mentioned earlier, versions of the module higher than 2.0 allow you to use both
symbolic and numeric features.
Consider the following scenario: Let's say you are running a small investment company
that employs a team of stockbrokers who make buy/sell decisions for the customers of
your company. Assume that your company has asked the traders to make each investment
decision on the basis of the following four criteria:
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
return_on_equity (R_on_E)
market_share (MS)
Since you are the boss, you keep track of the buy/sell decisions made by the
individual traders. But one unfortunate day, all of your traders decide to quit
because you did not pay them enough. So what do you do? If you had a module like
the one here, you could still run your company and do so in such a way that, on the
average, would do better than any of the individual traders who worked for your
company. This is what you do: You pool together the individual trader buy/sell
decisions you have accumulated during the last one year. This pooled information is
likely to look like:
example buy/sell P_to_E P_to_S R_on_E MS
============================================================+=
example_1 buy high low medium low
example_2 buy medium medium low low
example_3 sell low medium low high
....
....
This data, when formatted according to CSV, would constitute your training file. You
could feed this file into the module by calling:
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
csv_class_column_index => 1,
csv_columns_for_features => [2,3,4,5],
);
$dt->get_training_data();
$dt->calculate_first_order_probabilities();
$dt->calculate_class_priors();
Subsequently, you would construct a decision tree by calling:
my $root_node = $dt->construct_decision_tree_classifier();
Now you and your company (with practically no employees) are ready to service the
customers again. Suppose your computer needs to make a buy/sell decision about an
investment prospect that is best described by:
price_to_earnings_ratio = low
price_to_sales_ratio = very_low
return_on_equity = none
market_share = medium
All that your computer would need to do would be to construct a data vector like
my @data = qw / P_to_E=low
P_to_S=very_low
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
You can choose the best values to use for the last three constructor parameters by
running a 10-fold cross-validation test on your training data through the class
C<EvalTrainingData> that comes with Versions 2.1 and higher of this module. See the
section "TESTING THE QUALITY OF YOUR TRAINING DATA" of this document page.
=over
=item B<get_training_data():>
After you have constructed a new instance of the C<Algorithm::DecisionTree> class,
you must now read in the training data that is the file named in the call to the
constructor. This you do by:
$dt->get_training_data();
=item B<show_training_data():>
If you wish to see the training data that was just digested by the module,
call
$dt->show_training_data();
=item B<calculate_first_order_probabilities():>
=item B<calculate_class_priors():>
After the module has read the training data file, it needs to initialize the
probability cache. This you do by invoking:
$dt->calculate_first_order_probabilities()
$dt->calculate_class_priors()
=item B<construct_decision_tree_classifier():>
With the probability cache initialized, it is time to construct a decision tree
classifier. This you do by
my $root_node = $dt->construct_decision_tree_classifier();
This call returns an instance of type C<DTNode>. The C<DTNode> class is defined
within the main package file. So, don't forget, that C<$root_node> in the above
example call will be instantiated to an object of type C<DTNode>.
=item B<$root_nodeC<< -> >>display_decision_tree(" "):>
$root_node->display_decision_tree(" ");
This will display the decision tree in your terminal window by using a recursively
determined offset for each node as the display routine descends down the tree.
I have intentionally left the syntax fragment C<$root_node> in the above call to
remind the reader that C<display_decision_tree()> is NOT called on the instance of
the C<DecisionTree> we constructed earlier, but on the C<DTNode> instance returned by
the call to C<construct_decision_tree_classifier()>.
=item B<classify($root_node, \@test_sample):>
Let's say you want to classify the following data record:
my @test_sample = qw / g2=4.2
grade=2.3
gleason=4
eet=1.7
age=55.0
ploidy=diploid /;
you'd make the following call:
my $classification = $dt->classify($root_node, \@test_sample);
where, again, C<$root_node> is an instance of type C<DTNode> returned by the call to
C<construct_decision_tree_classifier()>. The variable C<$classification> holds a
reference to a hash whose keys are the class names and whose values the associated
probabilities. The hash that is returned by the above call also includes a special
key-value pair for a key named C<solution_path>. The value associated with this key
is an anonymous array that holds the path, in the form of a list of nodes, from the
root node to the leaf node in the decision tree where the final classification was
made.
=item B<classify_by_asking_questions($root_node):>
This method allows you to use a decision-tree based classifier in an interactive
mode. In this mode, a user is prompted for answers to the questions pertaining to
the feature tests at the nodes of the tree. The syntax for invoking this method is:
my $classification = $dt->classify_by_asking_questions($root_node);
where C<$dt> is an instance of the C<Algorithm::DecisionTree> class returned by a
call to C<new()> and C<$root_node> the root node of the decision tree returned by a
call to C<construct_decision_tree_classifier()>.
=back
=head1 THE INTROSPECTION API
To construct an instance of C<DTIntrospection>, you call
my $introspector = DTIntrospection->new($dt);
where you supply the instance of the C<DecisionTree> class you used for constructing
the decision tree through the parameter C<$dt>. After you have constructed an
instance of the introspection class, you must initialize it by
$introspector->initialize();
Subsequently, you can invoke either of the following methods:
$introspector->explain_classification_at_one_node($node);
$introspector->explain_classifications_at_multiple_nodes_interactively();
depending on whether you want introspection at a single specified node or inside an
infinite loop for an arbitrary number of nodes.
If you want to output a tabular display that shows for each node in the decision tree
all the training samples that fall in the portion of the feature space that belongs
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
data vector. Run the examples scripts in the Examples directory to see how the output
of classification can be displayed.
With regard to the soft classifications returned by this classifier, if the
probability distributions for the different classes overlap in the underlying feature
space, you would want the classifier to return all of the applicable class labels for
a data vector along with the corresponding class probabilities. Another reason for
why the decision tree classifier may associate significant probabilities with
multiple class labels is that you used inadequate number of training samples to
induce the decision tree. The good thing is that the classifier does not lie to you
(unlike, say, a hard classification rule that would return a single class label
corresponding to the partitioning of the underlying feature space). The decision
tree classifier give you the best classification that can be made given the training
data you fed into it.
=head1 USING BAGGING
Starting with Version 3.0, you can use the class C<DecisionTreeWithBagging> that
comes with the module to incorporate bagging in your decision tree based
classification. Bagging means constructing multiple decision trees for different
(possibly overlapping) segments of the data extracted from your training dataset and
then aggregating the decisions made by the individual decision trees for the final
classification. The aggregation of the classification decisions can average out the
noise and bias that may otherwise affect the classification decision obtained from
just one tree.
=over 4
=item B<Calling the bagging constructor::>
A typical call to the constructor for the C<DecisionTreeWithBagging> class looks
like:
use Algorithm::DecisionTreeWithBagging;
my $training_datafile = "stage3cancer.csv";
my $dtbag = Algorithm::DecisionTreeWithBagging->new(
training_datafile => $training_datafile,
csv_class_column_index => 2,
csv_columns_for_features => [3,4,5,6,7,8],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
how_many_bags => 4,
bag_overlap_fraction => 0.2,
csv_cleanup_needed => 1,
);
Note in particular the following two constructor parameters:
how_many_bags
bag_overlap_fraction
where, as the name implies, the parameter C<how_many_bags> controls how many bags
(and, therefore, how many decision trees) will be constructed from your training
dataset; and where the parameter C<bag_overlap_fraction> controls the degree of
overlap between the bags. To understand what exactly is achieved by setting the
parameter C<bag_overlap_fraction> to 0.2 in the above example, let's say that the
non-overlapping partitioning of the training data between the bags results in 100
training samples per bag. With bag_overlap_fraction set to 0.2, additional 20 samples
drawn randomly from the other bags will be added to the data in each bag.
=back
=head2 B<Methods defined for C<DecisionTreeWithBagging> class>
=over 8
=item B<get_training_data_for_bagging():>
This method reads your training datafile, randomizes it, and then partitions it into
the specified number of bags. Subsequently, if the constructor parameter
C<bag_overlap_fraction> is non-zero, it adds to each bag additional samples drawn at
random from the other bags. The number of these additional samples added to each bag
is controlled by the constructor parameter C<bag_overlap_fraction>. If this
parameter is set to, say, 0.2, the size of each bag will grow by 20% with the samples
drawn from the other bags.
=item B<show_training_data_in_bags():>
Shows for each bag the names of the training data samples in that bag.
=item B<calculate_first_order_probabilities():>
Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities from the data samples in each bag.
=item B<calculate_class_priors():>
Calls on the appropriate method of the main C<DecisionTree> class to estimate the
class priors for the data samples in each bag.
=item B<construct_decision_trees_for_bags():>
Calls on the appropriate method of the main C<DecisionTree> class to construct a
decision tree from the training data in each bag.
=item B<display_decision_trees_for_bags():>
Display separately the decision tree for each bag..
=item B<classify_with_bagging( test_sample ):>
Calls on the appropriate methods of the main C<DecisionTree> class to classify the
argument test sample.
=item B<display_classification_results_for_each_bag():>
Displays separately the classification decision made by each the decision tree
constructed for each bag.
=item B<get_majority_vote_classification():>
Using majority voting, this method aggregates the classification decisions made by
the individual decision trees into a single decision.
=back
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
max_depth_desired => 2,
jacobian_choice => 0,
csv_cleanup_needed => 1,
);
Note in particular the constructor parameters:
dependent_variable
predictor_columns
mse_threshold
jacobian_choice
The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable. The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables. The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed. If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.
The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2. Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the
coefficients is carried out using gradient descent. This is the fastest way to
calculate the regression coefficients. When C<jacobian_choice> is set to 1, you get
a weak version of gradient descent in which the Jacobian is set to the "design
matrix" itself. Choosing 2 for C<jacobian_choice> results in a more reasonable
approximation to the Jacobian. That, however, is at a cost of much longer
computation time. B<NOTE:> For most cases, using 0 for C<jacobian_choice> is the
best choice. See my tutorial "I<Linear Regression and Regression Trees>" for why
that is the case.
=back
=head2 B<Methods defined for C<RegressionTree> class>
=over 8
=item B<get_training_data_for_regression():>
Only CSV training datafiles are allowed. Additionally, the first record in the file
must list the names of the fields, and the first column must contain an integer ID
for each record.
=item B<construct_regression_tree():>
As the name implies, this is the method that construct a regression tree.
=item B<display_regression_tree(" "):>
Displays the regression tree, as the name implies. The white-space string argument
specifies the offset to use in displaying the child nodes in relation to a parent
node.
=item B<prediction_for_single_data_point( $root_node, $test_sample ):>
You call this method after you have constructed a regression tree if you want to
calculate the prediction for one sample. The parameter C<$root_node> is what is
returned by the call C<construct_regression_tree()>. The formatting of the argument
bound to the C<$test_sample> parameter is important. To elaborate, let's say you are
using two variables named C<$xvar1> and C<$xvar2> as your predictor variables. In
this case, the C<$test_sample> parameter will be bound to a list that will look like
['xvar1 = 23.4', 'xvar2 = 12.9']
Arbitrary amount of white space, including none, on the two sides of the equality
symbol is allowed in the construct shown above. A call to this method returns a
dictionary with two key-value pairs. One of the keys is called C<solution_path> and
the other C<prediction>. The value associated with key C<solution_path> is the path
in the regression tree to the leaf node that yielded the prediction. And the value
associated with the key C<prediction> is the answer you are looking for.
=item B<predictions_for_all_data_used_for_regression_estimation( $root_node ):>
This call calculates the predictions for all of the predictor variables data in your
training file. The parameter C<$root_node> is what is returned by the call to
C<construct_regression_tree()>. The values for the dependent variable thus predicted
can be seen by calling C<display_all_plots()>, which is the method mentioned below.
=item B<display_all_plots():>
This method displays the results obtained by calling the prediction method of the
previous entry. This method also creates a hardcopy of the plots and saves it as a
C<.png> disk file. The name of this output file is always C<regression_plots.png>.
=item B<mse_for_tree_regression_for_all_training_samples( $root_node ):>
This method carries out an error analysis of the predictions for the samples in your
training datafile. It shows you the overall MSE (Mean Squared Error) with tree-based
regression, the MSE for the data samples at each of the leaf nodes of the regression
tree, and the MSE for the plain old Linear Regression as applied to all of the data.
The parameter C<$root_node> in the call syntax is what is returned by the call to
C<construct_regression_tree()>.
=item B<bulk_predictions_for_data_in_a_csv_file( $root_node, $filename, $columns ):>
Call this method if you want to apply the regression tree to all your test data in a
disk file. The predictions for all of the test samples in the disk file are written
out to another file whose name is the same as that of the test file except for the
addition of C<_output> in the name of the file. The parameter C<$filename> is the
name of the disk file that contains the test data. And the parameter C<$columns> is a
list of the column indices for the predictor variables in the test file.
=back
=head1 GENERATING SYNTHETIC TRAINING DATA
The module file contains the following additional classes: (1)
C<TrainingDataGeneratorNumeric>, and (2) C<TrainingDataGeneratorSymbolic> for
generating synthetic training data.
The class C<TrainingDataGeneratorNumeric> outputs one CSV file for the
training data and another one for the test data for experimenting with numeric
features. The numeric values are generated using a multivariate Gaussian
distribution whose mean and covariance are specified in a parameter file. See the
file C<param_numeric.txt> in the C<Examples> directory for an example of such a
parameter file. Note that the dimensionality of the data is inferred from the
information you place in the parameter file.
The class C<TrainingDataGeneratorSymbolic> generates synthetic training for the
( run in 0.320 second using v1.01-cache-2.11-cpan-483215c6ad5 )