Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
sample. This is referred to as the nearest neighbor classification. There exist
hundreds of variations of varying power on these two basic approaches to the
classification of multidimensional data.
A decision tree classifier works differently. When you construct a decision tree,
you select for the root node a feature test that partitions the training data in a
way that causes maximal disambiguation of the class labels associated with the data.
In terms of information content as measured by entropy, such a feature test would
cause maximum reduction in class entropy in going from all of the training data taken
together to the data as partitioned by the feature test. You then drop from the root
node a set of child nodes, one for each partition of the training data created by the
feature test at the root node. When your features are purely symbolic, you'll have
one child node for each value of the feature chosen for the feature test at the root.
When the test at the root involves a numeric feature, you find the decision threshold
for the feature that best bipartitions the data and you drop from the root node two
child nodes, one for each partition. Now at each child node you pose the same
question that you posed when you found the best feature to use at the root: Which
feature at the child node in question would maximally disambiguate the class labels
associated with the training data corresponding to that child node?
As the reader would expect, the two key steps in any approach to decision-tree based
classification are the construction of the decision tree itself from a file
containing the training data, and then using the decision tree thus obtained for
classifying new data.
What is cool about decision tree classification is that it gives you soft
classification, meaning it may associate more than one class label with a given data
vector. When this happens, it may mean that your classes are indeed overlapping in
the underlying feature space. It could also mean that you simply have not supplied
sufficient training data to the decision tree classifier. For a tutorial
introduction to how a decision tree is constructed and used, visit
L<https://engineering.purdue.edu/kak/Tutorials/DecisionTreeClassifiers.pdf>
This module also allows you to generate your own synthetic training and test
data. Generating your own training data, using it for constructing a decision-tree
classifier, and subsequently testing the classifier on a synthetically generated
test set of data is a good way to develop greater proficiency with decision trees.
=head1 WHAT PRACTICAL PROBLEM IS SOLVED BY THIS MODULE
If you are new to the concept of a decision tree, their practical utility is best
understood with an example that only involves symbolic features. However, as
mentioned earlier, versions of the module higher than 2.0 allow you to use both
symbolic and numeric features.
Consider the following scenario: Let's say you are running a small investment company
that employs a team of stockbrokers who make buy/sell decisions for the customers of
your company. Assume that your company has asked the traders to make each investment
decision on the basis of the following four criteria:
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
return_on_equity (R_on_E)
market_share (MS)
Since you are the boss, you keep track of the buy/sell decisions made by the
individual traders. But one unfortunate day, all of your traders decide to quit
because you did not pay them enough. So what do you do? If you had a module like
the one here, you could still run your company and do so in such a way that, on the
average, would do better than any of the individual traders who worked for your
company. This is what you do: You pool together the individual trader buy/sell
decisions you have accumulated during the last one year. This pooled information is
likely to look like:
example buy/sell P_to_E P_to_S R_on_E MS
============================================================+=
example_1 buy high low medium low
example_2 buy medium medium low low
example_3 sell low medium low high
....
....
This data, when formatted according to CSV, would constitute your training file. You
could feed this file into the module by calling:
my $dt = Algorithm::DecisionTree->new(
training_datafile => $training_datafile,
csv_class_column_index => 1,
csv_columns_for_features => [2,3,4,5],
);
$dt->get_training_data();
$dt->calculate_first_order_probabilities();
$dt->calculate_class_priors();
Subsequently, you would construct a decision tree by calling:
my $root_node = $dt->construct_decision_tree_classifier();
Now you and your company (with practically no employees) are ready to service the
customers again. Suppose your computer needs to make a buy/sell decision about an
investment prospect that is best described by:
price_to_earnings_ratio = low
price_to_sales_ratio = very_low
return_on_equity = none
market_share = medium
All that your computer would need to do would be to construct a data vector like
my @data = qw / P_to_E=low
P_to_S=very_low
R_on_E=none
MS=medium /;
and call the decision tree classifier you just constructed by
$dt->classify($root_node, \@data);
The answer returned will be 'buy' and 'sell', along with the associated
probabilities. So if the probability of 'buy' is considerably greater than the
probability of 'sell', that's what you should instruct your computer to do.
The chances are that, on the average, this approach would beat the performance of any
of your individual traders who worked for you previously since the buy/sell decisions
made by the computer would be based on the collective wisdom of all your previous
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
the training dataset must contain a relatively small number of discrete class labels
for all of your data records if you want to model the data with one or more decision
trees. However, when one is trying to understand all of the associational
relationships that exist in a large database, one often runs into situations where,
instead of discrete class labels, you have a continuously valued variable as a
dependent variable whose values are predicated on a set of feature values. It is for
such situations that you will find useful the new class C<RegressionTree> that is now
a part of the C<DecisionTree> module. The C<RegressionTree> class has been
programmed as a subclass of the main C<DecisionTree> class.
You can think of regression with a regression tree as a powerful generalization of
the very commonly used Linear Regression algorithms. Although you can certainly
carry out polynomial regression with run-of-the-mill Linear Regression algorithms for
modeling nonlinearities between the predictor variables and the dependent variable,
specifying the degree of the polynomial is often tricky. Additionally, a polynomial
can inject continuities between the predictor and the predicted variables that may
not really exist in the real data. Regression trees, on the other hand, give you a
piecewise linear relationship between the predictor and the predicted variables that
is freed from the constraints of superimposed continuities at the joins between the
different segments. See the following tutorial for further information regarding the
standard linear regression approach and the regression that can be achieved with the
RegressionTree class in this module:
L<https://engineering.purdue.edu/kak/Tutorials/RegressionTree.pdf>
The RegressionTree class in the current version of the module assumes that all of
your data is numerical. That is, unlike what is possible with the DecisionTree class
(and the other more closely related classes in this module) that allow your training
file to contain a mixture of numerical and symbolic data, the RegressionTree class
requires that ALL of your data be numerical. I hope to relax this constraint in
future versions of this module. Obviously, the dependent variable will always be
numerical for regression.
See the example scripts in the directory C<ExamplesRegression> if you wish to become
more familiar with the regression capabilities of the module.
=over 4
=item B<Calling the RegressionTree constructor:>
my $training_datafile = "gendata5.csv";
my $rt = Algorithm::RegressionTree->new(
training_datafile => $training_datafile,
dependent_variable_column => 2,
predictor_columns => [1],
mse_threshold => 0.01,
max_depth_desired => 2,
jacobian_choice => 0,
csv_cleanup_needed => 1,
);
Note in particular the constructor parameters:
dependent_variable
predictor_columns
mse_threshold
jacobian_choice
The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable. The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables. The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed. If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.
The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2. Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the
coefficients is carried out using gradient descent. This is the fastest way to
calculate the regression coefficients. When C<jacobian_choice> is set to 1, you get
a weak version of gradient descent in which the Jacobian is set to the "design
matrix" itself. Choosing 2 for C<jacobian_choice> results in a more reasonable
approximation to the Jacobian. That, however, is at a cost of much longer
computation time. B<NOTE:> For most cases, using 0 for C<jacobian_choice> is the
best choice. See my tutorial "I<Linear Regression and Regression Trees>" for why
that is the case.
=back
=head2 B<Methods defined for C<RegressionTree> class>
=over 8
=item B<get_training_data_for_regression():>
Only CSV training datafiles are allowed. Additionally, the first record in the file
must list the names of the fields, and the first column must contain an integer ID
for each record.
=item B<construct_regression_tree():>
As the name implies, this is the method that construct a regression tree.
=item B<display_regression_tree(" "):>
Displays the regression tree, as the name implies. The white-space string argument
specifies the offset to use in displaying the child nodes in relation to a parent
node.
=item B<prediction_for_single_data_point( $root_node, $test_sample ):>
You call this method after you have constructed a regression tree if you want to
calculate the prediction for one sample. The parameter C<$root_node> is what is
returned by the call C<construct_regression_tree()>. The formatting of the argument
bound to the C<$test_sample> parameter is important. To elaborate, let's say you are
using two variables named C<$xvar1> and C<$xvar2> as your predictor variables. In
this case, the C<$test_sample> parameter will be bound to a list that will look like
['xvar1 = 23.4', 'xvar2 = 12.9']
Arbitrary amount of white space, including none, on the two sides of the equality
symbol is allowed in the construct shown above. A call to this method returns a
dictionary with two key-value pairs. One of the keys is called C<solution_path> and
the other C<prediction>. The value associated with key C<solution_path> is the path
in the regression tree to the leaf node that yielded the prediction. And the value
associated with the key C<prediction> is the answer you are looking for.
=item B<predictions_for_all_data_used_for_regression_estimation( $root_node ):>
This call calculates the predictions for all of the predictor variables data in your
( run in 1.634 second using v1.01-cache-2.11-cpan-5a3173703d6 )