Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
coefficients is carried out using gradient descent. This is the fastest way to
calculate the regression coefficients. When C<jacobian_choice> is set to 1, you get
a weak version of gradient descent in which the Jacobian is set to the "design
matrix" itself. Choosing 2 for C<jacobian_choice> results in a more reasonable
approximation to the Jacobian. That, however, is at a cost of much longer
computation time. B<NOTE:> For most cases, using 0 for C<jacobian_choice> is the
best choice. See my tutorial "I<Linear Regression and Regression Trees>" for why
that is the case.
=back
=head2 B<Methods defined for C<RegressionTree> class>
=over 8
=item B<get_training_data_for_regression():>
Only CSV training datafiles are allowed. Additionally, the first record in the file
must list the names of the fields, and the first column must contain an integer ID
for each record.
=item B<construct_regression_tree():>
As the name implies, this is the method that construct a regression tree.
=item B<display_regression_tree(" "):>
Displays the regression tree, as the name implies. The white-space string argument
specifies the offset to use in displaying the child nodes in relation to a parent
node.
=item B<prediction_for_single_data_point( $root_node, $test_sample ):>
You call this method after you have constructed a regression tree if you want to
calculate the prediction for one sample. The parameter C<$root_node> is what is
returned by the call C<construct_regression_tree()>. The formatting of the argument
bound to the C<$test_sample> parameter is important. To elaborate, let's say you are
using two variables named C<$xvar1> and C<$xvar2> as your predictor variables. In
this case, the C<$test_sample> parameter will be bound to a list that will look like
['xvar1 = 23.4', 'xvar2 = 12.9']
Arbitrary amount of white space, including none, on the two sides of the equality
symbol is allowed in the construct shown above. A call to this method returns a
dictionary with two key-value pairs. One of the keys is called C<solution_path> and
the other C<prediction>. The value associated with key C<solution_path> is the path
in the regression tree to the leaf node that yielded the prediction. And the value
associated with the key C<prediction> is the answer you are looking for.
=item B<predictions_for_all_data_used_for_regression_estimation( $root_node ):>
This call calculates the predictions for all of the predictor variables data in your
training file. The parameter C<$root_node> is what is returned by the call to
C<construct_regression_tree()>. The values for the dependent variable thus predicted
can be seen by calling C<display_all_plots()>, which is the method mentioned below.
=item B<display_all_plots():>
This method displays the results obtained by calling the prediction method of the
previous entry. This method also creates a hardcopy of the plots and saves it as a
C<.png> disk file. The name of this output file is always C<regression_plots.png>.
=item B<mse_for_tree_regression_for_all_training_samples( $root_node ):>
This method carries out an error analysis of the predictions for the samples in your
training datafile. It shows you the overall MSE (Mean Squared Error) with tree-based
regression, the MSE for the data samples at each of the leaf nodes of the regression
tree, and the MSE for the plain old Linear Regression as applied to all of the data.
The parameter C<$root_node> in the call syntax is what is returned by the call to
C<construct_regression_tree()>.
=item B<bulk_predictions_for_data_in_a_csv_file( $root_node, $filename, $columns ):>
Call this method if you want to apply the regression tree to all your test data in a
disk file. The predictions for all of the test samples in the disk file are written
out to another file whose name is the same as that of the test file except for the
addition of C<_output> in the name of the file. The parameter C<$filename> is the
name of the disk file that contains the test data. And the parameter C<$columns> is a
list of the column indices for the predictor variables in the test file.
=back
=head1 GENERATING SYNTHETIC TRAINING DATA
The module file contains the following additional classes: (1)
C<TrainingDataGeneratorNumeric>, and (2) C<TrainingDataGeneratorSymbolic> for
generating synthetic training data.
The class C<TrainingDataGeneratorNumeric> outputs one CSV file for the
training data and another one for the test data for experimenting with numeric
features. The numeric values are generated using a multivariate Gaussian
distribution whose mean and covariance are specified in a parameter file. See the
file C<param_numeric.txt> in the C<Examples> directory for an example of such a
parameter file. Note that the dimensionality of the data is inferred from the
information you place in the parameter file.
The class C<TrainingDataGeneratorSymbolic> generates synthetic training for the
purely symbolic case. The relative frequencies of the different possible values for
the features is controlled by the biasing information you place in a parameter file.
See C<param_symbolic.txt> for an example of such a file.
=head1 THE C<Examples> DIRECTORY
See the C<Examples> directory in the distribution for how to construct a decision
tree, and how to then classify new data using the decision tree. To become more
familiar with the module, run the scripts
construct_dt_and_classify_one_sample_case1.pl
construct_dt_and_classify_one_sample_case2.pl
construct_dt_and_classify_one_sample_case3.pl
construct_dt_and_classify_one_sample_case4.pl
The first script is for the purely symbolic case, the second for the case that
involves both numeric and symbolic features, the third for the case of purely numeric
features, and the last for the case when the training data is synthetically generated
by the script C<generate_training_data_numeric.pl>.
Next run the following script as it is for bulk classification of data records placed
in a CSV file:
( run in 1.057 second using v1.01-cache-2.11-cpan-df04353d9ac )