parent results from the CPAN

parent
Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
=item B<Calling the RegressionTree constructor:>

    my $training_datafile = "gendata5.csv";
    my $rt = Algorithm::RegressionTree->new(
                              training_datafile => $training_datafile,
                              dependent_variable_column => 2,
                              predictor_columns => [1],
                              mse_threshold => 0.01,
                              max_depth_desired => 2,
                              jacobian_choice => 0,
                              csv_cleanup_needed => 1,
             );

Note in particular the constructor parameters:

    dependent_variable
    predictor_columns
    mse_threshold
    jacobian_choice

The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable.  The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables.  The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed.  If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.

The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2.  Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the
coefficients is carried out using gradient descent.  This is the fastest way to
calculate the regression coefficients.  When C<jacobian_choice> is set to 1, you get
a weak version of gradient descent in which the Jacobian is set to the "design
matrix" itself. Choosing 2 for C<jacobian_choice> results in a more reasonable
approximation to the Jacobian.  That, however, is at a cost of much longer
computation time.  B<NOTE:> For most cases, using 0 for C<jacobian_choice> is the
best choice.  See my tutorial "I<Linear Regression and Regression Trees>" for why
that is the case.

=back

=head2 B<Methods defined for C<RegressionTree> class>

=over 8

=item B<get_training_data_for_regression():>

Only CSV training datafiles are allowed. Additionally, the first record in the file
must list the names of the fields, and the first column must contain an integer ID
for each record.

=item B<construct_regression_tree():>

As the name implies, this is the method that construct a regression tree.

=item B<display_regression_tree("     "):>

Displays the regression tree, as the name implies.  The white-space string argument
specifies the offset to use in displaying the child nodes in relation to a parent
node.

=item B<prediction_for_single_data_point( $root_node, $test_sample ):>

You call this method after you have constructed a regression tree if you want to
calculate the prediction for one sample.  The parameter C<$root_node> is what is
returned by the call C<construct_regression_tree()>.  The formatting of the argument
bound to the C<$test_sample> parameter is important.  To elaborate, let's say you are
using two variables named C<$xvar1> and C<$xvar2> as your predictor variables. In
this case, the C<$test_sample> parameter will be bound to a list that will look like

    ['xvar1 = 23.4', 'xvar2 = 12.9'] 

Arbitrary amount of white space, including none, on the two sides of the equality
symbol is allowed in the construct shown above.  A call to this method returns a
dictionary with two key-value pairs.  One of the keys is called C<solution_path> and
the other C<prediction>.  The value associated with key C<solution_path> is the path
in the regression tree to the leaf node that yielded the prediction.  And the value
associated with the key C<prediction> is the answer you are looking for.

=item B<predictions_for_all_data_used_for_regression_estimation( $root_node ):>

This call calculates the predictions for all of the predictor variables data in your
training file.  The parameter C<$root_node> is what is returned by the call to
C<construct_regression_tree()>.  The values for the dependent variable thus predicted
can be seen by calling C<display_all_plots()>, which is the method mentioned below.

=item B<display_all_plots():>

This method displays the results obtained by calling the prediction method of the
previous entry.  This method also creates a hardcopy of the plots and saves it as a
C<.png> disk file. The name of this output file is always C<regression_plots.png>.

=item B<mse_for_tree_regression_for_all_training_samples( $root_node ):>

This method carries out an error analysis of the predictions for the samples in your
training datafile.  It shows you the overall MSE (Mean Squared Error) with tree-based
regression, the MSE for the data samples at each of the leaf nodes of the regression
tree, and the MSE for the plain old Linear Regression as applied to all of the data.
The parameter C<$root_node> in the call syntax is what is returned by the call to
C<construct_regression_tree()>.

=item B<bulk_predictions_for_data_in_a_csv_file( $root_node, $filename, $columns ):>

Call this method if you want to apply the regression tree to all your test data in a
disk file.  The predictions for all of the test samples in the disk file are written
out to another file whose name is the same as that of the test file except for the
addition of C<_output> in the name of the file.  The parameter C<$filename> is the
name of the disk file that contains the test data. And the parameter C<$columns> is a
list of the column indices for the predictor variables in the test file.

=back

=head1 GENERATING SYNTHETIC TRAINING DATA

The module file contains the following additional classes: (1)
C<TrainingDataGeneratorNumeric>, and (2) C<TrainingDataGeneratorSymbolic> for
generating synthetic training data.

The class C<TrainingDataGeneratorNumeric> outputs one CSV file for the
( run in 0.806 second using v1.01-cache-2.11-cpan-85f18b9d64f )