Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

ExamplesBoosting/boosting_for_bulk_classification.pl  view on Meta::CPAN

                              training_datafile => $training_datafile,
                              csv_class_column_index => $training_file_class_name_in_column,
                              csv_columns_for_features => $training_file_columns_for_feature_values,
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_stages => $how_many_stages,
                              csv_cleanup_needed => 1,
              );

print "Reading and processing training data...\n";
$boosted->get_training_data_for_base_tree();

##   UNCOMMENT THE FOLLOWING STATEMENT if you want to see the training data used for
##   just the base tree:
$boosted->show_training_data_for_base_tree();

# This is a required call:
print "Calculating first-order probabilities...\n";
$boosted->calculate_first_order_probabilities_and_class_priors();

ExamplesBoosting/boosting_for_classifying_one_test_sample_1.pl  view on Meta::CPAN

                              training_datafile => $training_datafile,
                              csv_class_column_index => 2,
                              csv_columns_for_features => [3,4,5,6,7,8],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_stages => 4,
                              csv_cleanup_needed => 1,
             );

print "Reading and processing training data...\n";
$boosted->get_training_data_for_base_tree();

##   UNCOMMENT THE FOLLOWING STATEMENT if you want to see the training data used for
##   just the base tree:
$boosted->show_training_data_for_base_tree();

# This is a required call:
print "Calculating first-order probabilities...\n";
$boosted->calculate_first_order_probabilities_and_class_priors();

ExamplesBoosting/boosting_for_classifying_one_test_sample_2.pl  view on Meta::CPAN

                              training_datafile => $training_datafile,
                              csv_class_column_index => 1,
                              csv_columns_for_features => [2,3],
                              entropy_threshold => 0.01,
                              max_depth_desired => 8,
                              symbolic_to_numeric_cardinality_threshold => 10,
                              how_many_stages => 4,
                              csv_cleanup_needed => 1,
              );

print "Reading and processing training data...\n";
$boosted->get_training_data_for_base_tree();

##   UNCOMMENT THE FOLLOWING STATEMENT if you want to see the training data used for
##   just the base tree:
$boosted->show_training_data_for_base_tree();

# This is a required call:
print "Calculating first-order probabilities...\n";
$boosted->calculate_first_order_probabilities_and_class_priors();

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


B<Version 3.43:> This version fixes a bug in the C<csv_cleanup_needed()> function.
The source of the bug was a typo in a regex component meant for matching with white
space.  I have also made one additional change to this function to increase its
versatility.  With this change, you are now allowed to have empty strings as values
for features.

B<Version 3.42:> This version reintroduces C<csv_cleanup_needed> as an optional
parameter in the module constructor.  This was done in response to several requests
received from the user community. (Previously, all line records from a CSV file were
processed by the C<cleanup_csv()> function no matter what.)  The main point made by
the users was that invoking C<cleanup_csv()> when there was no need for CSV clean-up
extracted a performance penalty when ingesting large database files with tens of
thousands of line records.  In addition to making C<csv_cleanup_needed> optional, I
have also tweaked up the code in the C<cleanup_csv()> function in order to extract
data from a larger range of messy CSV files.

B<Version 3.41:> All the changes made in this version relate to the construction of
regression trees.  I have fixed a couple of bugs in the calculation of the regression
coefficients. Additionally, the C<RegressionTree> class now comes with a new
constructor parameter named C<jacobian_choice>.  For most cases, you'd set this

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

B<Version 3.21:> This version makes it easier to use a CSV training file that
violates the assumption that a comma be used only to separate the different field
values in a line record.  Some large econometrics databases use double-quoted values
for fields, and these values may contain commas (presumably for better readability).
This version also allows you to specify the leftmost entry in the first CSV record
that names all the fields. Previously, this entry was required to be an empty
double-quoted string.  I have also made some minor changes to the
'C<get_training_data_from_csv()>' method to make it more user friendly for large
training files that may contain tens of thousands of records.  When pulling training
data from such files, this method prints out a dot on the terminal screen for every
10000 records it has processed. 

B<Version 3.20:> This version brings the boosting capability to the C<DecisionTree>
module.

B<Version 3.0:> This version adds bagging to the C<DecisionTree> module. If your
training dataset is large enough, you can ask the module to construct multiple
decision trees using data bags extracted from your dataset.  The module can show you
the results returned by the individual decision trees and also the results obtained
by taking a majority vote of the classification decisions made by the individual
trees.  You can specify any arbitrary extent of overlap between the data bags.

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

    $introspector->display_training_training_samples_to_nodes_influence_propagation();

A training sample affects a node directly if the sample falls in the portion of the
features space assigned to that node. On the other hand, a training sample is
considered to affect a node indirectly if the node is a descendant of a node that is
affected directly by the sample.


=head1 BULK CLASSIFICATION OF DATA RECORDS

For large test datasets, you would obviously want to process an entire file of test
data at a time. The following scripts in the C<Examples> directory illustrate how you
can do that:

      classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the
training datafile, the second the test datafile, and the third the file in which the
classification results are to be deposited.  

The other examples directories, C<ExamplesBagging>, C<ExamplesBoosting>, and



( run in 0.265 second using v1.01-cache-2.11-cpan-8d75d55dd25 )