Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

Examples/README  view on Meta::CPAN


        construct_dt_and_classify_one_sample_case1.pl

        construct_dt_and_classify_one_sample_case2.pl

        construct_dt_and_classify_one_sample_case3.pl

        construct_dt_and_classify_one_sample_case4.pl

    as they are.  The first script is for the purely symbolic case, the
    second for a case that involves both numeric and symbolic features, the
    third for the case of purely numeric features, and the last for the
    case when the training data is synthetically generated by the script
    generate_training_data_numeric.pl

    Next, try to modify the test sample in these scripts and see what
    classification results you get for the new test samples.



(2) The first script mentioned above uses the training file
    'training_symbolic.csv', the second and the third scripts listed above
    use the training file `stage3cancer.csv', and the last script named
    above uses the training data file `training.csv'.  Regarding
    these training files:

       training_symbolic.csv:    See the script

                                    'generate_training_data_symbolic.pl'

                                 regarding how this purely symbolic data
                                 is generated.

Examples/README  view on Meta::CPAN


          training3.csv

    These are similar to the file `training.csv' in the sense that 
    they both contain two classes, each a 2D Gaussian distribution.
    The first, `training2.csv' was generated by the script
    `generate_training_data_numeric.pl ' using the parameter file

           param_numeric_strongly_overlapping_classes.txt

    and the second, `training3.csv' was generated by the same script 
    using the parameter file

           param_numeric_extremely_overlapping_classes.txt


(3) So far we have talked about classifying one test data record at a time.
    You can place multiple test data records in a disk file and classify
    them all in one go.  To see how that can be done, execute the following
    two command lines in the `examples' directory:

     classify_test_data_in_a_file.pl   training4.csv   test4.csv   out4.csv

    This script constructs the decision tree from the data in the first
    argument file and then uses it to classify the data in the second
    argument file.  The computed class labels are deposited in the third
    argument file.

    In general, the test data files should look identical to the training
    data files.  Of course, for real-world test data, you will not have the
    class labels for the test samples.  You are still required to reserve a
    column for the class label, which now must be just the empty string ""
    for each data record.  For example, the test data supplied in the
    following two calls through the files test4_no_class_labels.csv and
    test4_no_class_labels.dat does not mention class labels:

Examples/README  view on Meta::CPAN


       EvalTrainingData

defined in the main DecisionTree module file makes it straightforward to
evaluate the class discriminatory power your data (as long as it resides in
a `.csv' file.)  This new class is is a subclass of the DecisionTree class
in the module file.

Both the `evaluate' scripts mentioned above are identical in terms of the
usage logic shown.  The first is specifically for the training data file
`stage3cancer.csv' and second for the training data files `training.csv',
`training2.csv', and `training3.csv'.  The latter three data files contain
two Gaussian classes that are increasingly overlapping.  You can see for
yourself the decreasing quality of the training data as you evaluate first
the training file `training.csv', then the training file `training2.csv',
and finally the training file `training3.csv'.


===========================================================================

                  USING THE DT INTROSPECTION CLASS

Examples/README  view on Meta::CPAN


    introspection_in_a_loop_interactive.pl

    introspection_show_training_samples_at_all_nodes_direct_influence.pl

    introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script places you in an interactive session in which you will be
asked for the node number you are interested in.  Subsequently, you will be
asked for whether or not you are interested in specific questions that
introspection can provide answers for. The second script descends down the
decision tree and shows for each node the training samples that fall
directly in the portion of the feature space assigned to that node.  The
third script shows for each training sample how it affects the
decision-tree nodes either directly or indirectly through the
generalization achieved by the probabilistic modeling of the data.

===========================================================================


              GENERATING SYNTHETIC TRAINING AND TEST DATA

lib/Algorithm/BoostedDecisionTree.pm  view on Meta::CPAN

            $sum_of_probs += $self->{_sample_selection_probs}->{$stage_index}->{$sample};
            push @training_samples_this_stage, $sample if $sum_of_probs < 0.5;
            last if $sum_of_probs > 0.5;
        }
        $self->{_training_samples}->{$stage_index} = [sort {sample_index($a) <=> sample_index($b)} @training_samples_this_stage];
        if ($self->{_stagedebug}) {
            print "\nTraining samples for stage $stage_index: @{$self->{_training_samples}->{$stage_index}}\n\n";
            my $num_of_training_samples = @{$self->{_training_samples}->{$stage_index}};
            print "\nNumber of training samples this stage $num_of_training_samples\n\n";
        }
        # find intersection of two sets:
        my %misclassified_samples = map {$_ => 1} @{$self->{_misclassified_samples}->{$stage_index-1}};
        my @training_samples_selection_check = grep $misclassified_samples{$_}, @{$self->{_training_samples}->{$stage_index}};
        if ($self->{_stagedebug}) {
            my @training_in_misclassified = sort {sample_index($a) <=> sample_index($b)} @training_samples_selection_check;
            print "\nTraining samples in the misclassified set: @training_in_misclassified\n";
            my $how_many = @training_samples_selection_check;
            print "\nNumber_of_miscalssified_samples_in_training_set: $how_many\n";
        }
        my $dt_this_stage = Algorithm::DecisionTree->new('boostingmode');
        $dt_this_stage->{_training_data_hash} = { map {$_ => $self->{_all_training_data}->{$_} } @{$self->{_training_samples}->{$stage_index}} };

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

    if (!defined $value_for_feature) {
        my @leaf_node_class_probabilities = @{$node->get_class_probabilities()};
        foreach my $i (0..@{$self->{_class_names}}-1) {
            $answer{$self->{_class_names}->[$i]} = $leaf_node_class_probabilities[$i];
        }
        push @{$answer{'solution_path'}}, $node->get_serial_num();
        return \%answer;
    }
    if ($value_for_feature) {
        if (contained_in($feature_tested_at_node, keys %{$self->{_prob_distribution_numeric_features_hash}})) {
            print( "\nCLRD2 In the truly numeric section") if $self->{_debug3};
            my $pattern1 = '(.+)<(.+)';
            my $pattern2 = '(.+)>(.+)';
            foreach my $child (@children) {
                my @branch_features_and_values = @{$child->get_branch_features_and_values_or_thresholds()};
                my $last_feature_and_value_on_branch = $branch_features_and_values[-1]; 
                if ($last_feature_and_value_on_branch =~ /$pattern1/) {
                    my ($feature, $threshold) = ($1,$2); 
                    if ($value_for_feature <= $threshold) {
                        $path_found = 1;
                        %answer = %{$self->recursive_descent_for_classification($child,

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

                        %answer = %{$self->recursive_descent_for_classification($child,
                                                                            $features_and_values,\%answer)};
                        push @{$answer{'solution_path'}}, $node->get_serial_num();
                        last;
                    }
                }
            }
            return \%answer if $path_found;
        } else {
            my $feature_value_combo = "$feature_tested_at_node" . '=' . "$value_for_feature";
            print "\nCLRD3 In the symbolic section with feature_value_combo: $feature_value_combo\n" 
                if $self->{_debug3};
            foreach my $child (@children) {
                my @branch_features_and_values = @{$child->get_branch_features_and_values_or_thresholds()};
                print "\nCLRD4 branch features and values: @branch_features_and_values\n" if $self->{_debug3};
                my $last_feature_and_value_on_branch = $branch_features_and_values[-1]; 
                if ($last_feature_and_value_on_branch eq $feature_value_combo) {
                    %answer = %{$self->recursive_descent_for_classification($child,
                                                                              $features_and_values,\%answer)};
                    push @{$answer{'solution_path'}}, $node->get_serial_num();
                    $path_found = 1;

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

                $self->recursive_descent($left_child_node);
            }
            if ($best_entropy_for_greater < $existing_node_entropy - $self->{_entropy_threshold}) {
                my $right_child_node = DTNode->new(undef, $best_entropy_for_greater,
                                                         \@class_probabilities_for_greaterthan_child_node,
                            \@extended_branch_features_and_values_or_thresholds_for_greaterthan_child, $self);
                $node->add_child_link($right_child_node);
                $self->recursive_descent($right_child_node);
            }
        } else {
            print "\nRD16 RECURSIVE DESCENT: In section for symbolic features for creating children"
                if $self->{_debug3};
            my @values_for_feature = @{$self->{_features_and_unique_values_hash}->{$best_feature}};
            print "\nRD17 Values for feature $best_feature are @values_for_feature\n" if $self->{_debug3};
            my @feature_value_combos = sort map {"$best_feature" . '=' . $_} @values_for_feature;
            my @class_entropies_for_children = ();
            foreach my $feature_and_value_index (0..@feature_value_combos-1) {
                print "\nRD18 Creating a child node for: $feature_value_combos[$feature_and_value_index]\n"
                    if $self->{_debug3};
                my @extended_branch_features_and_values_or_thresholds;
                if (! @features_and_values_or_thresholds_on_branch) {

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


                push @partitioning_entropies, $partitioning_entropy;
                $partitioning_point_child_entropies_hash{$feature_name}{$value} = [$entropy1, $entropy2];
            }
            my ($min_entropy, $best_partition_point_index) = minimum(\@partitioning_entropies);
            if ($min_entropy < $existing_node_entropy) {
                $partitioning_point_threshold{$feature_name} = $newvalues[$best_partition_point_index];
                $entropy_values_for_different_features{$feature_name} = $min_entropy;
            }
        } else {
            print "\nBFC2:  Entering section reserved for symbolic features\n" if $self->{_debug3};
            print "\nBFC3 Feature name: $feature_name\n" if $self->{_debug3};
            my %seen;
            my @values = grep {$_ ne 'NA' && !$seen{$_}++} 
                                    @{$self->{_features_and_unique_values_hash}->{$feature_name}};
            @values = sort @values;
            print "\nBFC4 values for feature $feature_name are @values\n" if $self->{_debug3};

            my $entropy = 0;
            foreach my $value (@values) {
                my $feature_value_string = "$feature_name" . '=' . "$value";

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

            foreach my $i (0..@values_for_feature-1) {
                $self->{_probability_cache}->{$values_for_feature[$i]} = $probs[$i];
            }
            if (defined($value) && exists $self->{_probability_cache}->{$feature_and_value}) {
                return $self->{_probability_cache}->{$feature_and_value};
            } else {
                return 0;
            }
        }
    } else {
        # This section is only for purely symbolic features:  
        my @values_for_feature = @{$self->{_features_and_values_hash}->{$feature_name}};        
        @values_for_feature = map {"$feature_name=$_"} @values_for_feature;
        my @value_counts = (0) x @values_for_feature;
#        foreach my $sample (sort {sample_index($a) cmp sample_index($b)} keys %{$self->{_training_data_hash}}) {
        foreach my $sample (sort {sample_index($a) <=> sample_index($b)} keys %{$self->{_training_data_hash}}) {
            my @features_and_values = @{$self->{_training_data_hash}->{$sample}};
            foreach my $i (0..@values_for_feature-1) {
                for my $current_value (@features_and_values) {
                    $value_counts[$i]++ if $values_for_feature[$i] eq $current_value;
                }

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

                                                                     @sampling_points_for_feature;
            foreach my $i (0..@values_for_feature_and_class-1) {
                $self->{_probability_cache}->{$values_for_feature_and_class[$i]} = $probs[$i];
            }
            if (exists $self->{_probability_cache}->{$feature_value_class}) {
                return $self->{_probability_cache}->{$feature_value_class};
            } else {
                return 0;
            }
        } else {
            # This section is for numeric features that will be treated symbolically
            my %seen = ();
            my @values_for_feature = grep {$_ if $_ ne 'NA' && !$seen{$_}++} 
                                                 @{$self->{_features_and_values_hash}->{$feature_name}};
            @values_for_feature = map {"$feature_name=$_"} @values_for_feature;
            my @value_counts = (0) x @values_for_feature;
            foreach my $sample (@samples_for_class) {
                my @features_and_values = @{$self->{_training_data_hash}->{$sample}};
                foreach my $i (0..@values_for_feature-1) {
                    foreach my $current_value (@features_and_values) {
                        $value_counts[$i]++ if $values_for_feature[$i] eq $current_value;

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

                $self->{_probability_cache}->{$feature_and_value_and_class} = 
                                                           $value_counts[$i] / (1.0 * $total_counts);
            }
            if (exists $self->{_probability_cache}->{$feature_value_class}) {
                return $self->{_probability_cache}->{$feature_value_class};
            } else {
                return 0;
            }
        }
    } else {
        # This section is for purely symbolic features
        my @values_for_feature = @{$self->{_features_and_values_hash}->{$feature_name}};
        my %seen = ();
        @values_for_feature = grep {$_ if $_ ne 'NA' && !$seen{$_}++} 
                                             @{$self->{_features_and_values_hash}->{$feature_name}};
        @values_for_feature = map {"$feature_name=$_"} @values_for_feature;
        my @value_counts = (0) x @values_for_feature;
        foreach my $sample (@samples_for_class) {
            my @features_and_values = @{$self->{_training_data_hash}->{$sample}};
            foreach my $i (0..@values_for_feature-1) {
                foreach my $current_value (@features_and_values) {

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

        my $feature_at_node = $self->get_feature() || " ";
        my $node_creation_entropy_at_node = $self->get_node_entropy();
        my $print_node_creation_entropy_at_node = sprintf("%.3f", $node_creation_entropy_at_node);
        my @branch_features_and_values_or_thresholds = @{$self->get_branch_features_and_values_or_thresholds()};
        my @class_probabilities = @{$self->get_class_probabilities()};
        my @print_class_probabilities = map {sprintf("%0.3f", $_)} @class_probabilities;
        my @class_names = @{$self->get_class_names()};
        my @print_class_probabilities_with_class =
            map {"$class_names[$_]" . '=>' . $print_class_probabilities[$_]} 0..@class_names-1;
        print "NODE $serial_num: $offset BRANCH TESTS TO NODE: @branch_features_and_values_or_thresholds\n";
        my $second_line_offset = "$offset" . " " x (8 + length("$serial_num"));
        print "$second_line_offset" . "Decision Feature: $feature_at_node    Node Creation Entropy: " ,
              "$print_node_creation_entropy_at_node   Class Probs: @print_class_probabilities_with_class\n\n";
        $offset .= "   ";
        foreach my $child (@{$self->get_children()}) {
            $child->display_decision_tree($offset);
        }
    } else {
        my $node_creation_entropy_at_node = $self->get_node_entropy();
        my $print_node_creation_entropy_at_node = sprintf("%.3f", $node_creation_entropy_at_node);
        my @branch_features_and_values_or_thresholds = @{$self->get_branch_features_and_values_or_thresholds()};
        my @class_probabilities = @{$self->get_class_probabilities()};
        my @print_class_probabilities = map {sprintf("%0.3f", $_)} @class_probabilities;
        my @class_names = @{$self->get_class_names()};
        my @print_class_probabilities_with_class =
            map {"$class_names[$_]" . '=>' . $print_class_probabilities[$_]} 0..@class_names-1;
        print "NODE $serial_num: $offset BRANCH TESTS TO LEAF NODE: @branch_features_and_values_or_thresholds\n";
        my $second_line_offset = "$offset" . " " x (8 + length("$serial_num"));
        print "$second_line_offset" . "Node Creation Entropy: $print_node_creation_entropy_at_node   " .
              "Class Probs: @print_class_probabilities_with_class\n\n";
    }
}


##############################  Generate Your Own Numeric Training Data  #################################
#############################      Class TrainingDataGeneratorNumeric     ################################

##  See the script generate_training_data_numeric.pl in the examples
##  directory on how to use this class for generating your own numeric training and

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

decision trees using data bags extracted from your dataset.  The module can show you
the results returned by the individual decision trees and also the results obtained
by taking a majority vote of the classification decisions made by the individual
trees.  You can specify any arbitrary extent of overlap between the data bags.

B<Version 2.31:> The introspection capability in this version packs more of a punch.
For each training data sample, you can now figure out not only the decision-tree
nodes that are affected directly by that sample, but also those nodes that are
affected indirectly through the generalization achieved by the probabilistic modeling
of the data.  The 'examples' directory of this version includes additional scripts
that illustrate these enhancements to the introspection capability.  See the section
"The Introspection API" for a declaration of the introspection related methods, old
and new.

B<Version 2.30:> In response to requests from several users, this version includes a new
capability: You can now ask the module to introspect about the classification
decisions returned by the decision tree.  Toward that end, the module includes a new
class named C<DTIntrospection>.  Perhaps the most important bit of information you
are likely to seek through DT introspection is the list of the training samples that
fall directly in the portion of the feature space that is assigned to a node.
B<CAVEAT:> When training samples are non-uniformly distributed in the underlying
feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING SAMPLES FALL
IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE.  B<(This is an important
part of the generalization achieved by probabilistic modeling of the training data.)>
For additional information related to DT introspection, see the section titled
"DECISION TREE INTROSPECTION" in this documentation page.

B<Version 2.26> fixes a bug in the part of the module that some folks use for generating
synthetic data for experimenting with decision tree construction and classification.
In the class C<TrainingDataGeneratorNumeric> that is a part of the module, there
was a problem with the order in which the features were recorded from the
user-supplied parameter file.  The basic code for decision tree construction and
classification remains unchanged.

B<Version 2.25> further downshifts the required version of Perl for this module.  This

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN



=head1 HOW TO MAKE THE BEST CHOICES FOR THE CONSTRUCTOR PARAMETERS

Assuming your training data is good, the quality of the results you get from a
decision tree would depend on the choices you make for the constructor parameters
C<entropy_threshold>, C<max_depth_desired>, and
C<symbolic_to_numeric_cardinality_threshold>.  You can optimize your choices for
these parameters by running the 10-fold cross-validation test that is made available
in Versions 2.2 and higher through the new class C<EvalTrainingData> that is included
in the module file.  A description of how to run this test is in the previous section
of this document.


=head1 DECISION TREE INTROSPECTION

Starting with Version 2.30, you can ask the C<DTIntrospection> class of the module to
explain the classification decisions made at the different nodes of the decision
tree.

Perhaps the most important bit of information you are likely to seek through DT

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


    introspection_in_a_loop_interactive.pl

    introspection_show_training_samples_at_all_nodes_direct_influence.pl

    introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script places you in an interactive session in which you will first be
asked for the node number you are interested in.  Subsequently, you will be asked for
whether or not you are interested in specific questions that the introspection can
provide answers for. The second script descends down the decision tree and shows for
each node the training samples that fall directly in the portion of the feature space
assigned to that node.  The third script shows for each training sample how it
affects the decision-tree nodes either directly or indirectly through the
generalization achieved by the probabilistic modeling of the data.

The output of the script
C<introspection_show_training_samples_at_all_nodes_direct_influence.pl> looks like:

    Node 0: the samples are: None
    Node 1: the samples are: [sample_46 sample_58]

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

    ...

For each training sample, the display shown above first presents the list of nodes
that are directly affected by the sample.  A node is affected directly by a sample if
the latter falls in the portion of the feature space that belongs to the former.
Subsequently, for each training sample, the display shows a subtree of the nodes that
are affected indirectly by the sample through the generalization achieved by the
probabilistic modeling of the data.  In general, a node is affected indirectly by a
sample if it is a descendant of another node that is affected directly.

Also see the section titled B<The Introspection API> regarding how to invoke the
introspection capabilities of the module in your own code.

=head1 METHODS

The module provides the following methods for constructing a decision tree from
training data in a disk file and for classifying new data records with the decision
tree thus constructed:

=over 4

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

may use too large a number of bins for estimating the probabilities and that may slow
down the calculation of the decision tree.  You can get around this difficulty by
explicitly giving a value to the 'C<number_of_histogram_bins>' parameter.

=back


You can choose the best values to use for the last three constructor parameters by
running a 10-fold cross-validation test on your training data through the class
C<EvalTrainingData> that comes with Versions 2.1 and higher of this module.  See the
section "TESTING THE QUALITY OF YOUR TRAINING DATA" of this document page.

=over

=item B<get_training_data():>

After you have constructed a new instance of the C<Algorithm::DecisionTree> class,
you must now read in the training data that is the file named in the call to the
constructor.  This you do by:

    $dt->get_training_data(); 

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


=head1 BULK CLASSIFICATION OF DATA RECORDS

For large test datasets, you would obviously want to process an entire file of test
data at a time. The following scripts in the C<Examples> directory illustrate how you
can do that:

      classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the
training datafile, the second the test datafile, and the third the file in which the
classification results are to be deposited.  

The other examples directories, C<ExamplesBagging>, C<ExamplesBoosting>, and
C<ExamplesRandomizedTrees>, also contain scripts that illustrate how to carry out
bulk classification of data records when you wish to take advantage of bagging,
boosting, or tree randomization.  In their respective directories, these scripts are
named:

    bagging_for_bulk_classification.pl
    boosting_for_bulk_classification.pl

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

             );

Note in particular the constructor parameters:

    dependent_variable
    predictor_columns
    mse_threshold
    jacobian_choice

The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable.  The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables.  The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed.  If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.

The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2.  Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


See the C<Examples> directory in the distribution for how to construct a decision
tree, and how to then classify new data using the decision tree.  To become more
familiar with the module, run the scripts

    construct_dt_and_classify_one_sample_case1.pl
    construct_dt_and_classify_one_sample_case2.pl
    construct_dt_and_classify_one_sample_case3.pl
    construct_dt_and_classify_one_sample_case4.pl

The first script is for the purely symbolic case, the second for the case that
involves both numeric and symbolic features, the third for the case of purely numeric
features, and the last for the case when the training data is synthetically generated
by the script C<generate_training_data_numeric.pl>.

Next run the following script as it is for bulk classification of data records placed
in a CSV file:

    classify_test_data_in_a_file.pl   training4.csv   test4.csv   out4.csv

The script first constructs a decision tree using the training data in the training
file supplied by the first argument file C<training4.csv>.  The script then
calculates the class label for each data record in the test data file supplied
through the second argument file, C<test4.csv>.  The estimated class labels are
written out to the output file which in the call shown above is C<out4.csv>.  An
important thing to note here is that your test file --- in this case C<test4.csv> ---
must have a column for class labels.  Obviously, in real-life situations, there will
be no class labels in this column.  What that is the case, you can place an empty
string C<""> there for each data record. This is demonstrated by the following call:

    classify_test_data_in_a_file.pl   training4.csv   test4_no_class_labels.csv   out4.csv

The following script in the C<Examples> directory

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN


    introspection_show_training_samples_at_all_nodes_direct_influence.pl

    introspection_show_training_samples_to_nodes_influence_propagation.pl

The first script illustrates how to use the C<DTIntrospection> class of the module
interactively for generating explanations for the classification decisions made at
the nodes of the decision tree.  In the interactive session you are first asked for
the node number you are interested in.  Subsequently, you are asked for whether or
not you are interested in specific questions that the introspector can provide
answers for. The second script generates a tabular display that shows for each node
of the decision tree a list of the training samples that fall directly in the portion
of the feature space assigned that node.  (As mentioned elsewhere in this
documentation, when this list is empty for a node, that means the node is a result of
the generalization achieved by probabilistic modeling of the data.  Note that this
module constructs a decision tree NOT by partitioning the set of training samples,
BUT by partitioning the domains of the probability density functions.)  The third
script listed above also generates a tabular display, but one that shows how the
influence of each training sample propagates in the tree.  This display first shows
the list of nodes that are affected directly by the data in a training sample. This
list is followed by an indented display of the nodes that are affected indirectly by

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

    bagging_for_classifying_one_test_sample.pl
                                                                                               
    bagging_for_bulk_classification.pl

As the names of the scripts imply, the first shows how to call the different methods
of the C<DecisionTreeWithBagging> class for classifying a single test sample.  When
you are classifying a single test sample, you can also see how each bag is
classifying the test sample.  You can, for example, display the training data used in
each bag, the decision tree constructed for each bag, etc.

The second script is for the case when you place all of the test samples in a single
file.  The demonstration script displays for each test sample a single aggregate
classification decision that is obtained through majority voting by all the decision
trees.


=head1 THE C<ExamplesBoosting> DIRECTORY

The C<ExamplesBoosting> subdirectory in the main installation directory contains the
following three scripts:

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

needle-in-a-haystack and big-data data classification problems. These scripts are:

    randomized_trees_for_classifying_one_test_sample_1.pl

    randomized_trees_for_classifying_one_test_sample_2.pl

    classify_database_records.pl

The first script shows the constructor options to use for solving a
needle-in-a-haystack problem --- that is, a problem in which a vast majority of the
training data belongs to just one class.  The second script shows the constructor
options for using randomized decision trees for the case when you have access to a
very large database of training samples and you'd like to construct an ensemble of
decision trees using training samples pulled randomly from the training database.
The last script illustrates how you can evaluate the classification power of an
ensemble of decision trees as constructed by C<RandomizedTreesForBigData> by classifying
a large number of test samples extracted randomly from the training database.


=head1 THE C<ExamplesRegression> DIRECTORY

lib/Algorithm/RegressionTree.pm  view on Meta::CPAN

}

sub display_regression_tree {
    my $self = shift;
    my $offset = shift;
    my $serial_num = $self->get_serial_num();
    if (@{$self->get_children()} > 0) {
        my $feature_at_node = $self->get_feature() || " ";
        my @branch_features_and_values_or_thresholds = @{$self->get_branch_features_and_values_or_thresholds()};
        print "NODE $serial_num: $offset BRANCH TESTS TO NODE: @branch_features_and_values_or_thresholds\n";
        my $second_line_offset = "$offset" . " " x (8 + length("$serial_num"));
        print "$second_line_offset" . "Decision Feature: $feature_at_node\n\n";
        $offset .= "   ";
        foreach my $child (@{$self->get_children()}) {
            $child->display_regression_tree($offset);
        }
    } else {
        my @branch_features_and_values_or_thresholds = @{$self->get_branch_features_and_values_or_thresholds()};
        print "NODE $serial_num: $offset BRANCH TESTS TO LEAF NODE: @branch_features_and_values_or_thresholds\n";
        my $second_line_offset = "$offset" . " " x (8 + length("$serial_num"));
    }
}

1;



( run in 1.000 second using v1.01-cache-2.11-cpan-39bf76dae61 )