view release on metacpan or search on metacpan
ExamplesBoosting/boosting_for_bulk_classification.pl view on Meta::CPAN
print "Constructing base decision tree...\n";
$boosted->construct_base_decision_tree();
# UNCOMMENT THE FOLLOWING TWO STATEMENTS if you would like to see the base decision
# tree displayed in your terminal window:
#print "\n\nThe Decision Tree:\n\n";
#$boosted->display_base_decision_tree();
# This is a required call:
print "Constructing the rest of the decision trees....\n";
$boosted->construct_cascade_of_trees();
# UNCOMMENT the following statement if you wish to see the class labels for the
# samples misclassified by any particular stage. The integer argument in the call
# you see below is the stage index. Whe set to 0, that means the base classifier.
$boosted->show_class_labels_for_misclassified_samples_in_stage(0);
## UNCOMMENT the next statement if you want to see the decision trees constructed
## for each stage of the cascade:
print "\nDisplaying the decision trees for all stages:\n\n";
$boosted->display_decision_trees_for_different_stages();
### NOW YOU ARE READY TO CLASSIFY THE FILE-BASED TEST DATA:
get_test_data_from_csv();
open FILEOUT, ">$outputfile"
or die "Unable to open file $outputfile for writing out classification results: $!";
my $class_names = join ",", sort @{$boosted->get_all_class_names()};
ExamplesBoosting/boosting_for_classifying_one_test_sample_1.pl view on Meta::CPAN
## This script demonstrates how you can use boosting to classify a single
## test sample.
## The most important thing to keep in mind if you want to use boosting is
## the constructor parameters:
##
## how_many_stages
## As its name implies, this parameter controls how many decision trees
## will cascaded together for the boosted classifier. Recall that the
## training set for each decision tree in a cascade is heavily influenced
## by what gets misclassified by the previous decision tree. At the same
## time, the trust we place in each decision tree is based on its overall
## performance for classifying the entire training dataset.
use strict;
use warnings;
use Algorithm::BoostedDecisionTree;
my $training_datafile = "stage3cancer.csv";
ExamplesBoosting/boosting_for_classifying_one_test_sample_1.pl view on Meta::CPAN
print "Constructing base decision tree...\n";
$boosted->construct_base_decision_tree();
# UNCOMMENT THE FOLLOWING TWO STATEMENTS if you would like to see the base decision
# tree displayed in your terminal window:
#print "\n\nThe Decision Tree:\n\n";
#$boosted->display_base_decision_tree();
# This is a required call:
print "Constructing the rest of the decision trees....\n";
$boosted->construct_cascade_of_trees();
# UNCOMMENT the following statement if you wish to see the class labels for the
# samples misclassified by any particular stage. The integer argument in the call
# you see below is the stage index. Whe set to 0, that means the base classifier.
$boosted->show_class_labels_for_misclassified_samples_in_stage(0);
## UNCOMMENT the next statement if you want to see the decision trees constructed
## for each stage of the cascade:
print "\nDisplaying the decision trees for all stages:\n\n";
$boosted->display_decision_trees_for_different_stages();
print "Reading the test sample ...\n";
my $test_sample = ['g2 = 4.2',
'grade = 2.3',
'gleason = 4',
'eet = 1.7',
'age = 55.0',
'ploidy = diploid'];
# This is a required call:
print "Classifying with all the decision trees ....\n";
$boosted->classify_with_boosting($test_sample);
# UNCOMMENT THE FOLLOWING TWO STATEMENTS if you would like to see the classification
# results obtained with all the decision trees in the cascade:
print "\nDisplaying the classification results with all stages:\n\n";
$boosted->display_classification_results_for_each_stage();
my $final_classification = $boosted->trust_weighted_majority_vote_classifier();
print "\nFinal classification: $final_classification\n";
$boosted->display_trust_weighted_decision_for_test_sample();
ExamplesBoosting/boosting_for_classifying_one_test_sample_2.pl view on Meta::CPAN
## This script demonstrates how you can use boosting to classify a single
## test sample.
## The most important thing to keep in mind if you want to use boosting is
## the constructor parameters:
##
## how_many_stages
## As its name implies, this parameter controls how many decision trees
## will cascaded together for the boosted classifier. Recall that the
## training set for each decision tree in a cascade is heavily influenced
## by what gets misclassified by the previous decision tree. At the same
## time, the trust we place in each decision tree is based on its overall
## performance for classifying the entire training dataset.
use strict;
use warnings;
use Algorithm::BoostedDecisionTree;
my $training_datafile = "training6.csv";
ExamplesBoosting/boosting_for_classifying_one_test_sample_2.pl view on Meta::CPAN
print "Constructing base decision tree...\n";
$boosted->construct_base_decision_tree();
# UNCOMMENT THE FOLLOWING TWO STATEMENTS if you would like to see the base decision
# tree displayed in your terminal window:
#print "\n\nThe Decision Tree:\n\n";
$boosted->display_base_decision_tree();
# This is a required call:
print "Constructing the rest of the decision trees....\n";
$boosted->construct_cascade_of_trees();
# UNCOMMENT the following statement if you wish to see the class labels for the
# samples misclassified by any particular stage. The integer argument in the call
# you see below is the stage index. Whe set to 0, that means the base classifier.
#$boosted->show_class_labels_for_misclassified_samples_in_stage(0);
## UNCOMMENT the next statement if you want to see the decision trees constructed
## for each stage of the cascade:
print "\nDisplaying the decision trees for all stages:\n\n";
$boosted->display_decision_trees_for_different_stages();
print "Reading the test sample ...\n";
my $test_sample = ['gdp = 50.0',
'return_on_invest = 45'];
# This is a required call:
print "Classifying with all the decision trees ....\n";
$boosted->classify_with_boosting($test_sample);
# UNCOMMENT THE FOLLOWING TWO STATEMENTS if you would like to see the classification
# results obtained with all the decision trees in the cascade:
print "\nDisplaying the classification results with all stages:\n\n";
$boosted->display_classification_results_for_each_stage();
my $final_classification = $boosted->trust_weighted_majority_vote_classifier();
print "\nFinal classification: $final_classification\n";
$boosted->display_trust_weighted_decision_for_test_sample();
lib/Algorithm/BoostedDecisionTree.pm view on Meta::CPAN
sub construct_base_decision_tree {
my $self = shift;
$self->{_root_nodes}->{0} = $self->{_all_trees}->{0}->construct_decision_tree_classifier();
}
sub display_base_decision_tree {
my $self = shift;
$self->{_root_nodes}->{0}->display_decision_tree(" ");
}
sub construct_cascade_of_trees {
my $self = shift;
$self->{_training_samples}->{0} = $self->{_all_sample_names};
$self->{_misclassified_samples}->{0} = $self->evaluate_one_stage_of_cascade($self->{_all_trees}->{0}, $self->{_root_nodes}->{0});
if ($self->{_stagedebug}) {
$self->show_class_labels_for_misclassified_samples_in_stage(0);
print "\n\nSamples misclassified by base classifier: @{$self->{_misclassified_samples}->{0}}\n";
my $how_many = @{$self->{_misclassified_samples}->{0}};
print "\nNumber of misclassified samples: $how_many\n";
}
my $misclassification_error_rate = reduce {$a+$b} map {$self->{_sample_selection_probs}->{0}->{$_}} @{$self->{_misclassified_samples}->{0}};
print "\nMisclassification_error_rate for base classifier: $misclassification_error_rate\n" if $self->{_stagedebug};
$self->{_trust_factors}->{0} = 0.5 * log((1-$misclassification_error_rate)/$misclassification_error_rate);
print "\nBase class trust factor: $self->{_trust_factors}->{0}\n" if $self->{_stagedebug};
lib/Algorithm/BoostedDecisionTree.pm view on Meta::CPAN
$dt_this_stage->{_feature_values_how_many_uniques_hash} = {map {$_ => undef} keys %{$self->{_all_trees}->{0}->{_features_and_unique_values_hash}}};
$dt_this_stage->{_feature_values_how_many_uniques_hash} = {map {$_ => scalar @{$dt_this_stage->{_features_and_unique_values_hash}->{$_}}} keys %{$self->{_all_trees}->{0}->{_features_and_unique_values_hash}}};
$dt_this_stage->calculate_first_order_probabilities();
$dt_this_stage->calculate_class_priors();
print "\n\n>>>>>>>Done with the initialization of the tree for stage $stage_index<<<<<<<<<<\n" if $self->{_stagedebug};
my $root_node_this_stage = $dt_this_stage->construct_decision_tree_classifier();
$root_node_this_stage->display_decision_tree(" ") if $self->{_stagedebug};
$self->{_all_trees}->{$stage_index} = $dt_this_stage;
$self->{_root_nodes}->{$stage_index} = $root_node_this_stage;
$self->{_misclassified_samples}->{$stage_index} = $self->evaluate_one_stage_of_cascade($self->{_all_trees}->{$stage_index}, $self->{_root_nodes}->{$stage_index});
if ($self->{_stagedebug}) {
print "\nSamples misclassified by stage $stage_index classifier: @{$self->{_misclassified_samples}->{$stage_index}}\n";
printf("\nNumber of misclassified samples: %d\n", scalar @{$self->{_misclassified_samples}->{$stage_index}});
$self->show_class_labels_for_misclassified_samples_in_stage($stage_index);
}
my $misclassification_error_rate = reduce {$a+$b} map {$self->{_sample_selection_probs}->{$stage_index}->{$_}} @{$self->{_misclassified_samples}->{$stage_index}};
print "\nStage $stage_index misclassification_error_rate: $misclassification_error_rate\n" if $self->{_stagedebug};
$self->{_trust_factors}->{$stage_index} = 0.5 * log((1-$misclassification_error_rate)/$misclassification_error_rate);
print "\nStage $stage_index trust factor: $self->{_trust_factors}->{$stage_index}\n" if $self->{_stagedebug};
}
}
sub evaluate_one_stage_of_cascade {
my $self = shift;
my $trainingDT = shift;
my $root_node = shift;
my @misclassified_samples = ();
foreach my $test_sample_name (@{$self->{_all_sample_names}}) {
my @test_sample_data = @{$self->{_all_trees}->{0}->{_training_data_hash}->{$test_sample_name}};
print "original data in $test_sample_name:@test_sample_data\n" if $self->{_stagedebug};
@test_sample_data = map {$_ if $_ !~ /=NA$/} @test_sample_data;
print "$test_sample_name: @test_sample_data\n" if $self->{_stagedebug};
my %classification = %{$trainingDT->classify($root_node, \@test_sample_data)};
lib/Algorithm/BoostedDecisionTree.pm view on Meta::CPAN
my $true_class_label_for_test_sample = $self->{_all_trees}->{0}->{_samples_class_label_hash}->{$test_sample_name};
printf("%s: true_class: %s estimated_class: %s\n", $test_sample_name, $true_class_label_for_test_sample, $most_likely_class_label) if $self->{_stagedebug};
push @misclassified_samples, $test_sample_name if $true_class_label_for_test_sample ne $most_likely_class_label;
}
return [sort {sample_index($a) <=> sample_index($b)} @misclassified_samples];
}
sub show_class_labels_for_misclassified_samples_in_stage {
my $self = shift;
my $stage_index = shift;
die "\nYou must first call 'construct_cascade_of_trees()' before invoking 'show_class_labels_for_misclassified_samples_in_stage()'" unless @{$self->{_misclassified_samples}->{0}} > 0;
my @classes_for_misclassified_samples = ();
my @just_class_labels = ();
for my $sample (@{$self->{_misclassified_samples}->{$stage_index}}) {
my $true_class_label_for_sample = $self->{_all_trees}->{0}->{_samples_class_label_hash}->{$sample};
push @classes_for_misclassified_samples, sprintf("%s => %s", $sample, $true_class_label_for_sample);
push @just_class_labels, $true_class_label_for_sample;
}
print "\nSamples misclassified by the classifier for Stage $stage_index: @{$self->{_misclassified_samples}->{$stage_index}}\n";
my $how_many = @{$self->{_misclassified_samples}->{$stage_index}};
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
=back
See the example scripts in the directory C<bagging_examples> for how to call these
methods for classifying individual samples and for bulk classification when you place
all your test samples in a single file.
=head1 USING BOOSTING
Starting with Version 3.20, you can use the class C<BoostedDecisionTree> for
constructing a boosted decision-tree classifier. Boosting results in a cascade of
decision trees in which each decision tree is constructed with samples that are
mostly those that are misclassified by the previous decision tree. To be precise,
you create a probability distribution over the training samples for the selection of
samples for training each decision tree in the cascade. To start out, the
distribution is uniform over all of the samples. Subsequently, this probability
distribution changes according to the misclassifications by each tree in the cascade:
if a sample is misclassified by a given tree in the cascade, the probability of its
being selected for training the next tree is increased significantly. You also
associate a trust factor with each decision tree depending on its power to classify
correctly all of the training data samples. After a cascade of decision trees is
constructed in this manner, you construct a final classifier that calculates the
class label for a test data sample by taking into account the classification
decisions made by each individual tree in the cascade, the decisions being weighted
by the trust factors associated with the individual classifiers. These boosting
notions --- generally referred to as the AdaBoost algorithm --- are based on a now
celebrated paper "A Decision-Theoretic Generalization of On-Line Learning and an
Application to Boosting" by Yoav Freund and Robert Schapire that appeared in 1995 in
the Proceedings of the 2nd European Conf. on Computational Learning Theory. For a
tutorial introduction to AdaBoost, see L<https://engineering.purdue.edu/kak/Tutorials/AdaBoost.pdf>
Keep in mind the fact that, ordinarily, the theoretical guarantees provided by
boosting apply only to the case of binary classification. Additionally, your
training dataset must capture all of the significant statistical variations in the
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
=item B<construct_base_decision_tree():>
Calls on the appropriate method of the main C<DecisionTree> class to construct the
base decision tree.
=item B<display_base_decision_tree():>
Displays the base decision tree in your terminal window. (The textual form of the
decision tree is written out to the standard output.)
=item B<construct_cascade_of_trees():>
Uses the AdaBoost algorithm to construct a cascade of decision trees. As mentioned
earlier, the training samples for each tree in the cascade are drawn using a
probability distribution over the entire training dataset. This probability
distribution for any given tree in the cascade is heavily influenced by which
training samples are misclassified by the previous tree.
=item B<display_decision_trees_for_different_stages():>
Displays separately in your terminal window the decision tree constructed for each
stage of the cascade. (The textual form of the trees is written out to the standard
output.)
=item B<classify_with_boosting( $test_sample ):>
Calls on each decision tree in the cascade to classify the argument C<$test_sample>.
=item B<display_classification_results_for_each_stage():>
You can call this method to display in your terminal window the classification
decision made by each decision tree in the cascade. The method also prints out the
trust factor associated with each decision tree. It is important to look
simultaneously at the classification decision and the trust factor for each tree ---
since a classification decision made by a specific tree may appear bizarre for a
given test sample. This method is useful primarily for debugging purposes.
=item B<show_class_labels_for_misclassified_samples_in_stage( $stage_index ):>
As with the previous method, this method is useful mostly for debugging. It returns
class labels for the samples misclassified by the stage whose integer index is
supplied as an argument to the method. Say you have 10 stages in your cascade. The
value of the argument C<stage_index> would go from 0 to 9, with 0 corresponding to
the base tree.
=item B<trust_weighted_majority_vote_classifier():>
Uses the "final classifier" formula of the AdaBoost algorithm to pool together the
classification decisions made by the individual trees while taking into account the
trust factors associated with the trees. As mentioned earlier, we associate with
each tree of the cascade a trust factor that depends on the overall misclassification
rate associated with that tree.
=back
See the example scripts in the C<ExamplesBoosting> subdirectory for how to call the
methods listed above for classifying individual data samples with boosting and for
bulk classification when you place all your test samples in a single file.
=head1 USING RANDOMIZED DECISION TREES
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
boosting_for_classifying_one_test_sample_1.pl
boosting_for_classifying_one_test_sample_2.pl
boosting_for_bulk_classification.pl
As the names of the first two scripts imply, these show how to call the different
methods of the C<BoostedDecisionTree> class for classifying a single test sample.
When you are classifying a single test sample, you can see how each stage of the
cascade of decision trees is classifying the test sample. You can also view each
decision tree separately and also see the trust factor associated with the tree.
The third script is for the case when you place all of the test samples in a single
file. The demonstration script outputs for each test sample a single aggregate
classification decision that is obtained through trust-factor weighted majority
voting by all the decision trees.
=head1 THE C<ExamplesRandomizedTrees> DIRECTORY
The C<ExamplesRandomizedTrees> directory shows example scripts that you can use to