Algorithm-DecisionTree
view release on metacpan or search on metacpan
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
This method reads your training datafile, creates the data structures from the data
ingested for constructing the base decision tree.
=item B<show_training_data_for_base_tree():>
Writes to the standard output the training data samples and also some relevant
properties of the features used in the training dataset.
=item B<calculate_first_order_probabilities_and_class_priors():>
Calls on the appropriate methods of the main C<DecisionTree> class to estimate the
first-order probabilities and the class priors.
=item B<construct_base_decision_tree():>
Calls on the appropriate method of the main C<DecisionTree> class to construct the
base decision tree.
=item B<display_base_decision_tree():>
Displays the base decision tree in your terminal window. (The textual form of the
decision tree is written out to the standard output.)
=item B<construct_cascade_of_trees():>
Uses the AdaBoost algorithm to construct a cascade of decision trees. As mentioned
earlier, the training samples for each tree in the cascade are drawn using a
probability distribution over the entire training dataset. This probability
distribution for any given tree in the cascade is heavily influenced by which
training samples are misclassified by the previous tree.
=item B<display_decision_trees_for_different_stages():>
Displays separately in your terminal window the decision tree constructed for each
stage of the cascade. (The textual form of the trees is written out to the standard
output.)
=item B<classify_with_boosting( $test_sample ):>
Calls on each decision tree in the cascade to classify the argument C<$test_sample>.
=item B<display_classification_results_for_each_stage():>
You can call this method to display in your terminal window the classification
decision made by each decision tree in the cascade. The method also prints out the
trust factor associated with each decision tree. It is important to look
simultaneously at the classification decision and the trust factor for each tree ---
since a classification decision made by a specific tree may appear bizarre for a
given test sample. This method is useful primarily for debugging purposes.
=item B<show_class_labels_for_misclassified_samples_in_stage( $stage_index ):>
As with the previous method, this method is useful mostly for debugging. It returns
class labels for the samples misclassified by the stage whose integer index is
supplied as an argument to the method. Say you have 10 stages in your cascade. The
value of the argument C<stage_index> would go from 0 to 9, with 0 corresponding to
the base tree.
=item B<trust_weighted_majority_vote_classifier():>
Uses the "final classifier" formula of the AdaBoost algorithm to pool together the
classification decisions made by the individual trees while taking into account the
trust factors associated with the trees. As mentioned earlier, we associate with
each tree of the cascade a trust factor that depends on the overall misclassification
rate associated with that tree.
=back
See the example scripts in the C<ExamplesBoosting> subdirectory for how to call the
methods listed above for classifying individual data samples with boosting and for
bulk classification when you place all your test samples in a single file.
=head1 USING RANDOMIZED DECISION TREES
Consider the following two situations that call for using randomized decision trees,
meaning multiple decision trees that are trained using data extracted randomly from a
large database of training samples:
(1) Consider a two-class problem for which the training database is grossly
imbalanced in how many majority-class samples it contains vis-a-vis the number of
minority class samples. Let's assume for a moment that the ratio of majority class
samples to minority class samples is 1000 to 1. Let's also assume that you have a
test dataset that is drawn randomly from the same population mixture from which the
training database was created. Now consider a stupid data classification program
that classifies everything as belonging to the majority class. If you measure the
classification accuracy rate as the ratio of the number of samples correctly
classified to the total number of test samples selected randomly from the population,
this classifier would work with an accuracy of 99.99%.
(2) Let's now consider another situation in which we are faced with a huge training
database but in which every class is equally well represented. Feeding all the data
into a single decision tree would be akin to polling all of the population of the
United States for measuring the Coke-versus-Pepsi preference in the country. You are
likely to get better results if you construct multiple decision trees, each trained
with a collection of training samples drawn randomly from the training database.
After you have created all the decision trees, your final classification decision
could then be based on, say, majority voting by the trees.
In summary, the C<RandomizedTreesForBigData> class allows you to solve the following
two problems: (1) Data classification using the needle-in-a-haystack metaphor, that
is, when a vast majority of your training samples belong to just one class. And (2)
You have access to a very large database of training samples and you wish to
construct an ensemble of decision trees for classification.
=over 4
=item B<Calling the RandomizedTreesForBigData constructor:>
Here is how you'd call the C<RandomizedTreesForBigData> constructor for
needle-in-a-haystack classification:
use Algorithm::RandomizedTreesForBigData;
my $training_datafile = "your_database.csv";
my $rt = Algorithm::RandomizedTreesForBigData->new(
training_datafile => $training_datafile,
csv_class_column_index => 48,
csv_columns_for_features => [24,32,33,34,41],
entropy_threshold => 0.01,
max_depth_desired => 8,
symbolic_to_numeric_cardinality_threshold => 10,
( run in 2.163 seconds using v1.01-cache-2.11-cpan-119454b85a5 )