DATA results from the CPAN

DATA

Algorithm-DecisionTree

view release on metacpan or search on metacpan

script

        classify_by_asking_questions.pl

Execute the script as it is and see what happens.


===========================================================================


     EVALUATING THE CLASS DISCRIMINATORY POWER OF YOUR TRAINING DATA


Given a training data file that contains data records and the associated
class labels, one often wants to know the quality of the data in the file.
In other words, one wants to know if a training data file contains
sufficient information to discriminate between the different classes
mentioned in the file.

Starting with Version 2.2 of the DecisionTree module, you can now run a
10-fold cross-validation test on your training data to find out how much

Examples/README view on Meta::CPAN

introspection can provide answers for. The second script descends down the
decision tree and shows for each node the training samples that fall
directly in the portion of the feature space assigned to that node.  The
third script shows for each training sample how it affects the
decision-tree nodes either directly or indirectly through the
generalization achieved by the probabilistic modeling of the data.

===========================================================================


              GENERATING SYNTHETIC TRAINING AND TEST DATA


Starting with Version 1.6, you can use the module itself to generate
synthetic training and test data.  See the script

        generate_training_and_test_data_numeric.pl

        generate_training_and_test_data_symbolic.pl

for how to generate training data for the decision-tree classifier for the

Examples/classify_test_data_in_a_file.pl view on Meta::CPAN

### the training data that was read from the disk file:
#$dt->show_training_data();

my $root_node = $dt->construct_decision_tree_classifier();


### UNCOMMENT THE NEXT STATEMENT if you would like to see
### the decision tree displayed in your terminal window:
#$root_node->display_decision_tree("   ");

# NOW YOU ARE READY TO CLASSIFY THE FILE BASED TEST DATA:
my (@all_class_names, @feature_names, %class_for_sample_hash, %feature_values_for_samples_hash,
    %features_and_values_hash, %features_and_unique_values_hash, 
    %numeric_features_valuerange_hash, %feature_values_how_many_uniques_hash);

get_test_data_from_csv();
open OUTPUTHANDLE, ">$outputfile"
    or die "Unable to open the file $outputfile for writing out the classification results: $!";
if ($show_hard_classifications && ($outputfile !~ /\.csv$/i)) {
    print OUTPUTHANDLE "\nOnly the most probable class shown for each test sample\n\n";
} elsif (!$show_hard_classifications && ($outputfile !~ /\.csv$/i)) {

ExamplesBagging/bagging_for_bulk_classification.pl view on Meta::CPAN


$dtbag->calculate_first_order_probabilities();

$dtbag->calculate_class_priors();

$dtbag->construct_decision_trees_for_bags();

##  UNCOMMENT the following statement if you want to see the decision trees constructed for each bag
$dtbag->display_decision_trees_for_bags();

### NOW YOU ARE READY TO CLASSIFY THE FILE-BASED TEST DATA:
get_test_data_from_csv();

open FILEOUT, ">$outputfile"
    or die "Unable to open file $outputfile for writing out classification results: $!";

my $class_names = join ",", sort @{$dtbag->get_all_class_names()};

my $output_string = "sample_index,$class_names\n";

print FILEOUT $output_string;

ExamplesBoosting/boosting_for_bulk_classification.pl view on Meta::CPAN

#  samples misclassified by any particular stage.  The integer argument in the call
#  you see below is the stage index.  Whe set to 0, that means the base classifier.
$boosted->show_class_labels_for_misclassified_samples_in_stage(0);


##  UNCOMMENT the next statement if you want to see the decision trees constructed
##  for each stage of the cascade:
print "\nDisplaying the decision trees for all stages:\n\n";
$boosted->display_decision_trees_for_different_stages();

### NOW YOU ARE READY TO CLASSIFY THE FILE-BASED TEST DATA:
get_test_data_from_csv();

open FILEOUT, ">$outputfile"
    or die "Unable to open file $outputfile for writing out classification results: $!";

my $class_names = join ",", sort @{$boosted->get_all_class_names()};

my $output_string = "sample_index,$class_names\n";

print FILEOUT $output_string;

lib/Algorithm/DecisionTree.pm view on Meta::CPAN

additional option to the constructor that sets a user-defined value for the number of
points to use.  The name of the option is C<number_of_histogram_bins>.  The following
script 

    construct_dt_for_heavytailed.pl 

in the C<Examples> directory shows an example of how to call the constructor of the
module with the C<number_of_histogram_bins> option.


=head1 TESTING THE QUALITY OF YOUR TRAINING DATA

Versions 2.1 and higher include a new class named C<EvalTrainingData>, derived from
the main class C<DecisionTree>, that runs a 10-fold cross-validation test on your
training data to test its ability to discriminate between the classes mentioned in
the training file.

The 10-fold cross-validation test divides all of the training data into ten parts,
with nine parts used for training a decision tree and one part used for testing its
ability to classify correctly. This selection of nine parts for training and one part
for testing is carried out in all of the ten different possible ways.

lib/Algorithm/DecisionTree.pm view on Meta::CPAN

may use too large a number of bins for estimating the probabilities and that may slow
down the calculation of the decision tree.  You can get around this difficulty by
explicitly giving a value to the 'C<number_of_histogram_bins>' parameter.

=back


You can choose the best values to use for the last three constructor parameters by
running a 10-fold cross-validation test on your training data through the class
C<EvalTrainingData> that comes with Versions 2.1 and higher of this module.  See the
section "TESTING THE QUALITY OF YOUR TRAINING DATA" of this document page.

=over

=item B<get_training_data():>

After you have constructed a new instance of the C<Algorithm::DecisionTree> class,
you must now read in the training data that is the file named in the call to the
constructor.  This you do by:

    $dt->get_training_data();

lib/Algorithm/DecisionTree.pm view on Meta::CPAN

all the nodes that are affected directly AND indirectly by that sample, call

    $introspector->display_training_training_samples_to_nodes_influence_propagation();

A training sample affects a node directly if the sample falls in the portion of the
features space assigned to that node. On the other hand, a training sample is
considered to affect a node indirectly if the node is a descendant of a node that is
affected directly by the sample.


=head1 BULK CLASSIFICATION OF DATA RECORDS

For large test datasets, you would obviously want to process an entire file of test
data at a time. The following scripts in the C<Examples> directory illustrate how you
can do that:

      classify_test_data_in_a_file.pl

This script requires three command-line arguments, the first argument names the
training datafile, the second the test datafile, and the third the file in which the
classification results are to be deposited.

lib/Algorithm/DecisionTree.pm view on Meta::CPAN


Call this method if you want to apply the regression tree to all your test data in a
disk file.  The predictions for all of the test samples in the disk file are written
out to another file whose name is the same as that of the test file except for the
addition of C<_output> in the name of the file.  The parameter C<$filename> is the
name of the disk file that contains the test data. And the parameter C<$columns> is a
list of the column indices for the predictor variables in the test file.

=back

=head1 GENERATING SYNTHETIC TRAINING DATA

The module file contains the following additional classes: (1)
C<TrainingDataGeneratorNumeric>, and (2) C<TrainingDataGeneratorSymbolic> for
generating synthetic training data.

The class C<TrainingDataGeneratorNumeric> outputs one CSV file for the
training data and another one for the test data for experimenting with numeric
features.  The numeric values are generated using a multivariate Gaussian
distribution whose mean and covariance are specified in a parameter file. See the
file C<param_numeric.txt> in the C<Examples> directory for an example of such a

( run in 1.040 second using v1.01-cache-2.11-cpan-140bd7fdf52 )