Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

Examples/README  view on Meta::CPAN

                                 is generated.

       stage3cancer.csv:   Example of a CSV training data file with both 
                           symbolic and numeric features. 

       training.csv    :   Example of a CSV training data file for the
                           purely numeric case.  Contains two classes, each
                           a Gaussian distribution in 2D.  The parameters of
                           the two Gaussians are in the file: 
                           `param_numeric.txt'

    There are two additional training data files in the directory:

          training2.csv

          training3.csv

    These are similar to the file `training.csv' in the sense that 
    they both contain two classes, each a 2D Gaussian distribution.
    The first, `training2.csv' was generated by the script
    `generate_training_data_numeric.pl ' using the parameter file

           param_numeric_strongly_overlapping_classes.txt

    and the second, `training3.csv' was generated by the same script 
    using the parameter file

           param_numeric_extremely_overlapping_classes.txt


(3) So far we have talked about classifying one test data record at a time.
    You can place multiple test data records in a disk file and classify
    them all in one go.  To see how that can be done, execute the following
    two command lines in the `examples' directory:

     classify_test_data_in_a_file.pl   training4.csv   test4.csv   out4.csv

    This script constructs the decision tree from the data in the first
    argument file and then uses it to classify the data in the second
    argument file.  The computed class labels are deposited in the third
    argument file.

    In general, the test data files should look identical to the training
    data files.  Of course, for real-world test data, you will not have the
    class labels for the test samples.  You are still required to reserve a
    column for the class label, which now must be just the empty string ""
    for each data record.  For example, the test data supplied in the
    following two calls through the files test4_no_class_labels.csv and
    test4_no_class_labels.dat does not mention class labels:

 classify_test_data_in_a_file.pl training4.csv test4_no_class_labels.csv out4.csv



(4) Let's now talk about how you can deal with features that, statistically
    speaking, are not so "nice".  We are talking about features with
    heavy-tailed distributions over large value ranges.  As mentioned in
    the HTML based API for this module, such features can create problems
    with the estimation of the probability distributions associated with
    them.  As mentioned there, the main problem that such features cause
    is with deciding how best to sample the value range.

    Beginning with Version 2.22, you have two options in dealing with such
    features.  You can choose to go with the default behavior of the
    module, which is to sample the value range for such a feature over a
    maximum of 500 points.  Or, you can supply an additional option to the
    constructor that sets a user-defined value for the number of points to
    use.  The name of the option is "number_of_histogram_bins".  The
    following script

          construct_dt_for_heavytailed.pl

    shows an example of a DecisionTree constructor with the
    "number_of_histogram_bins" option.


===========================================================================


          FOR USING A DECISION TREE CLASSIFIER INTERACTIVELY


Starting with Version 1.6 of the module, you can use the DecisionTree
classifier in an interactive mode.  In this mode, after you have
constructed the decision tree, the user is prompted for answers to the
questions regarding the feature tests at the nodes of the tree.  Depending
on the answer supplied by the user at a node, the classifier takes a path
corresponding to the answer to descend down the tree to the next node, and
so on.  To get a feel for using a decision tree in this mode, examine the
script

        classify_by_asking_questions.pl

Execute the script as it is and see what happens.


===========================================================================


     EVALUATING THE CLASS DISCRIMINATORY POWER OF YOUR TRAINING DATA


Given a training data file that contains data records and the associated
class labels, one often wants to know the quality of the data in the file.
In other words, one wants to know if a training data file contains
sufficient information to discriminate between the different classes
mentioned in the file.

Starting with Version 2.2 of the DecisionTree module, you can now run a
10-fold cross-validation test on your training data to find out how much
class-discriminatory information is contained in the data.  The following
two scripts in the Examples directory:

       evaluate_training_data1.pl

       evaluate_training_data2.pl

As these scripts show, the following class 

       EvalTrainingData



( run in 0.640 second using v1.01-cache-2.11-cpan-e1769b4cff6 )