Algorithm-DecisionTree

 view release on metacpan or  search on metacpan

lib/Algorithm/DecisionTree.pm  view on Meta::CPAN

  # You can classify it by calling:

      my $classification = $dt->classify($root_node, \@test_sample);

  # The call to `classify()' returns a reference to a hash whose keys are the class
  # names and the values the associated classification probabilities.  This hash also
  # includes another key-value pair for the solution path from the root node to the
  # leaf node at which the final classification was carried out.


=head1 CHANGES

B<Version 3.43:> This version fixes a bug in the C<csv_cleanup_needed()> function.
The source of the bug was a typo in a regex component meant for matching with white
space.  I have also made one additional change to this function to increase its
versatility.  With this change, you are now allowed to have empty strings as values
for features.

B<Version 3.42:> This version reintroduces C<csv_cleanup_needed> as an optional
parameter in the module constructor.  This was done in response to several requests
received from the user community. (Previously, all line records from a CSV file were
processed by the C<cleanup_csv()> function no matter what.)  The main point made by
the users was that invoking C<cleanup_csv()> when there was no need for CSV clean-up
extracted a performance penalty when ingesting large database files with tens of
thousands of line records.  In addition to making C<csv_cleanup_needed> optional, I
have also tweaked up the code in the C<cleanup_csv()> function in order to extract
data from a larger range of messy CSV files.

B<Version 3.41:> All the changes made in this version relate to the construction of
regression trees.  I have fixed a couple of bugs in the calculation of the regression
coefficients. Additionally, the C<RegressionTree> class now comes with a new
constructor parameter named C<jacobian_choice>.  For most cases, you'd set this
parameter to 0, which causes the regression coefficients to be estimated through
linear least-squares minimization.

B<Version 3.40:> In addition to constructing decision trees, this version of the
module also allows you to construct regression trees. The regression tree capability
has been packed into a separate subclass, named C<RegressionTree>, of the main
C<DecisionTree> class.  The subdirectory C<ExamplesRegression> in the main
installation directory illustrates how you can use this new functionality of the
module.

B<Version 3.30:> This version incorporates four very significant upgrades/changes to
the C<DecisionTree> module: B<(1)> The CSV cleanup is now the default. So you do not
have to set any special parameters in the constructor calls to initiate CSV
cleanup. B<(2)> In the form of a new Perl class named C<RandomizedTreesForBigData>,
this module provides you with an easy-to-use programming interface for attempting
needle-in-a-haystack solutions for the case when your training data is overwhelmingly
dominated by a single class.  You need to set the constructor parameter
C<looking_for_needles_in_haystack> to invoke the logic that constructs multiple
decision trees, each using the minority class samples along with samples drawn
randomly from the majority class.  The final classification is made through a
majority vote from all the decision trees.  B<(3)> Assuming you are faced with a
big-data problem --- in the sense that you have been given a training database with a
very large number of training records --- the class C<RandomizedTreesForBigData> will
also let you construct multiple decision trees by pulling training data randomly from
your training database (without paying attention to the relative populations of the
classes).  The final classification decision for a test sample is based on a majority
vote from all the decision trees thus constructed.  See the C<ExamplesRandomizedTrees>
directory for how to use these new features of the module. And, finally, B<(4)>
Support for the old-style '.dat' training files has been dropped in this version.

B<Version 3.21:> This version makes it easier to use a CSV training file that
violates the assumption that a comma be used only to separate the different field
values in a line record.  Some large econometrics databases use double-quoted values
for fields, and these values may contain commas (presumably for better readability).
This version also allows you to specify the leftmost entry in the first CSV record
that names all the fields. Previously, this entry was required to be an empty
double-quoted string.  I have also made some minor changes to the
'C<get_training_data_from_csv()>' method to make it more user friendly for large
training files that may contain tens of thousands of records.  When pulling training
data from such files, this method prints out a dot on the terminal screen for every
10000 records it has processed. 

B<Version 3.20:> This version brings the boosting capability to the C<DecisionTree>
module.

B<Version 3.0:> This version adds bagging to the C<DecisionTree> module. If your
training dataset is large enough, you can ask the module to construct multiple
decision trees using data bags extracted from your dataset.  The module can show you
the results returned by the individual decision trees and also the results obtained
by taking a majority vote of the classification decisions made by the individual
trees.  You can specify any arbitrary extent of overlap between the data bags.

B<Version 2.31:> The introspection capability in this version packs more of a punch.
For each training data sample, you can now figure out not only the decision-tree
nodes that are affected directly by that sample, but also those nodes that are
affected indirectly through the generalization achieved by the probabilistic modeling
of the data.  The 'examples' directory of this version includes additional scripts
that illustrate these enhancements to the introspection capability.  See the section
"The Introspection API" for a declaration of the introspection related methods, old
and new.

B<Version 2.30:> In response to requests from several users, this version includes a new
capability: You can now ask the module to introspect about the classification
decisions returned by the decision tree.  Toward that end, the module includes a new
class named C<DTIntrospection>.  Perhaps the most important bit of information you
are likely to seek through DT introspection is the list of the training samples that
fall directly in the portion of the feature space that is assigned to a node.
B<CAVEAT:> When training samples are non-uniformly distributed in the underlying
feature space, IT IS POSSIBLE FOR A NODE TO EXIST EVEN WHEN NO TRAINING SAMPLES FALL
IN THE PORTION OF THE FEATURE SPACE ASSIGNED TO THE NODE.  B<(This is an important
part of the generalization achieved by probabilistic modeling of the training data.)>
For additional information related to DT introspection, see the section titled
"DECISION TREE INTROSPECTION" in this documentation page.

B<Version 2.26> fixes a bug in the part of the module that some folks use for generating
synthetic data for experimenting with decision tree construction and classification.
In the class C<TrainingDataGeneratorNumeric> that is a part of the module, there
was a problem with the order in which the features were recorded from the
user-supplied parameter file.  The basic code for decision tree construction and
classification remains unchanged.

B<Version 2.25> further downshifts the required version of Perl for this module.  This
was a result of testing the module with Version 5.10.1 of Perl.  Only one statement
in the module code needed to be changed for the module to work with the older version
of Perl.

B<Version 2.24> fixes the C<Makefile.PL> restriction on the required Perl version.  This
version should work with Perl versions 5.14.0 and higher.



( run in 1.020 second using v1.01-cache-2.11-cpan-98e64b0badf )