Algorithm-DecisionTree
view release on metacpan or search on metacpan
Examples/README view on Meta::CPAN
classify_test_data_in_a_file.pl training4.csv test4_no_class_labels.csv out4.csv
(4) Let's now talk about how you can deal with features that, statistically
speaking, are not so "nice". We are talking about features with
heavy-tailed distributions over large value ranges. As mentioned in
the HTML based API for this module, such features can create problems
with the estimation of the probability distributions associated with
them. As mentioned there, the main problem that such features cause
is with deciding how best to sample the value range.
Beginning with Version 2.22, you have two options in dealing with such
features. You can choose to go with the default behavior of the
module, which is to sample the value range for such a feature over a
maximum of 500 points. Or, you can supply an additional option to the
constructor that sets a user-defined value for the number of points to
use. The name of the option is "number_of_histogram_bins". The
following script
construct_dt_for_heavytailed.pl
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
price_to_earnings_ratio (P_to_E)
price_to_sales_ratio (P_to_S)
return_on_equity (R_on_E)
market_share (MS)
Since you are the boss, you keep track of the buy/sell decisions made by the
individual traders. But one unfortunate day, all of your traders decide to quit
because you did not pay them enough. So what do you do? If you had a module like
the one here, you could still run your company and do so in such a way that, on the
average, would do better than any of the individual traders who worked for your
company. This is what you do: You pool together the individual trader buy/sell
decisions you have accumulated during the last one year. This pooled information is
likely to look like:
example buy/sell P_to_E P_to_S R_on_E MS
============================================================+=
lib/Algorithm/DecisionTree.pm view on Meta::CPAN
Note in particular the constructor parameters:
dependent_variable
predictor_columns
mse_threshold
jacobian_choice
The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable. The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables. The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed. If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.
The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2. Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the
coefficients is carried out using gradient descent. This is the fastest way to
calculate the regression coefficients. When C<jacobian_choice> is set to 1, you get
( run in 0.896 second using v1.01-cache-2.11-cpan-de7293f3b23 )