cid results from the CPAN

Algorithm-DecisionTree

 classify_test_data_in_a_file.pl training4.csv test4_no_class_labels.csv out4.csv



(4) Let's now talk about how you can deal with features that, statistically
    speaking, are not so "nice".  We are talking about features with
    heavy-tailed distributions over large value ranges.  As mentioned in
    the HTML based API for this module, such features can create problems
    with the estimation of the probability distributions associated with
    them.  As mentioned there, the main problem that such features cause
    is with deciding how best to sample the value range.

    Beginning with Version 2.22, you have two options in dealing with such
    features.  You can choose to go with the default behavior of the
    module, which is to sample the value range for such a feature over a
    maximum of 500 points.  Or, you can supply an additional option to the
    constructor that sets a user-defined value for the number of points to
    use.  The name of the option is "number_of_histogram_bins".  The
    following script

          construct_dt_for_heavytailed.pl

lib/Algorithm/DecisionTree.pm view on Meta::CPAN


  price_to_earnings_ratio   (P_to_E)

  price_to_sales_ratio      (P_to_S)

  return_on_equity          (R_on_E)

  market_share              (MS)

Since you are the boss, you keep track of the buy/sell decisions made by the
individual traders.  But one unfortunate day, all of your traders decide to quit
because you did not pay them enough.  So what do you do?  If you had a module like
the one here, you could still run your company and do so in such a way that, on the
average, would do better than any of the individual traders who worked for your
company.  This is what you do: You pool together the individual trader buy/sell
decisions you have accumulated during the last one year.  This pooled information is
likely to look like:


  example      buy/sell     P_to_E     P_to_S     R_on_E      MS
  ============================================================+=

lib/Algorithm/DecisionTree.pm view on Meta::CPAN

Note in particular the constructor parameters:

    dependent_variable
    predictor_columns
    mse_threshold
    jacobian_choice

The first of these parameters, C<dependent_variable>, is set to the column index in
the CSV file for the dependent variable.  The second constructor parameter,
C<predictor_columns>, tells the system as to which columns contain values for the
predictor variables.  The third parameter, C<mse_threshold>, is for deciding when to
partition the data at a node into two child nodes as a regression tree is being
constructed.  If the minmax of MSE (Mean Squared Error) that can be achieved by
partitioning any of the features at a node is smaller than C<mse_threshold>, that
node becomes a leaf node of the regression tree.

The last parameter, C<jacobian_choice>, must be set to either 0 or 1 or 2.  Its
default value is 0. When this parameter equals 0, the regression coefficients are
calculated using the linear least-squares method and no further "refinement" of the
coefficients is carried out using gradient descent.  This is the fastest way to
calculate the regression coefficients.  When C<jacobian_choice> is set to 1, you get

( run in 0.896 second using v1.01-cache-2.11-cpan-de7293f3b23 )