Algorithm-TrunkClassifier

 view release on metacpan or  search on metacpan

pod/TrunkClassifier.pod  view on Meta::CPAN

validation. Two other classification procedures are supported: split-sample and dual
datasets. The split-sample procedure causes the input dataset to be split into two
sets, a training set and a test set. Trunks are built using the training set and
classification is done only on the test set. It is also possible to supply two datasets.
The input data file (final command line argument) is then used as training set and the
second dataset as test set. Thus the value for the -c option should be C<loocv>, for 
leave-one-out cross validation, C<split> for split-sample procedure, or C<dual> when
using two datasets.

=item C<-e value>

The percentage of samples in the input data file that should be used as test set when
using C<-p split>. Must be from 1 to 99. Default is 20.

=item C<-t value>

The name of the testset data file when using the C<-p dual> option.

=item C<-c value>

The value should be the name of the classification variable to use. Default is TISSUE.

=item C<-o value>

The value should be the name of the output folder. Created if it does not exist in the
current directory. Default is current directory.

=item C<-l value>

By default, the algorithm selects the number of decision levels to use for
classification. To override this, supply the -l option and an integer from 1 to 5. This
will force the algorithm to use that number of decision levels.

=item C<-i value>

This option can be used to inspect the dataset without running the classifier.
The option takes one of three possible values: C<samples>, C<probes> or C<classes>.

samples: prints the number of samples in each class for the classification variable
probes:  prints the number of probes in the dataset
classes: prints all classification variables in the meta data

=item C<-s value>

Name of a supplementary file containing class information for the samples in the
dataset. The contents should be in table format with columns being tab-separated. The
first row needs to contain column names and the first column should contain sample
names. The second and subsequent columns can contain class information, with the name
of the classification variable given as the column name, followed by class labels
on the rows starting with sample names. Examples of classification variables are STAGE,
GRADE and HISTOLOGY. Class labels could be EARLY and LATE for STAGE, or LOW and HIGH
for GRADE. The format of the file is illustrated here.

  Samples	ClassVar1	ClassVar2
  sample1	classLabel1	classLabel3
  sample2	classLabel1	classLabel4
  sample3	classLabel2	classLabel3
  sample4	classLabel1	classLabel4
  sample5	classLabel2	classLabel4

When this option is given, the algorithm first processes the supplementary file and
writes a new data file containing meta data. This data file is then used as input.

Note: If the C<-p dual> option is used, two datasets must be supplied. In this case the
supplementary file needs to contain the class information of all samples in both datasets.

=item C<-v>

This option makes the algorithm report its progress to the terminal during a run.

=item C<-u>

This option circumvents selection of decision levels and makes the algorithm use trunks
with 1, 2, 3, 4 and 5 decision levels during classification.

=item C<-h>

This option causes argument documentation to be printed to the terminal.

=back

=head2 OUTPUT

The algorithm produces five files as output: F<performance.txt>, F<loo_trunks.txt>,
F<cts_trunks>, F<class_report.txt> and F<log.txt>. The classification accuracy
can be found in F<performance.txt>. In case of leave-one-out cross validation, the
accuracy for each fold is reported along with the average accuracy across all folds.
Since the test set consists of one sample, the accuracy of one LOOCV fold is either
0 % (wrong) or 100 % (correct). For split-sample and dual datasets classification, only
the average accuracy is reported since there is only one test set.

The F<loo_trunks.txt> file contains the decision trunks resulting from leave-one-out
training on the training set. Since the training set is different in each fold,
different probes may be selected in the trunks. The decision levels of a trunk are shown
in order starting with the first level at the top. Each level consists of two rows:
the first row shows the name of the probe and the second row contains the decision
thresholds and the associated class labels. An illustration of a decision trunk with
three levels is shown here

              Probe X
  <= A (class1)     > B (class2)
  
              Probe Y
  <= C (class1)     > D (class2)
  
              Probe Z
  <= E (class1)     > F (class2)

Classification of a sample I<S> using this decision trunk would proceed as follows.

Compare the expression of probe I<X> in sample I<S> with thresholds I<A> and I<B>. If
the expression is less than I<A>, the sample is classified as I<class1>. If the
expression is greater than I<B>, the sample is classified as I<class2>. If the expression
is in-between I<A> and I<B>, the algorithm proceeds to the next decision level. This
continues until the last level, where the thresholds I<E> and I<F> are equal, meaning
that sample I<S> is guaranteed to be classified as either I<class1> or I<class2>.

The F<cts_trunks.txt> file contains decision trunks built using the complete training set.

The classification of each sample can be found in the F<class_report.txt> file. The rows
in this file start with a sample name, followed by "in X-class". X is the level in the



( run in 0.416 second using v1.01-cache-2.11-cpan-70e19b8f4f1 )