Algorithm-TrunkClassifier

 view release on metacpan or  search on metacpan

pod/TrunkClassifier.pod  view on Meta::CPAN

=head1 NAME

Algorithm::TrunkClassifier - Implementation of the Decision Trunk Classifier algorithm

=head1 SYNOPSIS

  use Algorithm::TrunkClassifier qw(runClassifier);

=head1 DESCRIPTION

This module contains the implementation of the Decision Trunk Classifier. The algorithm
can be used to perform binary classification on numeric data, e.g. the result of a
gene expression profiling experiment. Classification is based on so-called decision
trunks, which consist of a sequence of decision levels, represented as nodes in the
trunk. For each decision level, a probe is selected from the input data, and two decision
threshold are calculated. These threshold are associated to two outgoing edges from
the decision level. One edge represents the first class and the other edge represents
the second class.

During classification, the decision levels of a trunk are considered one at a time. To
classify a sample, its expression of the probe at the decision level is compared to the
thresholds of outgoing edges. If the expression is less than the first threshold,
class1 is assigned to the sample. If, on the other hand, the expression is greater than
the second threshold, class2 is assigned to the sample. In the case expression is
in-between the thresholds, the algorithm proceeds to the next decision level of the
trunk.

By default, classification is done by leave-one-out cross validation (LOOCV) meaning
that a single sample is used as test set, while the remaining samples are used to build
the classifier. This is done for every sample in the input dataset. See the the algorithm
publication for more details. A PubMed link can be found in L</"SEE ALSO">.

=head2 ARGUMENTS

Following installation, the algorithm can be run from the terminal using the
run_classifier.pl script supplied in the t/ folder. The command should be in this form

C<perl run_classifier.pl [Options] [Input data file]>

=head3 INPUT DATA FILE

The last argument must be the name of the input data file containing the expression
data in table format, where columns are tab-separated. The first row must contains
column names and the first column must contain row names. Samples need to be given in
columns and probes/attributes in rows. Before the name of the input data file, a number of
optional arguments may be given, see L</"OPTIONS"> below. A data file containing random data
is provided in the t/ folder.

=head3 META DATA

At the top of the input data file, before the expression data table, an number of meta data
rows starting with # can be given. The purpose of these rows is to tell the algorithm
what classification variables that are defined for the data and what classes the samples
belong to. The classification variable is the name of the property by which the samples
are divided into two groups. For example, if the samples should be classified as either
early or late stage cancer, the name of the classification variable would be STAGE.

Classification variables are defined on rows starting with #CLASSVAR, followed by the
name of the variable and the two class labels.

#CLASSVAR name classLabel1 classLabel2

Class labels (e.g. EARLY and LATE) are assigned to samples on rows staring with #CLASSMEM,
followed by the name of the classification variable and the class labels for all samples.

#CLASSMEM name sampleOneClass sampleTwoClass sampleThreeClass ...

Since this would be very tedious to fill in manually for large datasets, the algorithm
accepts a supplementary file with class information for all samples. See the C<-s value>
option below. An example of a supplementary file is given in the t/ folder.

An example of meta rows for the classification variable STAGE in a dataset with five
samples is

#CLASSVAR STAGE EARLY LATE
#CLASSMEM STAGE LATE LATE EARLY LATE EARLY

=head3 OPTIONS

=over 4

=item C<-p value>

By default, the algorithm trains and classifies one dataset using leave-one-out cross
validation. Two other classification procedures are supported: split-sample and dual
datasets. The split-sample procedure causes the input dataset to be split into two
sets, a training set and a test set. Trunks are built using the training set and
classification is done only on the test set. It is also possible to supply two datasets.
The input data file (final command line argument) is then used as training set and the
second dataset as test set. Thus the value for the -c option should be C<loocv>, for 
leave-one-out cross validation, C<split> for split-sample procedure, or C<dual> when

pod/TrunkClassifier.pod  view on Meta::CPAN

file must be given as the last command line argument. A testset data file must be
supplied using the -t option when the C<-p dual> option is used.

=item Unable to create new data file I<filename>

Indicates that the new input data file with meta data could not be written.

=item No samples in input/testset data file

Indicates that the input/testset data file, which is supposed to contain the expression data table,
does not contain any sample columns.

=item CLASSVAR class label equals NULL CLASS in input/testset data file

Indicates that one of the I<classLabel> values for a #CLASSVAR meta data row is equal
to the null class symbol #NA. The I<classLabel> values are the class labels for the
classification variable and cannot be equal to #NA, because this symbol is reserved as
a null class symbol.

=item Missing meta data for classification variable I<classVar> in input/testset data file

Indicates that the classification variable name given to the -c option is missing in
the meta data of the input or testset data file.

=item CLASSMEM vector for I<classVar> and sample vector have different lengths in input/testset data file

Indicates that the number of class labels on the #CLASSMEM row for I<classVar> is
different from the number of samples in the input or testset data file.

=item Invalid class label in I<classVar> CLASSMEM vector in input/testset data file

Indicates that one or more class labels on the #CLASSMEM row for I<classVar> are
invalid. Valid class labels are those given on the #CLASSVAR row for I<classVar> and
the null class symbol #NA.

=item Class I<classLabel> for classification variable I<classVar> has zero members in input/testset data file

Indicates that no samples have the class label I<classLabel> for classification
variable I<classVar>. This means that classification cannot be carried out, since all
samples belong to the same class.

=item Wrong number of columns in input/testset data file at probe I<index>

Indicates that a probe row in the input or testset data file has wrong number of columns with respect
to the number of samples.

=item Probe C<probename> in input data file not found in testset data file

Indicates that a probe in the input data file is missing in the testset. For classification
to be carried out using two datasets, all probes in the input data file must be present
in the testset.

=item Unable to create output file

Indicates that output files were not writable in the output folder.

=back

=head1 SEE ALSO

The publication describing the algorithm can be found in PubMed by this link:
L<http://www.ncbi.nlm.nih.gov/pubmed?Db=pubmed&Cmd=DetailsSearch&Term=23467331%5Buid%5D>

=head1 EXPORT

None by default. The runClassifier subroutine is exported on request.

=head1 AUTHOR

Benjamin Ulfenborg, E<lt>wolftower85@gmail.comE<gt>

=head1 COPYRIGHT AND LICENSE

Copyright (C) 2012 by Benjamin Ulfenborg

This module is free to use, modify and redistribute for academic purposes.

=cut



( run in 0.614 second using v1.01-cache-2.11-cpan-e1769b4cff6 )