Algorithm-TrunkClassifier

 view release on metacpan or  search on metacpan

pod/TrunkClassifier.pod  view on Meta::CPAN

the expression is less than I<A>, the sample is classified as I<class1>. If the
expression is greater than I<B>, the sample is classified as I<class2>. If the expression
is in-between I<A> and I<B>, the algorithm proceeds to the next decision level. This
continues until the last level, where the thresholds I<E> and I<F> are equal, meaning
that sample I<S> is guaranteed to be classified as either I<class1> or I<class2>.

The F<cts_trunks.txt> file contains decision trunks built using the complete training set.

The classification of each sample can be found in the F<class_report.txt> file. The rows
in this file start with a sample name, followed by "in X-class". X is the level in the
decision trunk where the sample was classified, and class is the class label assigned to
the sample.

The F<log.txt> file gives a summary of the classifier run. The information given includes
the name of the input data file, the name of the test data file (if any), the name of
the classification procedure, the split-sample percentage (if any), number of decision
levels used for classification, the name of the classification variable, the sizes of
class1 and class2 in the training and test set respectively, and the version of the algorithm.

In case the C<-u> option is used, the output files will contain the results from using
decision trunks with 1, 2, 3, 4 and 5 levels.

=head2 EXAMPLE

To provide an easy way of testing the algorithm, the t/ folder contains two test files.
The F<test_data.txt> contains a random dataset with 200 samples and 1000 probes. This set
has been generated such that the first 100 samples (healthy) have a mean gene expression
of 0 and standard deviation of 0.5 (normal distribution) for all genes, while the remaining
100 samples (malignant) have a mean of 1 and standard deviation of 0.5. The F<test_supp.txt>
is a supplementary file containing the class information associated to the random dataset.
To run the algorithm with this dataset, use the following command.

C<perl run_classifier.pl -v -o test_set_tissue -s test_supp.txt test_data.txt>

Since a supplementary file is given, a new data file with class information will be
written. Following this, the algorithm will build decision trunks and determine how many
decision levels to use for classification. Finally, LOOCV will be performed using the
selected trunks and output written. If no classification variable is explicitly given,
the algorithm will default to TISSUE. For the random dataset, this variable states if the
sample comes from healthy tissue or from a tumor. The supplementary file labels healthy
samples as T_HEALTHY and tumor samples as T_MALIGN. By looking in the supplementary file
it can also be seen that the random dataset comes with a second classification variable:
GRADE. This variable states if the tumor samples comes from an low- or high-state tumor.
This is indicated by G_LOW and G_HIGH. Since the healthy samples do not come from tumors,
they do not have GRADE classes. To indicate this, #NA is used. The #NA symbol is
interpreted by the algorithm as a null class, causing the sample to be excluded if GRADE
is given as the classification variable. To test this, use the following command.

C<perl run_classifier.pl -v -c GRADE -o test_set_stage -s test_supp.txt test_data.txt>

By comparing the output files, differences can be seen in how many folds of LOOCV has
been carried out, and in what probes where selected for the decision trunks. The log
file will also reflect that a different classification variable was used. Accuracy will
be good when classifying TISSUE, because the healthy and tumor samples have sufficiently
different gene expression values. For GRADE, however, all tumor samples have the same mean
and standard deviation, so the algorithm is not able to separate them.

=head2 WARNINGS AND ERROR MESSAGES

If an invalid argument is given, or if there is something wrong with the input data file
or supplementary file, the algorithm will output a warning or error message. Warnings
will not prevent the algorithm from running, but errors will. Here is a list of all
warnings/errors and how to interpret them

=head3 WARNINGS

=over 4

=item No classification variable names found in supplementary file

Indicates that the supplementary file has less than two columns. The algorithm expects the
first column to contain sample names and the following columns to contain class labels of
samples. The first row in the file must contain the names of the columns. For column 2, 3
and so on, the column name should be the classification variable name.

=item Missing class in supplmentary file at line I<index>, replacing with #NA

Indicates that a class label is missing on line I<index>. The missing label is
replaced with #NA, the symbol for the null class.

=item No sample classes found in supplementary file

Indicates that no class labels (rows) were found in the supplementary file.

=item Sample I<sampleName> has no I<classVar> class in supplementary file

Indicates that sample I<sampleName> in the input/testset data file is missing a class
label for I<classVar> in the supplementary file. The sample's class label becomes #NA.

=item CLASSVAR name missing in meta data of input/testset data file

Indicates that a meta data row (starting with #CLASSVAR) is missing the I<name> value.
The expected format is

#CLASSVAR name classLabel1 classLabel2

=item CLASSVAR class labels for I<classVar> missing in meta data of input/testset data file

Indicates that a meta data row (starting with #CLASSVAR) is missing one/both the I<classLabel>
values.

=item CLASSMEM name missing in meta data of input/testset data file

Indicates that a meta data row (starting with #CLASSMEM) is missing the I<name> value.
The expected format is

#CLASSVAR name sampleOneClass sampleTwoClass sampleThreeClass ...

=item Duplicate sample name I<sampleName> at positions I<pos1> and I<pos2> in input/testset data file

Indicates that two samples names in the input or testset data file are identical. This
does not affect classification.

=item Missing/invalid value I<value> in input/testset data file at probe I<index>

Indicates that an expression value for probe I<index> is missing.

=item Supplied level is to high, using trunks with I<level> level(s) instead

Indicates that the number of levels given to the C<-l> argument was to high. This can
happen when there are not enough samples to create five levels in a trunk. If 4 is given
to the C<-l> argument but only two levels could be created, the algorithm will use two
levels instead.



( run in 0.438 second using v1.01-cache-2.11-cpan-119454b85a5 )