Algorithm-TrunkClassifier
view release on metacpan or search on metacpan
Algorithm/TrunkClassifier/ppport.h view on Meta::CPAN
macro. Just C<#define> the macro before including C<ppport.h>:
#define DPPP_NAMESPACE MyOwnNamespace_
#include "ppport.h"
The default namespace is C<DPPP_>.
=back
The good thing is that most of the above can be checked by running
F<ppport.h> on your source code. See the next section for
details.
=head1 EXAMPLES
To verify whether F<ppport.h> is needed for your module, whether you
should make any changes to your code, and whether any special defines
should be used, F<ppport.h> can be run as a Perl script to check your
source code. Simply say:
perl ppport.h
Algorithm/TrunkClassifier/ppport.h view on Meta::CPAN
* data from C. All statics in extensions should be reworked to use
* this, if you want to make the extension thread-safe. See ext/re/re.xs
* for an example of the use of these macros.
*
* Code that uses these macros is responsible for the following:
* 1. #define MY_CXT_KEY to a unique string, e.g. "DynaLoader_guts"
* 2. Declare a typedef named my_cxt_t that is a structure that contains
* all the data that needs to be interpreter-local.
* 3. Use the START_MY_CXT macro after the declaration of my_cxt_t.
* 4. Use the MY_CXT_INIT macro such that it is called exactly once
* (typically put in the BOOT: section).
* 5. Use the members of the my_cxt_t structure everywhere as
* MY_CXT.member.
* 6. Use the dMY_CXT macro (a declaration) in all the functions that
* access MY_CXT.
*/
#if defined(MULTIPLICITY) || defined(PERL_OBJECT) || \
defined(PERL_CAPI) || defined(PERL_IMPLICIT_CONTEXT)
#ifndef START_MY_CXT
Algorithm/TrunkClassifier/src/feature_selection.c view on Meta::CPAN
*
* Description
* This library implements the independent t-test
*/
#include "feature_selection.h"
/*
Description: Feature selection method that uses the independent t-test to select a feature
Parameters: (1) 2D expression data matrix, (2) number of matrix rows, (3) number of matrix columns (4) list of column names,
(5) first group symbol and (6) second group symbol
Return value: Index of the top t-value feature (row in data matrix)
*/
int indTTest(double** expData, int numFeatures, int numSamples, char** sampleNames, char* NORMAL, char* MALIGN){
//Determine class sizes
int sample;
int numNormal = 0;
int numMalign = 0;
for(sample = 0; sample < numSamples; sample++){
if(strcmp(sampleNames[sample], NORMAL) == 0)
lib/Algorithm/TrunkClassifier/Util.pm view on Meta::CPAN
package Algorithm::TrunkClassifier::Util;
use warnings;
use strict;
our $VERSION = 'v1.0.1';
#Description: Sorts two arrays in accending order based on values in the first
#Parameters: (1) Numerical array reference, (2) second array reference
#Return value: None
sub dataSort($ $){
my ($numArrayRef, $secondArrayRef) = @_;
my $limiter = 1;
for(my $outer = 0; $outer < scalar(@{$numArrayRef}); $outer++){
for(my $inner = 0; $inner < scalar(@{$numArrayRef}) - $limiter; $inner++){
if(${$numArrayRef}[$inner] > ${$numArrayRef}[$inner+1]){
my $buffer = ${$numArrayRef}[$inner];
${$numArrayRef}[$inner] = ${$numArrayRef}[$inner+1];
${$numArrayRef}[$inner+1] = $buffer;
$buffer = ${$secondArrayRef}[$inner];
${$secondArrayRef}[$inner] = ${$secondArrayRef}[$inner+1];
${$secondArrayRef}[$inner+1] = $buffer;
}
}
$limiter++;
}
}
return 1;
pod/TrunkClassifier.pod view on Meta::CPAN
=head1 DESCRIPTION
This module contains the implementation of the Decision Trunk Classifier. The algorithm
can be used to perform binary classification on numeric data, e.g. the result of a
gene expression profiling experiment. Classification is based on so-called decision
trunks, which consist of a sequence of decision levels, represented as nodes in the
trunk. For each decision level, a probe is selected from the input data, and two decision
threshold are calculated. These threshold are associated to two outgoing edges from
the decision level. One edge represents the first class and the other edge represents
the second class.
During classification, the decision levels of a trunk are considered one at a time. To
classify a sample, its expression of the probe at the decision level is compared to the
thresholds of outgoing edges. If the expression is less than the first threshold,
class1 is assigned to the sample. If, on the other hand, the expression is greater than
the second threshold, class2 is assigned to the sample. In the case expression is
in-between the thresholds, the algorithm proceeds to the next decision level of the
trunk.
By default, classification is done by leave-one-out cross validation (LOOCV) meaning
that a single sample is used as test set, while the remaining samples are used to build
the classifier. This is done for every sample in the input dataset. See the the algorithm
publication for more details. A PubMed link can be found in L</"SEE ALSO">.
=head2 ARGUMENTS
pod/TrunkClassifier.pod view on Meta::CPAN
=over 4
=item C<-p value>
By default, the algorithm trains and classifies one dataset using leave-one-out cross
validation. Two other classification procedures are supported: split-sample and dual
datasets. The split-sample procedure causes the input dataset to be split into two
sets, a training set and a test set. Trunks are built using the training set and
classification is done only on the test set. It is also possible to supply two datasets.
The input data file (final command line argument) is then used as training set and the
second dataset as test set. Thus the value for the -c option should be C<loocv>, for
leave-one-out cross validation, C<split> for split-sample procedure, or C<dual> when
using two datasets.
=item C<-e value>
The percentage of samples in the input data file that should be used as test set when
using C<-p split>. Must be from 1 to 99. Default is 20.
=item C<-t value>
pod/TrunkClassifier.pod view on Meta::CPAN
samples: prints the number of samples in each class for the classification variable
probes: prints the number of probes in the dataset
classes: prints all classification variables in the meta data
=item C<-s value>
Name of a supplementary file containing class information for the samples in the
dataset. The contents should be in table format with columns being tab-separated. The
first row needs to contain column names and the first column should contain sample
names. The second and subsequent columns can contain class information, with the name
of the classification variable given as the column name, followed by class labels
on the rows starting with sample names. Examples of classification variables are STAGE,
GRADE and HISTOLOGY. Class labels could be EARLY and LATE for STAGE, or LOW and HIGH
for GRADE. The format of the file is illustrated here.
Samples ClassVar1 ClassVar2
sample1 classLabel1 classLabel3
sample2 classLabel1 classLabel4
sample3 classLabel2 classLabel3
sample4 classLabel1 classLabel4
pod/TrunkClassifier.pod view on Meta::CPAN
can be found in F<performance.txt>. In case of leave-one-out cross validation, the
accuracy for each fold is reported along with the average accuracy across all folds.
Since the test set consists of one sample, the accuracy of one LOOCV fold is either
0 % (wrong) or 100 % (correct). For split-sample and dual datasets classification, only
the average accuracy is reported since there is only one test set.
The F<loo_trunks.txt> file contains the decision trunks resulting from leave-one-out
training on the training set. Since the training set is different in each fold,
different probes may be selected in the trunks. The decision levels of a trunk are shown
in order starting with the first level at the top. Each level consists of two rows:
the first row shows the name of the probe and the second row contains the decision
thresholds and the associated class labels. An illustration of a decision trunk with
three levels is shown here
Probe X
<= A (class1) > B (class2)
Probe Y
<= C (class1) > D (class2)
Probe Z
pod/TrunkClassifier.pod view on Meta::CPAN
C<perl run_classifier.pl -v -o test_set_tissue -s test_supp.txt test_data.txt>
Since a supplementary file is given, a new data file with class information will be
written. Following this, the algorithm will build decision trunks and determine how many
decision levels to use for classification. Finally, LOOCV will be performed using the
selected trunks and output written. If no classification variable is explicitly given,
the algorithm will default to TISSUE. For the random dataset, this variable states if the
sample comes from healthy tissue or from a tumor. The supplementary file labels healthy
samples as T_HEALTHY and tumor samples as T_MALIGN. By looking in the supplementary file
it can also be seen that the random dataset comes with a second classification variable:
GRADE. This variable states if the tumor samples comes from an low- or high-state tumor.
This is indicated by G_LOW and G_HIGH. Since the healthy samples do not come from tumors,
they do not have GRADE classes. To indicate this, #NA is used. The #NA symbol is
interpreted by the algorithm as a null class, causing the sample to be excluded if GRADE
is given as the classification variable. To test this, use the following command.
C<perl run_classifier.pl -v -c GRADE -o test_set_stage -s test_supp.txt test_data.txt>
By comparing the output files, differences can be seen in how many folds of LOOCV has
been carried out, and in what probes where selected for the decision trunks. The log
( run in 0.829 second using v1.01-cache-2.11-cpan-39bf76dae61 )