Algorithm-TrunkClassifier

 view release on metacpan or  search on metacpan

lib/Algorithm/TrunkClassifier/Classification.pm  view on Meta::CPAN


use Algorithm::TrunkClassifier::DataWrapper;
use Algorithm::TrunkClassifier::FeatureSelection;
use Algorithm::TrunkClassifier::DecisionTrunk;
use Algorithm::TrunkClassifier::Util;
use POSIX;

our $VERSION = "v1.0.1";

#Description: Function responsible for building decision trunks and classifying test samples using LOOCV
#Parameters: (1) Package, (2) input dataset, (3) test dataset, (4) classification procedure, (5) split percent,
#            (6) testset data file name, (7) classification variable name, (8) output folder name,
#            (9) number of levels, (10) verbose flag, (11) input data file name (12) useall flag
#Return value: None
sub trainAndClassify($ $ $ $ $ $ $ $ $ $ $ $ $){
	shift(@_);
	my ($dataWrapper, $testset, $CLASSIFY, $SPLITPERCENT, $TESTFILE, $CLASSNAME, $OUTPUT, $LEVELS, $VERBOSE, $DATAFILE, $USEALL) = @_;
	
	#Create output files
	if(!-e $OUTPUT && $OUTPUT ne "."){
		system("mkdir $OUTPUT");

lib/Algorithm/TrunkClassifier/Classification.pm  view on Meta::CPAN

	}
	if($CLASSIFY ne "split"){
		$SPLITPERCENT = "NA";
	}
	my $name1 = $dataWrapper->getClassOneName();
	my $name2 = $dataWrapper->getClassTwoName();
	my $log = "Trunk classifier log\n";
	$log .= "Input data file: $DATAFILE\n";
	$log .= "Testset data file: $TESTFILE\n";
	$log .= "Procedure: $CLASSIFY\n";
	$log .= "Split percent: $SPLITPERCENT\n";
	$log .= "Number of levels: $numTrunkLevels[0]\n";
	$log .= "Classification variable: $CLASSNAME\n";
	$log .= "Training set classes:\n";
	if($CLASSIFY eq "loocv"){
		$log .= "\tClass one size: " . $dataWrapper->getClassSize($name1) . " ($name1)\n";
		$log .= "\tClass two size: " . $dataWrapper->getClassSize($name2) . " ($name2)\n";
	}
	else{
		$log .= "\tClass one size: " . $trainingSet->getClassSize($name1) . " ($name1)\n";
		$log .= "\tClass two size: " . $trainingSet->getClassSize($name2) . " ($name2)\n";

lib/Algorithm/TrunkClassifier/DataWrapper.pm  view on Meta::CPAN

	my $newWrapper = Algorithm::TrunkClassifier::DataWrapper->new();
	$newWrapper->{"colnames"} = \@colnames;
	$newWrapper->{"rownames"} = \@rownames;
	$newWrapper->{"data_matrix"} = \@matrixCol;
	$newWrapper->{"class_vector"} = \@classVector;
	$newWrapper->{"class_one"} = $self->{"class_one"};
	$newWrapper->{"class_two"} = $self->{"class_two"};
	return $newWrapper;
}

#Description: Removes a percentage of samples from a TrunkClassifier::DataWrapper object
#Parameters: (1) TrunkClassifier::DataWrapper object, (2) split percent
#Return value: TrunkClassifier::DataWrapper object containing the removed samples
sub splitSamples($ $){
	my ($self, $split) = @_;
	my $totNumSamples = $self->getNumSamples();
	my $testSetSize = floor(($split / 100) * $totNumSamples);
	my @colnames;
	my @rownames = $self->getProbeList();
	my @classVector;
	my @matrix;
	for(my $row = 0; $row < $self->getNumProbes(); $row++){

pod/TrunkClassifier.pod  view on Meta::CPAN

datasets. The split-sample procedure causes the input dataset to be split into two
sets, a training set and a test set. Trunks are built using the training set and
classification is done only on the test set. It is also possible to supply two datasets.
The input data file (final command line argument) is then used as training set and the
second dataset as test set. Thus the value for the -c option should be C<loocv>, for 
leave-one-out cross validation, C<split> for split-sample procedure, or C<dual> when
using two datasets.

=item C<-e value>

The percentage of samples in the input data file that should be used as test set when
using C<-p split>. Must be from 1 to 99. Default is 20.

=item C<-t value>

The name of the testset data file when using the C<-p dual> option.

=item C<-c value>

The value should be the name of the classification variable to use. Default is TISSUE.

pod/TrunkClassifier.pod  view on Meta::CPAN


The F<cts_trunks.txt> file contains decision trunks built using the complete training set.

The classification of each sample can be found in the F<class_report.txt> file. The rows
in this file start with a sample name, followed by "in X-class". X is the level in the
decision trunk where the sample was classified, and class is the class label assigned to
the sample.

The F<log.txt> file gives a summary of the classifier run. The information given includes
the name of the input data file, the name of the test data file (if any), the name of
the classification procedure, the split-sample percentage (if any), number of decision
levels used for classification, the name of the classification variable, the sizes of
class1 and class2 in the training and test set respectively, and the version of the algorithm.

In case the C<-u> option is used, the output files will contain the results from using
decision trunks with 1, 2, 3, 4 and 5 levels.

=head2 EXAMPLE

To provide an easy way of testing the algorithm, the t/ folder contains two test files.
The F<test_data.txt> contains a random dataset with 200 samples and 1000 probes. This set



( run in 0.296 second using v1.01-cache-2.11-cpan-05162d3a2b1 )