Bio-ToolBox

 view release on metacpan or  search on metacpan

scripts/manipulate_datasets.pl  view on Meta::CPAN

		$Data->metadata( $index, 'converted', 'Z-score' );

		# done
		push @datasets_modified, $name;
	}

	# report results
	if (@datasets_modified) {
		printf " %s were converted to Z-scores\n", join( ", ", @datasets_modified );
	}
	return scalar(@datasets_modified);
}

sub sort_function {

	# This will sort the entire data table by the values in one dataset

	# Request dataset
	my @indices;
	if (@_) {

		# from another subroutine
		@indices = @_;
	}
	else {
		@indices = _request_indices(
			" Enter column index number (or column ranges for mean) to sort by  ");
	}
	unless ( scalar(@indices) ) {
		print " WARNING: no index provided. nothing done\n";
		return 0;
	}

	# Ask the sort direction
	my $direction;
	if ($opt_direction) {

		# direction was specified on the command line
		$direction = $opt_direction;
	}
	else {
		# otherwise ask the user for the direction
		my $p = ' Sort by (i)ncreasing or (d)ecreasing order?:  ';
		$direction = prompt($p);
		unless ( $direction =~ /^[id]$/i ) {
			print " WARNING: unknown order; nothing done\n";
			return 0;
		}
	}

	# sort
	if ( scalar(@indices) == 1 ) {

		# excellent! only one column index to sort by
		$Data->sort_data( $indices[0], $direction );
	}
	else {
		# need to sort by the mean of provided column indices
		# we will generate a temporary column of the mean
		# first need to set the target of mean which is needed by combine function
		my $original = $opt_target;    # keep a backup just in case
		$opt_target = 'mean';
		combine_function(@indices);
		my $i = $Data->last_column;
		$opt_target = $original;       # restore backup just in case
		$Data->sort_data( $i, $direction );
		$Data->delete_column($i);      # delete the temporary column
	}

	# remove any pre-existing sorted metadata since no longer valid
	for ( my $i = 0; $i < $Data->number_columns; $i++ ) {
		$Data->delete_metadata( $i, 'sorted' );
	}

	# annotate metadata, but only if there was one index
	if ( scalar(@indices) == 1 ) {
		if ( lc $direction eq 'i' ) {
			$Data->metadata( $indices[0], 'sorted', "increasing" )
				unless $Data->metadata( $indices[0], 'AUTO' )
				;    # internal flag to not accept metadata
		}
		else {
			$Data->metadata( $indices[0], 'sorted', "decreasing" )
				unless $Data->metadata( $indices[0], 'AUTO' )
				;    # internal flag to not accept metadata
		}
	}

	return 1;
}

sub genomic_sort_function {

	# This will sort the entire data table by chromosome and start position

	my $result = $Data->gsort_data;
	unless ($result) {
		print " WARNING: Data table not sorted\n";
		return 0;
	}

	# remove any pre-existing sorted metadata since no longer valid
	for ( my $i = 0; $i < $Data->number_columns; $i++ ) {
		$Data->delete_metadata( $i, 'sorted' );
	}

	# annotate metadata
	my $chr_i   = $Data->chromo_column;
	my $start_i = $Data->start_column;
	$Data->metadata( $chr_i, 'sorted', 'genomic' )
		unless $Data->metadata( $chr_i, 'AUTO' );
	$Data->metadata( $start_i, 'sorted', 'genomic' )
		unless $Data->metadata( $start_i, 'AUTO' );

	print " Data table is sorted by genomic order\n";
	return 1;
}

sub toss_nulls_function {

	# Toss out datapoints (lines) that have a non-value in the specified dataset

	# generate the list of datasets to check
	my @order = _request_indices(
		" Enter one or more column index numbers to check for non-values\n   ");

scripts/manipulate_datasets.pl  view on Meta::CPAN

=item --dir [i | d]

Specify the direction of a sort: 
  
  - (i)ncreasing
  - (d)ecreasing
  
=item --name E<lt>stringE<gt>

Specify a new column name when re-naming a column using the rename function 
from the command line. Also, when generating a new column using a defined 
function C<--func [function]> from the command line, the new column will use 
this name.

=item --log 

Indicate whether the data is (not) in log2 space. This is required to ensure 
accurate mathematical calculations in some manipulations. This is not necessary 
when the log status is appropriately recorded in the dataset metadata.

=back

=head2 General options

=over 4

=item --gz 

Indicate whether the output file should be gzip compressed. The compression 
status of the input file will be preserved if overwriting.

=item --bgz

Specify whether the output file should be compressed with block gzip 
(bgzip) for tabix compatibility.

=item --version

Print the version number.

=item --help

Display the POD documentation using perldoc. 

=back

=head1 DESCRIPTION

This program allows some common mathematical and other manipulations on one
or more columns in a datafile. The program is designed as a simple
replacement for common manipulations performed in a full featured
spreadsheet program, e.g. Excel, particularly with datasets too large to be
loaded, all in a conveniant command line program. The program is designed
to be operated primarily as an interactive program, allowing for multiple
manipulations to be performed. Alternatively, single manipulations may be
performed as specified using command line options. As such, the program can
be called in shell scripts.

The program keeps track of the number of manipulations performed, and if 
any are performed, will write out to file the changed data. Unless an 
output file name is provided, it will overwrite the input file (NO backup is
made!).

=head1 FUNCTIONS

This is a list of the functions available for manipulating columns. These may 
be selected interactively from the main menu (note the case sensitivity!), 
or specified on the command line using the C<--func> option.

=over 4

=item B<stat> (menu option B<t>)

Print some basic statistics for a column, including mean, 
median, standard deviation, min, and max. If 0 values are present,
indicate whether to include them (y or n)

=item B<lengthstat> (menu option B<k>)

Print basic statistics on interval lengths represented by the
data table, which must include coordinate information.

=item B<reorder> (menu option B<R>)

The column may be rewritten in a different order. The new order 
is requested as a string of index numbers in the desired order. 
Also, a column may be deleted by skipping its number or duplicated
by including it twice.

=item B<delete> (menu option B<D>)

One or more column may be selected for deletion.

=item B<rename> (menu option B<n>)

Assign a new name to a column. For automatic execution, use the C<--name> 
option to specify the new name. Also, for any automatically executed 
function (using the C<--func> option) that generates a new column, the 
column's new name may be explicitly defined with this option.

=item B<number> (menu option B<b>)

Assign a number to each datapoint (or line), incrementing from 1 
to the end. The numbered column will be inserted after the requested 
column index.

=item B<concatenate> (menu option B<C>)

Concatenate the values from two or more columns into a single new 
column. The character used to join the values may be specified 
interactively or by the command line option C<--target> (default is '_' 
in automatic execution mode). The new column is appended at the end.

=item B<split> (menu option B<T>)

Split a column into two or more new columns using a specified character 
as the delimiter. The character may be specified interactively or 
with the C<--target> command line option (default is '_' in automatic 
execution mode). The new columns are appended at the end. If the 
number of split items are not equal amongst the rows, absent values 
are appended with null values.



( run in 3.118 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )