Bio-ToolBox
view release on metacpan or search on metacpan
scripts/manipulate_datasets.pl view on Meta::CPAN
$Data->metadata( $index, 'converted', 'Z-score' );
# done
push @datasets_modified, $name;
}
# report results
if (@datasets_modified) {
printf " %s were converted to Z-scores\n", join( ", ", @datasets_modified );
}
return scalar(@datasets_modified);
}
sub sort_function {
# This will sort the entire data table by the values in one dataset
# Request dataset
my @indices;
if (@_) {
# from another subroutine
@indices = @_;
}
else {
@indices = _request_indices(
" Enter column index number (or column ranges for mean) to sort by ");
}
unless ( scalar(@indices) ) {
print " WARNING: no index provided. nothing done\n";
return 0;
}
# Ask the sort direction
my $direction;
if ($opt_direction) {
# direction was specified on the command line
$direction = $opt_direction;
}
else {
# otherwise ask the user for the direction
my $p = ' Sort by (i)ncreasing or (d)ecreasing order?: ';
$direction = prompt($p);
unless ( $direction =~ /^[id]$/i ) {
print " WARNING: unknown order; nothing done\n";
return 0;
}
}
# sort
if ( scalar(@indices) == 1 ) {
# excellent! only one column index to sort by
$Data->sort_data( $indices[0], $direction );
}
else {
# need to sort by the mean of provided column indices
# we will generate a temporary column of the mean
# first need to set the target of mean which is needed by combine function
my $original = $opt_target; # keep a backup just in case
$opt_target = 'mean';
combine_function(@indices);
my $i = $Data->last_column;
$opt_target = $original; # restore backup just in case
$Data->sort_data( $i, $direction );
$Data->delete_column($i); # delete the temporary column
}
# remove any pre-existing sorted metadata since no longer valid
for ( my $i = 0; $i < $Data->number_columns; $i++ ) {
$Data->delete_metadata( $i, 'sorted' );
}
# annotate metadata, but only if there was one index
if ( scalar(@indices) == 1 ) {
if ( lc $direction eq 'i' ) {
$Data->metadata( $indices[0], 'sorted', "increasing" )
unless $Data->metadata( $indices[0], 'AUTO' )
; # internal flag to not accept metadata
}
else {
$Data->metadata( $indices[0], 'sorted', "decreasing" )
unless $Data->metadata( $indices[0], 'AUTO' )
; # internal flag to not accept metadata
}
}
return 1;
}
sub genomic_sort_function {
# This will sort the entire data table by chromosome and start position
my $result = $Data->gsort_data;
unless ($result) {
print " WARNING: Data table not sorted\n";
return 0;
}
# remove any pre-existing sorted metadata since no longer valid
for ( my $i = 0; $i < $Data->number_columns; $i++ ) {
$Data->delete_metadata( $i, 'sorted' );
}
# annotate metadata
my $chr_i = $Data->chromo_column;
my $start_i = $Data->start_column;
$Data->metadata( $chr_i, 'sorted', 'genomic' )
unless $Data->metadata( $chr_i, 'AUTO' );
$Data->metadata( $start_i, 'sorted', 'genomic' )
unless $Data->metadata( $start_i, 'AUTO' );
print " Data table is sorted by genomic order\n";
return 1;
}
sub toss_nulls_function {
# Toss out datapoints (lines) that have a non-value in the specified dataset
# generate the list of datasets to check
my @order = _request_indices(
" Enter one or more column index numbers to check for non-values\n ");
scripts/manipulate_datasets.pl view on Meta::CPAN
=item --dir [i | d]
Specify the direction of a sort:
- (i)ncreasing
- (d)ecreasing
=item --name E<lt>stringE<gt>
Specify a new column name when re-naming a column using the rename function
from the command line. Also, when generating a new column using a defined
function C<--func [function]> from the command line, the new column will use
this name.
=item --log
Indicate whether the data is (not) in log2 space. This is required to ensure
accurate mathematical calculations in some manipulations. This is not necessary
when the log status is appropriately recorded in the dataset metadata.
=back
=head2 General options
=over 4
=item --gz
Indicate whether the output file should be gzip compressed. The compression
status of the input file will be preserved if overwriting.
=item --bgz
Specify whether the output file should be compressed with block gzip
(bgzip) for tabix compatibility.
=item --version
Print the version number.
=item --help
Display the POD documentation using perldoc.
=back
=head1 DESCRIPTION
This program allows some common mathematical and other manipulations on one
or more columns in a datafile. The program is designed as a simple
replacement for common manipulations performed in a full featured
spreadsheet program, e.g. Excel, particularly with datasets too large to be
loaded, all in a conveniant command line program. The program is designed
to be operated primarily as an interactive program, allowing for multiple
manipulations to be performed. Alternatively, single manipulations may be
performed as specified using command line options. As such, the program can
be called in shell scripts.
The program keeps track of the number of manipulations performed, and if
any are performed, will write out to file the changed data. Unless an
output file name is provided, it will overwrite the input file (NO backup is
made!).
=head1 FUNCTIONS
This is a list of the functions available for manipulating columns. These may
be selected interactively from the main menu (note the case sensitivity!),
or specified on the command line using the C<--func> option.
=over 4
=item B<stat> (menu option B<t>)
Print some basic statistics for a column, including mean,
median, standard deviation, min, and max. If 0 values are present,
indicate whether to include them (y or n)
=item B<lengthstat> (menu option B<k>)
Print basic statistics on interval lengths represented by the
data table, which must include coordinate information.
=item B<reorder> (menu option B<R>)
The column may be rewritten in a different order. The new order
is requested as a string of index numbers in the desired order.
Also, a column may be deleted by skipping its number or duplicated
by including it twice.
=item B<delete> (menu option B<D>)
One or more column may be selected for deletion.
=item B<rename> (menu option B<n>)
Assign a new name to a column. For automatic execution, use the C<--name>
option to specify the new name. Also, for any automatically executed
function (using the C<--func> option) that generates a new column, the
column's new name may be explicitly defined with this option.
=item B<number> (menu option B<b>)
Assign a number to each datapoint (or line), incrementing from 1
to the end. The numbered column will be inserted after the requested
column index.
=item B<concatenate> (menu option B<C>)
Concatenate the values from two or more columns into a single new
column. The character used to join the values may be specified
interactively or by the command line option C<--target> (default is '_'
in automatic execution mode). The new column is appended at the end.
=item B<split> (menu option B<T>)
Split a column into two or more new columns using a specified character
as the delimiter. The character may be specified interactively or
with the C<--target> command line option (default is '_' in automatic
execution mode). The new columns are appended at the end. If the
number of split items are not equal amongst the rows, absent values
are appended with null values.
( run in 3.118 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )