Bio-ToolBox
view release on metacpan or search on metacpan
lib/Bio/ToolBox/Data.pm view on Meta::CPAN
if ( defined $d ) {
# we have what appears to be a dataset column
$possibles{$d} ||= [];
push @{ $possibles{$d} }, $i;
}
else {
# still an unknown possibility
push @{ $possibles{unknown} }, $i;
}
}
}
# check datasets
unless (@datasets) {
if ( scalar( keys %possibles ) > 1 ) {
# we will always have the unknown category, so anything more than one
# means we found legitimate dataset columns
delete $possibles{unknown};
}
@datasets = sort { $a cmp $b } keys %possibles;
}
# check starts
if ( scalar @startcolumns != scalar @datasets ) {
@startcolumns = (); # ignore what we were given?
foreach my $d (@datasets) {
# take the first column with this dataset
push @startcolumns, $possibles{$d}->[0];
}
}
# check stops
if ( scalar @endcolumns != scalar @datasets ) {
@endcolumns = (); # ignore what we were given?
foreach my $d (@datasets) {
# take the last column with this dataset
push @endcolumns, $possibles{$d}->[-1];
}
}
# Prepare Data object to store the summed data
my $summed_data = $self->new(
feature => 'averaged_windows',
columns => [ 'Window', 'Midpoint' ],
);
# Go through each dataset
foreach my $d ( 0 .. $#datasets ) {
# Prepare score column name
my $data_name = simplify_dataset_name( $datasets[$d] );
# add column
my $i = $summed_data->add_column($data_name);
$summed_data->metadata( $i, 'dataset', $datasets[$d] );
# tag for remembering we're working with percentile bins
my $do_percentile = 0;
# remember the row
my $row = 1;
# Collect summarized data
for my $column ( $startcolumns[$d] .. $endcolumns[$d] ) {
# determine the midpoint position of the window
# this assumes the column metadata has start and stop
my $midpoint = int(
sum0(
$self->metadata( $column, 'start' ),
$self->metadata( $column, 'stop' )
) / 2
);
# convert midpoint to fraction of 1000 for plotting if necessary
if ( substr( $self->name($column), -1 ) eq '%' ) {
$midpoint *= 10; # midpoint * 0.01 * 1000 bp
$do_percentile++;
}
if ( $do_percentile and substr( $self->name($column), -2 ) eq 'bp' ) {
# working on the extension after the percentile bins
$midpoint += 1000;
}
# collect the values in the column
my @values;
for my $row ( 1 .. $self->last_row ) {
my $v = $self->value( $row, $column );
if ( looks_like_number($v) ) {
push @values, $v;
}
else {
# we treat this as zero, as opposed to skipping it, so that we
# do not over-emphasize the remaining signal from those columns
# that do not have much signal to begin with
# it distorts the interpretation
push @values, 0;
}
}
# adjust if log value
my $log = $self->metadata( $column, 'log2' ) || 0;
if ($log) {
@values = map { 2**$_ } @values;
}
# determine mean value
my $window_value;
my $num_values = scalar(@values);
if (@values) {
if ( $args{method} eq 'mean' ) {
$window_value = sum0(@values) / $num_values;
}
elsif ( $args{method} eq 'trimmean' ) {
if ( scalar @values == 1 ) {
$window_value = $values[0];
}
elsif ( scalar @values < 100 ) {
# use standard mean
$window_value = sum0(@values) / $num_values;
}
else {
@values = sort { $a <=> $b } @values;
my $x = sprintf( "%.0f", $num_values / 100 );
$window_value = sum0( @values[ $x .. ( $num_values - $x ) ] ) /
( $num_values - ( 2 * $x ) );
}
}
elsif ( $args{method} eq 'median' ) {
if ( scalar @values == 1 ) {
$window_value = $values[0];
}
elsif ( $num_values & 1 ) {
# odd number of values
$window_value = $values[ ( $#values / 2 ) ];
}
else {
my $mid = $num_values / 2;
$window_value = sum0( $values[ $mid - 1 ], $values[$mid] ) / 2;
lib/Bio/ToolBox/Data.pm view on Meta::CPAN
first 10 lines. Returns two strings: the first is a generic flavor, and the
second is a more specific format, if applicable. Generic flavor values will
be one of C<gff>, C<bed>, C<ucsc>, or C<undefined>. These correlate to specific
Parser adapters. Specific formats could be any number of possibilities, for
example C<undefined>, C<gtf>, C<gff3>, C<narrowPeak>, C<genePred>, etc.
=item save
=item write_file
my $success = $Data->save;
my $success = $Data->save('my_file.txt');
my $success = $Data->save(filename => $file, gz => 1);
print "file $success was saved!\n";
Pass the file name to be written. If no file name is passed, then
the filename and path stored in the metadata are used, if present.
These methods will write the Data structure out to file. It will
be first verified as to proper structure. Opened BED and GFF files
are checked to see if their structure is maintained. If so, they
are written in the same format; if not, they are written as regular
tab-delimited text files.
You may pass additional options.
=over 4
=item filename
Optionally pass a new filename. Required for new objects; previous
opened files may be overwritten if a new name is not provided. If
necessary, the file extension may be changed; for example, BED files
that no longer match the defined format lose the F<.bed> and gain a F<.txt>
extension. Compression may or add or strip F<.gz> as appropriate. If
a path is not provided, the current working directory is used.
=item gz
Boolean value to change the compression status of the output file. If
overwriting an input file, the default is maintain the compression status,
otherwise no compression. Pass a 0 for no compression, 1 for standard
gzip compression, or 2 for block gzip (bgzip) compression for tabix
compatibility.
=back
If the file save is successful, it will return the full path and
name of the saved file, complete with any changes to the file extension.
=item summary_file
Write a separate file summarizing columns of data (mean values).
The mean value of each column becomes a row value, and each column
header becomes a row identifier (i.e. the table is transposed). The
best use of this is to summarize the mean profile of windowed data
collected across a feature. See the L<Bio::ToolBox> scripts
L<get_relative_data.pl> and L<get_binned_data.pl> as examples.
For data from L<get_binned_data.pl> where the columns are expressed
as percentile bins, the reported midpoint column is automatically
converted based on a length of 1000 bp.
You may pass these options. They are optional.
=over 4
=item filename
Pass an optional new filename. The default is to take the basename
and append "_<method>_summary" to it.
=item startcolumn
=item stopcolumn
Provide the starting and ending columns to summarize. The default
start is the leftmost column without a recognized standard name.
The default ending column is the last rightmost column.
=item dataset
Pass a string that is the name of the dataset. This could be collected
from the metadata, if present. This will become the name of the score
column if defined.
=item method
Pass the name of the method to combine the values. Methods include
C<mean> (default if not specified), C<median>, or C<trimmean>, where
the top and bottom 1% of the sorted values are discarded and a mean
of the remaining 98% of the values is used. If fewer than 100 values
are available, no trimming is done and a straight mean value is
determined.
=back
The name of the summarized column is either the provided dataset name,
the defined basename in the metadata of the Data structure, or a generic
name. If successful, it will return the name of the file saved.
=back
=head2 Verifying Datasets
When working with row Features and collecting scores, the dataset
from which you are collecting must be verified prior to collection.
This ensures that the proper database adaptor is available and loaded,
and that the dataset is correctly specified (otherwise nothing would be
collected). This verification is normally performed transparently when
you call L<get_score|Bio::ToolBox::Data::Feature/get_score> or
L<get_position_scores|Bio::ToolBox::Data::Feature/get_position_scores>.
However, datasets may be explicitly verified prior to calling the score
methods.
=over 4
=item verify_dataset
my $dataset = $Data->verify_dataset($dataset, $database);
( run in 0.955 second using v1.01-cache-2.11-cpan-39bf76dae61 )