view release on metacpan or search on metacpan
- Added a new option of position when adjusting coordinates of retrieved
features using the script get_features.pl. Coordinates may be adjusted
at the 5 prime, 3 prime, or both ends of stranded features. This also
fixes bugs where collected features on the reverse strand with adjusted
coordinates were not reported properly.
- Improved automatic recognition of the name, score, and other columns
in the convertor scripts data2bed.pl, data2gff.pl, and data2wig.pl.
- Improved the Cluster and Treeview export function in script
manipulate_datasets.pl. The CDT files generated now include separate ID
and NAME columns per the specification, and new manipulations are
included prior to exporting, including percentile rank and log2.
- The convert null function now also converts zero values if requested
in script manipulate_datasets.pl.
- Added new option of a minimum size when trimming windows in the script
find_enriched_regions.pl.
- Increased the radius from 35 bp to 50 bp when verifying a putative
mapped nucleosome in script map_nucleosomes.pl, leading to fewer
overlapping or offset nucleosomes.
- Added new option to re-center offset nucleosomes in script
verify_nucleosome_mapping.pl. Also improved report formatting.
- Added checks and warnings when writing file names longer than 256
lib/Bio/ToolBox/Data.pm view on Meta::CPAN
# Go through each dataset
foreach my $d ( 0 .. $#datasets ) {
# Prepare score column name
my $data_name = simplify_dataset_name( $datasets[$d] );
# add column
my $i = $summed_data->add_column($data_name);
$summed_data->metadata( $i, 'dataset', $datasets[$d] );
# tag for remembering we're working with percentile bins
my $do_percentile = 0;
# remember the row
my $row = 1;
# Collect summarized data
for my $column ( $startcolumns[$d] .. $endcolumns[$d] ) {
# determine the midpoint position of the window
# this assumes the column metadata has start and stop
my $midpoint = int(
sum0(
$self->metadata( $column, 'start' ),
$self->metadata( $column, 'stop' )
) / 2
);
# convert midpoint to fraction of 1000 for plotting if necessary
if ( substr( $self->name($column), -1 ) eq '%' ) {
$midpoint *= 10; # midpoint * 0.01 * 1000 bp
$do_percentile++;
}
if ( $do_percentile and substr( $self->name($column), -2 ) eq 'bp' ) {
# working on the extension after the percentile bins
$midpoint += 1000;
}
# collect the values in the column
my @values;
for my $row ( 1 .. $self->last_row ) {
my $v = $self->value( $row, $column );
if ( looks_like_number($v) ) {
push @values, $v;
}
lib/Bio/ToolBox/Data.pm view on Meta::CPAN
=item summary_file
Write a separate file summarizing columns of data (mean values).
The mean value of each column becomes a row value, and each column
header becomes a row identifier (i.e. the table is transposed). The
best use of this is to summarize the mean profile of windowed data
collected across a feature. See the Bio::ToolBox scripts
L<get_relative_data.pl> and L<get_binned_data.pl> as examples.
For data from L<get_binned_data.pl> where the columns are expressed
as percentile bins, the reported midpoint column is automatically
converted based on a length of 1000 bp.
You may pass these options. They are optional.
=over 4
=item filename
Pass an optional new filename. The default is to take the basename
and append "_<method>_summary" to it.
scripts/data2wig.pl view on Meta::CPAN
if ( $attribute_name and $Input->gff ) {
# a GFF attribute
return sub {
my $row = shift;
my $attribs = $row->gff_attributes;
my $score = $attribs->{$attribute_name} || 0;
return if $score eq '.';
# format as necessary
$score =~ s/\%$//; # remove stupid percents if present
return $score;
};
}
elsif ( $attribute_name and $Input->vcf and defined $score_index ) {
# a VCF attribute from one sample
return sub {
my $row = shift;
my $attribs = $row->vcf_attributes;
my $score = $attribs->{$score_index}{$attribute_name} || 0;
return 0 if $score eq '.';
# format as necessary
$score =~ s/\%$//; # remove stupid percents if present
return $score;
};
}
elsif ( $attribute_name and $Input->vcf and @score_indices ) {
# a VCF attribute from many samples
return sub {
my $row = shift;
my $attribs = $row->vcf_attributes;
my @scores;
foreach (@score_indices) {
my $s = $attribs->{$_}{$attribute_name} || 0;
$s =~ s/\%$//; # remove stupid percents if present
if ( looks_like_number($s) ) {
push @scores, $s;
}
}
return &{$method_sub}(@scores);
};
}
elsif ( @score_indices and $fast ) {
# collect over multiple score columns from array reference
scripts/get_binned_data.pl view on Meta::CPAN
my $length = $row->length; # subfeatures not allowed here, so use feature length
# collect the scores to the bins in the region
for my $column ( $startcolumn .. ( $Data->last_column ) ) {
# we will step through each data column, representing each window (bin)
# across the feature's region
# any scores within this window will be collected and the mean
# value reported
# convert the window start and stop coordinates (as percentages) to
# actual bp
# this depends on whether the binsize is explicitly defined in bp or
# is a fraction of the feature length
my ( $start, $stop );
if ( $Data->metadata( $column, 'bin_size' ) =~ /bp$/ ) {
# the bin size is explicitly defined
# the start and stop points are relative to either the feature
# start (always 0) or the end (the feature length), depending
scripts/get_binned_data.pl view on Meta::CPAN
# across the feature's region
# any scores within this window will be collected and the mean
# value reported
# record nulls if no data returned
unless ( scalar keys %{$regionscores} ) {
$row->value( $column, calculate_score( $method, undef ) );
next;
}
# convert the window start and stop coordinates (as percentages) to
# actual bp
# this depends on whether the binsize is explicitly defined in bp or
# is a fraction of the feature length
my ( $start, $stop );
if ( $Data->metadata( $column, 'bin_size' ) =~ /bp$/ ) {
# the bin size is explicitly defined
# the start and stop points are relative to either the feature
# start (always 0) or the end (the feature length), depending
scripts/get_binned_data.pl view on Meta::CPAN
$col++;
}
}
}
### Prepare all of the bin columns and their metadata
sub prepare_bins {
my ( $binsize, $dataset ) = @_;
# the size of the bin in percentage units, default would be 10%
# each bin will be titled the starting and ending point for that bin in
# percentage units
# for example, -20..-10,-10..0,0..10,10..20
# if $extension is defined, then it will add the appropriate flanking bins,
# otherwise it should skip them
# bin(s) on 5' flank
if ($extension) {
if ($extension_size) {
scripts/get_binned_data.pl view on Meta::CPAN
=back
=head2 Bin specification
=over 4
=item --bins E<lt>integerE<gt>
Specify the number of bins that will be generated over the length
of the feature. The size of the feature is a percentage of the
feature length. The default number is 10, which results in bins of
size equal to 10% of the feature length.
=item --ext E<lt>integerE<gt>
Specify the number of extended bins on either side of the feature.
The bins are of the same size as determined by the feature
length and the --bins value. The default is 0.
=item --extsize E<lt>integerE<gt>
Specify the exact bin size in bp of the extended bins rather than
using a percentage of feature of length.
=item --min E<lt>integerE<gt>
Specify the minimum feature size to be averaged. Features with a
length below this value will not be skipped (all bins will have
null values). This is to avoid having bin sizes below the average
microarray tiling distance. The default is undefined (no limit).
=back
scripts/get_binned_data.pl view on Meta::CPAN
=item --help
This help text.
=back
=head1 DESCRIPTION
This program will collect data across a gene or feature body into numerous
percentile bins. It is used to determine if there is a spatial
distribution preference for the dataset over gene bodies. The number
of bins may be specified as a command argument (default 10). Additionally,
extra bins may be extended on either side of the gene (default 0 on either
side). The bin size is determined as a percentage of gene length.
=head1 EXAMPLES
These are some examples of some common scenarios for collecting data.
=over 4
=item Collect scores in intervals
You want to collect the mean score from a bigWig file in 10% intervals
scripts/manipulate_datasets.pl view on Meta::CPAN
push @datasets_modified, $Data->name($index);
}
# report results
if (@datasets_modified) {
printf " %s were median scaled to $target\n", join( ", ", @datasets_modified );
}
return scalar(@datasets_modified);
}
sub percentile_rank_function {
# this subroutine will convert a dataset into a percentile rank
# request datasets
my @indices;
if (@_) {
# provided from an internal subroutine
@indices = @_;
}
else {
# otherwise request from user
@indices = _request_indices(
" Enter one or more column index numbers to convert to percentile rank "
);
}
unless (@indices) {
print " WARNING: unknown index number(s). nothing done\n";
return 0;
}
# Where to put new values?
my $placement = _request_placement();
# Process each index request
my @datasets_modified; # a list of which datasets were modified
foreach my $index (@indices) {
# Calculate percent rank of values
my @cv = $Data->column_values($index);
shift @cv; # skip header
my @values = grep { looks_like_number($_) } @cv;
unless (@values) {
printf " WARNING: no numeric values in dataset %d, %s! Skipping\n",
$index, $Data->name($index);
next;
}
my $total = scalar @values;
my %percentrank;
my $n = 1;
foreach ( sort { $a <=> $b } @values ) {
# sort by increasing hash values, not hash keys
# percentrank is key value (index) divided by total
$percentrank{$_} = $n / $total;
$n++;
}
# Replace the contents with the calculated percent rank
$index = _prepare_new_destination( $index, '_pr' ) if $placement =~ /^n/i;
$Data->iterate(
sub {
my $row = shift;
my $v = $row->value($index);
next unless looks_like_number($v);
$row->value( $index, $percentrank{$v} );
}
);
# update metadata
$Data->metadata( $index, 'converted', 'percent_rank' );
# done
push @datasets_modified, $Data->name($index);
}
# report results
if (@datasets_modified) {
printf " %s were converted to percent rank\n", join( ", ", @datasets_modified );
}
return scalar(@datasets_modified);
}
sub zscore_function {
# this subroutine will generate a z-score for each value in a dataset
# identify the datasets to convert
my @indices;
scripts/manipulate_datasets.pl view on Meta::CPAN
}
else {
# ask the user
print <<LIST;
Available dataset manipulations
su - decreasing sort by sum of row values
sm - decreasing sort by mean of row values
cg - median center features (genes)
cd - median center datasets
zd - convert dataset to Z-scores
pd - convert dataset to percentile rank
L2 - convert dataset to log2
L10 - convert dataset to log10
n0 - convert null values to 0
LIST
my $p = 'Enter the manipulation(s) in order of desired execution: ';
my $answer = prompt($p);
@manipulations = split /[,\s]+/, $answer;
}
### First, delete extraneous datasets or columns
scripts/manipulate_datasets.pl view on Meta::CPAN
subtract_function(@datasets);
}
elsif (/^zd$/i) {
# Z-score convert dataset
print " converting datasets to Z-scores....\n";
zscore_function(@datasets);
}
elsif (/^pd$/i) {
# convert dataset to percentile rank
print " converting datasets to percentile ranks....\n";
percentile_rank_function(@datasets);
}
elsif (/^l2$/i) {
# convert dataset to log2 values
print " converting datasets to log2 values....\n";
log_function( 2, @datasets );
}
elsif (/^l10$/i) {
# convert dataset to log10 values
scripts/manipulate_datasets.pl view on Meta::CPAN
'addname' => \&addname_function,
'cnull' => \&convert_nulls_function,
'absolute' => \&convert_absolute_function,
'minimum' => \&minimum_function,
'maximum' => \&maximum_function,
'add' => \&add_function,
'subtract' => \&subtract_function,
'multiply' => \&multiply_function,
'divide' => \÷_function,
'scale' => \&median_scale_function,
'pr' => \&percentile_rank_function,
'zscore' => \&zscore_function,
'log' => \&log_function,
'log2' => \&log_function, # holdover from previous
'delog' => \&delog_function,
'delog2' => \&delog_function,
'format' => \&format_function,
'combine' => \&combine_function,
'ratio' => \&ratio_function,
'diff' => \&difference_function,
'normdiff' => \&normalized_difference_function,
scripts/manipulate_datasets.pl view on Meta::CPAN
A column may be a median scaled as a means of normalization
with other columns. The current median of the column requested is
presented, and a new median target is requested. The column may
either be replaced with the median scaled values or added as a new
column. For automatic execution, specify the new median target
with the --target option.
=item B<pr> (menu option B<p>)
A column may be converted to a percentile rank, whereby the
data values are sorted in ascending order and assigned a new value
from 0..1 based on its rank in the sorted order. The column may
either be replaced with the percentile rank or added as a new
column. The original order of the column is maintained.
=item B<zscore> (menu option B<Z>)
Generate a Z-score or standard score for each value in a column. The
Z-score is the number of standard deviations the value is away from
the column's mean, such that the new mean is 0 and the standard
deviation is 1. Provides a simple method of normalizing columns
with disparate dynamic ranges.
scripts/manipulate_datasets.pl view on Meta::CPAN
--index <name>,<start-stop>). Extraneous columns are removed.
Additional manipulations on the columns may be performed prior to
exporting. These may be chosen interactively or using the codes
listed below and specified using the --target option.
su - decreasing sort by sum of row values
sm - decreasing sort by mean of row values
cg - median center features (rows)
cd - median center datasets (columns)
zd - convert columns to Z-scores
pd - convert columns to percentile ranks
L2 - convert values to log2
L10 - convert values to log10
n0 - convert nulls to 0.0
A simple Cluster data text file is written (default file name
"<basename>.cdt"), but without the GWEIGHT column or EWEIGHT row. The
original file will not be rewritten.
=item B<rewrite> (menu option B<W>)