Bio-ToolBox

 view release on metacpan or  search on metacpan

CHANGES  view on Meta::CPAN

	- Added a new option of position when adjusting coordinates of retrieved
	features using the script get_features.pl. Coordinates may be adjusted
	at the 5 prime, 3 prime, or both ends of stranded features. This also 
	fixes bugs where collected features on the reverse strand with adjusted
	coordinates were not reported properly.
	- Improved automatic recognition of the name, score, and other columns
	in the convertor scripts data2bed.pl, data2gff.pl, and data2wig.pl. 
	- Improved the Cluster and Treeview export function in script
	manipulate_datasets.pl. The CDT files generated now include separate ID
	and NAME columns per the specification, and new manipulations are
	included prior to exporting, including percentile rank and log2.
	- The convert null function now also converts zero values if requested
	in script manipulate_datasets.pl.
	- Added new option of a minimum size when trimming windows in the script
	find_enriched_regions.pl.
	- Increased the radius from 35 bp to 50 bp when verifying a putative
	mapped nucleosome in script map_nucleosomes.pl, leading to fewer
	overlapping or offset nucleosomes.
	- Added new option to re-center offset nucleosomes in script
	verify_nucleosome_mapping.pl. Also improved report formatting.
	- Added checks and warnings when writing file names longer than 256

lib/Bio/ToolBox/Data.pm  view on Meta::CPAN

	# Go through each dataset
	foreach my $d ( 0 .. $#datasets ) {

		# Prepare score column name
		my $data_name = simplify_dataset_name( $datasets[$d] );

		# add column
		my $i = $summed_data->add_column($data_name);
		$summed_data->metadata( $i, 'dataset', $datasets[$d] );

		# tag for remembering we're working with percentile bins
		my $do_percentile = 0;

		# remember the row
		my $row = 1;

		# Collect summarized data
		for my $column ( $startcolumns[$d] .. $endcolumns[$d] ) {

			# determine the midpoint position of the window
			# this assumes the column metadata has start and stop
			my $midpoint = int(
				sum0(
					$self->metadata( $column, 'start' ),
					$self->metadata( $column, 'stop' )
				) / 2
			);

			# convert midpoint to fraction of 1000 for plotting if necessary
			if ( substr( $self->name($column), -1 ) eq '%' ) {
				$midpoint *= 10;    # midpoint * 0.01 * 1000 bp
				$do_percentile++;
			}
			if ( $do_percentile and substr( $self->name($column), -2 ) eq 'bp' ) {

				# working on the extension after the percentile bins
				$midpoint += 1000;
			}

			# collect the values in the column
			my @values;
			for my $row ( 1 .. $self->last_row ) {
				my $v = $self->value( $row, $column );
				if ( looks_like_number($v) ) {
					push @values, $v;
				}

lib/Bio/ToolBox/Data.pm  view on Meta::CPAN


=item summary_file

Write a separate file summarizing columns of data (mean values). 
The mean value of each column becomes a row value, and each column 
header becomes a row identifier (i.e. the table is transposed). The 
best use of this is to summarize the mean profile of windowed data 
collected across a feature. See the Bio::ToolBox scripts 
L<get_relative_data.pl> and L<get_binned_data.pl> as examples. 
For data from L<get_binned_data.pl> where the columns are expressed 
as percentile bins, the reported midpoint column is automatically 
converted based on a length of 1000 bp.

You may pass these options. They are optional.

=over 4

=item filename

Pass an optional new filename. The default is to take the basename 
and append "_<method>_summary" to it.

scripts/data2wig.pl  view on Meta::CPAN

	if ( $attribute_name and $Input->gff ) {

		# a GFF attribute
		return sub {
			my $row     = shift;
			my $attribs = $row->gff_attributes;
			my $score   = $attribs->{$attribute_name} || 0;
			return if $score eq '.';

			# format as necessary
			$score =~ s/\%$//;    # remove stupid percents if present
			return $score;
		};
	}
	elsif ( $attribute_name and $Input->vcf and defined $score_index ) {

		# a VCF attribute from one sample
		return sub {
			my $row     = shift;
			my $attribs = $row->vcf_attributes;
			my $score   = $attribs->{$score_index}{$attribute_name} || 0;
			return 0 if $score eq '.';

			# format as necessary
			$score =~ s/\%$//;    # remove stupid percents if present
			return $score;
		};
	}
	elsif ( $attribute_name and $Input->vcf and @score_indices ) {

		# a VCF attribute from many samples
		return sub {
			my $row     = shift;
			my $attribs = $row->vcf_attributes;
			my @scores;
			foreach (@score_indices) {
				my $s = $attribs->{$_}{$attribute_name} || 0;
				$s =~ s/\%$//;    # remove stupid percents if present
				if ( looks_like_number($s) ) {
					push @scores, $s;
				}
			}
			return &{$method_sub}(@scores);
		};
	}
	elsif ( @score_indices and $fast ) {

		# collect over multiple score columns from array reference

scripts/get_binned_data.pl  view on Meta::CPAN

		my $length = $row->length;   # subfeatures not allowed here, so use feature length

		# collect the scores to the bins in the region
		for my $column ( $startcolumn .. ( $Data->last_column ) ) {

			# we will step through each data column, representing each window (bin)
			# across the feature's region
			# any scores within this window will be collected and the mean
			# value reported

			# convert the window start and stop coordinates (as percentages) to
			# actual bp
			# this depends on whether the binsize is explicitly defined in bp or
			# is a fraction of the feature length
			my ( $start, $stop );
			if ( $Data->metadata( $column, 'bin_size' ) =~ /bp$/ ) {

				# the bin size is explicitly defined

				# the start and stop points are relative to either the feature
				# start (always 0) or the end (the feature length), depending

scripts/get_binned_data.pl  view on Meta::CPAN

		# across the feature's region
		# any scores within this window will be collected and the mean
		# value reported

		# record nulls if no data returned
		unless ( scalar keys %{$regionscores} ) {
			$row->value( $column, calculate_score( $method, undef ) );
			next;
		}

		# convert the window start and stop coordinates (as percentages) to
		# actual bp
		# this depends on whether the binsize is explicitly defined in bp or
		# is a fraction of the feature length
		my ( $start, $stop );
		if ( $Data->metadata( $column, 'bin_size' ) =~ /bp$/ ) {

			# the bin size is explicitly defined

			# the start and stop points are relative to either the feature
			# start (always 0) or the end (the feature length), depending

scripts/get_binned_data.pl  view on Meta::CPAN

			$col++;
		}
	}
}

### Prepare all of the bin columns and their metadata
sub prepare_bins {

	my ( $binsize, $dataset ) = @_;

	# the size of the bin in percentage units, default would be 10%
	# each bin will be titled the starting and ending point for that bin in
	# percentage units
	# for example, -20..-10,-10..0,0..10,10..20

	# if $extension is defined, then it will add the appropriate flanking bins,
	# otherwise it should skip them

	# bin(s) on 5' flank
	if ($extension) {

		if ($extension_size) {

scripts/get_binned_data.pl  view on Meta::CPAN


=back

=head2 Bin specification

=over 4

=item --bins E<lt>integerE<gt>

Specify the number of bins that will be generated over the length 
of the feature. The size of the feature is a percentage of the 
feature length. The default number is 10, which results in bins of 
size equal to 10% of the feature length. 

=item --ext E<lt>integerE<gt>

Specify the number of extended bins on either side of the feature. 
The bins are of the same size as determined by the feature 
length and the --bins value. The default is 0. 

=item --extsize E<lt>integerE<gt>

Specify the exact bin size in bp of the extended bins rather than
using a percentage of feature of length.

=item --min E<lt>integerE<gt>

Specify the minimum feature size to be averaged. Features with a
length below this value will not be skipped (all bins will have
null values). This is to avoid having bin sizes below the average 
microarray tiling distance. The default is undefined (no limit).

=back

scripts/get_binned_data.pl  view on Meta::CPAN


=item --help

This help text.

=back

=head1 DESCRIPTION

This program will collect data across a gene or feature body into numerous 
percentile bins. It is used to determine if there is a spatial 
distribution preference for the dataset over gene bodies. The number 
of bins may be specified as a command argument (default 10). Additionally, 
extra bins may be extended on either side of the gene (default 0 on either 
side). The bin size is determined as a percentage of gene length.

=head1 EXAMPLES

These are some examples of some common scenarios for collecting data.

=over 4

=item Collect scores in intervals

You want to collect the mean score from a bigWig file in 10% intervals 

scripts/manipulate_datasets.pl  view on Meta::CPAN

		push @datasets_modified, $Data->name($index);
	}

	# report results
	if (@datasets_modified) {
		printf " %s were median scaled to $target\n", join( ", ", @datasets_modified );
	}
	return scalar(@datasets_modified);
}

sub percentile_rank_function {

	# this subroutine will convert a dataset into a percentile rank

	# request datasets
	my @indices;
	if (@_) {

		# provided from an internal subroutine
		@indices = @_;
	}
	else {
		# otherwise request from user
		@indices = _request_indices(
			" Enter one or more column index numbers to convert to percentile rank  "
		);
	}
	unless (@indices) {
		print " WARNING: unknown index number(s). nothing done\n";
		return 0;
	}

	# Where to put new values?
	my $placement = _request_placement();

	# Process each index request
	my @datasets_modified;    # a list of which datasets were modified
	foreach my $index (@indices) {

		# Calculate percent rank of values
		my @cv = $Data->column_values($index);
		shift @cv;            # skip header
		my @values = grep { looks_like_number($_) } @cv;
		unless (@values) {
			printf " WARNING: no numeric values in dataset %d, %s! Skipping\n",
				$index, $Data->name($index);
			next;
		}
		my $total = scalar @values;
		my %percentrank;
		my $n = 1;
		foreach ( sort { $a <=> $b } @values ) {

			# sort by increasing hash values, not hash keys
			# percentrank is key value (index) divided by total
			$percentrank{$_} = $n / $total;
			$n++;
		}

		# Replace the contents with the calculated percent rank
		$index = _prepare_new_destination( $index, '_pr' ) if $placement =~ /^n/i;
		$Data->iterate(
			sub {
				my $row = shift;
				my $v   = $row->value($index);
				next unless looks_like_number($v);
				$row->value( $index, $percentrank{$v} );
			}
		);

		# update metadata
		$Data->metadata( $index, 'converted', 'percent_rank' );

		# done
		push @datasets_modified, $Data->name($index);
	}

	# report results
	if (@datasets_modified) {
		printf " %s were converted to percent rank\n", join( ", ", @datasets_modified );
	}
	return scalar(@datasets_modified);
}

sub zscore_function {

	# this subroutine will generate a z-score for each value in a dataset

	# identify the datasets to convert
	my @indices;

scripts/manipulate_datasets.pl  view on Meta::CPAN

	}
	else {
		# ask the user
		print <<LIST;
Available dataset manipulations
  su - decreasing sort by sum of row values
  sm - decreasing sort by mean of row values
  cg - median center features (genes)
  cd - median center datasets
  zd - convert dataset to Z-scores
  pd - convert dataset to percentile rank
  L2 - convert dataset to log2
  L10 - convert dataset to log10
  n0 - convert null values to 0 
LIST
		my $p      = 'Enter the manipulation(s) in order of desired execution: ';
		my $answer = prompt($p);
		@manipulations = split /[,\s]+/, $answer;
	}

	### First, delete extraneous datasets or columns

scripts/manipulate_datasets.pl  view on Meta::CPAN

			subtract_function(@datasets);
		}
		elsif (/^zd$/i) {

			# Z-score convert dataset
			print " converting datasets to Z-scores....\n";
			zscore_function(@datasets);
		}
		elsif (/^pd$/i) {

			# convert dataset to percentile rank
			print " converting datasets to percentile ranks....\n";
			percentile_rank_function(@datasets);
		}
		elsif (/^l2$/i) {

			# convert dataset to log2 values
			print " converting datasets to log2 values....\n";
			log_function( 2, @datasets );
		}
		elsif (/^l10$/i) {

			# convert dataset to log10 values

scripts/manipulate_datasets.pl  view on Meta::CPAN

		'addname'     => \&addname_function,
		'cnull'       => \&convert_nulls_function,
		'absolute'    => \&convert_absolute_function,
		'minimum'     => \&minimum_function,
		'maximum'     => \&maximum_function,
		'add'         => \&add_function,
		'subtract'    => \&subtract_function,
		'multiply'    => \&multiply_function,
		'divide'      => \&divide_function,
		'scale'       => \&median_scale_function,
		'pr'          => \&percentile_rank_function,
		'zscore'      => \&zscore_function,
		'log'         => \&log_function,
		'log2'        => \&log_function,                     # holdover from previous
		'delog'       => \&delog_function,
		'delog2'      => \&delog_function,
		'format'      => \&format_function,
		'combine'     => \&combine_function,
		'ratio'       => \&ratio_function,
		'diff'        => \&difference_function,
		'normdiff'    => \&normalized_difference_function,

scripts/manipulate_datasets.pl  view on Meta::CPAN


A column may be a median scaled as a means of normalization 
with other columns. The current median of the column requested is
presented, and a new median target is requested. The column may 
either be replaced with the median scaled values or added as a new 
column. For automatic execution, specify the new median target 
with the --target option.

=item B<pr> (menu option B<p>)

A column may be converted to a percentile rank, whereby the
data values are sorted in ascending order and assigned a new value 
from 0..1 based on its rank in the sorted order. The column may 
either be replaced with the percentile rank or added as a new
column. The original order of the column is maintained.

=item B<zscore> (menu option B<Z>)

Generate a Z-score or standard score for each value in a column. The
Z-score is the number of standard deviations the value is away from
the column's mean, such that the new mean is 0 and the standard 
deviation is 1. Provides a simple method of normalizing columns
with disparate dynamic ranges.

scripts/manipulate_datasets.pl  view on Meta::CPAN

--index <name>,<start-stop>). Extraneous columns are removed. 
Additional manipulations on the columns may be performed prior to 
exporting. These may be chosen interactively or using the codes 
listed below and specified using the --target option.
  
  su - decreasing sort by sum of row values
  sm - decreasing sort by mean of row values
  cg - median center features (rows)
  cd - median center datasets (columns)
  zd - convert columns to Z-scores
  pd - convert columns to percentile ranks
  L2 - convert values to log2
  L10 - convert values to log10
  n0 - convert nulls to 0.0

A simple Cluster data text file is written (default file name 
"<basename>.cdt"), but without the GWEIGHT column or EWEIGHT row. The 
original file will not be rewritten.

=item B<rewrite> (menu option B<W>)



( run in 0.517 second using v1.01-cache-2.11-cpan-709fd43a63f )