Bio-ToolBox

 view release on metacpan or  search on metacpan

lib/Bio/ToolBox/Data.pm  view on Meta::CPAN

			if ( defined $d ) {

				# we have what appears to be a dataset column
				$possibles{$d} ||= [];
				push @{ $possibles{$d} }, $i;
			}
			else {
				# still an unknown possibility
				push @{ $possibles{unknown} }, $i;
			}
		}
	}

	# check datasets
	unless (@datasets) {
		if ( scalar( keys %possibles ) > 1 ) {

			# we will always have the unknown category, so anything more than one
			# means we found legitimate dataset columns
			delete $possibles{unknown};
		}
		@datasets = sort { $a cmp $b } keys %possibles;
	}

	# check starts
	if ( scalar @startcolumns != scalar @datasets ) {
		@startcolumns = ();    # ignore what we were given?
		foreach my $d (@datasets) {

			# take the first column with this dataset
			push @startcolumns, $possibles{$d}->[0];
		}
	}

	# check stops
	if ( scalar @endcolumns != scalar @datasets ) {
		@endcolumns = ();      # ignore what we were given?
		foreach my $d (@datasets) {

			# take the last column with this dataset
			push @endcolumns, $possibles{$d}->[-1];
		}
	}

	# Prepare Data object to store the summed data
	my $summed_data = $self->new(
		feature => 'averaged_windows',
		columns => [ 'Window', 'Midpoint' ],
	);

	# Go through each dataset
	foreach my $d ( 0 .. $#datasets ) {

		# Prepare score column name
		my $data_name = simplify_dataset_name( $datasets[$d] );

		# add column
		my $i = $summed_data->add_column($data_name);
		$summed_data->metadata( $i, 'dataset', $datasets[$d] );

		# tag for remembering we're working with percentile bins
		my $do_percentile = 0;

		# remember the row
		my $row = 1;

		# Collect summarized data
		for my $column ( $startcolumns[$d] .. $endcolumns[$d] ) {

			# determine the midpoint position of the window
			# this assumes the column metadata has start and stop
			my $midpoint = int(
				sum0(
					$self->metadata( $column, 'start' ),
					$self->metadata( $column, 'stop' )
				) / 2
			);

			# convert midpoint to fraction of 1000 for plotting if necessary
			if ( substr( $self->name($column), -1 ) eq '%' ) {
				$midpoint *= 10;    # midpoint * 0.01 * 1000 bp
				$do_percentile++;
			}
			if ( $do_percentile and substr( $self->name($column), -2 ) eq 'bp' ) {

				# working on the extension after the percentile bins
				$midpoint += 1000;
			}

			# collect the values in the column
			my @values;
			for my $row ( 1 .. $self->last_row ) {
				my $v = $self->value( $row, $column );
				if ( looks_like_number($v) ) {
					push @values, $v;
				}
				else {
					# we treat this as zero, as opposed to skipping it, so that we
					# do not over-emphasize the remaining signal from those columns
					# that do not have much signal to begin with
					# it distorts the interpretation
					push @values, 0;
				}
			}

			# adjust if log value
			my $log = $self->metadata( $column, 'log2' ) || 0;
			if ($log) {
				@values = map { 2**$_ } @values;
			}

			# determine mean value
			my $window_value;
			my $num_values = scalar(@values);
			if (@values) {
				if ( $args{method} eq 'mean' ) {
					$window_value = sum0(@values) / $num_values;
				}
				elsif ( $args{method} eq 'trimmean' ) {
					if ( scalar @values == 1 ) {
						$window_value = $values[0];
					}
					elsif ( scalar @values < 100 ) {

						# use standard mean
						$window_value = sum0(@values) / $num_values;
					}
					else {
						@values = sort { $a <=> $b } @values;
						my $x = sprintf( "%.0f", $num_values / 100 );
						$window_value = sum0( @values[ $x .. ( $num_values - $x ) ] ) /
							( $num_values - ( 2 * $x ) );
					}
				}
				elsif ( $args{method} eq 'median' ) {
					if ( scalar @values == 1 ) {
						$window_value = $values[0];
					}
					elsif ( $num_values & 1 ) {

						# odd number of values
						$window_value = $values[ ( $#values / 2 ) ];
					}
					else {
						my $mid = $num_values / 2;
						$window_value = sum0( $values[ $mid - 1 ], $values[$mid] ) / 2;

lib/Bio/ToolBox/Data.pm  view on Meta::CPAN

first 10 lines. Returns two strings: the first is a generic flavor, and the
second is a more specific format, if applicable. Generic flavor values will
be one of C<gff>, C<bed>, C<ucsc>, or C<undefined>. These correlate to specific
Parser adapters. Specific formats could be any number of possibilities, for
example C<undefined>, C<gtf>, C<gff3>, C<narrowPeak>, C<genePred>, etc.  


=item save

=item write_file

  my $success = $Data->save;
  my $success = $Data->save('my_file.txt');
  my $success = $Data->save(filename => $file, gz => 1);
  print "file $success was saved!\n";

Pass the file name to be written. If no file name is passed, then 
the filename and path stored in the metadata are used, if present.

These methods will write the Data structure out to file. It will 
be first verified as to proper structure. Opened BED and GFF files 
are checked to see if their structure is maintained. If so, they 
are written in the same format; if not, they are written as regular 
tab-delimited text files. 

You may pass additional options.

=over 4

=item filename

Optionally pass a new filename. Required for new objects; previous 
opened files may be overwritten if a new name is not provided. If 
necessary, the file extension may be changed; for example, BED files 
that no longer match the defined format lose the F<.bed> and gain a F<.txt> 
extension. Compression may or add or strip F<.gz> as appropriate. If 
a path is not provided, the current working directory is used.

=item gz

Boolean value to change the compression status of the output file. If 
overwriting an input file, the default is maintain the compression status, 
otherwise no compression. Pass a 0 for no compression, 1 for standard 
gzip compression, or 2 for block gzip (bgzip) compression for tabix 
compatibility.

=back

If the file save is successful, it will return the full path and 
name of the saved file, complete with any changes to the file extension.

=item summary_file

Write a separate file summarizing columns of data (mean values). 
The mean value of each column becomes a row value, and each column 
header becomes a row identifier (i.e. the table is transposed). The 
best use of this is to summarize the mean profile of windowed data 
collected across a feature. See the L<Bio::ToolBox> scripts 
L<get_relative_data.pl> and L<get_binned_data.pl> as examples. 
For data from L<get_binned_data.pl> where the columns are expressed 
as percentile bins, the reported midpoint column is automatically 
converted based on a length of 1000 bp.

You may pass these options. They are optional.

=over 4

=item filename

Pass an optional new filename. The default is to take the basename 
and append "_<method>_summary" to it.

=item startcolumn

=item stopcolumn

Provide the starting and ending columns to summarize. The default 
start is the leftmost column without a recognized standard name. 
The default ending column is the last rightmost column.

=item dataset

Pass a string that is the name of the dataset. This could be collected 
from the metadata, if present. This will become the name of the score 
column if defined.

=item method

Pass the name of the method to combine the values. Methods include 
C<mean> (default if not specified), C<median>, or C<trimmean>, where 
the top and bottom 1% of the sorted values are discarded and a mean
of the remaining 98% of the values is used. If fewer than 100 values
are available, no trimming is done and a straight mean value is 
determined.

=back

The name of the summarized column is either the provided dataset name, 
the defined basename in the metadata of the Data structure, or a generic 
name. If successful, it will return the name of the file saved.

=back

=head2 Verifying Datasets

When working with row Features and collecting scores, the dataset 
from which you are collecting must be verified prior to collection. 
This ensures that the proper database adaptor is available and loaded, 
and that the dataset is correctly specified (otherwise nothing would be 
collected). This verification is normally performed transparently when 
you call L<get_score|Bio::ToolBox::Data::Feature/get_score> or 
L<get_position_scores|Bio::ToolBox::Data::Feature/get_position_scores>.
However, datasets may be explicitly verified prior to calling the score 
methods. 

=over 4

=item verify_dataset

 my $dataset = $Data->verify_dataset($dataset, $database);



( run in 0.955 second using v1.01-cache-2.11-cpan-39bf76dae61 )