Bio-BigFile
view release on metacpan or search on metacpan
lib/Bio/DB/BigBed.pm view on Meta::CPAN
called like this:
@subfeats = $feature->getSeqFeatures()
This will return a list of subfeatures corresponding to the block
structure specified in the BED line. The boundaries of the blocks are
found using the features' start() and end() methods:
for my $block (@subfeats) {
my $start = $block->start;
my $end = $block->end;
print "$start..$end";
}
The start and end coordinates are given in chromosome coordinates.
The BioPerl API requires that features and subfeatures have primary
tags. If the main (parent) feature is of type "mRNA" (either
determined via the blocks heuristic or requested explicitly), then
subfeatures will have primary tags "CDS", "five_prime_UTR" and/or
"three_time_UTR". If the main feature has the type "region", then the
subfeatures will have primary tags "thickregion" and "thinregion",
based on where they are with respect to the thickStart and thinStart
BED fields. Similarly, a feature of type "feature" will have subparts
of type "thickfeature" and "thinfeature".
Notice that the module will split blocks in two at the thickStart and
thickEnd positions.
I<bin:#count>
A type named "bin:" followed by an integer will divide each
chromosome/contig into the indicated number of summary bins, and
return one feature for each bin. For example, type "bin:100" will
return 100 evenly-spaced bins across each chromosome/contig.
The returned bins have all the same methods as those returned by the
"region" type, except that the start() and end() methods return the
boundaries of the bin rather than any individual interval reported in
the BED file. Instead of returning a single integer value, the score()
method returns a hash of reference to statistical summary information:
Key Value
--- ---------
validCount Number of intervals in the bin
maxVal Maximum value in the bin
minVal Minimum value in the bin
sumData Sum of the intervals in the bin
sumSquares Sum of the squares of the intervals in the bin
In addition, the bin objects add the following convenience methods:
$bin->count() Same as $bin->score->{validCount}
$bin->minVal() Same as $bin->score->{minVal}
$bin->maxVal() Same as $bin->score->{maxVal}
$bin->mean() The mean of values in the bin (from the formula above)
$bin->variance() The variance of values in the bin (ditto)
$bin->stdev() The standard deviation of values in the bin (ditto)
From these values one can determine the mean, variance and standard
deviation across one or more genomic intervals. The formulas are as
follows:
sub mean {
my ($sumData,$validCount) = @_;
return $sumData/$validCount;
}
sub variance {
my ($sumData,$sumSquares,$validCount) = @_;
my $var = $sumSquares - $sumData*$sumData/$validCount;
if ($validCount > 1) {
$var /= $validCount-1;
}
return 0 if $var < 0;
return $var;
}
sub stdev {
my ($sumData,$sumSquares,$validCount) = @_;
my $variance = variance($sumData,$sumSquares,$validCount);
return sqrt($variance);
}
Note that in the calculation of variance, there is a chance of getting
very small negative numbers in a tight distribution due to floating
point rounding errors. Hence the check for variance < 0. To pool bins,
simply sum the individual values.
For your convenience, this module optionally exports functions that
perform these calculations for you. Please see L</EXPORTED FUNCTIONS>
below.
If no bin count is specified, then a value of 1 is assumed. This will
return one bin spanning the entirety of the region specified. For
example:
my ($bin) = $bigbed->features(-seq_id=>'chr1',
-start=>1,-end=>120_000_000,
-type => 'bin');
print "Features on chr1:1..120,000,000 : ",$bin->count,"\n";
my ($bin) = $bigbed->features(-seq_id=>'chr1',-type=>'bin');
print "Features on chr1: ",$bin->count,"\n";
my @bins = $bigbed->features(-type=>'bin'); # no position specified
for my $bin (@bins) {
my $chr = $bin->seq_id;
print "Features on $chr: ",$bin->count,"\n";
}
I<summary>
This feature type is similar to "bin" except that instead of returning
one feature for each binned interval on the genome, it returns a
single object from which you can retrieve summary statistics across
fixed-width bins in a more memory-efficient manner. Call the object's
statistical_summary() method with the number of bins you need to get
an array ref of bins length. Each element of the array will be a
hashref containing the B<minVal>, B<maxVal>, B<sumData>, B<sumSquares>
and B<validCount> keys. The following code illustrates how this works:
( run in 0.515 second using v1.01-cache-2.11-cpan-e1769b4cff6 )