Bio-BigFile
view release on metacpan or search on metacpan
lib/Bio/DB/BigWig.pm view on Meta::CPAN
Return the DNA accessor (usually a Bio::DB::Fasta object) which the
object uses to fetch the sequence of the reference genome.
=cut
sub fa { shift->{fa} }
=back
=cut
sub segment {
my $self = shift;
my ($seqid,$start,$end) = @_;
if ($_[0] =~ /^-/) {
my %args = @_;
$seqid = $args{-seq_id} || $args{-name};
$start = $args{-start};
$end = $args{-stop} || $args{-end};
} else {
($seqid,$start,$end) = @_;
}
my $size = $self->bw->chromSize($seqid) or return;
$start ||= 1;
$end ||= $size-1;
return unless $start >= 1 && $start < $size;
return unless $end >= 1 && $end < $size;
return Bio::DB::BigFile::Segment->new(-bf => $self,
-seq_id=> $seqid,
-start => $start,
-end => $end);
}
=head2 Retrieving intervals and summary statistics
This section describes methods that return lists of intervals and
summary statistics from the BigWig object. Most of the methods are
oriented towards retrieving information about the distribution of
values in the WIG file. Summary information typically consists of the
following fields:
Key Value
--- ---------
validCount Number of intervals in the bin
maxVal Maximum value in the bin
minVal Minimum value in the bin
sumData Sum of the intervals in the bin
sumSquares Sum of the squares of the intervals in the bin
From these values one can determine the mean, variance and standard
deviation across one or more genomic intervals. The formulas are as
follows:
sub mean {
my ($sumData,$validCount) = @_;
return $sumData/$validCount;
}
sub variance {
my ($sumData,$sumSquares,$validCount) = @_;
my $var = $sumSquares - $sumData*$sumData/$validCount;
if ($validCount > 1) {
$var /= $validCount-1;
}
return 0 if $var < 0;
return $var;
}
sub stdev {
my ($sumData,$sumSquares,$validCount) = @_;
my $variance = variance($sumData,$sumSquares,$validCount);
return sqrt($variance);
}
Note that in the calculation of variance, there is a chance of getting
very small negative numbers in a tight distribution due to floating
point rounding errors. Hence the check for variance < 0. To pool bins,
simply sum the individual values.
For your convenience, Bio::DB::BigWig optionally exports functions
that perform these calculations for you. Please see L</EXPORTED
FUNCTIONS> below.
The following methods allow you to query the BigWig file:
=over 4
=item B<@features = $bigwig-E<gt>features(@args)>
This method is the workhorse for retrieving various types of intervals
and summary statistics from the BigWig database. It takes a series of
named arguments in the format (-argument1 => value1, -argument2 =>
value2, ...) and returns a list of zero or more BioPerl
Bio::SeqFeatureI objects.
The following arguments are recognized:
Argument Description Default
-------- ----------- -------
-seq_id Chromosome or contig name defining All chromosomes/contigs.
the range of interest.
-start Start of the range of interest. 1
-end End of the range of interest Chromosome/contig end
-type Type of feature to retrieve 'region'
(see below).
-iterator Boolean, which if true, returns undef (false)
lib/Bio/DB/BigWig.pm view on Meta::CPAN
end and score are of the greatest interest.
Three different feature types are accepted, each one with slightly
different properties:
I<region>,I<interval>
Features retrieved using either of these types will represent the raw
intervals and values present in the original WIG file (the two are
equivalent, "region" is preferred because it is a Sequence Ontology
term, but "interval" is probably more natural). Note that this
operation may consume a lot of memory and be processor-intensive. It
is almost always better to create an iterator using get_seq_stream().
These features have the following useful methods:
$feature->seq_id() The chromosome/contig name
$feature->start() Start of the interval
$feature->end() End of the interval
$feature->score() Score for the interval
$feature->primary_id() Primary ID for the interval
Due to floating point precision issues at the BigWig C level, the
scores may be very slightly different from the originals.
Here is a simple script that dumps out the contents of chromosome I in
BedGraph format:
use Bio::DB::BigWig;
my $wig = Bio::DB::BigWig->new(-bigwig=>$path);
my @features = $wig->features(-seq_id=>'I',-type=>'region');
for my $p (@features) {
my $seqid = $p->seq_id;
my $start = $p->start;
my $end = $p->end;
my $val = $p->score;
print join("\t",$seqid,$start,$end,$val),"\n";
}
I<bin:#count>
A type named "bin:" followed by an integer will divide each
chromosome/contig into the indicated number of summary bins, and
return one feature for each bin. For example, type "bin:100" will
return 100 evenly-spaced bins across each chromosome/contig.
The returned bin features have the same methods as those returned by
the "region"/"interval" types, except that the start() and end()
methods return the boundaries of the bin rather than any individual
interval reported in the WIG file. Instead of returning a single
value, the score() method returns a hash of statistical summary
information containing the keys B<validCount>, B<maxVal>, B<minVal>,
B<sumData> and B<sumSquares> as described earlier.
In addition, the bin objects add the following convenience methods:
$bin->count() Same as $bin->score->{validCount}
$bin->minVal() Same as $bin->score->{minVal}
$bin->maxVal() Same as $bin->score->{maxVal}
$bin->mean() The mean of values in the bin (from the formula above)
$bin->variance() The variance of values in the bin (ditto)
$bin->stdev() The standard deviation of values in the bin (ditto)
If no number is specified (i.e. you search for type "bin"), then an
interval of "1" is assumed. You will receive one bin spanning each
chromosome/contig.
I<summary>
This feature type is similar to "bin" except that instead of returning
one feature for each binned interval on the genome, it returns a
single object from which you can retrieve summary statistics across
fixed-width bins in a more memory-efficient manner. Call the object's
statistical_summary() method with the number of bins you need to get
an array ref of bins length. Each element of the array will be a
hashref containing the B<minVal>, B<maxVal>, B<sumData>, B<sumSquares>
and B<validCount> keys. The following code illustrates how this works:
use Bio::DB::BigWig 'binMean','binStdev';
my $wig = Bio::DB::BigWig->new(-bigwig=>$path);
my @chroms = $wig->features(-type=>'summary');
for my $c (@chroms) {
my $seqid = $c->seq_id;
my $c_start = $c->start;
my $stats = $c->statistical_summary(10);
my $bin_width = $c->length/@$stats;
my $start = $c_start;
for my $s (@$stats) {
my $mean = binMean($s);
my $stdev = binStdev($s);
my $end = $start + $bin_width-1;
print "$seqid:",int($start),'..',int($end),": $mean +/- $stdev\n";
} continue {
$start += $c_start;
}
}
"summary" features have a score() method which returns a statistical
summary hash across the entire region. They also have a
get_seq_stream() method which returns a feature iterator across the
region they cover.
my $summary = $c->score;
To get the mean and stdev across the entire chromosome containing the
summary feature, call its chr_mean() and chr_stdev() methods:
my $mean = $c->chr_mean;
my $stdev = $c->chr_stdev;
To get the global mean and stdev across the entire genome containing
the summary feature, call its global_mean() and global_stdev() methods:
my $mean = $c->global_mean;
my $stdev = $c->global_stdev;
The summary object's chr_stats() and global_stats() methods return a
( run in 0.716 second using v1.01-cache-2.11-cpan-39bf76dae61 )