streaming results from the CPAN

streaming
Bio-DB-HTS
view release on metacpan or search on metacpan

 Bio::DB::HTSfile->index_build($bamfile);
 my $index = Bio::DB::HTSfile->index_load($hfile);
 my $index = Bio::DB::HTSfile->index_open_in_safewd($hfile);

 my $callback = sub {
     my $alignment = shift;
     my $start       = $alignment->start;
     my $end         = $alignment->end;
     my $seqid       = $target_names->[$alignment->tid];
     print $alignment->qname," aligns to $seqid:$start..$end\n";
 }
 my $header = $index->header;
 $index->fetch($hfile,$header->parse_region('seq2'),$callback);

=head1 DESCRIPTION

This module provides a Perl interface to the HTSlib library for
indexed and unindexed SAM/BAM/CRAM sequence alignment databases. 
It provides support for retrieving information on individual alignments,
read pairs, and alignment coverage information across large
regions. It also provides callback functionality for calling SNPs and
performing other base-by-base functions.

=head2 The high-level API

The high-level API provides a BioPerl-compatible interface to indexed
BAM and CRAM files. The alignment file database is treated as a collection of
Bio::SeqFeatureI features, and can be searched for features by name,
location, type and combinations of feature tags such as whether the
alignment is part of a mate-pair.

When opening a alignment database using the high-level API, you provide the
pathnames of two files: the FASTA file that contains the reference
genome sequence, and the BAM file that contains the query sequences
and their alignments. If either of the two files needs to be indexed,
the indexing will need to be built. You can then query the
database for alignment features by combinations of name, position,
type, and feature tag.

The high-level API provides access to up to four feature "types":

 * "match": The "raw" unpaired alignment between a read and the
   reference sequence.

 * "read_pair": Paired alignments; a single composite
   feature that contains two subfeatures for the alignments of each
   of the mates in a mate pair.

 * "coverage": A feature that spans a region of interest that contains
   numeric information on the coverage of reads across the region.

 * "region": A way of retrieving information about the reference
   sequence. Searching for features of type "region" will return a
   list of chromosomes or contigs in the reference sequence, rather
   than read alignments.

 * "chromosome": A synonym for "region".

B<Features> can be en masse in a single call, retrieved in a
memory-efficient streaming basis using an iterator, or interrogated
using a filehandle that return a series of SAM-format lines.

B<SAM alignment flags> can be retrieved using BioPerl's feature "tag"
mechanism. For example, to interrogate the FIRST_MATE flag, one
fetches the "FIRST_MATE" tag:

  warn "aye aye captain!" if $alignment->get_tag_values('FIRST_MATE');

The Bio::SeqFeatureI interface has been extended to retrieve all flags
as a compact human-readable string, and to return the CIGAR alignment
in a variety of formats.

B<Split alignments>, such as reads that cover introns, are dealt with
in one of two ways. The default is to leave split alignments alone:
they can be detected by one or more "N" operations in the CIGAR
string. Optionally, you can choose to have the API split these
alignments across two or more subfeatures; the CIGAR strings of these
split alignments will be adjusted accordingly.

B<Interface to the pileup routines> The API provides you with access
to the samtools "pileup" API. This gives you the ability to write a
callback that will be invoked on every column of the alignment for the
purpose of calculating coverage, quality score metrics, or SNP
calling.

B<Access to the reference sequence> When you create the Bio::DB::HTS
object, you can pass the path to a FASTA file containing the reference
sequence. Alternatively, you may pass an object that knows how to
retrieve DNA sequences across a range via the seq() or fetch_seq()
methods, as described under new().

If the SAM/BAM file has MD tags, then these tags will be used to
reconstruct the reference sequence when necessary, in which case you
can completely omit the -fasta argument. Note that not all SAM/BAM
files have MD tags, and those that do may not use them correctly due
to the newness of this part of the SAM spec. You may wish to populate
these tags using samtools' "calmd" command.

If the -fasta argument is omitted and no MD tags are present, then the
reference sequence will be returned as 'N'.

The B<main object classes> that you will be dealing with in the
high-level API are as follows:

 * Bio::DB::HTS               -- A collection of alignments and reference sequences.
 * Bio::DB::HTS::Alignment    -- The alignment between a query and the reference.
 * Bio::DB::HTS::Query        -- An object corresponding to the query sequence in
                                  which both (+) and (-) strand alignments are
                                  shown in the reference (+) strand.
 * Bio::DB::HTS::Target       -- An interface to the query sequence in which
                                   (-) strand alignments are shown in reverse
                                   complement

You may encounter other classes as well. These include:

 * Bio::DB::HTS::Segment       -- This corresponds to a region on the reference
                                  sequence.
 * Bio::DB::HTS::Constants     -- This defines CIGAR symbol constants and flags.
 * Bio::DB::HTS::AlignWrapper  -- An alignment helper object that adds split
                                  alignment functionality. See Bio::DB::HTS::Alignment
( run in 1.484 second using v1.01-cache-2.11-cpan-140bd7fdf52 )