Compress-BGZF
view release on metacpan or search on metacpan
lib/Compress/BGZF.pm view on Meta::CPAN
package Compress::BGZF 0.007001;
use 5.012;
use strict;
use warnings;
1;
__END__
=head1 NAME
Compress::BGZF - Read/write blocked GZIP (BGZF) files
=head1 SYNOPSIS
use Compress::BGZF::Writer;
use Compress::BGZF::Reader;
# create a BGZF file
my @records = generate_data();
my $fh_out = Compress::BGZF::Writer->new_filehandle( 'somefile.gz' );
print {$fh_out} $_ for (@records);
close $fh_out;
# perform non-sequential reads
my $fh_in = Compress::BGZF::Reader->new_filehandle( 'somefile.gz' );
# read 32 bytes from uncompressed file offset 3020
seek $fh_in, 3020, 0;
read $fh_in, my $buffer, 32;
print "data: $buffer\n";
=head1 DESCRIPTION
C<Compress::BGZF> contains a pair of modules for working with block GZIP (BGZF) files.
BGZF is a specialized GZIP format that is compatible with existing GZIP tools
and libraries, but which allows for fast random access at the cost of a modest
increase in file size. It does this by concatenating together multiple
complete GZIP blocks, each of which has a full header and footer and thus can
be decompressed individually without reading through earlier parts of the
file, and by including an extra field in each header that contains the size of
the block. Upon creation of a Reader object, an index containing the
compressed and uncompressed offsets of the start of each block is either read
from disk or generated from the data itself. C<seek>, C<read>, and C<tell> (or
their object-oriented counterparts) can then be performed on the compressed
file as if it were uncompressed. Seeks are fast, and a worst-case maximum of
64k of preceeding data will be uncompressed in order to reach the data of
interest.
=head2 Selected Implementation Notes
According to the BGZF specification, each GZIP block is limited to 64kb in
size (including an 18 byte header and 8 byte footer). While in theory the
uncompressed size could be larger, limits of the virtual offset calculation
and ease of implementation mean that this size limit is enforced on the
uncompressed data.
Virtual offsets are calculated as follows: for any given position in the
uncompressed file, the virtual offset is calculated from the starting byte
offset A of the block in which it occurs (relative to the compressed file) and
the byte offset B at which it occurs in the uncompressed payload of that
block, such that VO = A << 16 | B. This single value then contains sufficient
information to quickly seek to the given location and begin extracting data.
=head1 METHODS
See individual POD of Reader and Writer modules.
A demonstration is included under bin/ named "bgzip.pl" which is designed to
( run in 2.456 seconds using v1.01-cache-2.11-cpan-8f98c5d2c55 )