Archive-Tar-Stream
view release on metacpan or search on metacpan
lib/Archive/Tar/Stream.pm view on Meta::CPAN
package Archive::Tar::Stream;
use strict;
use warnings;
# this is pretty fixed by the format!
use constant BLOCKSIZE => 512;
use constant BLOCKCOUNT => 2048;
use constant BUFSIZE => (512*2048);
# dependencies
use IO::File;
use IO::Handle;
use File::Temp;
use List::Util qw(min);
# XXX - make this an OO attribute
our $VERBOSE = 0;
=head1 NAME
Archive::Tar::Stream - pure perl IO-friendly tar file management
=head1 VERSION
Version 0.05
=cut
our $VERSION = '0.05';
=head1 SYNOPSIS
Archive::Tar::Stream grew from a requirement to process very large
archives containing email backups, where the IO hit for unpacking
a tar file, repacking parts of it, and then unlinking all the files
was prohibitive.
Archive::Tar::Stream takes two file handles, one purely for reads,
one purely for writes. It does no seeking, it just unpacks
individual records from the input filehandle, and packs records
to the output filehandle.
This module does not attempt to do any file handle management or
compression for you. External zcat and gzip are quite fast and
use separate cores.
use Archive::Tar::Stream;
my $ts = Archive::Tar::Stream->new(outfh => $fh);
$ts->AddFile($name, -s $fh, $fh);
# remove large non-jpeg files from a tar.gz
my $infh = IO::File->new("zcat $infile |") || die "oops";
my $outfh = IO::File->new("| gzip > $outfile") || die "double oops";
my $ts = Archive::Tar::Stream->new(infh => $infh, outfh => $outfh);
$ts->StreamCopy(sub {
my ($header, $outpos, $fh) = @_;
# we want all small files
return 'KEEP' if $header->{size} < 64 * 1024;
# and any other jpegs
return 'KEEP' if $header->{name} =~ m/\.jpg$/i;
# no, seriously
return 'EDIT' unless $fh;
return 'KEEP' if mimetype_of_filehandle($fh) eq 'image/jpeg';
# ok, we don't want other big files
return 'SKIP';
});
=head1 SUBROUTINES/METHODS
=head2 new
my $ts = Archive::Tar::Stream->new(%args);
Args:
infh - filehandle to read from
outfh - filehandle to write to
inpos - initial offset in infh
outpos - initial offset in outfh
safe_copy - boolean.
Offsets are for informational purposes only, but can be
useful if you are tracking offsets of items within your
tar files separately. All read and write functions
update these offsets. If you don't provide offsets, they
will default to zero.
Safe Copy is the default - you have to explicitly turn it
off. If Safe Copy is set, every file is first extracted
( run in 3.131 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )