File-Unpack
view release on metacpan or search on metacpan
my ($number, $dec_places) = @_;
$dec_places = 2 unless defined $dec_places;
my $div = 1;
my $unit = '';
my $neg = '';
if ($number < 0)
{
$neg = '-'; $number = -$number;
}
if ($number > $div * 1024)
{
$div *= 1024; $unit = 'k';
if ($number > $div * 1024)
{
$div *= 1024; $unit = 'm';
if ($number > $div * 1024)
{
$div *= 1024; $unit = 'g';
if ($number > $div * 1024)
{
$div *= 1024; $unit = 't';
}
}
}
}
return sprintf "%s%.*f%s", $neg, $dec_places, ($number / $div), $unit;
}
# see fs.pm/check_fs_health()
sub minfree
{
my $self = shift;
my %opt = @_;
for my $i (qw(factor bytes percent))
{
$self->{minfree}{$i} = $opt{$i} if defined $opt{$i};
$self->{minfree}{$i} ||= 0;
}
$self->{minfree}{bytes} = _bytes_unit($self->{minfree}{bytes});
$self->{minfree}{percent} =~ s{%$}{};
$self->{fs_warn} = $opt{warning} if ref $opt{warning};
}
=head2 mime
$u->mime($filename)
$u->mime(file => $filename)
$u->mime(buf => "#!/bin ...", file => "what-was-read")
$u->mime(fd => \*STDIN, file => "what-was-opened")
Determines the MIME type (and optionally additional information) of a file.
The file can be specified by filename, by a provided buffer or an opened file descriptor.
For the latter two cases, specifying a filename is optional, and used only for diagnostics.
C<mime> uses libmagic by Christos Zoulas exposed via File::LibMagic and also uses
the shared-mime-info database from freedesktop.org exposed via
File::MimeInfo::Magic, if available. Either one is sufficient, but having both
is better. LibMagic sometimes says 'text/x-pascal', although we have a F<.desktop>
file, or says 'text/plain', but has contradicting details in its description.
C<File::MimeInfo::Magic::magic> is consulted where the libmagic output is dubious. E.g. when
the desciption says something interesting like 'Debian binary package (format 2.0)' but the
mimetype says 'application/octet-stream'. The combination of both libraries gives us
excellent reliability in the critical field of MIME type recognition.
This implementation also features multi-level MIME type recognition for efficient unpacking.
When e.g. unpacking a large bzipped tar archive, this saves us from creating a
huge temporary tar-file which C<unpack> would extract in a second step. The multi-level recognition
returns 'application/x-tar+bzip2' in this case, and allows for a MIME helper
to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf'
does, making a very simple and efficient MIME helper).
C<mime> returns a 3 or 4 element arrayref with mimetype, charset, description, diff;
where diff is only present when the libfile and shared-mime-info methods disagree.
In case of 'text/plain', an additional rule based on file name suffix is used to allow
recognition of well known plain text pack formats.
We return 'text/x-suffix-XX+plain', where XX is one of the recognized suffixes
(in all lower case and without the dot). E.g. a plain mmencoded file has no
header and looks like 'plain/text' to all the known magic libraries. We
recognize the suffixes .mm, .b64, and .base64 for this (case insignificant).
A similar rule exitst for 'application/octect-stream'. It may trigger e.g. for
LZMA compressed files which fail to provide a magic number.
Examples:
[ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']
[ 'text/x-mpegurl', 'utf-8', 'M3U playlist text',
[ 'text/plain', 'application/x-mpegurl']]
[ 'application/x-tar+bzip2, 'binary',
"bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]
=cut
sub mime
{
my ($self, @in) = @_;
my %in;
%in = %{$in[0]} if !$#in and ref $in[0] eq 'HASH';
unshift @in, 'file' if !$#in and !ref $in[0];
%in = @in if $#in > 0;
my $flm = $self->{flm} ||= File::LibMagic->new();
unless (defined $in{buf})
{
my $fd = $in{fd};
unless ($fd)
{
open $fd, "<", $in{file} or
return [ 'x-system/x-error', undef, "cannot open '$in{file}': $!" ];
}
my $f = $in{file}||'-';
$in{buf} = '';
my $pos = tell $fd;
##bzip2 below needs a long buffer, or it returns 0.
my $len = read $fd, $in{buf}, $UNCOMP_BUFSZ;
return [ 'x-system/x-error', undef, "read '$f' failed: $!" ] unless defined $len;
return [ 'x-system/x-error', undef, "read '$f' failed: $len: $!" ] if $len < 0;
return [ 'text/x-empty', undef, 'empty' ] if $len == 0;
seek $fd, $pos, 0;
close $fd unless $in{fd};
}
## flm can say 'cannot open \'IP\' (No such file or directory)'
## flm can say 'CDF V2 Document, corrupt: Can\'t read SAT' (application/vnd.ms-excel)
my $mime1 = $flm->checktype_contents($in{buf});
if ($mime1 =~ m{, corrupt: } or $mime1 =~ m{^application/octet-stream\b})
{
# application/x-iso9660-image is reported as application/octet-stream if the buffer is short.
# iso images usually start with 0x8000 bytes of all '\0'.
print STDERR "mime: readahead buffer $UNCOMP_BUFSZ too short\n" if $self->{verbose} > 2;
if (defined $in{file} and -f $in{file})
{
print STDERR "mime: reopening $in{file}\n" if $self->{verbose} > 1;
$mime1 = $flm->checktype_filename($in{file});
}
}
print STDERR "flm->checktype_contents: $mime1\n" if $self->{verbose} > 1;
$in{file} = '-' unless defined $in{file};
return [ 'x-system/x-error', undef, $mime1 ] if $mime1 =~ m{^cannot open};
# in SLES11 we get 'text/plain charset=utf-8' without semicolon.
my $enc; ($mime1, $enc) = ($1,$2) if $mime1 =~ m{^(.*?);\s*(.*)$} or
$mime1 =~ m{^(.*?)\s+(.*)$};
$enc =~ s{^charset=}{} if defined $enc;
my @r = ($mime1, $enc, $flm->describe_contents($in{buf}) );
my $mime2;
if ($mime1 =~ m{^application/xml})
{
# This is horrible from a greedy text cruncher perspective:
# although xml is a plain text syntax, it is reported by flm to be
# outside text/*
$r[0] = "text/x-application-xml";
}
if ($mime1 =~ m{^text/x-(?:pascal|fortran)$})
{
# xterm.desktop
# ['text/x-pascal; charset=utf-8','UTF-8 Unicode Pascal program text']
# 'application/x-desktop'
#
# Times-Roman.afm
# ['text/x-fortran; charset=us-ascii','ASCII font metrics']
# 'application/x-font-afm'
#
# debian/rules
# ['text/x-pascal; charset=us-ascii','a /usr/bin/make -f script text']
# 'text/x-makefile'
if ($mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); })
{
$r[0] = "text/$1" if $mime2 =~ m{/(\S+)};
}
}
elsif (($mime1 eq 'text/plain' and $r[2] =~ m{(?:PostScript|font)}i)
or ($mime1 eq 'application/postscript'))
{
# 11.3 says:
# IPA.pfa
# ['text/plain; charset=us-ascii','PostScript Type 1 font text (OmegaSerifIPA 001.000)']
# sles11 says:
# IPA.pfa
# ['application/postscript', undef, 'PostScript document text']
#
# mime2 = 'application/x-font-type1'
# $mime2 = eval { File::MimeInfo::Magic::mimetype($in{file}); };
$mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); };
if ($mime2 and $mime2 =~ m{^(.*)/(.*)$})
{
my ($a,$b) = ($1,$2);
$a = 'text' if $r[2] =~ m{\btext\b}i;
$r[0] = "$a/$b";
}
}
if ($r[0] eq 'text/plain' or
$r[0] eq 'application/octet-stream')
{
# hmm, are we sure? No, if the description contradicts:
#
$r[0] = "text/x-uuencode" if $r[2] eq 'uuencoded or xxencoded text';
# bin/floor
# ['text/x-pascal; charset=us-ascii','a /usr/bin/tclsh script text']
# 'text/plain'
$r[0] = "text/x-$2" if $r[2] =~ m{^a (\S*/)?([^/\s]+) .*script text$}i;
if ($r[2] =~ m{\bimage\b})
{
# ./opengl/test.tga
# ['application/octet-stream; charset=binary','Targa image data - RGB 128 x 128']
# 'image/x-tga'
$mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); };
$r[0] = $mime2 if $mime2 and $mime2 =~ m{^image/};
}
}
if ($r[0] eq 'application/octet-stream')
{
# it can't get much worse, can it?
##
# dotdot.tar.lzma
if ($r[0] =~ m{^application/x-(ms-dos-|)executable$})
{
if (-x '/usr/bin/upx')
{
# upx refuses to read symlinks. Work around this.
my $in_file = $in{file};
$in_file = readlink($in{file}) if -l $in{file};
$r[0] .= '+upx' unless run(['/usr/bin/upx', '-q', '-q', '-t', $in_file]);
}
}
${$in{uncomp}} = $uncomp_buf if ref $in{uncomp} eq 'SCALAR';
$r[3] = [ $mime1, $mime2 ] if $mime1 ne $r[0] or ($mime2 and $mime2 ne $mime1);
return \@r;
}
=head1 AUTHOR
Juergen Weigert, C<< <jnw at cpan.org> >>
=head1 BUGS
The implementation of C<mime> is an ugly hack. We suffer from the existence of
multiple file magic databases, and multiple conflicting implementations. With
Perl we have at least 5 modules for this; here we use two.
The builtin list of MIME helpers is incomplete. Please submit your handler code.
Please report any bugs or feature requests to C<bug-file-unpack at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack>. I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.
=head1 RELATED MODULES
While designing File::Unpack, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack and became dependencies or are recommended.
Others exposed drawbacks during closer examination and may find some of their
wheels re-invented here.
=head2 Used Modules
=over
=item File::LibMagic
This is the prefered mimetype engine. It disregards the suffix, recognizes more
types than any of the alternatives, and uses exactly the same engine as
/usr/bin/file in openSUSE systems. It also returns charset and description
information. We crossreference the description with the mimetype to detect
weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g.
detecting LZMA compression which fails to provide any recognizable magic.
Required if you use C<mime>; otherwise not a hard requirement.
=item File::MimeInfo::Magic
Uses both magic information and file suffixes to determine the mimetype. Its
magic() function is used in a few cases, where File::LibMagic fails. E.g. as
of June 2010, libmagic does not recognize 'image/x-targa'.
File::MimeInfo::Magic may be slower, but it features the shared-mime-info
database from freedesktop.org . Recommended if you use C<mime>.
=item String::ShellQuote
Used to call external MIME helpers. Required.
=item BSD::Resource
Used to reliably restrict the maximum file size. Recommended.
=item File::Path
mkpath(). Required.
=item Cwd
fast_abs_path(). Required.
=item JSON
Used for formatting the logfile. Required.
=back
=head2 Modules Not Used
=over
=item Archive::Extract
Archive::Extract tries first to determine what type of archive you are passing
it, by inspecting its suffix. 'Maybe this module should use something like
"File::Type" to determine the type, rather than blindly trust the suffix'.
[quoted from perldoc]
Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command
line programs and won't consume so much memory. Default: use "Archive::Tar".
=item Archive::Zip
If you are just going to be extracting zips (and/or other archives) you are
recommended to look at using Archive::Extract . [quoted from perldoc]
It is pure perl, so it's a lot slower then your '/usr/bin/zip'.
=item Archive::Tar
It is pure Perl, so it's a lot slower then your "/bin/tar".
It is heavy on memory, all will be read into memory. [quoted from perldoc]
=item File::MMagic, File::MMagic::XS, File::Type
Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior.
They often say 'text/plain' or 'application/octet-stream' where the latter two report
useful mimetypes.
=back
=head1 SUPPORT
You can find documentation for this module with the perldoc command.
( run in 1.138 second using v1.01-cache-2.11-cpan-e1769b4cff6 )