File-Unpack2

 view release on metacpan or  search on metacpan

Unpack2.pm  view on Meta::CPAN

  my ($number, $dec_places) = @_;
  $dec_places = 2 unless defined $dec_places;
  my $div = 1;
  my $unit = '';
  my $neg = '';
  if ($number < 0)
    {
      $neg = '-'; $number = -$number;
    }
  if ($number > $div * 1024)
    {
      $div *= 1024; $unit = 'k'; 
      if ($number > $div * 1024)
        {
	  $div *= 1024; $unit = 'm'; 
	  if ($number > $div * 1024)
	    {
	      $div *= 1024; $unit = 'g'; 
	      if ($number > $div * 1024)
	        {
		  $div *= 1024; $unit = 't'; 
		}
	    }
	}
    }
  return sprintf "%s%.*f%s", $neg, $dec_places, ($number / $div), $unit;
}

# see fs.pm/check_fs_health()

sub minfree
{
  my $self = shift;
  my %opt = @_;

  for my $i (qw(factor bytes percent))
    {
      $self->{minfree}{$i} = $opt{$i} if defined $opt{$i};
      $self->{minfree}{$i} ||= 0;
    }
  $self->{minfree}{bytes} = _bytes_unit($self->{minfree}{bytes});
  $self->{minfree}{percent} =~ s{%$}{};
  $self->{fs_warn} = $opt{warning} if ref $opt{warning};
}

=head2 mime

$u->mime($filename)

$u->mime(file => $filename)

$u->mime(buf => "#!/bin ...", file => "what-was-read")

$u->mime(fd => \*STDIN, file => "what-was-opened")

Determines the MIME type (and optionally additional information) of a file.
The file can be specified by filename, by a provided buffer or an opened file descriptor.
For the latter two cases, specifying a filename is optional, and used only for diagnostics.

C<mime> uses libmagic by Christos Zoulas exposed via File::LibMagic and also uses
the shared-mime-info database from freedesktop.org exposed via
File::MimeInfo::Magic, if available.  Either one is sufficient, but having both
is better. LibMagic sometimes says 'text/x-pascal', although we have a F<.desktop>
file, or says 'text/plain', but has contradicting details in its description.

C<File::MimeInfo::Magic::magic> is consulted where the libmagic output is dubious. E.g. when 
the desciption says something interesting like 'Debian binary package (format 2.0)' but the 
mimetype says 'application/octet-stream'. The combination of both libraries gives us 
excellent reliability in the critical field of MIME type recognition.

This implementation also features multi-level MIME type recognition for efficient unpacking.
When e.g. unpacking a large bzipped tar archive, this saves us from creating a
huge temporary tar-file which C<unpack> would extract in a second step.  The multi-level recognition
returns 'application/x-tar+bzip2' in this case, and allows for a MIME helper
to e.g. pipe the bzip2 contents into tar (which is exactly what 'tar jxvf'
does, making a very simple and efficient MIME helper).

C<mime> returns a 3 or 4 element arrayref with mimetype, charset, description, diff;
where diff is only present when the libfile and shared-mime-info methods disagree.

In case of 'text/plain', an additional rule based on file name suffix is used to allow
recognition of well known plain text pack formats. 
We return 'text/x-suffix-XX+plain', where XX is one of the recognized suffixes
(in all lower case and without the dot).  E.g. a plain mmencoded file has no
header and looks like 'plain/text' to all the known magic libraries. We
recognize the suffixes .mm, .b64, and .base64 for this (case insignificant).
A similar rule exitst for 'application/octect-stream'. It may trigger e.g. for
LZMA compressed files which fail to provide a magic number.

Examples:
 
 [ 'text/x-perl', 'us-ascii', 'a /usr/bin/perl -w script text']

 [ 'text/x-mpegurl', 'utf-8', 'M3U playlist text', 
   [ 'text/plain', 'application/x-mpegurl']]

 [ 'application/x-tar+bzip2, 'binary', 
   "bzip2 compressed data, block size = 900k\nPOSIX tar archive (GNU)", ...]

=cut

sub mime 
{
  my ($self, @in) = @_;

  my %in;
     %in = %{$in[0]}  if !$#in and ref $in[0] eq 'HASH';
  unshift @in, 'file' if !$#in and !ref $in[0];
  %in = @in if $#in > 0;

  my $flm = $self->{flm} ||= File::LibMagic->new();

  unless (defined $in{buf})
    {
      my $fd = $in{fd};
      unless ($fd)
        {
	  open $fd, "<", $in{file} or
	    return [ 'x-system/x-error', undef, "cannot open '$in{file}': $!" ];
	}

      my $f = $in{file}||'-';
      $in{buf} = '';
      my $pos = tell $fd;
      ##bzip2 below needs a long buffer, or it returns 0.
      my $len = read $fd, $in{buf}, $UNCOMP_BUFSZ;
      return [ 'x-system/x-error', undef, "read '$f' failed: $!" ] unless defined $len;
      return [ 'x-system/x-error', undef, "read '$f' failed: $len: $!" ] if $len < 0;
      return [ 'text/x-empty', undef, 'empty' ] if $len == 0;
      seek $fd, $pos, 0;

      close $fd unless $in{fd};
    }


  ## flm can say 'cannot open \'IP\' (No such file or directory)'
  ## flm can say 'CDF V2 Document, corrupt: Can\'t read SAT'	(application/vnd.ms-excel)
  my $mime1 = eval { $flm->checktype_contents($in{buf}) };
  if ($@) {
    warn $@;
    return [ 'x-system/x-error', undef, "libmimemagic exception"];
  }
  if ($mime1 =~ m{, corrupt: } or $mime1 =~ m{^application/octet-stream\b})
    {
      # application/x-iso9660-image is reported as application/octet-stream if the buffer is short.
      # iso images usually start with 0x8000 bytes of all '\0'.
      print STDERR "mime: readahead buffer $UNCOMP_BUFSZ too short\n" if $self->{verbose} > 2;
      if (defined $in{file} and -f $in{file})
        {
          print STDERR "mime: reopening $in{file}\n" if $self->{verbose} > 1;
          $mime1 = $flm->checktype_filename($in{file});
	}
    }
  print STDERR "flm->checktype_contents: $mime1\n" if $self->{verbose} > 1;
  $in{file} = '-' unless defined $in{file};
    
  return [ 'x-system/x-error', undef, $mime1 ] if $mime1 =~ m{^cannot open};

  # in SLES11 we get 'text/plain charset=utf-8' without semicolon.
  my $enc; ($mime1, $enc) = ($1,$2) if $mime1 =~ m{^(.*?);\s*(.*)$} or
                                       $mime1 =~ m{^(.*?)\s+(.*)$};
  $enc =~ s{^charset=}{} if defined $enc;
  my @r = ($mime1, $enc, $flm->describe_contents($in{buf}) );
  my $mime2;

  
  if ($mime1 =~ m{^application/xml})
    {
      # This is horrible from a greedy text cruncher perspective:
      # although xml is a plain text syntax, it is reported by flm to be 
      # outside text/*
      $r[0] = "text/x-application-xml";
    }

  if ($mime1 =~ m{^text/x-(?:pascal|fortran)$})
    {
      # xterm.desktop
      # ['text/x-pascal; charset=utf-8','UTF-8 Unicode Pascal program text']
      # 'application/x-desktop'
      #
      # Times-Roman.afm
      # ['text/x-fortran; charset=us-ascii','ASCII font metrics']
      # 'application/x-font-afm'
      #
      # debian/rules
      # ['text/x-pascal; charset=us-ascii','a /usr/bin/make -f  script text']
      # 'text/x-makefile'
      if ($mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); })
        {
	  $r[0] = "text/$1" if $mime2 =~ m{/(\S+)};
	}
    }
  elsif (($mime1 eq 'text/plain' and $r[2] =~ m{(?:PostScript|font)}i)
	or ($mime1 eq 'application/postscript'))
    {
      # 11.3 says:
      #  IPA.pfa
      #  ['text/plain; charset=us-ascii','PostScript Type 1 font text (OmegaSerifIPA 001.000)']
      # sles11 says:
      #  IPA.pfa
      #  ['application/postscript', undef, 'PostScript document text']
      #
      # mime2 = 'application/x-font-type1'
      # $mime2 = eval { File::MimeInfo::Magic::mimetype($in{file}); };
      $mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); };
      if ($mime2 and $mime2 =~ m{^(.*)/(.*)$})
        {
	  my ($a,$b) = ($1,$2);
	  $a = 'text' if $r[2] =~ m{\btext\b}i; 
	  $r[0] = "$a/$b";
	}
    }

  if ($r[0] eq 'text/plain' or 
      $r[0] eq 'application/octet-stream')
    {
      # hmm, are we sure? No, if the description contradicts:
      # 
      $r[0] = "text/x-uuencode" if $r[2] eq 'uuencoded or xxencoded text';

      # bin/floor
      # ['text/x-pascal; charset=us-ascii','a /usr/bin/tclsh script text']
      # 'text/plain'
      $r[0] = "text/x-$2" if $r[2] =~ m{^a (\S*/)?([^/\s]+) .*script text$}i;
      if ($r[2] =~ m{\bimage\b})
        {
	  # ./opengl/test.tga
	  # ['application/octet-stream; charset=binary','Targa image data - RGB 128 x 128']
	  # 'image/x-tga'
          $mime2 ||= eval { open my $fd,'<',\$in{buf}; File::MimeInfo::Magic::magic($fd); };
	  $r[0] = $mime2 if $mime2 and $mime2 =~ m{^image/};
	}
    }

  if ($r[0] eq 'application/octet-stream')
    {
      # it can't get much worse, can it?
      ##
      # dotdot.tar.lzma

Unpack2.pm  view on Meta::CPAN

  if ($r[0] =~ m{^application/x-(ms-dos-|)executable$})
    {
      if (-x '/usr/bin/upx')
        {
	  # upx refuses to read symlinks. Work around this.
	  my $in_file = $in{file};
	  $in_file = readlink($in{file}) if -l $in{file};
	  $r[0] .= '+upx' unless run(['/usr/bin/upx', '-q', '-q', '-t', $in_file]);
	}
    }

  ${$in{uncomp}} = $uncomp_buf if ref $in{uncomp} eq 'SCALAR';
  $r[3] = [ $mime1, $mime2 ] if $mime1 ne $r[0] or ($mime2 and $mime2 ne $mime1);

  return \@r;
}

=head1 AUTHOR

Juergen Weigert, C<< <jnw at cpan.org> >>

=head1 BUGS

The implementation of C<mime> is an ugly hack. We suffer from the existence of
multiple file magic databases, and multiple conflicting implementations. With
Perl we have at least 5 modules for this; here we use two.

The builtin list of MIME helpers is incomplete. Please submit your handler code.

Please report any bugs or feature requests to C<bug-file-unpack at rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=File-Unpack2>.  I will be notified, and then you'll
automatically be notified of progress on your bug as I make changes.


=head1 RELATED MODULES

While designing File::Unpack2, a range of other perl modules were examined. Many modules provide valuable service to File::Unpack2 and became dependencies or are recommended.
Others exposed drawbacks during closer examination and may find some of their
wheels re-invented here.

=head2 Used Modules

=over

=item File::LibMagic

This is the prefered mimetype engine. It disregards the suffix, recognizes more
types than any of the alternatives, and uses exactly the same engine as
/usr/bin/file in openSUSE systems. It also returns charset and description
information.  We crossreference the description with the mimetype to detect
weaknesses, and consult File::MimeInfo::Magic and some own logic, for e.g.
detecting LZMA compression which fails to provide any recognizable magic.
Required if you use C<mime>; otherwise not a hard requirement.

=item File::MimeInfo::Magic

Uses both magic information and file suffixes to determine the mimetype. Its
magic() function is used in a few cases, where File::LibMagic fails.  E.g. as
of June 2010, libmagic does not recognize 'image/x-targa'.
File::MimeInfo::Magic may be slower, but it features the shared-mime-info
database from freedesktop.org .  Recommended if you use C<mime>.

=item String::ShellQuote 

Used to call external MIME helpers. Required.

=item BSD::Resource

Used to reliably restrict the maximum file size. Recommended.

=item File::Path

mkpath(). Required.

=item Cwd

fast_abs_path(). Required.

=item JSON

Used for formatting the logfile. Required.

=back

=head2 Modules Not Used

=over

=item Archive::Extract

Archive::Extract tries first to determine what type of archive you are passing
it, by inspecting its suffix. 'Maybe this module should use something like
"File::Type" to determine the type, rather than blindly trust the suffix'.
[quoted from perldoc]

Set $Archive::Extract::PREFER_BIN to 1, which will prefer the use of command 
line programs and won't consume so much memory. Default: use "Archive::Tar".

=item Archive::Zip

If you are just going to be extracting zips (and/or other archives) you are 
recommended to look at using Archive::Extract . [quoted from perldoc]
It is pure perl, so it's a lot slower then your '/usr/bin/zip'.

=item Archive::Tar

It is pure Perl, so it's a lot slower then your "/bin/tar".
It is heavy on memory, all will be read into memory. [quoted from perldoc]

=item File::MMagic, File::MMagic::XS, File::Type

Compared to File::LibMagic and File::MimeInfo::Magic, these three are inferior.
They often say 'text/plain' or 'application/octet-stream' where the latter two report 
useful mimetypes.

=back

=head1 SUPPORT

You can find documentation for this module with the perldoc command.



( run in 1.712 second using v1.01-cache-2.11-cpan-e1769b4cff6 )