Archive-BagIt

 view release on metacpan or  search on metacpan

README.mkdn  view on Meta::CPAN

# NAME

Archive::BagIt - The main module to handle bags.

# VERSION

version 0.101

# NAME

Achive::BagIt - The main module to handle Bags

# SOURCE

The original development version was on github at [http://github.com/rjeschmi/Archive-BagIt](http://github.com/rjeschmi/Archive-BagIt)
and may be cloned from there.

The actual development version is available at [https://git.fsfe.org/art1pirat/Archive-BagIt](https://git.fsfe.org/art1pirat/Archive-BagIt)

# Conformance to RFC8493

The module should fulfill the RFC requirements, with following limitations:

- only encoding UTF-8 is supported
- version 0.97 or 1.0 allowed
- version 0.97 requires tag-/manifest-files with md5-fixity
- version 1.0 requires tag-/manifest-files with sha512-fixity
- BOM is not supported
- Carriage Return in bagit-files are not allowed
- fetch.txt is unsupported

At the moment only filepaths in linux-style are supported.

To get an more detailled overview, see the testsuite under `t/verify_bag.t` and corresponding test bags from the BagIt conformance testsuite of Library of Congress under `bagit_conformance_suite/`.

See [https://datatracker.ietf.org/doc/rfc8493/?include\_text=1](https://datatracker.ietf.org/doc/rfc8493/?include_text=1) for details.

# TODO

- enhanced testsuite
- reduce complexity
- use modern perl code
- add flag to enable very strict verify

# Backward Compatibility

To reduce the complexity of code in current module the support for

- parallel processing
=item synchronous I/O

is removed. The existing code is very fast, so there is no performance loss.

In near future the support for [Archive::BagIt::Fast](https://metacpan.org/pod/Archive%3A%3ABagIt%3A%3AFast) will be  removed, because it needs hooks, which increase code
complexity in current module without any performance benefit.

# FAQ

## How to access the manifest-entries directly?

Try this:

    foreach my $algorithm ( keys %{ $self->manifests }) {
        my $entries_ref = $self->manifests->{$algorithm}->manifest_entries();
        # $entries_ref returns a hashref like:
        # {
        #     data/hello.txt   "e7c22b994c59d9cf2b48e549b1e24666636045930d3da7c1acb299d1c3b7f931f94aae41edda2c2b207a36e10f8bcb8d45223e54878f5b316e7ce3b6bc019629"
        # }
    }

Similar for tagmanifests

## How fast is [Archive::BagIt](https://metacpan.org/pod/Archive%3A%3ABagIt)?

I have made great efforts to optimize Archive::BagIt for high throughput. There are two limiting factors:

- calculation of checksums, by switching from the module "Digest" to OpenSSL by using [Net::SSLeay](https://metacpan.org/pod/Net%3A%3ASSLeay) a significant
   speed increase could be achieved.
- loading the files referenced in the manifest files was previously done serially and using synchronous I/O. By
   using the [IO::Async](https://metacpan.org/pod/IO%3A%3AAsync) module, the files are loaded asynchronously, the performance gain is huge.

On my system with 8cores, SSD and a large 9GB bag with 568 payload files the results for `verify_bag()` are:

                     processing time          run time             throughput
    Version       user time    system time    total time    total    MB/s
     v0.71        38.31s        1.60s         39.938s       100%     230
     v0.81        25.48s        1.68s         27.1s          67%     340
     v0.82        48.85s        3.89s          6.84s         17%    1346



( run in 0.541 second using v1.01-cache-2.11-cpan-524268b4103 )