App-MtAws
view release on metacpan or search on metacpan
* [Installation via CPAN](#or-installation-via-cpan)
* [Installation general instructions, troubleshooting, edge cases and misc instructions](#installation-general-instructions-troubleshooting-edge-cases-and-misc-instructions)
* [Warnings ( MUST READ )](#warnings--must-read-)
* [Help/contribute this project](#helpcontribute-this-project)
* [Usage](#usage)
* [Restoring journal](#restoring-journal)
* [Journal concept](#journal-concept)
* [Specification for some commands](#specification-for-some-commands)
* [sync](#sync)
* [restore](#restore)
* [restore-completed](#restore-completed)
* [upload-file](#upload-file)
* [retrieve-inventory](#retrieve-inventory)
* [download-inventory](#download-inventory)
* [list-vaults](#list-vaults)
* [other commands](#other-commands)
* [File selection options](#file-selection-options)
* [Additional command line options](#additional-command-line-options)
* [Configuring Character Encodings](#configuring-character-encodings)
* [Limitations](#limitations)
* [See also](#see-also)
* [Minimum Amazon Glacier permissions](#minimum-amazon-glacier-permissions)
## Features
* Does not use any existing Amazon Glacier library, so can be flexible in implementing advanced features
* Amazon Glacier Multipart upload
* Multi-segment download (using HTTP Range header)
* Multithreaded upload/download
* Multipart+Multithreaded download/upload
* Multithreaded archive retrieval, deletion and download
* TreeHash validation while downloading
* Tracking of all uploaded files with a local journal file (opened for write in append mode only)
* Checking integrity of local files using journal
* Ability to limit number of archives to retrieve
* File selection options for all commands (using flexible rules with wildcard support)
* Full synchronization to Amazon Glacier - new file uploaded, modified files can be replaced, deletions can be propogated
* File name and modification times are stored as Glacier metadata ([metadata format for developers][mt-aws-glacier Amazon Glacier meta-data format specification])
* Ability to re-create journal file from Amazon Glacier metadata
* Full UTF-8 support (and full single-byte encoding support for *BSD systems)
* Multipart/multithreaded upload from STDIN
* User selectable HTTPS support. Currently defaults to plaintext HTTP
* Vault creation and deletion
* STS/IAM security tokens support
[mt-aws-glacier Amazon Glacier meta-data format specification]:https://github.com/vsespb/mt-aws-glacier/blob/master/lib/App/MtAws/MetaData.pm
## Important bugs/missing features
* Only multipart upload implemented, no plain upload
* Mac OS X filesystem treated as case-sensitive
## Production readiness
* After **one year** since first public version released, beta testing was finished and version 1.xxx released. Current project status is **non-beta**, **stable**.
## Installation/System requirements
Script is made for Unix OS. Tested under Linux. Should work under other POSIX OSes (*BSD, Solaris). Lightly tested under Mac OS X.
Will NOT work under Windows/Cygwin. Minimum Perl version required is 5.8.8 (pretty old, AFAIK there are no supported distributions with older Perls)
### Installation via OS package manager
NOTE: If you've used manual installation before, please remove previously installed `mtglacier` executable from your path.
NOTE: If you've used CPAN installation before, please remove previously installed module, ([cpanm] is capable to do that)
##### Ubuntu 12.04+
Can be installed/updated via PPA [vsespb/mt-aws-glacier](https://launchpad.net/~vsespb/+archive/mt-aws-glacier):
1. `sudo apt-get update`
2. `sudo apt-get install software-properties-common python-software-properties`
3. `sudo add-apt-repository ppa:vsespb/mt-aws-glacier`
(GPG key id/fingerprint would be **D2BFA5E4** and **D7F1BC2238569FC447A8D8249E86E8B2D2BFA5E4**)
4. `sudo apt-get update`
5. `sudo apt-get install libapp-mtaws-perl`
##### Debian 6 (Squeeze)
Can be installed/updated via custom repository
1. `wget -O - http://mt-aws.com/vsespb.gpg.key | sudo apt-key add -`
(this will add GPG key 2C00 B003 A56C 5F2A 75C4 4BF8 2A6E 0307 **D0FF 5699**)
2. Add repository
echo "deb http://dl.mt-aws.com/debian/current squeeze main"|sudo tee /etc/apt/sources.list.d/mt-aws.list
3. `sudo apt-get update`
4. `sudo apt-get install libapp-mtaws-perl`
[Amazon Glacier metadata format used by mt-aws glacier]:https://github.com/vsespb/mt-aws-glacier/blob/master/lib/App/MtAws/MetaData.pm
## Journal concept
#### What is Journal
Journal is a file in local filesystem, which contains list of all files, uploaded to Amazon Glacier.
Strictly saying, this file contains a list of operations (list of records), performed with Amazon Glacier vault. Main operations are:
file creation, file deletion and file retrieval.
Create operation records contains: *local filename* (relative to transfer root - `--dir`), file *size*, file last *modification time* (in 1 second resolution), file *TreeHash* (Amazon
hashing algorithm, based on SHA256), file upload time, and Amazon Glacier *archive id*
Delete operation records contains *local filename* and corresponding Amazon Glacier *archive id*
Having such list of operation, we can, any time reconstruct list of files, that are currently stored in Amazon Glacier.
As you see Journal records don't contain Amazon Glacier *region*, *vault*, file permissions, last access times and other filesystem metadata.
Thus you should always use a separate Journal file for each Amazon Glacier *vault*. Also, file metadata (except filename and file *modification time*) will
be lost, if you restore files from Amazon Glacier.
#### Some Journal features
* It's a text file. You can parse it with `grep` `awk` `cut`, `tail` etc, to extract information in case you need perform some advanced stuff, that `mtglacier` can't do (NOTE: make sure you know what you're doing ).
To view only some files:
grep Majorca Photos.journal
To view only creation records:
grep CREATED Photos.journal | wc -l
To compare only important fields of two journals
cut journal -f 4,5,6,7,8 |sort > journal.cut
cut new-journal -f 4,5,6,7,8 |sort > new-journal.cut
diff journal.cut new-journal.cut
* Each text line in a file represent one record
* It's an append-only file. File opened in append-only mode, and new records only added to the end. This guarantees that
you can recover Journal file to previous state in case of bug in program/crash/some power/filesystem issues. You can even use `chattr +a` to set append-only protection to the Journal.
* As Journal file is append-only, it's easy to perform incremental backups of it
#### Why Journal is a file in local filesystem file, but not in online Cloud storage (like Amazon S3 or Amazon DynamoDB)?
Journal is needed to restore backup, and we can expect that if you need to restore a backup, that means that you lost your filesystem, together with Journal.
However Journal also needed to perform *new backups* (`sync` command), to determine which files are already in Glacier and which are not. And also to checking local file integrity (`check-local-hash` command).
Actually, usually you perform new backups every day. And you restore backups (and loose your filesystem) very rare.
So fast (local) journal is essential to perform new backups fast and cheap (important for users who backups thousands or millions of files).
And if you lost your journal, you can restore it from Amazon Glacier (see `retrieve-inventory` command). Also it's recommended to backup your journal
to another backup system (Amazon S3 ? Dropbox ?) with another tool, because retrieving inventory from Amazon Glacier is pretty slow.
Also some users might want to backup *same* files from *multiple* different locations. They will need *synchronization* solution for journal files.
Anyway I think problem of putting Journals into cloud can be automated and solved with 3 lines bash script..
#### How to maintain a relation between my journal files and my vaults?
1. You can name journal with same name as your vault. Example: Vault name is `Photos`. Journal file name is `Photos.journal`. Or `eu-west-1-Photos.journal`
2. (Almost) Any command line option can be used in config file, so you can create `myphotos.cfg` with following content:
key=YOURKEY
secret=YOURSECRET
protocol=http
region=us-east-1
vault=Photos
journal=/home/me/.glacier/photos.journal
#### Why Journal does not contain region/vault information?
Keeping journal/vault in config does looks to me more like a Unix way. It can be a bit danger, but easier to maintain, because:
1. Let's imaging I decided to put region/vault into Journal. There are two options:
a. Put it into beginning of the file, before journal creation.
b. Store same region/vault in each record of the file. It looks like a waste of disk space.
Option (a) looks better. So this way journal will contain something like
region=us-east-1
vault=Photos
in the beginning. But same can be achieved by putting same lines to the config file (see previous question)
2. Also, putting vault/region to journal will make command line options `--vault` and `--region` useless
for general commands and will require to add another command (something like `create-journal-file`)
3. There is a possibility to use different *account id* in Amazon Glacier (i.e. different person's account). It's not supported yet in `mtglacier`,
but when it will, I'll have to store *account id* together with *region*/*vault*. Also default *account id* is '-' (means 'my account'). If one wish to use same
vault from a different Amazon Glacier account, he'll have to change '-' to real account id. So need to have ability to edit *account id*.
And *region/vault* information does not have sense without account.
4. Some users can have different permissions for different vaults, so they needs to maintain `key`/`secret`/`account_id` `region/vault` `journal` relation in same place
(this only can be config file, because involves `secret`)
5. Amazon might allow renaming of vaults or moving it across regions, in the future.
6. Currently journal consists of independent records, so can be split to separate records using `grep`, or several
journals can be merged using `cat` (but be careful if doing that)
7. In the future, there can be other features and options added, such as compression/encryption, which might require to decide again where to put new attributes for it.
8. Usually there is different policy for backing up config files and journal files (modifiable). So if you loose your journal file, you won't be sure which config corresponds to which *vault* (and journal file
can be restored from a *vault*)
9. It's better to keep relation between *vault* and transfer root (`--dir` option) in one place, such as config file.
#### Why Journal (and metadata stored in Amazon Glacier) does not contain file's metadata (like permissions)?
If you want to store permissions, put your files to archives before backup to Amazon Glacier. There are lot's of different possible things to store as file metadata information,
most of them are not portable. Take a look on archives file formats - different formats allows to store different metadata.
## File selection options
`filter`, `include`, `exclude` options allow you to construct a list of RULES to select only certain files for the operation.
Can be used with commands: `sync`, `purge-vault`, `restore`, `restore-completed ` and `check-local-hash`
+ **--filter**
Adds one or several RULES to the list of rules. One filter value can contain multiple rules, it has same effect as multiple filter values with one
RULE each.
--filter 'RULE1 RULE2' --filter 'RULE3'
is same as
--filter 'RULE1 RULE2 RULE3'
RULES should be a sequence of PATTERNS, followed by '+' or '-' and separated by a spaces. There can be a space between '+'/'-' and PATTERN.
RULES: [+-]PATTERN [+-]PATTERN ...
'+' means INCLUDE PATTERN, '-' means EXCLUDE PATTERN
NOTES:
1. If RULES contain spaces or wildcards, you must quote it when running `mtglacier` from Shell (Example: `mtglacier ... --filter -tmp/` but `mtglacier --filter '-log/ -tmp/'`)
2. Although, PATTERN can contain spaces, you cannot use if, because RULES separated by a space(s).
3. PATTERN can be empty (Example: `--filter +data/ --filter -` - excludes everything except any directory with name `data`, last pattern is empty)
4. Unlike other options, `filter`, `include` and `exclude` cannot be used in config file (in order to avoid mess with order of rules)
+ **--include**
Adds an INCLUDE PATTERN to list of rules (Example: `--include /data/ --filter '+/photos/ -'` - include only photos and data directories)
+ **--exclude**
Adds an EXCLUDE PATTERN to list of rules (Example: `--exclude /data/` - include everything except /data and subdirectories)
NOTES:
1. You can use spaces in PATTERNS here (Example: `--exclude '/my documents/'` - include everything except "/my documents" and subdirectories)
+ **How PATTERNS work**
+ 1) If the pattern starts with a '/' then it is anchored to a particular spot in the hierarchy of files, otherwise it is matched against the final
component of the filename.
`/tmp/myfile` - matches only `/tmp/myfile`. But `tmp/myfile` - matches `/tmp/myfile` and `/home/john/tmp/myfile`
+ 2) If the pattern ends with a '/' then it will only match a directory and all files/subdirectories inside this directory. It won't match regular file.
Note that if directory is empty, it won't be synchronized to Amazon Glacier, as it does not support directories
`log/` - matches only directory `log`, but not a file `log`
+ 3) If pattern does not end with a '/', it won't match directory (directories are not supported by Amazon Glacier, so it makes no sense to match a directory
without subdirectories). However if, in future versions, we find a way to store empty directories in Amazon Glacier, this behavior may change.
`log` - matches only file `log`, but not a directory `log` nor files inside it
+ 4) if the pattern contains a '/' (not counting a trailing '/') then it is matched against the full pathname, including any leading directories.
Otherwise it is matched only against the final component of the filename.
`myfile` - matches `myfile` in any directory (i.e. matches both `/home/ivan/myfile` and `/data/tmp/myfile`), but it does not match
`/tmp/myfile/myfile1`. While `tmp/myfile` matches `/data/tmp/myfile` and `/tmp/myfile/myfile1`
+ 5) Wildcard '*' matches zero or more characters, but it stops at slashes.
`/tmp*/file` matches `/tmp/file`, `/tmp1/file`, `/tmp2/file` but not `tmp1/x/file`
+ 6) Wildcard '**' matches anything, including slashes.
`/tmp**/file` matches `/tmp/file`, `/tmp1/file`, `/tmp2/file`, `tmp1/x/file` and `tmp1/x/y/z/file`
+ 7) When wildcard '**' meant to be a separated path component (i.e. surrounded with slashes/beginning of line/end of line), it matches 0 or more subdirectories.
`/foo/**/bar` matches `foo/bar` and `foo/x/bar`. Also `**/file` matches `/file` and `x/file`
+ 8) Wildcard '?' matches any (exactly one) character except a slash ('/').
`??.txt` matches `11.txt`, `xy.txt` but not `abc.txt`
+ 9) if PATTERN is empty, it matches anything.
`mtglacier ... --filter '+data/ -'` - Last pattern is empty string (followed by '-')
+ 10) If PATTERN is started with '!' it only match when rest of pattern (i.e. without '!') does not match.
`mtglacier ... --filter '-!/data/ +*.gz' -` - include only `*.gz` files inside `data/` directory.
+ **How rules are processed**
+ 1) File's relative filename (relative to `--dir` root) is checked against rules in the list. Once filename match PATTERN, file is included or excluded depending on the kind of PATTERN matched.
No other rules checked after first match.
`--filter '+*.txt -file.txt'` File `file.txt` is INCLUDED, it matches 1st pattern, so 2nd pattern is ignored
+ 2) If no rules matched - file is included (default rule is INCLUDE rule).
`--filter '+*.jpeg'` File `file.txt` is INCLUDED, as it does not match any rules
+ 3) When we process both local files and Journal filelist (sync, restore commands), rule applied to BOTH sides.
+ 4) When traverse directory tree, (in contrast to behavior of some tools, like _Rsync_), if a directory (and all subdirectories) match exclude pattern,
directory tree is not pruned, traversal go into the directory. So this will work fine (it will include `/tmp/data/a/b/c`, but exclude all other files in `/tmp/data`):
--filter '+/tmp/data/a/b/c -/tmp/data/ +'
+ 5) In some cases, to reduce disk IO, directory traversal into excluded directory can be stopped.
This only can happen when `mtglacier` absolutely sure that it won't break behavior (4) described above.
Currently it's guaranteed that traversal stop only in case when:
( run in 0.624 second using v1.01-cache-2.11-cpan-0bb4e1dffa6 )