Compress-Zstd

 view release on metacpan or  search on metacpan

ext/zstd/doc/zstd_compression_format.md  view on Meta::CPAN

In this case, `Window_Descriptor` byte is skipped,
but `Frame_Content_Size` is necessarily present.
As a consequence, the decoder must allocate a memory segment
of size equal or larger than `Frame_Content_Size`.

In order to preserve the decoder from unreasonable memory requirements,
a decoder is allowed to reject a compressed frame
which requests a memory size beyond decoder's authorized range.

For broader compatibility, decoders are recommended to support
memory sizes of at least 8 MB.
This is only a recommendation,
each decoder is free to support higher or lower limits,
depending on local limitations.

__`Unused_bit`__

A decoder compliant with this specification version shall not interpret this bit.
It might be used in any future version,
to signal a property which is transparent to properly decode the frame.
An encoder compliant with this specification version must set this bit to zero.

__`Reserved_bit`__

This bit is reserved for some future feature.
Its value _must be zero_.
A decoder compliant with this specification version must ensure it is not set.
This bit may be used in a future revision,
to signal a feature that must be interpreted to decode the frame correctly.

__`Content_Checksum_flag`__

If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end.
See `Content_Checksum` paragraph.

__`Dictionary_ID_flag`__

This is a 2-bits flag (`= FHD & 3`),
telling if a dictionary ID is provided within the header.
It also specifies the size of this field as `DID_Field_Size`.

|`Flag_Value`    |  0  |  1  |  2  |  3  |
| -------------- | --- | --- | --- | --- |
|`DID_Field_Size`|  0  |  1  |  2  |  4  |

#### `Window_Descriptor`

Provides guarantees on minimum memory buffer required to decompress a frame.
This information is important for decoders to allocate enough memory.

The `Window_Descriptor` byte is optional.
When `Single_Segment_flag` is set, `Window_Descriptor` is not present.
In this case, `Window_Size` is `Frame_Content_Size`,
which can be any value from 0 to 2^64-1 bytes (16 ExaBytes).

| Bit numbers |     7-3    |     2-0    |
| ----------- | ---------- | ---------- |
| Field name  | `Exponent` | `Mantissa` |

The minimum memory buffer size is called `Window_Size`.
It is described by the following formulas :
```
windowLog = 10 + Exponent;
windowBase = 1 << windowLog;
windowAdd = (windowBase / 8) * Mantissa;
Window_Size = windowBase + windowAdd;
```
The minimum `Window_Size` is 1 KB.
The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB.

In general, larger `Window_Size` tend to improve compression ratio,
but at the cost of memory usage.

To properly decode compressed data,
a decoder will need to allocate a buffer of at least `Window_Size` bytes.

In order to preserve decoder from unreasonable memory requirements,
a decoder is allowed to reject a compressed frame
which requests a memory size beyond decoder's authorized range.

For improved interoperability,
it's recommended for decoders to support `Window_Size` of up to 8 MB,
and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB.
It's merely a recommendation though,
decoders are free to support larger or lower limits,
depending on local limitations.

#### `Dictionary_ID`

This is a variable size field, which contains
the ID of the dictionary required to properly decode the frame.
`Dictionary_ID` field is optional. When it's not present,
it's up to the decoder to know which dictionary to use.

`Dictionary_ID` field size is provided by `DID_Field_Size`.
`DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`.
1 byte can represent an ID 0-255.
2 bytes can represent an ID 0-65535.
4 bytes can represent an ID 0-4294967295.
Format is __little-endian__.

It's allowed to represent a small ID (for example `13`)
with a large 4-bytes dictionary ID, even if it is less efficient.

_Reserved ranges :_
Within private environments, any `Dictionary_ID` can be used.

However, for frames and dictionaries distributed in public space,
`Dictionary_ID` must be attributed carefully.
Rules for public environment are not yet decided,
but the following ranges are reserved for some future registrar :
- low range  : `<= 32767`
- high range : `>= (1 << 31)`

Outside of these ranges, any value of `Dictionary_ID`
which is both `>= 32768` and `< (1<<31)` can be used freely,
even in public environment.



#### `Frame_Content_Size`

ext/zstd/doc/zstd_compression_format.md  view on Meta::CPAN


| `Literals_Length_Code` |         0-15           |
| ---------------------- | ---------------------- |
| length                 | `Literals_Length_Code` |
| `Number_of_Bits`       |          0             |

| `Literals_Length_Code` |  16  |  17  |  18  |  19  |  20  |  21  |  22  |  23  |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| `Baseline`             |  16  |  18  |  20  |  22  |  24  |  28  |  32  |  40  |
| `Number_of_Bits`       |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |

| `Literals_Length_Code` |  24  |  25  |  26  |  27  |  28  |  29  |  30  |  31  |
| ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| `Baseline`             |  48  |  64  |  128 |  256 |  512 | 1024 | 2048 | 4096 |
| `Number_of_Bits`       |   4  |   6  |   7  |   8  |   9  |  10  |  11  |  12  |

| `Literals_Length_Code` |  32  |  33  |  34  |  35  |
| ---------------------- | ---- | ---- | ---- | ---- |
| `Baseline`             | 8192 |16384 |32768 |65536 |
| `Number_of_Bits`       |  13  |  14  |  15  |  16  |


##### Match length codes

Match length codes are values ranging from `0` to `52` included.
They define lengths from 3 to 131074 bytes.
The match length is equal to the decoded `Baseline` plus
the result of reading `Number_of_Bits` bits from the bitstream,
as a __little-endian__ value.

| `Match_Length_Code` |         0-31            |
| ------------------- | ----------------------- |
| value               | `Match_Length_Code` + 3 |
| `Number_of_Bits`    |          0              |

| `Match_Length_Code` |  32  |  33  |  34  |  35  |  36  |  37  |  38  |  39  |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| `Baseline`          |  35  |  37  |  39  |  41  |  43  |  47  |  51  |  59  |
| `Number_of_Bits`    |   1  |   1  |   1  |   1  |   2  |   2  |   3  |   3  |

| `Match_Length_Code` |  40  |  41  |  42  |  43  |  44  |  45  |  46  |  47  |
| ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
| `Baseline`          |  67  |  83  |  99  |  131 |  259 |  515 | 1027 | 2051 |
| `Number_of_Bits`    |   4  |   4  |   5  |   7  |   8  |   9  |  10  |  11  |

| `Match_Length_Code` |  48  |  49  |  50  |  51  |  52  |
| ------------------- | ---- | ---- | ---- | ---- | ---- |
| `Baseline`          | 4099 | 8195 |16387 |32771 |65539 |
| `Number_of_Bits`    |  12  |  13  |  14  |  15  |  16  |

##### Offset codes

Offset codes are values ranging from `0` to `N`.

A decoder is free to limit its maximum `N` supported.
Recommendation is to support at least up to `22`.
For information, at the time of this writing.
the reference decoder supports a maximum `N` value of `31`.

An offset code is also the number of additional bits to read in __little-endian__ fashion,
and can be translated into an `Offset_Value` using the following formulas :

```
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
if (Offset_Value > 3) offset = Offset_Value - 3;
```
It means that maximum `Offset_Value` is `(2^(N+1))-1`
supporting back-reference distances up to `(2^(N+1))-4`,
but is limited by [maximum back-reference distance](#window_descriptor).

`Offset_Value` from 1 to 3 are special : they define "repeat codes".
This is described in more detail in [Repeat Offsets](#repeat-offsets).

#### Decoding Sequences
FSE bitstreams are read in reverse direction than written. In zstd,
the compressor writes bits forward into a block and the decompressor
must read the bitstream _backwards_.

To find the start of the bitstream it is therefore necessary to
know the offset of the last byte of the block which can be found
by counting `Block_Size` bytes after the block header.

After writing the last bit containing information, the compressor
writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
padding. The last byte of the compressed bitstream cannot be `0` for
that reason.

When decompressing, the last byte containing the padding is the first
byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
begins.

FSE decoding requires a 'state' to be carried from symbol to symbol.
For more explanation on FSE decoding, see the [FSE section](#fse).

For sequence decoding, a separate state keeps track of each
literal lengths, offsets, and match lengths symbols.
Some FSE primitives are also used.
For more details on the operation of these primitives, see the [FSE section](#fse).

##### Starting states
The bitstream starts with initial FSE state values,
each using the required number of bits in their respective _accuracy_,
decoded previously from their normalized distribution.

It starts by `Literals_Length_State`,
followed by `Offset_State`,
and finally `Match_Length_State`.

Reminder : always keep in mind that all values are read _backward_,
so the 'start' of the bitstream is at the highest position in memory,
immediately before the last `1`-bit for padding.

After decoding the starting states, a single sequence is decoded
`Number_Of_Sequences` times.
These sequences are decoded in order from first to last.
Since the compressor writes the bitstream in the forward direction,
this means the compressor must encode the sequences starting with the last
one and ending with the first.

##### Decoding a sequence

ext/zstd/doc/zstd_compression_format.md  view on Meta::CPAN

[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/

FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
`Symbol`, `Num_Bits`, and `Baseline`.
The `log2` of the table size is its `Accuracy_Log`.
An FSE state value represents an index in this table.

To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
The next symbol in the stream is the `Symbol` indicated in the table for that state.
To obtain the next state value,
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.

[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems

### FSE Table Description
To decode FSE streams, it is necessary to construct the decoding table.
The Zstandard format encodes FSE table descriptions as follows:

An FSE distribution table describes the probabilities of all symbols
from `0` to the last present one (included)
on a normalized scale of `1 << Accuracy_Log` .
Note that there must be two or more symbols with nonzero probability.

It's a bitstream which is read forward, in __little-endian__ fashion.
It's not necessary to know bitstream exact size,
it will be discovered and reported by the decoding process.

The bitstream starts by reporting on which scale it operates.
Let's `low4Bits` designate the lowest 4 bits of the first byte :
`Accuracy_Log = low4bits + 5`.

Then follows each symbol value, from `0` to last present one.
The number of bits used by each field is variable.
It depends on :

- Remaining probabilities + 1 :
  __example__ :
  Presuming an `Accuracy_Log` of 8,
  and presuming 100 probabilities points have already been distributed,
  the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
  Therefore, it must read `log2sup(157) == 8` bits.

- Value decoded : small values use 1 less bit :
  __example__ :
  Presuming values from 0 to 157 (inclusive) are possible,
  255-157 = 98 values are remaining in an 8-bits field.
  They are used this way :
  first 98 values (hence from 0 to 97) use only 7 bits,
  values from 98 to 157 use 8 bits.
  This is achieved through this scheme :

  | Value read | Value decoded | Number of bits used |
  | ---------- | ------------- | ------------------- |
  |   0 -  97  |   0 -  97     |  7                  |
  |  98 - 127  |  98 - 127     |  8                  |
  | 128 - 225  |   0 -  97     |  7                  |
  | 226 - 255  | 128 - 157     |  8                  |

Symbols probabilities are read one by one, in order.

Probability is obtained from Value decoded by following formula :
`Proba = value - 1`

It means value `0` becomes negative probability `-1`.
`-1` is a special probability, which means "less than 1".
Its effect on distribution table is described in the [next section].
For the purpose of calculating total allocated probability points, it counts as one.

[next section]:#from-normalized-distribution-to-decoding-tables

When a symbol has a __probability__ of `zero`,
it is followed by a 2-bits repeat flag.
This repeat flag tells how many probabilities of zeroes follow the current one.
It provides a number ranging from 0 to 3.
If it is a 3, another 2-bits repeat flag follows, and so on.

When last symbol reaches cumulated total of `1 << Accuracy_Log`,
decoding is complete.
If the last symbol makes cumulated total go above `1 << Accuracy_Log`,
distribution is considered corrupted.

Then the decoder can tell how many bytes were used in this process,
and how many symbols are present.
The bitstream consumes a round number of bytes.
Any remaining bit within the last byte is just unused.

#### From normalized distribution to decoding tables

The distribution of normalized probabilities is enough
to create a unique decoding table.

It follows the following build rule :

The table has a size of `Table_Size = 1 << Accuracy_Log`.
Each cell describes the symbol decoded,
and instructions to get the next state.

Symbols are scanned in their natural order for "less than 1" probabilities.
Symbols with this probability are being attributed a single cell,
starting from the end of the table and retreating.
These symbols define a full state reset, reading `Accuracy_Log` bits.

All remaining symbols are allocated in their natural order.
Starting from symbol `0` and table position `0`,
each symbol gets allocated as many cells as its probability.
Cell allocation is spreaded, not linear :
each successor position follow this rule :

```
position += (tableSize>>1) + (tableSize>>3) + 3;
position &= tableSize-1;
```

A position is skipped if already occupied by a "less than 1" probability symbol.
`position` does not reset between symbols, it simply iterates through
each position in the table, switching to the next symbol when enough
states have been allocated to the current one.

The result is a list of state values.
Each state will decode the current symbol.

ext/zstd/doc/zstd_compression_format.md  view on Meta::CPAN

each takes its allocated width from Baseline.

| state order      |   0   |   1   |    2   |   3  |   4   |
| ---------------- | ----- | ----- | ------ | ---- | ----- |
| width            |  32   |  32   |   32   |  16  |  16   |
| `Number_of_Bits` |   5   |   5   |    5   |   4  |   4   |
| range number     |   2   |   4   |    6   |   0  |   1   |
| `Baseline`       |  32   |  64   |   96   |   0  |  16   |
| range            | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |

The next state is determined from current state
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.

See [Appendix A] for the results of this process applied to the default distributions.

[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes


Huffman Coding
--------------
Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams.
Therefore, to find the start of the bitstream, it is therefore to
know the offset of the last byte of the Huffman-coded stream.

After writing the last bit containing information, the compressor
writes a single `1`-bit and then fills the byte with 0-7 `0` bits of
padding. The last byte of the compressed bitstream cannot be `0` for
that reason.

When decompressing, the last byte containing the padding is the first
byte to read. The decompressor needs to skip 0-7 initial `0`-bits and
the first `1`-bit it occurs. Afterwards, the useful part of the bitstream
begins.

The bitstream contains Huffman-coded symbols in __little-endian__ order,
with the codes defined by the method below.

### Huffman Tree Description

Prefix coding represents symbols from an a priori known alphabet
by bit sequences (codewords), one codeword for each symbol,
in a manner such that different symbols may be represented
by bit sequences of different lengths,
but a parser can always parse an encoded string
unambiguously symbol-by-symbol.

Given an alphabet with known symbol frequencies,
the Huffman algorithm allows the construction of an optimal prefix code
using the fewest bits of any possible prefix codes for that alphabet.

Prefix code must not exceed a maximum code length.
More bits improve accuracy but cost more header size,
and require more memory or more complex decoding operations.
This specification limits maximum code length to 11 bits.

#### Representation

All literal values from zero (included) to last present one (excluded)
are represented by `Weight` with values from `0` to `Max_Number_of_Bits`.
Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
`Max_Number_of_Bits` must be <= 11,
otherwise the representation is considered corrupted.

__Example__ :
Let's presume the following Huffman tree must be described :

|  literal value   |  0  |  1  |  2  |  3  |  4  |  5  |
| ---------------- | --- | --- | --- | --- | --- | --- |
| `Number_of_Bits` |  1  |  2  |  3  |  0  |  4  |  4  |

The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4,
nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is :
```
Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0
```
It gives the following series of weights :

| literal value |  0  |  1  |  2  |  3  |  4  |
| ------------- | --- | --- | --- | --- | --- |
|   `Weight`    |  4  |  3  |  2  |  0  |  1  |

The decoder will do the inverse operation :
having collected weights of literal symbols from `0` to `4`,
it knows the last literal, `5`, is present with a non-zero `Weight`.
The `Weight` of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.

#### Huffman Tree header

This is a single byte value (0-255),
which describes how the series of weights is encoded.

- if `headerByte` < 128 :
  the series of weights is compressed using FSE (see below).
  The length of the FSE-compressed series is equal to `headerByte` (0-127).

- if `headerByte` >= 128 :
  + the series of weights uses a direct representation,
    where each `Weight` is encoded directly as a 4 bits field (0-15).
  + They are encoded forward, 2 weights to a byte,
    first weight taking the top four bits and second one taking the bottom four.
    * e.g. the following operations could be used to read the weights:
      `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc.
  + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes,
    meaning it uses only full bytes even if `Number_of_Weights` is odd.
  + `Number_of_Weights = headerByte - 127`.
    * Note that maximum `Number_of_Weights` is 255-127 = 128,
      therefore, only up to 128 `Weight` can be encoded using direct representation.
    * Since the last non-zero `Weight` is _not_ encoded,
      this scheme is compatible with alphabet sizes of up to 129 symbols,
      hence including literal symbol 128.
    * If any literal symbol > 128 has a non-zero `Weight`,
      direct representation is not possible.
      In such case, it's necessary to use FSE compression.


#### Finite State Entropy (FSE) compression of Huffman weights

In this case, the series of Huffman weights is compressed using FSE compression.
It's a single bitstream with 2 interleaved states,
sharing a single distribution table.

To decode an FSE bitstream, it is necessary to know its compressed size.
Compressed size is provided by `headerByte`.
It's also necessary to know its _maximum possible_ decompressed size,
which is `255`, since literal values span from `0` to `255`,
and last symbol's `Weight` is not represented.

An FSE bitstream starts by a header, describing probabilities distribution.
It will create a Decoding Table.
For a list of Huffman weights, the maximum accuracy log is 6 bits.
For more description see the [FSE header description](#fse-table-description)

The Huffman header compression uses 2 states,
which share the same FSE distribution table.
The first state (`State1`) encodes the even indexed symbols,
and the second (`State2`) encodes the odd indexed symbols.
`State1` is initialized first, and then `State2`, and they take turns
decoding a single symbol and updating their state.
For more details on these FSE operations, see the [FSE section](#fse).

The number of symbols to decode is determined
by tracking bitStream overflow condition:
If updating state after decoding a symbol would require more bits than
remain in the stream, it is assumed that extra bits are 0.  Then,
symbols for each of the final states are decoded and the process is complete.

#### Conversion from weights to Huffman prefix codes

All present symbols shall now have a `Weight` value.
It is possible to transform weights into `Number_of_Bits`, using this formula:
```
Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0
```
Symbols are sorted by `Weight`.
Within same `Weight`, symbols keep natural sequential order.
Symbols with a `Weight` of zero are removed.
Then, starting from lowest `Weight`, prefix codes are distributed in sequential order.

__Example__ :
Let's presume the following list of weights has been decoded :

| Literal  |  0  |  1  |  2  |  3  |  4  |  5  |
| -------- | --- | --- | --- | --- | --- | --- |
| `Weight` |  4  |  3  |  2  |  0  |  1  |  1  |

Sorted by weight and then natural sequential order,
it gives the following distribution :

| Literal          |  3  |  4  |  5  |  2  |  1  |   0  |
| ---------------- | --- | --- | --- | --- | --- | ---- |
| `Weight`         |  0  |  1  |  1  |  2  |  3  |   4  |
| `Number_of_Bits` |  0  |  4  |  4  |  3  |  2  |   1  |
| prefix codes     | N/A | 0000| 0001| 001 | 01  |   1  |

### Huffman-coded Streams

Given a Huffman decoding table,
it's possible to decode a Huffman-coded stream.

Each bitstream must be read _backward_,
that is starting from the end down to the beginning.
Therefore it's necessary to know the size of each bitstream.

It's also necessary to know exactly which _bit_ is the last one.
This is detected by a final bit flag :
the highest bit of latest byte is a final-bit-flag.
Consequently, a last byte of `0` is not possible.
And the final-bit-flag itself is not part of the useful bitstream.
Hence, the last byte contains between 0 and 7 useful bits.

Starting from the end,
it's possible to read the bitstream in a __little-endian__ fashion,
keeping track of already used bits. Since the bitstream is encoded in reverse
order, starting from the end read symbols in forward order.

For example, if the literal sequence "0145" was encoded using above prefix code,
it would be encoded (in reverse order) as:

|Symbol  |   5  |   4  |  1 | 0 | Padding |
|--------|------|------|----|---|---------|
|Encoding|`0000`|`0001`|`01`|`1`| `00001` |

Resulting in following 2-bytes bitstream :
```
00010000 00001101
```

Here is an alternative representation with the symbol codes separated by underscore:
```
0001_0000 00001_1_01



( run in 1.006 second using v1.01-cache-2.11-cpan-39bf76dae61 )