percent results from the CPAN

percent
Algorithm-Huffman
view release on metacpan or search on metacpan
        de                bc
       /  \              /  \
   "0"/    \"1"      "0"/    \"1"
     d      e          b      c
     

Finally this encoding table would be created:

   a    1
   b    010
   c    011
   d    000
   e    001

Please note, that there is no rule defining what element in the tree
is ordered to left or to right. So it's also possible to get e.g. the coding
scheme:

   a    0
   b    100
   c    101
   d    110
   e    111

=head2 METHODS

=over

=item my $huff = Algorithm::Huffman->new( HASHREF )

Creates a new Huffman table,
based on the given occurencies of characters.
The keys of the given hashref are the characters/strings,
the values are their occurencies.

A hashref is used, as such a hash can become quite large
(e.g. all three letter combinations).

The passed hash must have at least 2 elements,
as a huffman algorithm for one or zero elements isn't
very useful for anything. 
Even for two elements, the one becomes "0",
the other "1", independent of their counting.

The counting (given as values in the counting hash),
must be greater or equal to zero. (Negative countings doesn't make
any sense). If one character/substring has a counting of zero,
it is still encoded. It's a feature thinking to a situation where you
would try to encode a large text. You count every character and 
most common substrings in the first part of this large text
(or from a dictionary) to get a good assumption of the whole 
character/substring counting. There could be some ASCII characters
(e.g. 'ä' in an english text), that didn't occur. To ensure that 
the whole text is encodable, you simply set the counting of every 
character not yet counted to zero. That guarantees that
there is an encoding/decoding bit sequences for these ones.
It also guarantees that these bit sequences are longer than
all other encoding/decoding sequences of counted characters/substrings.

The countings needn't be integers,
they could also be fractions (e.g. percentage).

=item $huff->encode_hash

Returns a reference to the encoding hash.
The keys of the encoding hash are the characters/strings passed
at the construction. The values are their bit representation.
Please note that the bit represantations are strings 
of ones and zeros is returned and not binary numbers.

=item $huff->decode_hash

Returns a reference to the decoding hash.
The keys of the decoding hash are the bit presentations,
while the values are the characters/strings the bitstrings stands for.
Please note that the bit represantations are strings 
of ones and zeros is returned and not binary numbers.

=item $huff->encode_bitstring($string)

Returns a bitstring of '1' and '0',
representing an encoded version (with the current huffman tree) 
of the given string.

There could be some ambiguities,
e.g. if there is an 'e' and an 'er' in the huffman tree.
This algorithm is greedy.
That means the given string is traversed from the beginning
and in every loop, the longest possible encoding from the huffman tree is taken.
In the above example,
that would be 'er' instead of 'e'.

The greedy way isn't guarantueed to exist also in future versions.
(E.g., I could imagine to look for the next two (or n) possible encoding
substrings from the huffman tree
and to select the one with the shortest encoding bitstring).

=item $huff->encode($string)

Returns the huffman encoded packed bitvector of C<$string>.

Please look to the description of C<encode_bitstring> for details.

=item $huff->decode_bitstring($bitstring)

Decodes a bitstring of '1' and '0' to the original string.
Allthough the encoding could be a bit ambigious,
the decoding is alway unambigious.

Please take care that only ones and zeroes are in the bitstring.
The method will die otherwise.

It will also die if the bitstring isnt complete.
E.g., assuming,
you have a Huffman-Table

  a => 1
  b => 01
  c => 00
  
and wanted to code 'abc'. The right coding is '10100'.
( run in 1.776 second using v1.01-cache-2.11-cpan-39bf76dae61 )