origin results from the CPAN

origin
Algorithm-Huffman
view release on metacpan or search on metacpan
must be greater or equal to zero. (Negative countings doesn't make
any sense). If one character/substring has a counting of zero,
it is still encoded. It's a feature thinking to a situation where you
would try to encode a large text. You count every character and 
most common substrings in the first part of this large text
(or from a dictionary) to get a good assumption of the whole 
character/substring counting. There could be some ASCII characters
(e.g. 'ä' in an english text), that didn't occur. To ensure that 
the whole text is encodable, you simply set the counting of every 
character not yet counted to zero. That guarantees that
there is an encoding/decoding bit sequences for these ones.
It also guarantees that these bit sequences are longer than
all other encoding/decoding sequences of counted characters/substrings.

The countings needn't be integers,
they could also be fractions (e.g. percentage).

=item $huff->encode_hash

Returns a reference to the encoding hash.
The keys of the encoding hash are the characters/strings passed
at the construction. The values are their bit representation.
Please note that the bit represantations are strings 
of ones and zeros is returned and not binary numbers.

=item $huff->decode_hash

Returns a reference to the decoding hash.
The keys of the decoding hash are the bit presentations,
while the values are the characters/strings the bitstrings stands for.
Please note that the bit represantations are strings 
of ones and zeros is returned and not binary numbers.

=item $huff->encode_bitstring($string)

Returns a bitstring of '1' and '0',
representing an encoded version (with the current huffman tree) 
of the given string.

There could be some ambiguities,
e.g. if there is an 'e' and an 'er' in the huffman tree.
This algorithm is greedy.
That means the given string is traversed from the beginning
and in every loop, the longest possible encoding from the huffman tree is taken.
In the above example,
that would be 'er' instead of 'e'.

The greedy way isn't guarantueed to exist also in future versions.
(E.g., I could imagine to look for the next two (or n) possible encoding
substrings from the huffman tree
and to select the one with the shortest encoding bitstring).

=item $huff->encode($string)

Returns the huffman encoded packed bitvector of C<$string>.

Please look to the description of C<encode_bitstring> for details.

=item $huff->decode_bitstring($bitstring)

Decodes a bitstring of '1' and '0' to the original string.
Allthough the encoding could be a bit ambigious,
the decoding is alway unambigious.

Please take care that only ones and zeroes are in the bitstring.
The method will die otherwise.

It will also die if the bitstring isnt complete.
E.g., assuming,
you have a Huffman-Table

  a => 1
  b => 01
  c => 00
  
and wanted to code 'abc'. The right coding is '10100'.
But '1010' (the last 0 is missing) will produce the error message:
C<Unknown bit sequence starting at index 3 in the bitstring>.

=item $huff->decode($bitvector)

Decodes a packed bitvector (encoded with the ->encode method).

Please look to the description of C<decode_bitstring> for details.

=back   

=head2 EXPORT

None by default.

=head1 BUGS

If a character/string has occurs zero times, it is still coded.
At the moment, you have to grep them out before.
I don't plan to change it,
as it can realistic happen and they would play a role.
(Imagine, you would code all three letter combinations found in some
english texts, you still would have to code all ASCII characters,
even if they don't occur in the texts you have analyzed.
Reason is that they could occur in other texts and
you would have to guarantee that you can code every text
without any lost information)

If you encode part for part your stream,
you could get the idea of doing stuff like:

  my $encode1 = $huff->encode_bitstring($chapter1);
  my $encode2 = $huff->encode_bitstring($chapter2);
  
  my $total_encode = $encode1 . $encode2;
  
  my $all_chapters = $huff->decode_bitstring($total_encode);
  
  # Now $all_chapter eq $chapter1 . $chapter2
  
That will work fine,
but I'm afraid, it won't work,
if you replace the C<..code_bitstring methods>
with the C<..code> methods.
( run in 1.036 second using v1.01-cache-2.11-cpan-bbcb1afb8fc )