Algorithm-RabinKarp
view release on metacpan or search on metacpan
lib/Algorithm/RabinKarp.pm view on Meta::CPAN
=head1 DESCRIPTION
This is an implementation of Rabin and Karp's streaming hash, as
described in "Winnowing: Local Algorithms for Document Fingerprinting" by
Schleimer, Wilkerson, and Aiken. Following the suggestion of Schleimer,
I am using their second equation:
$H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k
The results of this hash encodes information about the next k values in
the stream (hense k-gram.) This means for any given stream of length n
integer values (or characters), you will get back n - k + 1 hash
values.
For best results, you will want to create a code generator that filters
your data to remove all unnecessary information. For example, in a large
english document, you should probably remove all white space, as well
as removing all capitalization.
=head1 INTENT
lib/Algorithm/RabinKarp/Util.pm view on Meta::CPAN
=item stream_string ( $scalar )
Iterates across characters in a string.
=cut
sub stream_string {
my $string = shift;
my $pos = 0;
sub {
return if ($pos >= length($string));
my @ret = (ord(substr($string, $pos, 1)), $pos);
$pos++;
return @ret;
};
}
=back
=head1 AUTHOR
#! perl
use strict;
use warnings;
use Test::More;
my $source = lc("A do run run run, a do run run");
$source =~ tr[a-z][]cd; #we don't want anything else.
my $k = 5;
my $kgrams = length($source) - $k + 1;
plan tests => 2 + 3 * $kgrams;
use Algorithm::RabinKarp;
my $kgram = Algorithm::RabinKarp->new($k,$source);
ok my @values = $kgram->values, "We get a kgram hash array";
is @values, $kgrams, "We get length - k + 1 kgram hash values";
my %kgram_seen;
my %source_seen;
#use Data::Dumper; warn Dumper( [
#map { [ $_->[0], substr($source, $_->[1], $_->[2]) ] }
#map { [ $_->[0], $_->[1], $_->[2] - $_->[1] + 1]} @values ]);
for my $i (0..(length($source)-$k)) {
my $fragment = substr($source, $i, $k);
my $occurences = $source_seen{$fragment}++;
my $kgram = shift @{$values[$i]};
is $kgram_seen{$kgram}++, $occurences,
"$fragment has occurred $occurences times.";
my ($start, $end) = @{$values[$i]};
is_deeply [$start, $end], [$i, $i + $k - 1],
"$fragment position correctly recorded";
is substr($source, $start, $end - $start + 1 ), $fragment,
( run in 0.871 second using v1.01-cache-2.11-cpan-65fba6d93b7 )