Algorithm-RabinKarp

 view release on metacpan or  search on metacpan

lib/Algorithm/RabinKarp.pm  view on Meta::CPAN

=head1 DESCRIPTION

This is an implementation of Rabin and Karp's streaming hash, as
described in "Winnowing: Local Algorithms for Document Fingerprinting" by
Schleimer, Wilkerson, and Aiken.  Following the suggestion of Schleimer,
I am using their second equation:

  $H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k

The results of this hash encodes information about the next k values in
the stream (hense k-gram.) This means for any given stream of length n
integer values (or characters), you will get back n - k + 1 hash
values.

For best results, you will want to create a code generator that filters
your data to remove all unnecessary information. For example, in a large
english document, you should probably remove all white space, as well
as removing all capitalization.

=head1 INTENT

lib/Algorithm/RabinKarp/Util.pm  view on Meta::CPAN

=item stream_string ( $scalar )

Iterates across characters in a string.

=cut 

sub stream_string {
  my $string = shift;
  my $pos = 0;
  sub {
      return if ($pos >= length($string));
      my @ret = (ord(substr($string, $pos, 1)), $pos);
      $pos++;
      return @ret;
  };
}

=back

=head1 AUTHOR

t/hash.t  view on Meta::CPAN

#! perl

use strict;
use warnings;
use Test::More;


my $source = lc("A do run run run, a do run run");
$source =~ tr[a-z][]cd; #we don't want anything else.
my $k = 5;
my $kgrams =  length($source) - $k + 1;
plan tests => 2 + 3 * $kgrams;

use Algorithm::RabinKarp;

my $kgram = Algorithm::RabinKarp->new($k,$source);
ok my @values = $kgram->values, "We get a kgram hash array";
is @values, $kgrams, "We get length - k + 1 kgram hash values";

my %kgram_seen;
my %source_seen;

#use Data::Dumper; warn Dumper( [
#map { [ $_->[0], substr($source, $_->[1], $_->[2]) ] }
#map { [ $_->[0], $_->[1], $_->[2] - $_->[1] + 1]} @values ]);

for my $i (0..(length($source)-$k)) {
  my $fragment = substr($source, $i, $k);
  my $occurences = $source_seen{$fragment}++;
  my $kgram = shift @{$values[$i]};
  is $kgram_seen{$kgram}++, $occurences, 
    "$fragment has occurred $occurences times.";
    
  my ($start, $end) = @{$values[$i]};
  is_deeply [$start, $end], [$i, $i + $k - 1], 
    "$fragment position correctly recorded";
  is substr($source, $start, $end - $start + 1 ), $fragment,



( run in 0.871 second using v1.01-cache-2.11-cpan-65fba6d93b7 )