Algorithm-RabinKarp
view release on metacpan or search on metacpan
lib/Algorithm/RabinKarp.pm view on Meta::CPAN
package Algorithm::RabinKarp;
use warnings;
use strict;
use Algorithm::RabinKarp::Util qw(stream_fh stream_string);
use UNIVERSAL;
use constant BASE => 2;
our $VERSION = "0.41";
=head1 NAME
Algorithm::RabinKarp - Rabin-Karp streaming hash
=head1 SYNOPSIS
my $text = "A do run run run, a do run run";
my $kgram = Algorithm::RabinKarp->new($window, $text);
or
my $kgram2 = Algorithm::RabinKarp->new($window, $fh);
or
my $kgram3 = Algorithm::RabinKarp->new($window, sub {
...
return $num, $position;
});
my ($hash, $start_position, $end_position) = $kgram->next;
my @values = $kgram->values;
my %occurances; # a dictionary of all kgrams.
while (my ($hash, @pos) = @{shift @values}) {
push @{$occurances{$hash}}, \@pos;
}
my $needle = Algorithm::RabinKarp->new(6, "needle");
open my $fh, '<', "haystack.txt";
my $haystack = Algorithm::RabinKarp->new(6, $fh);
my $needle_hash = $needle->next;
while (my ($hay_hash, @pos) = $haystack->next) {
warn "Possible match for 'needle' at @pos"
if $needle_hash eq $hay_hash;
}
=head1 DESCRIPTION
This is an implementation of Rabin and Karp's streaming hash, as
described in "Winnowing: Local Algorithms for Document Fingerprinting" by
Schleimer, Wilkerson, and Aiken. Following the suggestion of Schleimer,
I am using their second equation:
$H[ $c[2..$k + 1] ] = (( $H[ $c[1..$k] ] - $c[1] ** $k ) + $c[$k+1] ) * $k
The results of this hash encodes information about the next k values in
the stream (hense k-gram.) This means for any given stream of length n
integer values (or characters), you will get back n - k + 1 hash
values.
For best results, you will want to create a code generator that filters
your data to remove all unnecessary information. For example, in a large
english document, you should probably remove all white space, as well
as removing all capitalization.
=head1 INTENT
By preprocessing your document with the Rabin Karp hashing algorithm,
it makes it possible to create a "fingerprint" of your document (or documents),
and then perform multiple searches for fragments contained within your document
database.
Schleimer, Wilkerson, and Aiken suggest preproccessing to remove
unnecessary information (like whitespace), as well as known redundent information
(like, say, copyright notices or other boilerplate that is 'acceptable'.)
They also suggest a post processing pass to reduce data volume, using a technique
called winnowing (see the link at the end of this documentation.)
=head1 METHODS
=over
=item new($k, [FileHandle|Scalar|Coderef] )
Creates a new hash generator. If you provide a callback function, it must
return the next integer value in the stream. Additionally, you may
return the original position of the value in the stream (ie, you may have been
filtering characters out because they're redundant.)
=cut
sub new {
my $class = shift;
my $k = shift;
my $stream = $class->make_stream(shift);
my $rm_k = BASE;
bless {
k => $k % 32,
vals => [],
stream => $stream,
}, ref $class || $class;
}
sub make_stream {
my $class = shift;
my $source = shift;
( run in 0.378 second using v1.01-cache-2.11-cpan-62beec7d96d )