Data-Startup
view release on metacpan or search on metacpan
t/ExtUtils/SVDmaker/Algorithm/Diff.pm view on Meta::CPAN
Imagine that there are two arrows. Arrow A points to an element of sequence A,
and arrow B points to an element of the sequence B. Initially, the arrows
point to the first elements of the respective sequences. C<traverse_sequences>
will advance the arrows through the sequences one element at a time, calling an
appropriate user-specified callback function before each advance. It
willadvance the arrows in such a way that if there are equal elements C<$A[$i]>
and C<$B[$j]> which are equal and which are part of the LCS, there will be
some moment during the execution of C<traverse_sequences> when arrow A is
pointing to C<$A[$i]> and arrow B is pointing to C<$B[$j]>. When this happens,
C<traverse_sequences> will call the C<MATCH> callback function and then it will
advance both arrows.
Otherwise, one of the arrows is pointing to an element of its sequence that is
not part of the LCS. C<traverse_sequences> will advance that arrow and will
call the C<DISCARD_A> or the C<DISCARD_B> callback, depending on which arrow it
advanced. If both arrows point to elements that are not part of the LCS, then
C<traverse_sequences> will advance one of them and call the appropriate
callback, but it is not specified which it will call.
The arguments to C<traverse_sequences> are the two sequences to traverse, and a
hash which specifies the callback functions, like this:
traverse_sequences( \@seq1, \@seq2,
{ MATCH => $callback_1,
DISCARD_A => $callback_2,
DISCARD_B => $callback_3,
} );
Callbacks for MATCH, DISCARD_A, and DISCARD_B are invoked with at least the
indices of the two arrows as their arguments. They are not expected to return
any values. If a callback is omitted from the table, it is not called.
Callbacks for A_FINISHED and B_FINISHED are invoked with at least the
corresponding index in A or B.
If arrow A reaches the end of its sequence, before arrow B does,
C<traverse_sequences> will call the C<A_FINISHED> callback when it advances
arrow B, if there is such a function; if not it will call C<DISCARD_B> instead.
Similarly if arrow B finishes first. C<traverse_sequences> returns when both
arrows are at the ends of their respective sequences. It returns true on
success and false on failure. At present there is no way to fail.
C<traverse_sequences> may be passed an optional fourth parameter; this is a
CODE reference to a key generation function. See L</KEY GENERATION FUNCTIONS>.
Additional parameters, if any, will be passed to the key generation function.
=head2 C<traverse_balanced>
C<traverse_balanced> is an alternative to C<traverse_sequences>. It
uses a different algorithm to iterate through the entries in the
computed LCS. Instead of sticking to one side and showing element changes
as insertions and deletions only, it will jump back and forth between
the two sequences and report I<changes> occurring as deletions on one
side followed immediatly by an insertion on the other side.
In addition to the
C<DISCARD_A>,
C<DISCARD_B>, and
C<MATCH>
callbacks supported by C<traverse_sequences>, C<traverse_balanced> supports
a C<CHANGE> callback indicating that one element got C<replaced> by another:
traverse_sequences( \@seq1, \@seq2,
{ MATCH => $callback_1,
DISCARD_A => $callback_2,
DISCARD_B => $callback_3,
CHANGE => $callback_4,
} );
If no C<CHANGE> callback is specified, C<traverse_balanced>
will map C<CHANGE> events to C<DISCARD_A> and C<DISCARD_B> actions,
therefore resulting in a similar behaviour as C<traverse_sequences>
with different order of events.
C<traverse_balanced> might be a bit slower than C<traverse_sequences>,
noticable only while processing huge amounts of data.
The C<sdiff> function of this module
is implemented as call to C<traverse_balanced>.
=head1 KEY GENERATION FUNCTIONS
C<diff>, C<LCS>, and C<traverse_sequences> accept an optional last parameter.
This is a CODE reference to a key generating (hashing) function that should
return a string that uniquely identifies a given element. It should be the
case that if two elements are to be considered equal, their keys should be the
same (and the other way around). If no key generation function is provided,
the key will be the element as a string.
By default, comparisons will use "eq" and elements will be turned into keys
using the default stringizing operator '""'.
Where this is important is when you're comparing something other than strings.
If it is the case that you have multiple different objects that should be
considered to be equal, you should supply a key generation function. Otherwise,
you have to make sure that your arrays contain unique references.
For instance, consider this example:
package Person;
sub new
{
my $package = shift;
return bless { name => '', ssn => '', @_ }, $package;
}
sub clone
{
my $old = shift;
my $new = bless { %$old }, ref($old);
}
sub hash
{
return shift()->{'ssn'};
}
my $person1 = Person->new( name => 'Joe', ssn => '123-45-6789' );
my $person2 = Person->new( name => 'Mary', ssn => '123-47-0000' );
t/ExtUtils/SVDmaker/Algorithm/Diff.pm view on Meta::CPAN
# now the end
while ( $aStart <= $aFinish
and $bStart <= $bFinish
and &$compare( $a->[$aFinish], $b->[$bFinish], @_ ) )
{
$matchVector->[ $aFinish-- ] = $bFinish--;
}
# Now compute the equivalence classes of positions of elements
my $bMatches =
_withPositionsOfInInterval( $b, $bStart, $bFinish, $keyGen, @_ );
my $thresh = [];
my $links = [];
my ( $i, $ai, $j, $k );
for ( $i = $aStart ; $i <= $aFinish ; $i++ )
{
$ai = &$keyGen( $a->[$i], @_ );
if ( exists( $bMatches->{$ai} ) )
{
$k = 0;
for $j ( @{ $bMatches->{$ai} } )
{
# optimization: most of the time this will be true
if ( $k and $thresh->[$k] > $j and $thresh->[ $k - 1 ] < $j )
{
$thresh->[$k] = $j;
}
else
{
$k = _replaceNextLargerWith( $thresh, $j, $k );
}
# oddly, it's faster to always test this (CPU cache?).
if ( defined($k) )
{
$links->[$k] =
[ ( $k ? $links->[ $k - 1 ] : undef ), $i, $j ];
}
}
}
}
if (@$thresh)
{
for ( my $link = $links->[$#$thresh] ; $link ; $link = $link->[0] )
{
$matchVector->[ $link->[1] ] = $link->[2];
}
}
return wantarray ? @$matchVector : $matchVector;
}
sub traverse_sequences
{
my $a = shift; # array ref
my $b = shift; # array ref
my $callbacks = shift || {};
my $keyGen = shift;
my $matchCallback = $callbacks->{'MATCH'} || sub { };
my $discardACallback = $callbacks->{'DISCARD_A'} || sub { };
my $finishedACallback = $callbacks->{'A_FINISHED'};
my $discardBCallback = $callbacks->{'DISCARD_B'} || sub { };
my $finishedBCallback = $callbacks->{'B_FINISHED'};
my $matchVector = _longestCommonSubsequence( $a, $b, $keyGen, @_ );
# Process all the lines in @$matchVector
my $lastA = $#$a;
my $lastB = $#$b;
my $bi = 0;
my $ai;
for ( $ai = 0 ; $ai <= $#$matchVector ; $ai++ )
{
my $bLine = $matchVector->[$ai];
if ( defined($bLine) ) # matched
{
&$discardBCallback( $ai, $bi++, @_ ) while $bi < $bLine;
&$matchCallback( $ai, $bi++, @_ );
}
else
{
&$discardACallback( $ai, $bi, @_ );
}
}
# The last entry (if any) processed was a match.
# $ai and $bi point just past the last matching lines in their sequences.
while ( $ai <= $lastA or $bi <= $lastB )
{
# last A?
if ( $ai == $lastA + 1 and $bi <= $lastB )
{
if ( defined($finishedACallback) )
{
&$finishedACallback( $lastA, @_ );
$finishedACallback = undef;
}
else
{
&$discardBCallback( $ai, $bi++, @_ ) while $bi <= $lastB;
}
}
# last B?
if ( $bi == $lastB + 1 and $ai <= $lastA )
{
if ( defined($finishedBCallback) )
{
&$finishedBCallback( $lastB, @_ );
$finishedBCallback = undef;
}
else
{
&$discardACallback( $ai++, $bi, @_ ) while $ai <= $lastA;
}
}
&$discardACallback( $ai++, $bi, @_ ) if $ai <= $lastA;
&$discardBCallback( $ai, $bi++, @_ ) if $bi <= $lastB;
}
return 1;
}
sub traverse_balanced
{
my $a = shift; # array ref
my $b = shift; # array ref
my $callbacks = shift || {};
my $keyGen = shift;
my $matchCallback = $callbacks->{'MATCH'} || sub { };
my $discardACallback = $callbacks->{'DISCARD_A'} || sub { };
my $discardBCallback = $callbacks->{'DISCARD_B'} || sub { };
my $changeCallback = $callbacks->{'CHANGE'};
my $matchVector = _longestCommonSubsequence( $a, $b, $keyGen, @_ );
# Process all the lines in match vector
my $lastA = $#$a;
my $lastB = $#$b;
my $bi = 0;
my $ai = 0;
my $ma = -1;
my $mb;
while (1)
{
# Find next match indices $ma and $mb
do { $ma++ } while ( $ma <= $#$matchVector && !defined $matchVector->[$ma] );
last if $ma > $#$matchVector; # end of matchVector?
$mb = $matchVector->[$ma];
# Proceed with discard a/b or change events until
# next match
while ( $ai < $ma || $bi < $mb )
{
if ( $ai < $ma && $bi < $mb )
{
# Change
if ( defined $changeCallback )
{
&$changeCallback( $ai++, $bi++, @_ );
}
else
{
&$discardACallback( $ai++, $bi, @_ );
&$discardBCallback( $ai, $bi++, @_ );
}
}
elsif ( $ai < $ma )
{
&$discardACallback( $ai++, $bi, @_ );
}
else
{
# $bi < $mb
&$discardBCallback( $ai, $bi++, @_ );
}
}
# Match
&$matchCallback( $ai++, $bi++, @_ );
}
while ( $ai <= $lastA || $bi <= $lastB )
{
if ( $ai <= $lastA && $bi <= $lastB )
{
# Change
( run in 0.798 second using v1.01-cache-2.11-cpan-39bf76dae61 )