Algorithm-Diff
view release on metacpan or search on metacpan
lib/Algorithm/Diff.pm view on Meta::CPAN
# $index = int(( $high + $low ) / 2); # without 'use integer'
$found = $array->[$index];
if ( $aValue == $found )
{
return undef;
}
elsif ( $aValue > $found )
{
$low = $index + 1;
}
else
{
$high = $index - 1;
}
}
# now insertion point is in $low.
$array->[$low] = $aValue; # overwrite next larger
return $low;
}
# This method computes the longest common subsequence in $a and $b.
# Result is array or ref, whose contents is such that
# $a->[ $i ] == $b->[ $result[ $i ] ]
# foreach $i in ( 0 .. $#result ) if $result[ $i ] is defined.
# An additional argument may be passed; this is a hash or key generating
# function that should return a string that uniquely identifies the given
# element. It should be the case that if the key is the same, the elements
# will compare the same. If this parameter is undef or missing, the key
# will be the element as a string.
# By default, comparisons will use "eq" and elements will be turned into keys
# using the default stringizing operator '""'.
# Additional parameters, if any, will be passed to the key generation
# routine.
sub _longestCommonSubsequence
{
my $a = shift; # array ref or hash ref
my $b = shift; # array ref or hash ref
my $counting = shift; # scalar
my $keyGen = shift; # code ref
my $compare; # code ref
if ( ref($a) eq 'HASH' )
{ # prepared hash must be in $b
my $tmp = $b;
$b = $a;
$a = $tmp;
}
# Check for bogus (non-ref) argument values
if ( !ref($a) || !ref($b) )
{
my @callerInfo = caller(1);
die 'error: must pass array or hash references to ' . $callerInfo[3];
}
# set up code refs
# Note that these are optimized.
if ( !defined($keyGen) ) # optimize for strings
{
$keyGen = sub { $_[0] };
$compare = sub { my ( $a, $b ) = @_; $a eq $b };
}
else
{
$compare = sub {
my $a = shift;
my $b = shift;
&$keyGen( $a, @_ ) eq &$keyGen( $b, @_ );
};
}
my ( $aStart, $aFinish, $matchVector ) = ( 0, $#$a, [] );
my ( $prunedCount, $bMatches ) = ( 0, {} );
if ( ref($b) eq 'HASH' ) # was $bMatches prepared for us?
{
$bMatches = $b;
}
else
{
my ( $bStart, $bFinish ) = ( 0, $#$b );
# First we prune off any common elements at the beginning
while ( $aStart <= $aFinish
and $bStart <= $bFinish
and &$compare( $a->[$aStart], $b->[$bStart], @_ ) )
{
$matchVector->[ $aStart++ ] = $bStart++;
$prunedCount++;
}
# now the end
while ( $aStart <= $aFinish
and $bStart <= $bFinish
and &$compare( $a->[$aFinish], $b->[$bFinish], @_ ) )
{
$matchVector->[ $aFinish-- ] = $bFinish--;
$prunedCount++;
}
# Now compute the equivalence classes of positions of elements
$bMatches =
_withPositionsOfInInterval( $b, $bStart, $bFinish, $keyGen, @_ );
}
my $thresh = [];
my $links = [];
my ( $i, $ai, $j, $k );
for ( $i = $aStart ; $i <= $aFinish ; $i++ )
{
$ai = &$keyGen( $a->[$i], @_ );
if ( exists( $bMatches->{$ai} ) )
{
lib/Algorithm/Diff.pm view on Meta::CPAN
@indices = $diff->Range( $seqNum, $base );
C<Range> is like C<Items> except that it returns a list of I<indices> to
the items rather than the items themselves. By default, the index of
the first item (in each sequence) is 0 but this can be changed by
calling the C<Base> method. So, by default, the following two snippets
return the same lists:
@list = $diff->Items(2);
@list = @seq2[ $diff->Range(2) ];
You can also specify the base to use as the second argument. So the
following two snippets I<always> return the same lists:
@list = $diff->Items(1);
@list = @seq1[ $diff->Range(1,0) ];
=item C<Base>
$curBase = $diff->Base();
$oldBase = $diff->Base($newBase);
C<Base> sets and/or returns the current base (usually 0 or 1) that is
used when you request range information. The base defaults to 0 so
that range information is returned as array indices. You can set the
base to 1 if you want to report traditional line numbers instead.
=item C<Min>
$min1 = $diff->Min(1);
$min = $diff->Min( $seqNum, $base );
C<Min> returns the first value that C<Range> would return (given the
same arguments) or returns C<undef> if C<Range> would return an empty
list.
=item C<Max>
C<Max> returns the last value that C<Range> would return or C<undef>.
=item C<Get>
( $n, $x, $r ) = $diff->Get(qw( min1 max1 range1 ));
@values = $diff->Get(qw( 0min2 1max2 range2 same base ));
C<Get> returns one or more scalar values. You pass in a list of the
names of the values you want returned. Each name must match one of the
following regexes:
/^(-?\d+)?(min|max)[12]$/i
/^(range[12]|same|diff|base)$/i
The 1 or 2 after a name says which sequence you want the information
for (and where allowed, it is required). The optional number before
"min" or "max" is the base to use. So the following equalities hold:
$diff->Get('min1') == $diff->Min(1)
$diff->Get('0min2') == $diff->Min(2,0)
Using C<Get> in a scalar context when you've passed in more than one
name is a fatal error (C<die> is called).
=back
=head2 C<prepare>
Given a reference to a list of items, C<prepare> returns a reference
to a hash which can be used when comparing this sequence to other
sequences with C<LCS> or C<LCS_length>.
$prep = prepare( \@seq1 );
for $i ( 0 .. 10_000 )
{
@lcs = LCS( $prep, $seq[$i] );
# do something useful with @lcs
}
C<prepare> may be passed an optional third parameter; this is a CODE
reference to a key generation function. See L</KEY GENERATION
FUNCTIONS>.
$prep = prepare( \@seq1, \&keyGen );
for $i ( 0 .. 10_000 )
{
@lcs = LCS( $seq[$i], $prep, \&keyGen );
# do something useful with @lcs
}
Using C<prepare> provides a performance gain of about 50% when calling LCS
many times compared with not preparing.
=head2 C<diff>
@diffs = diff( \@seq1, \@seq2 );
$diffs_ref = diff( \@seq1, \@seq2 );
C<diff> computes the smallest set of additions and deletions necessary
to turn the first sequence into the second, and returns a description
of these changes. The description is a list of I<hunks>; each hunk
represents a contiguous section of items which should be added,
deleted, or replaced. (Hunks containing unchanged items are not
included.)
The return value of C<diff> is a list of hunks, or, in scalar context, a
reference to such a list. If there are no differences, the list will be
empty.
Here is an example. Calling C<diff> for the following two sequences:
a b c e h j l m n p
b c d e f j k l m r s t
would produce the following list:
(
[ [ '-', 0, 'a' ] ],
[ [ '+', 2, 'd' ] ],
[ [ '-', 4, 'h' ],
[ '+', 4, 'f' ] ],
( run in 1.495 second using v1.01-cache-2.11-cpan-5735350b133 )