Algorithm-Diff

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

1.05 Sun Jun 11 15:17:05 PDT 2000
- Changed version label string.
- Put MJD's PPT diff version into this distribution as diffnew.pl

1.04 Added documentation.

1.03 Working version

1.01 First version by Ned Konz.
- Total re-write to cure problems with memory usage and speed.
  Now takes only a few seconds and less than three megabytes
  to compare two 4000-line files.

- Changed optional callback function reference from being equality
  function to being key (hash) generation function.

0.59 Last version maintained by Mark-Jason Dominus

cdiff.pl  view on Meta::CPAN

    # context after
    my $end1 = $hunk->{"end1"};
    $num_added = ($end1+$context_items > $#f1) ?
                  $#f1 - $end1 :
                  $context_items;
    $hunk->{"end1"} += $num_added;
    $hunk->{"end2"} += $num_added;
}

# Is there an overlap between hunk arg0 and old hunk arg1?
# Note: if end of old hunk is one less than beginning of second, they overlap
sub does_overlap {
    my ($hunk, $oldhunk) = @_;
    return "" unless $oldhunk; # first time through, $oldhunk is empty

    # Do I actually need to test both?
    return ($hunk->{"start1"} - $oldhunk->{"end1"} <= 1 ||
            $hunk->{"start2"} - $oldhunk->{"end2"} <= 1);
}

# Prepend hunk arg1 to hunk arg0

diffnew.pl  view on Meta::CPAN

    # context after
    my $end1 = $hunk->{"end1"};
    $num_added = ($end1+$context_items > $#f1) ?
                  $#f1 - $end1 :
                  $context_items;
    $hunk->{"end1"} += $num_added;
    $hunk->{"end2"} += $num_added;
}

# Is there an overlap between hunk arg0 and old hunk arg1?
# Note: if end of old hunk is one less than beginning of second, they overlap
sub does_overlap {
    my ($hunk, $oldhunk) = @_;
    return "" unless $oldhunk; # first time through, $oldhunk is empty

    # Do I actually need to test both?
    return ($hunk->{"start1"} - $oldhunk->{"end1"} <= 1 ||
            $hunk->{"start2"} - $oldhunk->{"end2"} <= 1);
}

# Prepend hunk arg1 to hunk arg0

htmldiff.pl  view on Meta::CPAN

	{ -style => 'margin-left: 24pt' },
	span( { -style => 'color: red' }, $ARGV[0] ),
	span(" <i>vs.</i> "),
	span( { -style => 'color: blue' }, $ARGV[1] )
  ),
  "\n";

# And compare the arrays
traverse_sequences(
	\@a,    # first sequence
	\@b,    # second sequence
	{
		MATCH     => \&match,     # callback on identical lines
		DISCARD_A => \&only_a,    # callback on A-only
		DISCARD_B => \&only_b,    # callback on B-only
	}
);

print end_html();

lib/Algorithm/Diff.pm  view on Meta::CPAN

    return wantarray ? @$retval : $retval;
}

########################################
my $Root= __PACKAGE__;
package Algorithm::Diff::_impl;
use strict;

sub _Idx()  { 0 } # $me->[_Idx]: Ref to array of hunk indices
            # 1   # $me->[1]: Ref to first sequence
            # 2   # $me->[2]: Ref to second sequence
sub _End()  { 3 } # $me->[_End]: Diff between forward and reverse pos
sub _Same() { 4 } # $me->[_Same]: 1 if pos 1 contains unchanged items
sub _Base() { 5 } # $me->[_Base]: Added to range's min and max
sub _Pos()  { 6 } # $me->[_Pos]: Which hunk is currently selected
sub _Off()  { 7 } # $me->[_Off]: Offset into _Idx for current position
sub _Min() { -2 } # Added to _Off to get min instead of max+1

sub Die
{
    require Carp;

lib/Algorithm/Diff.pm  view on Meta::CPAN

subsequence' method.  In the LCS problem, you have two sequences of
items:

    a b c d f g h j q z

    a b c d e f g i j k r x y z

and you want to find the longest sequence of items that is present in
both original sequences in the same order.  That is, you want to find
a new sequence I<S> which can be obtained from the first sequence by
deleting some items, and from the second sequence by deleting other
items.  You also want I<S> to be as long as possible.  In this case I<S>
is

    a b c d f g j z

From there it's only a small step to get diff-like output:

    e   h i   k   q r x y
    +   - +   +   - + + +

lib/Algorithm/Diff.pm  view on Meta::CPAN

=head2 C<LCS_length>

This is just like C<LCS> except it only returns the length of the
longest common subsequence.  This provides a performance gain of about
9% compared to C<LCS>.

=head2 C<LCSidx>

Like C<LCS> except it returns references to two arrays.  The first array
contains the indices into @seq1 where the LCS items are located.  The
second array contains the indices into @seq2 where the LCS items are located.

Therefore, the following three lists will contain the same values:

    my( $idx1, $idx2 ) = LCSidx( \@seq1, \@seq2 );
    my @list1 = @seq1[ @$idx1 ];
    my @list2 = @seq2[ @$idx2 ];
    my @list3 = LCS( \@seq1, \@seq2 );

=head2 C<new>

    $diff = Algorithm::Diffs->new( \@seq1, \@seq2 );
    $diff = Algorithm::Diffs->new( \@seq1, \@seq2, \%opts );

C<new> computes the smallest set of additions and deletions necessary
to turn the first sequence into the second and compactly records them
in the object.

You use the object to iterate over I<hunks>, where each hunk represents
a contiguous section of items which should be added, deleted, replaced,
or left unchanged.

=over 4

The following summary of all of the methods looks a lot like Perl code
but some of the symbols have different meanings:

    [ ]     Encloses optional arguments
    :       Is followed by the default value for an optional argument
    |       Separates alternate return results

lib/Algorithm/Diff.pm  view on Meta::CPAN

Actually, C<Next> returns the object's new position, which is a number
between 1 and the number of hunks (inclusive), or returns a false value.

=item C<Prev>

C<Prev($N)> is almost identical to C<Next(-$N)>; it moves to the $Nth
previous hunk.  On a 'reset' object, C<Prev()> [and C<Next(-1)>] move
to the last hunk.

The position returned by C<Prev> is relative to the I<end> of the
hunks; -1 for the last hunk, -2 for the second-to-last, etc.

=item C<Reset>

    $diff->Reset();     # Reset the object's position
    $diff->Reset($pos); # Move to the specified hunk
    $diff->Reset(1);    # Move to the first hunk
    $diff->Reset(-1);   # Move to the last hunk

C<Reset> returns the object, so, for example, you could use
C<< $diff->Reset()->Next(-1) >> to get the number of hunks.

lib/Algorithm/Diff.pm  view on Meta::CPAN

object's position.  But C<Copy> takes an optional first argument to set the
new position, so the following three snippets are equivalent:

    $copy = $diff->Copy($pos);

    $copy = $diff->Copy();
    $copy->Reset($pos);

    $copy = $diff->Copy()->Reset($pos);

C<Copy> takes an optional second argument to set the base for
the copy.  If you wish to change the base of the copy but leave
the position the same as in the original, here are two
equivalent ways:

    $copy = $diff->Copy();
    $copy->Base( 0 );

    $copy = $diff->Copy(undef,0);

Here are two equivalent way to get a "reset" copy:

lib/Algorithm/Diff.pm  view on Meta::CPAN


C<Range> is like C<Items> except that it returns a list of I<indices> to
the items rather than the items themselves.  By default, the index of
the first item (in each sequence) is 0 but this can be changed by
calling the C<Base> method.  So, by default, the following two snippets
return the same lists:

    @list = $diff->Items(2);
    @list = @seq2[ $diff->Range(2) ];

You can also specify the base to use as the second argument.  So the
following two snippets I<always> return the same lists:

    @list = $diff->Items(1);
    @list = @seq1[ $diff->Range(1,0) ];

=item C<Base>

    $curBase = $diff->Base();
    $oldBase = $diff->Base($newBase);

lib/Algorithm/Diff.pm  view on Meta::CPAN


Using C<prepare> provides a performance gain of about 50% when calling LCS
many times compared with not preparing.

=head2 C<diff>

    @diffs     = diff( \@seq1, \@seq2 );
    $diffs_ref = diff( \@seq1, \@seq2 );

C<diff> computes the smallest set of additions and deletions necessary
to turn the first sequence into the second, and returns a description
of these changes.  The description is a list of I<hunks>; each hunk
represents a contiguous section of items which should be added,
deleted, or replaced.  (Hunks containing unchanged items are not
included.)

The return value of C<diff> is a list of hunks, or, in scalar context, a
reference to such a list.  If there are no differences, the list will be
empty.

Here is an example.  Calling C<diff> for the following two sequences:

    a b c e h j l m n p

lib/Algorithm/Diff.pm  view on Meta::CPAN

      [ [ '+', 6, 'k' ] ],

      [ [ '-',  8, 'n' ],
        [ '-',  9, 'p' ],
        [ '+',  9, 'r' ],
        [ '+', 10, 's' ],
        [ '+', 11, 't' ] ],
    )

There are five hunks here.  The first hunk says that the C<a> at
position 0 of the first sequence should be deleted (C<->).  The second
hunk says that the C<d> at position 2 of the second sequence should
be inserted (C<+>).  The third hunk says that the C<h> at position 4
of the first sequence should be removed and replaced with the C<f>
from position 4 of the second sequence.  And so on.

C<diff> may be passed an optional third parameter; this is a CODE
reference to a key generation function.  See L</KEY GENERATION
FUNCTIONS>.

Additional parameters, if any, will be passed to the key generation
routine.

=head2 C<sdiff>

lib/Algorithm/Diff.pm  view on Meta::CPAN

So each pair of indices (except the last pair) describes where a hunk
begins (in each sequence).  Since each hunk must end at the item just
before the item that starts the next hunk, the next pair of indices can
be used to determine where the hunk ends.

So, the first 4 entries (0..3) describe the first hunk.  Entries 0 and 1
describe where the first hunk begins (and so are always both 0).
Entries 2 and 3 describe where the next hunk begins, so subtracting 1
from each tells us where the first hunk ends.  That is, the first hunk
contains items C<$diff[0]> through C<$diff[2] - 1> of the first sequence
and contains items C<$diff[1]> through C<$diff[3] - 1> of the second
sequence.

In other words, the first hunk consists of the following two lists of items:

               #  1st pair     2nd pair
               # of indices   of indices
    @list1 = @a[ $cdiff[0] .. $cdiff[2]-1 ];
    @list2 = @b[ $cdiff[1] .. $cdiff[3]-1 ];
               # Hunk start   Hunk end

lib/Algorithm/Diff.pm  view on Meta::CPAN

example is empty (otherwise we'd violate the above convention).  Note
that the first 4 index values in our example are all zero.  Plug these
values into our previous code block and we get:

    @hunk1a = @a[ 0 .. 0-1 ];
    @hunk1b = @b[ 0 .. 0-1 ];

And C<0..-1> returns the empty list.

Move down one pair of indices (2..5) and we get the offset ranges for
the second hunk, which contains changed items.

Since C<@diff[2..5]> contains (0,0,1,0) in our example, the second hunk
consists of these two lists of items:

        @hunk2a = @a[ $cdiff[2] .. $cdiff[4]-1 ];
        @hunk2b = @b[ $cdiff[3] .. $cdiff[5]-1 ];
    # or
        @hunk2a = @a[ 0 .. 1-1 ];
        @hunk2b = @b[ 0 .. 0-1 ];
    # or
        @hunk2a = @a[ 0 .. 0 ];
        @hunk2b = @b[ 0 .. -1 ];



( run in 0.688 second using v1.01-cache-2.11-cpan-39bf76dae61 )