length results from the CPAN

Algorithm-NeedlemanWunsch
view release on metacpan or search on metacpan
lib/Algorithm/NeedlemanWunsch.pm view on Meta::CPAN
__END__

=head1 NAME

Algorithm::NeedlemanWunsch - sequence alignment with configurable scoring

=head1 VERSION

Version 0.03

=cut

=head1 SYNOPSIS

    use Algorithm::NeedlemanWunsch;

    sub score_sub {
        if (!@_) {
	    return -2; # gap penalty
        }

	return ($_[0] eq $_[1]) ? 1 : -1;
    }

    my $matcher = Algorithm::NeedlemanWunsch->new(\&score_sub);
    my $score = $matcher->align(
               \@a,
               \@b,
               {   align     => \&on_align,
                   shift_a => \&on_shift_a,
                   shift_b => \&on_shift_b,
		   select_align => \&on_select_align
               });

=head1 DESCRIPTION

Sequence alignment is a way to find commonalities in two (or more)
similar sequences or strings of some items or characters. Standard
motivating example is the comparison of DNA sequences and their
functional and evolutionary similarities and differences, but the
problem has much wider applicability - for example finding the longest
common subsequence (that is, C<diff>) is a special case of sequence
alignment.

Conceptually, sequence alignment works by scoring all possible
alignments and choosing the alignment with maximal score. For example,
sequences C<a t c t> and C<t g a t> may be aligned

  sequence A: a t c - t
                | |   |
  sequence B: - t g a t

or

  sequence A: - - a t c t
                  | |
  sequence B: t g a t - -

(and exponentially many other ways, of course). Note that
Needleman-Wunsch considers I<global> alignments, over the entire
length of both sequences; each item is either aligned with an item of
the other sequence, or corresponds to a I<gap> (which is always
aligned with an item - aligning two gaps wouldn't help anything). This
approach is especially suitable for comparing sequences of comparable
length and somewhat similar along their whole lengths - that is,
without long stretches that have nothing to do with each other. If
your sequences don't satisfy these requirements, consider using local
alignment, which, strictly speaking, isn't Needleman-Wunsch, but is
similar enough to be implemented in this module as well - see below
for details.

In the example above, the second alignment has more gaps than the
first, but perhaps your a's are structurally important and you like
them lined up so much that you'd still prefer the second
alignment. Conversely, if c is "almost the same" as g, it might be the
first alignment that matches better. Needleman-Wunsch formalizes such
considerations into a I<similarity matrix>, assigning payoffs to each
(ordered, but the matrix is normally symmetrical so that the order
doesn't matter) pair of possible sequence items, plus a I<gap
penalty>, quantifying the desirability of a gap in a sequence. A
preference of pairings over gaps is expressed by a low (relative to
the similarity matrix values, normally negative) gap penalty.

The alignment score is then defined as the sum, over the positions
where at least one sequence has an item, of the similarity matrix
values indexed by the first and second item (when both are defined)
and gap penalties (for items aligned with a gap). For example, if C<S>
is the similarity matrix and C<g> denotes the gap penalty, the
alignment

  sequence A: a a t t c c

  sequence B: a - - - t c

has score C<S[a, a] + 3 * g + S[c, t] + S[c, c]>.

When the gap penalty is 0 and the similarity an identity matrix, i.e.
assigning 1 to every match and 0 to every mismatch, Needleman-Wunsch
reduces to finding the longest common subsequence.

The algorithm for maximizing the score is a standard application of
dynamic programming, computing the optimal alignment score of empty
and 1-item sequences and building it up until the whole input
sequences are taken into consideration. Once the optimal score is
known, the algorithm traces back to find the gap positions. Note that
while the maximal score is obviously unique, the alignment having it
in general isn't; this module's interface allows the calling
application to choose between different optimal alignments.

=head1 METHODS

=head2 Standard algorithm

=head3 new(\&score_sub [, $gap_penalty ])

The constructor. Takes one mandatory argument, which is a coderef to a
sub implementing the similarity matrix, plus an optional gap penalty
argument. If the gap penalty isn't specified as a constructor
argument, the C<Algorithm::NeedlemanWunsch> object gets it by calling
the scoring sub without arguments; apart from that case, the sub is
called with 2 arguments, which are items from the first and second
sequence, respectively, passed to
C<Algorithm::NeedlemanWunsch::align>. Note that the sub must be pure,
i.e. always return the same value when called with the same arguments.
( run in 0.969 second using v1.01-cache-2.11-cpan-39bf76dae61 )