Bio-Grep

 view release on metacpan or  search on metacpan

lib/Bio/Grep.pm  view on Meta::CPAN


To start the back-end with the specified settings, simply call

  $sbe->search();

This method also accepts an hash reference with settings. In this case, all
previous defined options except all paths and the database are set to their
default values.

  $sbe->search({ mismatches => 2, 
                 reverse_complement => 0, 
                 query => 'AGAGCCCT' });

=head2 ANALYZE SEARCH RESULTS

Use such a L<Bio::Perl> like while loop to analyze the search results.

  while ( my $res = $sbe->next_res ) {
     print $res->sequence->id . "\n";
     print $res->alignment_string() . "\n\n";
  }

See L<Bio::Grep::SearchResult> for all available information.


=head1 BGREP

This distribution comes with a sample script called L<bgrep>. 

=head1 WHICH BACK-END?


We support these external back-ends:

=over

=item C<Vmatch> 

L<http://vmatch.de/>
	
=item C<Agrep> 

L<ftp://ftp.cs.arizona.edu/agrep/> (original Wu-Manber 1992 implementation for
UNIX),
L<http://www.tgries.de/agrep/> (DOS, Windows, OS/2),
L<http://webglimpse.net/download.php> (Agrep binary of C<Glimpse>) and
L<http://laurikari.net/tre/download.html> (TRE implementation).

=item C<GUUGle> 

L<http://bibiserv.techfak.uni-bielefeld.de/guugle/>

=back

=head2 FEATURE COMPARISON

=begin html

<table><tr><th>Feature</th><th>Agrep</th><th>GUUGle</th><th>RE</th><th>Vmatch</th></tr><tr><td>Suffix Arrays/Trees</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Sliding Window</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
</tr>
<tr><td>Persistent Index<sup>1</sup></td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Mismatches</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Edit Distance</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Insertions</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
</tr>
<tr><td>Deletions</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
</tr>
<tr><td>Multiple Queries<sup>2</sup></td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>GU</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
</tr>
<tr><td>DNA/RNA</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Protein</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Direct and Revcom</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Reverse Complement</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Upstream/Downstream</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Filters</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Query Length<sup>3</sup></td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
</tr>
<tr><td>Regular Expressions<sup>4</sup></td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
<td style="font-weight: bold;text-align: center;background-color: #00ff00;">yes</td>
<td style="text-align:center;background-color: #ffe0e0;">no</td>
</tr>
</table><br/><div style="font-size: smaller"><hr width="300"
align="left"><sup>1</sup>Needs pre-calculation and (much) more memory but queries are in general faster<br/><sup>2</sup>With query_file<br/><sup>3</sup>Matches if a substring of the query of size n or larger matches<br/><sup>4</sup>Agrep soon</div>

=end html

=begin man

   Features               || Agrep  | GUUGle |   RE   | Vmatch 
   Suffix Arrays/Trees    ||   no   |  yes   |   no   |  yes   
   Sliding Window         ||  yes   |   no   |  yes   |   no   
   Persistent Index 1     ||   no   |   no   |   no   |  yes   
   Mismatches             ||  yes   |   no   |   no   |  yes   
   Edit Distance          ||  yes   |   no   |   no   |  yes   
   Insertions             ||   no   |   no   |   no   |   no   
   Deletions              ||   no   |   no   |   no   |   no   
   Multiple Queries 2     ||   no   |  yes   |   no   |  yes   
   GU                     ||   no   |  yes   |   no   |   no   
   DNA/RNA                ||  yes   |  yes   |  yes   |  yes   
   Protein                ||  yes   |   no   |  yes   |  yes   
   Direct and Revcom      ||   no   |  yes   |  yes   |  yes   
   Reverse Complement     ||  yes   |  yes   |  yes   |  yes   
   Upstream/Downstream    ||   no   |  yes   |  yes   |  yes   
   Filters                ||   no   |  yes   |  yes   |  yes   
   Query Length 3         ||   no   |  yes   |   no   |  yes   
   Regular Expressions 4  ||   no   |   no   |  yes   |   no   

--
 1 Needs pre-calculation and (much) more memory but queries are in general faster
 2 With query_file
 3 Matches if a substring of the query of size n or larger matches
 4 Agrep soon

=end man


C<Vmatch> is fast but needs a lot of memory. C<Agrep> is the best choice if
you allow many mismatches in short sequences, if you want to search in Fasta
files with relatively short sequences (e.g CDNA or Protein databases) and if
you are only interested in which sequences the approximate match was found. 
Its performance is in this case amazing. If you want the exact positions of a
match in the sequence, choose C<Vmatch>. If you want nice alignments, choose 
C<Vmatch> too (C<EMBOSS> can automatically align the sequence and the query in
the C<Agrep> back-end, but then C<Vmatch> is faster). Filters require exact 
positions, so you can't use them with C<Agrep>. This may change in future 
version or not. The C<Agrep> implementation of the C<TRE> library 
(L<http://laurikari.net/tre/>) is also supported. This implementation has less
limitations and more features (e.g. you get the exact hit positions) but is
much slower. See L<Bio::Grep::Benchmarks>.

C<GUUGle> may be the best choice if you have RNA queries (counts GU as no 
mismatch) and if you are interested in only exact matches. Another
solution here would be to use C<Vmatch> and write a filter (see next section)
that only allows GU mismatches. Of course, this is only an alternative if you
can limit (C<$sbe-E<gt>settings-E<gt>mismatches()>) the maximal number of GU
mismatches. C<Vmatch> with its pre-calculated suffix arrays is really fast, so 
you should consider this option.

Perl regular expressions are available in the C<RE> back-end. It is a very



( run in 0.494 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )