BioPerl

 view release on metacpan or  search on metacpan

Bio/Search/HSP/PsiBlastHSP.pm  view on Meta::CPAN

#-----------------------------------------------------------------
#
# BioPerl module Bio::Search::HSP::PsiBlastHSP
#
# (This module was originally called Bio::Tools::Blast::HSP)
#
# Please direct questions and support issues to <bioperl-l@bioperl.org> 
#
# Cared for by Steve Chervitz <sac@bioperl.org>
#
# You may distribute this module under the same terms as perl itself
#-----------------------------------------------------------------

## POD Documentation:

=head1 NAME

Bio::Search::HSP::PsiBlastHSP - Bioperl BLAST High-Scoring Pair object

=head1 SYNOPSIS

See L<Bio::Search::Hit::BlastHit>.

=head1 DESCRIPTION

A Bio::Search::HSP::PsiBlastHSP object provides an interface to data
obtained in a single alignment section of a Blast report (known as a
"High-scoring Segment Pair"). This is essentially a pairwise
alignment with score information.

PsiBlastHSP objects are accessed via L<Bio::Search::Hit::BlastHit>
objects after parsing a BLAST report using the L<Bio::SearchIO>
system.

The construction of PsiBlastHSP objects is performed by
Bio::Factory::BlastHitFactory in a process that is
orchestrated by the Blast parser (L<Bio::SearchIO::psiblast>).
The resulting PsiBlastHSPs are then accessed via
L<Bio::Search::Hit::BlastHit>). Therefore, you do not need to
use L<Bio::Search::HSP::PsiBlastHSP>) directly. If you need to construct
PsiBlastHSPs directly, see the new() function for details.

For L<Bio::SearchIO> BLAST parsing usage examples, see the
C<examples/search-blast> directory of the Bioperl distribution.


=head2 Start and End coordinates

Sequence endpoints are swapped so that start is always less than
end. This affects For TBLASTN/X hits on the minus strand. Strand
information can be recovered using the strand() method. This
normalization step is standard Bioperl practice. It also facilitates
use of range information by methods such as match().

=over 1

=item * Supports BLAST versions 1.x and 2.x, gapped and ungapped.

=back

Bio::Search::HSP::PsiBlastHSP.pm has the ability to extract a list of all
residue indices for identical and conservative matches along both
query and sbjct sequences. Since this degree of detail is not always
needed, this behavior does not occur during construction of the PsiBlastHSP
object.  These data will automatically be collected as necessary as
the PsiBlastHSP.pm object is used.

=head1 DEPENDENCIES

Bio::Search::HSP::PsiBlastHSP.pm is a concrete class that inherits from
L<Bio::SeqFeature::SimilarityPair> and L<Bio::Search::HSP::HSPI>.
L<Bio::Seq> and L<Bio::SimpleAlign> are employed for creating
sequence and alignment objects, respectively.

=head2 Relationship to L<Bio::SimpleAlign> and L<Bio::Seq>

PsiBlastHSP.pm can provide the query or sbjct sequence as a L<Bio::Seq>
object via the L<seq()|seq> method. The PsiBlastHSP.pm object can also create a
two-sequence L<Bio::SimpleAlign> alignment object using the the query
and sbjct sequences via the L<get_aln()|get_aln> method. Creation of alignment
objects is not automatic when constructing the PsiBlastHSP.pm object since
this level of functionality is not always required and would generate
a lot of extra overhead when crunching many reports.


=head1 FEEDBACK

=head2 Mailing Lists

User feedback is an integral part of the evolution of this and other
Bioperl modules.  Send your comments and suggestions preferably to one
of the Bioperl mailing lists.  Your participation is much appreciated.

  bioperl-l@bioperl.org                  - General discussion
  http://bioperl.org/wiki/Mailing_lists  - About the mailing lists

=head2 Support 

Please direct usage questions or support issues to the mailing list:

I<bioperl-l@bioperl.org>

rather than to the module maintainer directly. Many experienced and 
reponsive experts will be able look at the problem and quickly 
address it. Please include a thorough description of the problem 
with code and data examples if at all possible.

=head2 Reporting Bugs

Bio/Search/HSP/PsiBlastHSP.pm  view on Meta::CPAN

        $self->{'_sbjctStrand'} = $2;
    }

#    if($data =~ m!Gaps = (\d+)/(\d+)!) {
#         $self->{'_totalGaps'} = $1;
#    } else {
#         $self->{'_totalGaps'} = 0;
#    }
}



=head2 _set_seq_data

 Usage     : called automatically when sequence data is requested.
 Purpose   : Sets the HSP sequence data for both query and sbjct sequences.
           : Includes: start, stop, length, gaps, and raw sequence.
 Argument  : n/a
 Throws    : Propagates any exception thrown by _set_match_seq()
 Comments  : Uses raw data stored by _set_data() during object construction.
           : These data are not always needed, so it is conditionally
           : executed only upon demand by methods such as gaps(), _set_residues(),
           : etc. _set_seq() does the dirty work.

See Also   : L</_set_seq>

=cut

#-----------------
sub _set_seq_data {
#-----------------
    my $self = shift;

    $self->_set_seq('query', @{$self->{'_queryList'}});
    $self->_set_seq('sbjct', @{$self->{'_sbjctList'}});

    # Liberate some memory.
    @{$self->{'_queryList'}} = @{$self->{'_sbjctList'}} = ();
    undef $self->{'_queryList'};
    undef $self->{'_sbjctList'};

    $self->{'_set_seq_data'} = 1;
}



=head2 _set_seq

 Usage     : called automatically by _set_seq_data()
           : $hsp_obj->($seq_type, @data);
 Purpose   : Sets sequence information for both the query and sbjct sequences.
           : Directly counts the number of gaps in each sequence (if gapped Blast).
 Argument  : $seq_type = 'query' or 'sbjct'
           : @data = all seq lines with the form:
           : Query: 61  SPHNVKDRKEQNGSINNAISPTATANTSGSQQINIDSALRDRSSNVAAQPSLSDASSGSN 120
 Throws    : Exception if data strings cannot be parsed, probably due to a change
           : in the Blast report format.
 Comments  : Uses first argument to determine which data members to set
           : making this method sensitive data member name changes.
           : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
 Warning   : Sequence endpoints are normalized so that start < end. This affects HSPs
           : for TBLASTN/X hits on the minus strand. Normalization facilitates use
           : of range information by methods such as match().

See Also   : L</_set_seq_data>, L</matches>, L</range>, L</start>, L</end>

=cut

#-------------
sub _set_seq {
#-------------
    my $self      = shift;
    my $seqType   = shift;
    my @data      = @_;
    my @ranges    = ();
    my @sequence  = ();
    my $numGaps   = 0;

    foreach( @data ) {
        if( m/(\d+) *([^\d\s]+) *(\d+)/) {
            push @ranges, ( $1, $3 ) ;
            push @sequence, $2;
        #print STDERR "_set_seq found sequence \"$2\"\n";
        } else {
            $self->warn("Bad sequence data: $_");
        }
    }

    if( !(scalar(@sequence) and scalar(@ranges))) {
        my $id_str = $self->_id_str;
        $self->throw("Can't set sequence: missing data. Possibly unrecognized Blast format. ($id_str)");
   }

    # Sensitive to member name changes.
    $seqType = "_\L$seqType\E";
    $self->{$seqType.'Start'} = $ranges[0];
    $self->{$seqType.'Stop'}  = $ranges[ $#ranges ];
    $self->{$seqType.'Seq'}   = \@sequence;

    $self->{$seqType.'Length'} = abs($ranges[ $#ranges ] - $ranges[0]) + 1;

    # Adjust lengths for BLASTX, TBLASTN, TBLASTX sequences
    # Converting nucl coords to amino acid coords.

    my $prog = $self->algorithm;
    if($prog eq 'TBLASTN' and $seqType eq '_sbjct') {
        $self->{$seqType.'Length'} /= 3;
    } elsif($prog eq 'BLASTX' and $seqType eq '_query') {
        $self->{$seqType.'Length'} /= 3;
    } elsif($prog eq 'TBLASTX') {
        $self->{$seqType.'Length'} /= 3;
    }

    if( $prog ne 'BLASTP' ) {
        $self->{$seqType.'Strand'} = 'Plus' if $prog =~ /BLASTN/;
        $self->{$seqType.'Strand'} = 'Plus' if ($prog =~ /BLASTX/ and $seqType eq '_query');
        # Normalize sequence endpoints so that start < end.
        # Reverse complement or 'minus strand' HSPs get flipped here.
        if($self->{$seqType.'Start'} > $self->{$seqType.'Stop'}) {
            ($self->{$seqType.'Start'}, $self->{$seqType.'Stop'}) =
                ($self->{$seqType.'Stop'}, $self->{$seqType.'Start'});
            $self->{$seqType.'Strand'} = 'Minus';
        }
    }

    ## Count number of gaps in each seq. Only need to do this for gapped Blasts.
#    if($self->{'_gapped'}) {
        my $seqstr = join('', @sequence);
        $seqstr =~ s/\s//g;
        my $num_gaps = CORE::length($seqstr) - $self->{$seqType.'Length'};
        $self->{$seqType.'Gaps'} = $num_gaps if $num_gaps > 0;
#    }
}


=head2 _set_residues

 Usage     : called automatically when residue data is requested.
 Purpose   : Sets the residue numbers representing the identical and
           : conserved positions. These data are obtained by analyzing the
           : symbols between query and sbjct lines of the alignments.
 Argument  : n/a
 Throws    : Propagates any exception thrown by _set_seq_data() and _set_match_seq().
 Comments  : These data are not always needed, so it is conditionally
           : executed only upon demand by methods such as seq_inds().
           : Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).

See Also   : L</_set_seq_data>, L</_set_match_seq>, L</seq_inds>

=cut

#------------------
sub _set_residues {
#------------------
    my $self      = shift;
    my @sequence  = ();

    $self->_set_seq_data() unless $self->{'_set_seq_data'};

    # Using hashes to avoid saving duplicate residue numbers.
    my %identicalList_query = ();
    my %identicalList_sbjct = ();
    my %conservedList_query = ();
    my %conservedList_sbjct = ();

    my $aref = $self->_set_match_seq() if not ref $self->{'_matchSeq'};
    $aref  ||= $self->{'_matchSeq'};
    my $seqString = join('', @$aref );

    my $qseq = join('',@{$self->{'_querySeq'}});
    my $sseq = join('',@{$self->{'_sbjctSeq'}});
    my $resCount_query = $self->{'_queryStop'} || 0;
    my $resCount_sbjct = $self->{'_sbjctStop'} || 0;

    my $prog = $self->algorithm;
    if($prog !~ /^BLASTP|^BLASTN/) {
        if($prog eq 'TBLASTN') {



( run in 0.534 second using v1.01-cache-2.11-cpan-39bf76dae61 )