BioPerl
view release on metacpan or search on metacpan
Bio/Search/HSP/BlastHSP.pm view on Meta::CPAN
#-----------------------------------------------------------------
#
# BioPerl module Bio::Search::HSP::BlastHSP
#
# (This module was originally called Bio::Tools::Blast::HSP)
#
# Please direct questions and support issues to <bioperl-l@bioperl.org>
#
# Cared for by Steve Chervitz <sac@bioperl.org>
#
# You may distribute this module under the same terms as perl itself
#-----------------------------------------------------------------
## POD Documentation:
=head1 NAME
Bio::Search::HSP::BlastHSP - Bioperl BLAST High-Scoring Pair object
=head1 SYNOPSIS
See L<Bio::Search::Hit::BlastHit>.
=head1 DESCRIPTION
A Bio::Search::HSP::BlastHSP object provides an interface to data
obtained in a single alignment section of a Blast report (known as a
"High-scoring Segment Pair"). This is essentially a pairwise
alignment with score information.
BlastHSP objects are accessed via L<Bio::Search::Hit::BlastHit>
objects after parsing a BLAST report using the L<Bio::SearchIO>
system.
The construction of BlastHSP objects is performed by
Bio::Factory::BlastHitFactory in a process that is
orchestrated by the Blast parser (L<Bio::SearchIO::psiblast>).
The resulting BlastHSPs are then accessed via
L<Bio::Search::Hit::BlastHit>). Therefore, you do not need to
use L<Bio::Search::HSP::BlastHSP>) directly. If you need to construct
BlastHSPs directly, see the new() function for details.
For L<Bio::SearchIO> BLAST parsing usage examples, see the
C<examples/searchio> directory of the Bioperl distribution.
=head2 Start and End coordinates
Sequence endpoints are swapped so that start is always less than
end. This affects For TBLASTN/X hits on the minus strand. Strand
information can be recovered using the strand() method. This
normalization step is standard Bioperl practice. It also facilitates
use of range information by methods such as match().
=over 1
=item * Supports BLAST versions 1.x and 2.x, gapped and ungapped.
=back
Bio::Search::HSP::BlastHSP.pm has the ability to extract a list of all
residue indices for identical and conservative matches along both
query and sbjct sequences. Since this degree of detail is not always
needed, this behavior does not occur during construction of the BlastHSP
object. These data will automatically be collected as necessary as
the BlastHSP.pm object is used.
=head1 DEPENDENCIES
Bio::Search::HSP::BlastHSP.pm is a concrete class that inherits from
L<Bio::SeqFeature::SimilarityPair> and L<Bio::Search::HSP::HSPI>.
L<Bio::Seq> and L<Bio::SimpleAlign> are employed for creating
sequence and alignment objects, respectively.
=head2 Relationship to SimpleAlign.pm & Seq.pm
BlastHSP.pm can provide the query or sbjct sequence as a L<Bio::Seq>
object via the L<seq()|seq> method. The BlastHSP.pm object can also create a
two-sequence L<Bio::SimpleAlign> alignment object using the the query
and sbjct sequences via the L<get_aln()|get_aln> method. Creation of alignment
objects is not automatic when constructing the BlastHSP.pm object since
this level of functionality is not always required and would generate
a lot of extra overhead when crunching many reports.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I<bioperl-l@bioperl.org>
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Bio/Search/HSP/BlastHSP.pm view on Meta::CPAN
$self->{'_sbjctStrand'} = $2;
}
# if($data =~ m!Gaps = (\d+)/(\d+)!) {
# $self->{'_totalGaps'} = $1;
# } else {
# $self->{'_totalGaps'} = 0;
# }
}
=head2 _set_seq_data
Usage : called automatically when sequence data is requested.
Purpose : Sets the HSP sequence data for both query and sbjct sequences.
: Includes: start, stop, length, gaps, and raw sequence.
Argument : n/a
Throws : Propagates any exception thrown by _set_match_seq()
Comments : Uses raw data stored by _set_data() during object construction.
: These data are not always needed, so it is conditionally
: executed only upon demand by methods such as gaps(), _set_residues(),
: etc. _set_seq() does the dirty work.
See Also : L</_set_seq>
=cut
#-----------------
sub _set_seq_data {
#-----------------
my $self = shift;
$self->_set_seq('query', @{$self->{'_queryList'}});
$self->_set_seq('sbjct', @{$self->{'_sbjctList'}});
# Liberate some memory.
@{$self->{'_queryList'}} = @{$self->{'_sbjctList'}} = ();
undef $self->{'_queryList'};
undef $self->{'_sbjctList'};
$self->{'_set_seq_data'} = 1;
}
=head2 _set_seq
Usage : called automatically by _set_seq_data()
: $hsp_obj->($seq_type, @data);
Purpose : Sets sequence information for both the query and sbjct sequences.
: Directly counts the number of gaps in each sequence (if gapped Blast).
Argument : $seq_type = 'query' or 'sbjct'
: @data = all seq lines with the form:
: Query: 61 SPHNVKDRKEQNGSINNAISPTATANTSGSQQINIDSALRDRSSNVAAQPSLSDASSGSN 120
Throws : Exception if data strings cannot be parsed, probably due to a change
: in the Blast report format.
Comments : Uses first argument to determine which data members to set
: making this method sensitive data member name changes.
: Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
Warning : Sequence endpoints are normalized so that start < end. This affects HSPs
: for TBLASTN/X hits on the minus strand. Normalization facilitates use
: of range information by methods such as match().
See Also : L</_set_seq_data>, L</matches>, L</range>, L</start>, L</end>
=cut
#-------------
sub _set_seq {
#-------------
my $self = shift;
my $seqType = shift;
my @data = @_;
my @ranges = ();
my @sequence = ();
my $numGaps = 0;
foreach( @data ) {
if( m/(\d+) *([^\d\s]+) *(\d+)/) {
push @ranges, ( $1, $3 ) ;
push @sequence, $2;
#print STDERR "_set_seq found sequence \"$2\"\n";
} else {
$self->warn("Bad sequence data: $_");
}
}
if( !(scalar(@sequence) and scalar(@ranges))) {
my $id_str = $self->_id_str;
$self->throw("Can't set sequence: missing data. Possibly unrecognized Blast format. ($id_str) $seqType");
}
# Sensitive to member name changes.
$seqType = "_\L$seqType\E";
$self->{$seqType.'Start'} = $ranges[0];
$self->{$seqType.'Stop'} = $ranges[ $#ranges ];
$self->{$seqType.'Seq'} = \@sequence;
$self->{$seqType.'Length'} = abs($ranges[ $#ranges ] - $ranges[0]) + 1;
# Adjust lengths for BLASTX, TBLASTN, TBLASTX sequences
# Converting nucl coords to amino acid coords.
my $prog = $self->algorithm;
if($prog eq 'TBLASTN' and $seqType eq '_sbjct') {
$self->{$seqType.'Length'} /= 3;
} elsif($prog eq 'BLASTX' and $seqType eq '_query') {
$self->{$seqType.'Length'} /= 3;
} elsif($prog eq 'TBLASTX') {
$self->{$seqType.'Length'} /= 3;
}
if( $prog ne 'BLASTP' ) {
$self->{$seqType.'Strand'} = 'Plus' if $prog =~ /BLASTN/;
$self->{$seqType.'Strand'} = 'Plus' if ($prog =~ /BLASTX/ and $seqType eq '_query');
# Normalize sequence endpoints so that start < end.
# Reverse complement or 'minus strand' HSPs get flipped here.
if($self->{$seqType.'Start'} > $self->{$seqType.'Stop'}) {
($self->{$seqType.'Start'}, $self->{$seqType.'Stop'}) =
($self->{$seqType.'Stop'}, $self->{$seqType.'Start'});
$self->{$seqType.'Strand'} = 'Minus';
}
}
## Count number of gaps in each seq. Only need to do this for gapped Blasts.
# if($self->{'_gapped'}) {
my $seqstr = join('', @sequence);
$seqstr =~ s/\s//g;
my $num_gaps = CORE::length($seqstr) - $self->{$seqType.'Length'};
$self->{$seqType.'Gaps'} = $num_gaps if $num_gaps > 0;
# }
}
=head2 _set_residues
Usage : called automatically when residue data is requested.
Purpose : Sets the residue numbers representing the identical and
: conserved positions. These data are obtained by analyzing the
: symbols between query and sbjct lines of the alignments.
Argument : n/a
Throws : Propagates any exception thrown by _set_seq_data() and _set_match_seq().
Comments : These data are not always needed, so it is conditionally
: executed only upon demand by methods such as seq_inds().
: Behavior is dependent on the type of BLAST analysis (TBLASTN, BLASTP, etc).
See Also : L</_set_seq_data>, L</_set_match_seq>, L</seq_inds>
=cut
#------------------
sub _set_residues {
#------------------
my $self = shift;
my @sequence = ();
$self->_set_seq_data() unless $self->{'_set_seq_data'};
# Using hashes to avoid saving duplicate residue numbers.
my %identicalList_query = ();
my %identicalList_sbjct = ();
my %conservedList_query = ();
my %conservedList_sbjct = ();
my $aref = $self->_set_match_seq() if not ref $self->{'_matchSeq'};
$aref ||= $self->{'_matchSeq'};
my $seqString = join('', @$aref );
my $qseq = join('',@{$self->{'_querySeq'}});
my $sseq = join('',@{$self->{'_sbjctSeq'}});
my $resCount_query = $self->{'_queryStop'} || 0;
my $resCount_sbjct = $self->{'_sbjctStop'} || 0;
my $prog = $self->algorithm;
if($prog !~ /^BLASTP|^BLASTN/) {
if($prog eq 'TBLASTN') {
( run in 1.271 second using v1.01-cache-2.11-cpan-39bf76dae61 )