AcePerl
view release on metacpan or search on metacpan
Ace/Sequence.pm view on Meta::CPAN
['END','STOP'],
['OFFSET','OFF'],
['LENGTH','LEN'],
'REFSEQ',
['DATABASE','DB'],
],@_);
# Object must have a parent sequence and/or a reference
# sequence. In some cases, the parent sequence will be the
# object itself. The reference sequence is used to set up
# the frame of reference for the coordinate system.
# fetch the sequence object if we don't have it already
croak "Please provide either a Sequence object or a database and name"
unless ref($seq) || ($seq && $db);
# convert start into offset
$offset = $start - 1 if defined($start) and !defined($offset);
# convert stop/end into length
$length = ($end > $start) ? $end - $offset : $end - $offset - 2
Ace/Sequence/Feature.pm view on Meta::CPAN
'-1' => '+1'); # war is peace, &c.
use overload
'""' => 'asString',
;
# parse a line from a sequence list
sub new {
my $pack = shift;
my ($parent,$ref,$r_offset,$r_strand,$abs,$gff_line,$db) = @_;
my ($sourceseq,$method,$type,$start,$end,$score,$strand,$frame,$group) = split "\t",$gff_line;
if (defined($strand)) {
$strand = $strand eq '-' ? '-1' : '+1';
} else {
$strand = 0;
}
# for efficiency/performance, we don't use superclass new() method, but modify directly
# handling coordinates. See SCRAPS below for what should be in here
$strand = '+1' if $strand < 0 && $r_strand < 0; # two wrongs do make a right
($start,$end) = ($end,$start) if $strand < 0;
Ace/Sequence/Feature.pm view on Meta::CPAN
p_offset => $r_offset,
refseq => [$ref,$r_offset,$r_strand],
strand => $r_strand,
fstrand => $strand,
absolute => $abs,
info => {
seqname=> $sourceseq,
method => $method,
type => $type,
score => $score,
frame => $frame,
group => $group,
db => $db,
}
},$pack;
return $self;
}
sub smapped { 1; }
# $_[0] is field name, $_[1] is self, $_[2] is optional replacement value
Ace/Sequence/Feature.pm view on Meta::CPAN
sub seqname {
my $self = shift;
my $seq = $self->_field('seqname');
$self->db->fetch(Sequence=>$seq);
}
sub method { shift->_field('method',@_) } # ... I prefer "method"
sub subtype { shift->_field('method',@_) } # ... or even "subtype"
sub type { shift->_field('type',@_) } # ... I prefer "type"
sub score { shift->_field('score',@_) } # float indicating some sort of score
sub frame { shift->_field('frame',@_) } # one of 1, 2, 3 or undef
sub info { # returns Ace::Object(s) with info about the feature
my $self = shift;
unless ($self->{group}) {
my $info = $self->{info}{group} || 'Method "'.$self->method.'"';
$info =~ s/(\"[^\"]*);([^\"]*\")/$1$;$2/g;
my @data = split(/\s*;\s*/,$info);
foreach (@data) { s/$;/;/g }
$self->{group} = [map {$self->toAce($_)} @data];
}
return wantarray ? @{$self->{group}} : $self->{group}->[0];
Ace/Sequence/Feature.pm view on Meta::CPAN
Returns the strandedness of this feature, either "+1" or "-1". For
features that are not stranded, returns 0.
=item reversed()
$reversed = $feature->reversed;
Returns true if the feature is reversed relative to its source
sequence.
=item frame()
$frame = $feature->frame;
For features that have a frame, such as a predicted coding sequence,
returns the frame, either 0, 1 or 2. For other features, returns undef.
=item group()
=item info()
=item target()
$info = $feature->info;
These methods (synonyms for one another) return an Ace::Object
Ace/Sequence/Feature.pm view on Meta::CPAN
=cut
__END__
# SCRAPS
# the new() code done "right"
# sub new {
# my $pack = shift;
# my ($ref,$r_offset,$r_strand,$gff_line) = @_;
# my ($sourceseq,$method,$type,$start,$end,$score,$strand,$frame,$group) = split "\t";
# ($start,$end) = ($end,$start) if $strand < 0;
# my $self = $pack->SUPER::new($source,$start,$end);
# $self->{info} = {
# seqname=> $sourceseq,
# method => $method,
# type => $type,
# score => $score,
# frame => $frame,
# group => $group,
# };
# $self->{fstrand} = $strand;
# return $self;
# }
docs/GFF_Spec.html view on Meta::CPAN
document are described and marked as (<b>Version 2 changes</b>).
</FONT>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="fields"><h2>Definition</h2></A>
Fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [group]>[comments] <P>
<dl>
<dt><seqname>
<dd>The name of the sequence. Having an explicit sequence name
allows a feature file to be prepared for a data set of multiple
sequences. Normally the seqname will be the identifier of the
sequence in an accompanying fasta format file. An alternative is that
'seqname' is the identifier for a sequence in a public database, such
as an EMBL/Genbank/DDBJ accession number. Which is the case, and
which file or database to use, should be explained in accompanying
docs/GFF_Spec.html view on Meta::CPAN
<dt><score>
<dd> A floating point value. When there is no score (i.e. for a
sensor that just records the possible presence of a signal, as for the
EMBL features above) you should use '.'. (<b>Version 2 change</b>: in
version 1 of GFF you had to write 0 in such circumstances.)<P>
<dt><strand>
<dd> One of '+', '-' or '.'. '.' should be used when
strand is not relevant, e.g. for dinucleotide repeats.<P>
<dt><frame>
<dd> One of '0', '1', '2' or '.'. '0' indicates that the specified
region is in frame, i.e. that its first base corresponds to the first
base of a codon. '1' indicates that there is one extra base,
i.e. that the second base of the region corresponds to the first base
of a codon, and '2' means that the third base of the region is the
first base of a codon. If the strand is '-', then the first base of
the region is value of <end>, because the corresponding
coding region will run from <end> to <start> on
the reverse strand. As with <strand>, if the frame is not
relevant then set <frame> to '.'.
It has been pointed out that "phase" might be a better descriptor than
"frame" for this field.<P>
<dt><A NAME="group_field">[group] </A>
<dd> An optional string-valued field that can be used as a name to
group together a set of records. Typical uses might be to group the
introns and exons in one gene prediction (or experimentally verified
gene structure), or to group multiple regions of match to another
sequence, such as an EST or a protein. See below for examples.<br>
<b>Version 2 change</b>: In version 2, the optional [group] field on the line
must have an tag value structure following the syntax used within
docs/GFF_Spec.html view on Meta::CPAN
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="cum_score_array"><h3> Cumulative Score Arrays </h3>
One issue that comes up with a record-based format such as the GFF
format is how to cope with large numbers of overlapping segments. For
example, in a long sequence, if one tries to include a separate record
giving the score of every candidate exon, where a candidate exon is
defined as a segment of the sequence that begins and ends at candidate
splice sites and consists of an open reading frame in between, then
one can have an infeasibly large number of records. The problem is
that there can be a huge number of highly overlapping exon
candidates. <P>
Let us assume that the score of an exon can be decomposed into three
parts: the score of the 5' splice site, the score of the 3' splice
site, and the sum of the scores of all the codons in between. In such
a case it can be much more efficient to use the GFF format to report
separate scores for the splice site sensors and for the individual
codons in all three (or six, including reverse strand) frames, and let
the program that interprets this file assemble the exon scores. The
exon scores can be calculated efficiently by first creating three
arrays, each of which contains in its [i]th position a value A[i] that
is the partial sum of the codon scores in a particular frame for the
entire sequence from position 1 up to position i. Then for any
positions i < j, the sum of the scores of all codons from i to j can
be obtained as A[j] - A[i]. Using these arrays, along with the
candidate splice site scores, a very large number of scores for
overlapping exons are implicitly defined in a data structure that
takes only linear space with respect to the number of positions in the
sequence, and such that the score for each exon can be retrieved in
constant time. <P>
When the GFF format is used to transmit scores that can be summed for
( run in 1.105 second using v1.01-cache-2.11-cpan-e1769b4cff6 )