BioPerl
view release on metacpan or search on metacpan
Bio/DB/GFF.pm view on Meta::CPAN
=back
=head2 GFF Fundamentals
The GFF format is a flat tab-delimited file, each line of which
corresponds to an annotation, or feature. Each line has nine columns
and looks like this:
Chr1 curated CDS 365647 365963 . + 1 Transcript "R119.7"
The 9 columns are as follows:
=over 4
=item 1.
reference sequence
This is the ID of the sequence that is used to establish the
coordinate system of the annotation. In the example above, the
reference sequence is "Chr1".
=item 2.
source
The source of the annotation. This field describes how the annotation
was derived. In the example above, the source is "curated" to
indicate that the feature is the result of human curation. The names
and versions of software programs are often used for the source field,
as in "tRNAScan-SE/1.2".
=item 3.
method
The annotation method. This field describes the type of the
annotation, such as "CDS". Together the method and source describe
the annotation type.
=item 4.
start position
The start of the annotation relative to the reference sequence.
=item 5.
stop position
The stop of the annotation relative to the reference sequence. Start
is always less than or equal to stop.
=item 6.
score
For annotations that are associated with a numeric score (for example,
a sequence similarity), this field describes the score. The score
units are completely unspecified, but for sequence similarities, it is
typically percent identity. Annotations that don't have a score can
use "."
=item 7.
strand
For those annotations which are strand-specific, this field is the
strand on which the annotation resides. It is "+" for the forward
strand, "-" for the reverse strand, or "." for annotations that are
not stranded.
=item 8.
phase
For annotations that are linked to proteins, this field describes the
phase of the annotation on the codons. It is a number from 0 to 2, or
"." for features that have no phase.
=item 9.
group
GFF provides a simple way of generating annotation hierarchies ("is
composed of" relationships) by providing a group field. The group
field contains the class and ID of an annotation which is the logical
parent of the current one. In the example given above, the group is
the Transcript named "R119.7".
The group field is also used to store information about the target of
sequence similarity hits, and miscellaneous notes. See the next
section for a description of how to describe similarity targets.
The format of the group fields is "Class ID" with a single space (not
a tab) separating the class from the ID. It is VERY IMPORTANT to
follow this format, or grouping will not work properly.
=back
The sequences used to establish the coordinate system for annotations
can correspond to sequenced clones, clone fragments, contigs or
super-contigs. Thus, this module can be used throughout the lifecycle
of a sequencing project.
In addition to a group ID, the GFF format allows annotations to have a
group class. For example, in the ACeDB representation, RNA
interference experiments have a class of "RNAi" and an ID that is
unique among the RNAi experiments. Since not all databases support
this notion, the class is optional in all calls to this module, and
defaults to "Sequence" when not provided.
Double-quotes are sometimes used in GFF files around components of the
group field. Strictly, this is only necessary if the group name or
class contains whitespace.
=head2 Making GFF files work with this module
Some annotations do not need to be individually named. For example,
it is probably not useful to assign a unique name to each ALU repeat
in a vertebrate genome. Others, such as predicted genes, correspond
( run in 1.690 second using v1.01-cache-2.11-cpan-39bf76dae61 )