BioPerl

 view release on metacpan or  search on metacpan

Bio/DB/GFF.pm  view on Meta::CPAN

=back

=head2 GFF Fundamentals

The GFF format is a flat tab-delimited file, each line of which
corresponds to an annotation, or feature.  Each line has nine columns
and looks like this:

 Chr1  curated  CDS 365647  365963  .  +  1  Transcript "R119.7"

The 9 columns are as follows:

=over 4

=item 1.

reference sequence

This is the ID of the sequence that is used to establish the
coordinate system of the annotation.  In the example above, the
reference sequence is "Chr1".

=item 2.

source

The source of the annotation.  This field describes how the annotation
was derived.  In the example above, the source is "curated" to
indicate that the feature is the result of human curation.  The names
and versions of software programs are often used for the source field,
as in "tRNAScan-SE/1.2".

=item 3.

method

The annotation method.  This field describes the type of the
annotation, such as "CDS".  Together the method and source describe
the annotation type.

=item 4.

start position

The start of the annotation relative to the reference sequence. 

=item 5.

stop position

The stop of the annotation relative to the reference sequence.  Start
is always less than or equal to stop.

=item 6.

score

For annotations that are associated with a numeric score (for example,
a sequence similarity), this field describes the score.  The score
units are completely unspecified, but for sequence similarities, it is
typically percent identity.  Annotations that don't have a score can
use "."

=item 7.

strand

For those annotations which are strand-specific, this field is the
strand on which the annotation resides.  It is "+" for the forward
strand, "-" for the reverse strand, or "." for annotations that are
not stranded.

=item 8.

phase

For annotations that are linked to proteins, this field describes the
phase of the annotation on the codons.  It is a number from 0 to 2, or
"." for features that have no phase.

=item 9.

group

GFF provides a simple way of generating annotation hierarchies ("is
composed of" relationships) by providing a group field.  The group
field contains the class and ID of an annotation which is the logical
parent of the current one.  In the example given above, the group is
the Transcript named "R119.7".

The group field is also used to store information about the target of
sequence similarity hits, and miscellaneous notes.  See the next
section for a description of how to describe similarity targets.

The format of the group fields is "Class ID" with a single space (not
a tab) separating the class from the ID. It is VERY IMPORTANT to
follow this format, or grouping will not work properly.

=back

The sequences used to establish the coordinate system for annotations
can correspond to sequenced clones, clone fragments, contigs or
super-contigs.  Thus, this module can be used throughout the lifecycle
of a sequencing project.

In addition to a group ID, the GFF format allows annotations to have a
group class.  For example, in the ACeDB representation, RNA
interference experiments have a class of "RNAi" and an ID that is
unique among the RNAi experiments.  Since not all databases support
this notion, the class is optional in all calls to this module, and
defaults to "Sequence" when not provided.

Double-quotes are sometimes used in GFF files around components of the
group field.  Strictly, this is only necessary if the group name or
class contains whitespace.

=head2 Making GFF files work with this module

Some annotations do not need to be individually named.  For example,
it is probably not useful to assign a unique name to each ALU repeat
in a vertebrate genome.  Others, such as predicted genes, correspond



( run in 1.690 second using v1.01-cache-2.11-cpan-39bf76dae61 )