AcePerl

 view release on metacpan or  search on metacpan

docs/GFF_Spec.html  view on Meta::CPAN

extract information out of the file.  For these reasons, for the
primary format, we propose a record-based structure, where each
feature is described on a single line, and line order is not relevant.
<P>

We do not intend GFF format to be used for complete data management of
the analysis and annotation of genomic sequence.  Systems such as
Acedb, Genotator etc. that have much richer data representation
semantics have been designed for that purpose.  The disadvantages in
using their formats for data exchange (or other richer formats such as
ASN.1) are (1) they require more complexity in parsing/processing, (2)
there is little hope on achieving consensus on how to capture all
information.  GFF is intentionally aiming for a low common
denominator. <P>

Here are some example records:

<pre>
SEQ1	EMBL	atg	103	105	.	+	0
SEQ1	EMBL	exon	103	172	.	+	0
SEQ1	EMBL	splice5	172	173	.	+	.
SEQ1	netgene	splice5	172	173	0.94	+	.
SEQ1	genie	sp5-20	163	182	2.3	+	.
SEQ1	genie	sp5-10	168	177	2.1	+	.
SEQ2	grail	ATG	17	19	2.1	-	0
</pre>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="version_2_update"><h2>Version 2 GFF Update</h2></A>
<P>
<FONT COLOR="#8F2020"><b>ALERT 98/12/16</b>: Following discussions with Lincoln Stein and others, we
propose the Version 2 format of GFF, as specifically described in 
this document. The Version 2 specification has not yet been frozen and 
is presented as a "work-in-progress" at this time, open to
user feedback on the proposed changes (plus other suggestions for improvement).
The main change from Version 1 to Version 2 is the requirement for a tag-value
type structure (essentially .ace format) for any additional material on the
line, following the mandatory fields.  We also now 
allow '.' as a score, for features for which there is no score.  Dumping in version
2 format is implemented in ACEDB.  Changes in the remainder of this
document are described and marked as (<b>Version 2 changes</b>).
</FONT>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="fields"><h2>Definition</h2></A>

Fields are:
&#060;seqname&#062; &#060;source&#062; &#060;feature&#062; &#060;start&#062; &#060;end&#062; &#060;score&#062; &#060;strand&#062; &#060;frame&#062; [group]&#062;[comments] <P>
 <dl>

 <dt>&#060;seqname&#062; 
 <dd>The name of the sequence.  Having an explicit sequence name
allows a feature file to be prepared for a data set of multiple
sequences.  Normally the seqname will be the identifier of the
sequence in an accompanying fasta format file.  An alternative is that
'seqname' is the identifier for a sequence in a public database, such
as an EMBL/Genbank/DDBJ accession number.  Which is the case, and
which file or database to use, should be explained in accompanying
information.<P>

 <dt>&#060;source&#062; 
 <dd> The source of this feature.  This field will normally be used to
indicate the program making the prediction, or if it comes from public
database annotation, or is experimentally verified, etc.<P>

 <dt>&#060;feature&#062; 
 <dd> The feature type name.  We hope to suggest a standard set of
features, to facilitate import/export, comparison etc..  Of course,
people are free to define new ones as needed.  For example, Genie
splice detectors account for a region of DNA, and multiple detectors
may be available for the same site, as shown above.<P>
<A name="standard_feature_table">
(<b>Version 2 change</b>: <u>Standard Table of Features</u> - 
we would like to enforce a standard nomenclature for
common GFF features. This does not forbid the use of other features,
rather, just that if the feature is obviously described in the standard
list, that the standard label should be used. For this standard table
we propose to fall back on the international public standards for genomic 
database feature annotation, specifically, the 
<a href="http://www.ebi.ac.uk/ebi_docs/embl_db/ft/feature_key_ref.html">
DDBJ/EMBL/GenBank feature table</a>).<P>

 <dt>&#060;start&#062;, &#060;end&#062;
 <dd> Integers.  &#060;start&#062; must be less than or equal to
&#060;end&#062;.  Sequence numbering starts at 1, so these numbers
should be between 1 and the length of the relevant sequence,
inclusive. (<b>Version 2 change</b>: version 2 condones values of
&#060;start&#062; and &#060;end&#062; that extend outside the
reference sequence.  This is often more natural when dumping from
acedb, rather than clipping.  It means that some software using the
files may need to clip for itself.)<P>

 <dt>&#060;score&#062; 
 <dd> A floating point value.  When there is no score (i.e. for a
sensor that just records the possible presence of a signal, as for the
EMBL features above) you should use '.'. (<b>Version 2 change</b>: in
version 1 of GFF you had to write 0 in such circumstances.)<P>

 <dt>&#060;strand&#062; 
 <dd> One of '+', '-' or '.'.  '.' should be used when
strand is not relevant, e.g. for dinucleotide repeats.<P>

 <dt>&#060;frame&#062;
 <dd> One of '0', '1', '2' or '.'.  '0' indicates that the specified
region is in frame, i.e. that its first base corresponds to the first
base of a codon.  '1' indicates that there is one extra base,
i.e. that the second base of the region corresponds to the first base
of a codon, and '2' means that the third base of the region is the
first base of a codon.  If the strand is '-', then the first base of
the region is value of &#060;end&#062;, because the corresponding
coding region will run from &#060;end&#062; to &#060;start&#062; on
the reverse strand.  As with &#060;strand&#062;, if the frame is not
relevant then set &#060;frame&#062; to '.'.  
It has been pointed out that "phase" might be a better descriptor than
"frame" for this field.<P>

 <dt><A NAME="group_field">[group] </A>
 <dd> An optional string-valued field that can be used as a name to
group together a set of records.  Typical uses might be to group the
introns and exons in one gene prediction (or experimentally verified
gene structure), or to group multiple regions of match to another
sequence, such as an EST or a protein.  See below for examples.<br>

<b>Version 2 change</b>: In version 2, the optional [group] field on the line
must have an tag value structure following the syntax used within
objects in a .ace file, flattened onto one line by semicolon
separators.  Tags must be standard identifiers
([A-Za-z][A-Za-z0-9_]*).  Free text values must be quoted with double
quotes. <em>Note: all non-printing characters in such free text value strings
(e.g. newlines, tabs, control characters, etc)
must be explicitly represented by their C (UNIX) style backslash-escaped
representation (e.g. newlines as '\n', tabs as '\t').</em>
As in ACEDB, multiple values can follow a specific tag.  The
aim is to establish consistent use of particular tags, corresponding
to an underlying implied ACEDB model if you want to think that way
(but acedb is not required).  Examples of these would be:
<font size="3"><pre>
seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11 55 ; E_value 0.0003



( run in 1.321 second using v1.01-cache-2.11-cpan-ceb78f64989 )