AcePerl
view release on metacpan or search on metacpan
docs/GFF_Spec.html view on Meta::CPAN
<A HREF=/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="Home page" SRC=/icons/arrow.small.up.gif> Home</A>
<A HREF=/Software/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to Software & Databases " SRC=/icons/arrow.small.left.gif> Software & Databases </A>
<A HREF=/Software/GFF/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to GFF" SRC=/icons/arrow.small.left.gif> GFF</A>
</TT></FONT>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
<P>
<!-- open table cell holding the page content -->
<CENTER><TABLE BORDER="0" WIDTH="80%"><TR><TD ALIGN="LEFT" VALIGN="TOP">
<!-- page content starts here -->
<A NAME="TOC">
<H1 ALIGN="CENTER">GFF (Gene Finding Features) Specifications Document</H1>
<!-- INDEX BEGIN -->
<UL>
<LI><A HREF="#introduction">Introduction</A>
<LI><A HREF="#version_2_update">Version 2 GFF Update</A>
<LI><A HREF="#fields">Definition</A>
<UL>
<LI><A HREF="#standard_feature_table">Standard Table of Features</A>
<LI><A HREF="#group_field">Group Field</A>
<LI><A HREF="#comments">Comments</A>
<UL>
<LI><A HREF="#meta_info">Comments for Meta-Information</A>
</UL>
<LI><A HREF="#file_names">File Naming</A>
</UL>
<LI><A HREF="#semantics">Semantics</A>
<LI><A HREF="#GFF_use">Ways to use GFF</A>
<UL>
<LI><A HREF="#examples">Complex Examples</A>
<UL>
<LI><A HREF="#homology_feature">Similarities to Other Sequences</A>
</UL>
<LI><A HREF="#cum_score_array">Cumulative Score Arrays</A>
</UL>
<LI><A HREF="#mailing_list"> Mailing list</A>
<LI><A HREF="#edit_history">Edit History</A>
<LI><A HREF="#authors">Authors</A>
</UL>
<!-- INDEX END -->
<HR>
<A NAME="introduction"><h2>Introduction</h2></A>
<P>
Essentially all current approaches to gene finding in higher organisms
use a variety of recognition methods that give scores to likely
signals (starts, splice sites, stops etc.) or to extended regions
(exons, introns etc.), and then combine these to give complete gene
structures. Normally the combination step is done in the same program
as the feature detection, often using dynamic programming methods. We
would like to enable these processes to be decoupled, by proposing a
format called GFF (Gene-Finding Format) for the transfer of feature
information. It would then be possible to take features from an
outside source and add them in to an existing program, or in the
extreme to write a dynamic programming system which only took external
features.
<P>
In particular, establishing GFF would allow people to develop features
and have them tested without having to maintain a complete
gene-finding system. Equally, it would help those developing and
applying integrated gene-finding programs to test new feature
detectors developed by others, or even by themselves.
<P>
We want the GFF format to be easy to parse and process by a variety of
programs in different languages. e.g. it would be useful if Unix
tools like grep, sort and simple perl and awk scripts could easily
extract information out of the file. For these reasons, for the
primary format, we propose a record-based structure, where each
feature is described on a single line, and line order is not relevant.
<P>
We do not intend GFF format to be used for complete data management of
the analysis and annotation of genomic sequence. Systems such as
Acedb, Genotator etc. that have much richer data representation
semantics have been designed for that purpose. The disadvantages in
using their formats for data exchange (or other richer formats such as
ASN.1) are (1) they require more complexity in parsing/processing, (2)
there is little hope on achieving consensus on how to capture all
information. GFF is intentionally aiming for a low common
denominator. <P>
Here are some example records:
<pre>
SEQ1 EMBL atg 103 105 . + 0
SEQ1 EMBL exon 103 172 . + 0
SEQ1 EMBL splice5 172 173 . + .
SEQ1 netgene splice5 172 173 0.94 + .
SEQ1 genie sp5-20 163 182 2.3 + .
SEQ1 genie sp5-10 168 177 2.1 + .
SEQ2 grail ATG 17 19 2.1 - 0
</pre>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="version_2_update"><h2>Version 2 GFF Update</h2></A>
<P>
<FONT COLOR="#8F2020"><b>ALERT 98/12/16</b>: Following discussions with Lincoln Stein and others, we
propose the Version 2 format of GFF, as specifically described in
this document. The Version 2 specification has not yet been frozen and
is presented as a "work-in-progress" at this time, open to
user feedback on the proposed changes (plus other suggestions for improvement).
The main change from Version 1 to Version 2 is the requirement for a tag-value
type structure (essentially .ace format) for any additional material on the
line, following the mandatory fields. We also now
allow '.' as a score, for features for which there is no score. Dumping in version
2 format is implemented in ACEDB. Changes in the remainder of this
document are described and marked as (<b>Version 2 changes</b>).
</FONT>
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="fields"><h2>Definition</h2></A>
Fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [group]>[comments] <P>
<dl>
<dt><seqname>
<dd>The name of the sequence. Having an explicit sequence name
allows a feature file to be prepared for a data set of multiple
sequences. Normally the seqname will be the identifier of the
sequence in an accompanying fasta format file. An alternative is that
'seqname' is the identifier for a sequence in a public database, such
as an EMBL/Genbank/DDBJ accession number. Which is the case, and
which file or database to use, should be explained in accompanying
information.<P>
<dt><source>
<dd> The source of this feature. This field will normally be used to
indicate the program making the prediction, or if it comes from public
database annotation, or is experimentally verified, etc.<P>
<dt><feature>
( run in 0.529 second using v1.01-cache-2.11-cpan-39bf76dae61 )