AcePerl
view release on metacpan or search on metacpan
docs/GFF_Spec.html view on Meta::CPAN
|
<A HREF=/HGP/>HGP</A>
|
<A HREF=/Projects/>Projects</A>
|
<A HREF=/DataSearch/>Database Searches</A>
|
<A HREF=/Software/><B>Software</B></A>
|
<A HREF=/Teams/>Teams</A>
|
<A HREF=http://search.sanger.ac.uk>Search</A>
|
</FONT></TT>
</TD>
</TR>
<TR> <TD ALIGN=LEFT VALIGN=TOP NOWRAP>
<FONT FACE=Arial,Helvetica,sans-serif SIZE=-1><TT>
<A HREF=/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="Home page" SRC=/icons/arrow.small.up.gif> Home</A>
<A HREF=/Software/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to Software & Databases " SRC=/icons/arrow.small.left.gif> Software & Databases </A>
<A HREF=/Software/GFF/><IMG WIDTH=11 HEIGHT=10 BORDER=0 HSPACE=0 ALIGN=TOP ALT="up to GFF" SRC=/icons/arrow.small.left.gif> GFF</A>
</TT></FONT>
</TD>
</TR>
</TABLE>
</TD>
</TR>
</TABLE>
<P>
<!-- open table cell holding the page content -->
<CENTER><TABLE BORDER="0" WIDTH="80%"><TR><TD ALIGN="LEFT" VALIGN="TOP">
<!-- page content starts here -->
<A NAME="TOC">
<H1 ALIGN="CENTER">GFF (Gene Finding Features) Specifications Document</H1>
<!-- INDEX BEGIN -->
<UL>
<LI><A HREF="#introduction">Introduction</A>
<LI><A HREF="#version_2_update">Version 2 GFF Update</A>
<LI><A HREF="#fields">Definition</A>
<UL>
<LI><A HREF="#standard_feature_table">Standard Table of Features</A>
<LI><A HREF="#group_field">Group Field</A>
<LI><A HREF="#comments">Comments</A>
<UL>
<LI><A HREF="#meta_info">Comments for Meta-Information</A>
</UL>
<LI><A HREF="#file_names">File Naming</A>
</UL>
<LI><A HREF="#semantics">Semantics</A>
<LI><A HREF="#GFF_use">Ways to use GFF</A>
<UL>
<LI><A HREF="#examples">Complex Examples</A>
<UL>
<LI><A HREF="#homology_feature">Similarities to Other Sequences</A>
</UL>
<LI><A HREF="#cum_score_array">Cumulative Score Arrays</A>
</UL>
<LI><A HREF="#mailing_list"> Mailing list</A>
<LI><A HREF="#edit_history">Edit History</A>
<LI><A HREF="#authors">Authors</A>
</UL>
<!-- INDEX END -->
<HR>
<A NAME="introduction"><h2>Introduction</h2></A>
<P>
Essentially all current approaches to gene finding in higher organisms
use a variety of recognition methods that give scores to likely
signals (starts, splice sites, stops etc.) or to extended regions
(exons, introns etc.), and then combine these to give complete gene
structures. Normally the combination step is done in the same program
as the feature detection, often using dynamic programming methods. We
would like to enable these processes to be decoupled, by proposing a
format called GFF (Gene-Finding Format) for the transfer of feature
information. It would then be possible to take features from an
outside source and add them in to an existing program, or in the
extreme to write a dynamic programming system which only took external
features.
<P>
In particular, establishing GFF would allow people to develop features
and have them tested without having to maintain a complete
gene-finding system. Equally, it would help those developing and
applying integrated gene-finding programs to test new feature
detectors developed by others, or even by themselves.
<P>
We want the GFF format to be easy to parse and process by a variety of
programs in different languages. e.g. it would be useful if Unix
tools like grep, sort and simple perl and awk scripts could easily
extract information out of the file. For these reasons, for the
primary format, we propose a record-based structure, where each
feature is described on a single line, and line order is not relevant.
<P>
We do not intend GFF format to be used for complete data management of
the analysis and annotation of genomic sequence. Systems such as
Acedb, Genotator etc. that have much richer data representation
semantics have been designed for that purpose. The disadvantages in
using their formats for data exchange (or other richer formats such as
ASN.1) are (1) they require more complexity in parsing/processing, (2)
there is little hope on achieving consensus on how to capture all
information. GFF is intentionally aiming for a low common
denominator. <P>
Here are some example records:
<pre>
SEQ1 EMBL atg 103 105 . + 0
SEQ1 EMBL exon 103 172 . + 0
SEQ1 EMBL splice5 172 173 . + .
SEQ1 netgene splice5 172 173 0.94 + .
SEQ1 genie sp5-20 163 182 2.3 + .
SEQ1 genie sp5-10 168 177 2.1 + .
SEQ2 grail ATG 17 19 2.1 - 0
docs/GFF_Spec.html view on Meta::CPAN
without whitespace. That allows things like 1.3, 4a etc.
<dt> <pre> ##date {date} </pre>
<dd> The date the file was made, or perhaps that the prediction
programs were run. We suggest to use astronomical format: 1997-11-08
for 8th November 1997, first because these sort properly, and second
to avoid any US/European bias.
<dt> <pre>
##DNA {seqname}
##acggctcggattggcgctggatgatagatcagacgac
##...
##end-DNA
</pre>
<dd> To give a DNA sequence. Several people have pointed out that it may
be convenient to include the sequence in the file. It should not
become mandatory to do so. Often the seqname will be a well-known
identifier, and the sequence can easily be retrieved from a
database, or an accompanying file.
<dt> <pre> ##sequence-region {seqname} {start} {end} </pre>
<dd> To indicate that this file only contains entries for the
specified subregion of a sequence.
</dl>
Please feel free to propose new ## lines.
The ## line proposal came out of some discussions including Anders
Krogh, David Haussler, people at the Newton Institute on 1997-10-29
and some email from Suzanna Lewis. Of course, naive programs can
ignore all of these...
<A NAME="file_names"><h3> File Naming </h3>
We propose that the format is called "GFF", with conventional file
name ending ".gff".
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="semantics"><h2> Semantics </h2>
We have intentionally avoided overspecifying the semantics of the
format. For example, we have not restricted the items expressible in
GFF to a specified set of feature types (splice sites, exons etc.)
with defined semantics. Therefore, in order for the information in a
gff file to be useful to somebody else, the person producing the
features must describe the meaning of the features. <P>
In the example given above the feature "splice5" indicates that there
is a candidate 5' splice site between positions 172 and 173. The
"sp5-20" feature is a prediction based on a window of 20 bp for the
same splice site. To use either of these, you must know the position
within the feature of the predicted splice site. This only needs to
be given once, possibly in comments at the head of the file, or in a
separate document. <P>
Another example is the scoring scheme; we ourselves would like the
score to be a log-odds likelihood score in bits to a defined null
model, but that is not required, because different methods take
different approaches.
Avoiding a prespecified feature set also leaves open the possibility
for GFF to be used for new feature types, such as CpG islands,
hypersensitive sites, promoter/enhancer elements, etc.
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="GFF_use"><h2> Ways to use GFF </h2>
Here are a few suggestions on how the GFF format might be used.
<ol>
<li> Simple sharing of sensors. In this case, researcher A has a sensor,
such as a 3' splice site sensor, and researcher B wants to test that
sensor. They agree on a set of sequences, researcher A runs the
sensor on these sequences and sends the resulting GFF file to
researher B, who then evaluates the result.<P>
<li> Representing experimental results. GFF feature records can also
be created for experimentally confirmed exons and other features. In
these cases there will presumably be no score. Such "confirmed" GFF
files will be useful for evaluating predictions, using the same
software as you would to compare predictions.<P>
<li> Integrated gene parsing. Several GFF files from different
researchers can be combined to provide the features used by an
integrated genefinder. As mentioned above, this has the advantage
that different combinations of sensors and dynamic programming methods
for assembling sensor scores into consistent gene parses can be easily
explored.<P>
<li> Reporting final predictions. GFF format can also be used to
communicate finished gene predictions. One simply reports final
predicted exons and other predicted gene features, either with their
original scores. or with some sort of posterior scores, rather than,
or in addition to, reporting all candidate gene features with their
scores. To show that a set of the components belong to a single
prediction, a "group" field can be added to all the accepted sites.
This is useful for comparing the outputs of several integrated
genefinders among themselves, and to "confirmed" GFF files. A
particular advantage of having the same format for both raw sensor
feature score files and final gene parse files is that one can easily
explore the possibility of combining the final gene parses from
several different genefinders, using another round of dynamic
programming, into a single integrated predicted parse.<P>
<li> Visualisation. GFF will also provide a simple standard format for
standardising input to visualisation programs, showing predicted and
experimentally determined features, gene structures etc.
</ol>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="examples"><h3> Complex Examples</h3>
<A NAME="homology_feature">
<h4> Similarities to Other Sequences </h4>
A major source of information about a sequence comes from similarities
to other sequences. For example, BLAST hits to protein sequences help
identify potential coding regions. We can represent these as a set of
"homology gene features", grouping hits to the same target as follows:
<font size="3"><pre>
seq1 BLASTX similarity 101 136 87.1 + 0 HBA_HUMAN
seq1 BLASTX similarity 107 133 72.4 + 0 HBB_HUMAN
seq1 BLASTX similarity 290 343 67.1 + 0 HBA_HUMAN
</pre></font>
If further information is needed about where in the target protein
each match occurs, it can be given after the protein name, e.g.
as the start coordinate in the target.
<P>
<b>Version 2 change</b>: In version 2 this has been formalised using
the tag Target which expects to be followed by the name of the target,
followed (optionally) by start and end point in the target as
integers, as in
<font size="3"><pre>
seq1 BLASTX similarity 101 235 87.1 + 0 Target "HBA_HUMAN" 11 55 ; E_value 0.0003
</pre></font>
We need to finalise on a tag model for gapped alignments...
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="cum_score_array"><h3> Cumulative Score Arrays </h3>
One issue that comes up with a record-based format such as the GFF
format is how to cope with large numbers of overlapping segments. For
example, in a long sequence, if one tries to include a separate record
giving the score of every candidate exon, where a candidate exon is
defined as a segment of the sequence that begins and ends at candidate
splice sites and consists of an open reading frame in between, then
one can have an infeasibly large number of records. The problem is
that there can be a huge number of highly overlapping exon
candidates. <P>
Let us assume that the score of an exon can be decomposed into three
parts: the score of the 5' splice site, the score of the 3' splice
site, and the sum of the scores of all the codons in between. In such
a case it can be much more efficient to use the GFF format to report
separate scores for the splice site sensors and for the individual
codons in all three (or six, including reverse strand) frames, and let
the program that interprets this file assemble the exon scores. The
exon scores can be calculated efficiently by first creating three
arrays, each of which contains in its [i]th position a value A[i] that
is the partial sum of the codon scores in a particular frame for the
entire sequence from position 1 up to position i. Then for any
positions i < j, the sum of the scores of all codons from i to j can
be obtained as A[j] - A[i]. Using these arrays, along with the
candidate splice site scores, a very large number of scores for
overlapping exons are implicitly defined in a data structure that
takes only linear space with respect to the number of positions in the
sequence, and such that the score for each exon can be retrieved in
constant time. <P>
When the GFF format is used to transmit scores that can be summed for
efficient retrieval as in the case of the codon scores above, we ask
that the provider of the scores indicate that these scores are
summable in this manner, and provide a recipe for calculating the
scores that are to be derived from these summable scores, such as the
exon scores described above. We place no limit on the complexity of
( run in 1.316 second using v1.01-cache-2.11-cpan-75ffa21a3d4 )