AcePerl

 view release on metacpan or  search on metacpan

docs/GFF_Spec.html  view on Meta::CPAN

rather, just that if the feature is obviously described in the standard
list, that the standard label should be used. For this standard table
we propose to fall back on the international public standards for genomic 
database feature annotation, specifically, the 
<a href="http://www.ebi.ac.uk/ebi_docs/embl_db/ft/feature_key_ref.html">
DDBJ/EMBL/GenBank feature table</a>).<P>

 <dt>&#060;start&#062;, &#060;end&#062;
 <dd> Integers.  &#060;start&#062; must be less than or equal to
&#060;end&#062;.  Sequence numbering starts at 1, so these numbers
should be between 1 and the length of the relevant sequence,
inclusive. (<b>Version 2 change</b>: version 2 condones values of
&#060;start&#062; and &#060;end&#062; that extend outside the
reference sequence.  This is often more natural when dumping from
acedb, rather than clipping.  It means that some software using the
files may need to clip for itself.)<P>

 <dt>&#060;score&#062; 
 <dd> A floating point value.  When there is no score (i.e. for a
sensor that just records the possible presence of a signal, as for the
EMBL features above) you should use '.'. (<b>Version 2 change</b>: in
version 1 of GFF you had to write 0 in such circumstances.)<P>

 <dt>&#060;strand&#062; 
 <dd> One of '+', '-' or '.'.  '.' should be used when
strand is not relevant, e.g. for dinucleotide repeats.<P>

 <dt>&#060;frame&#062;
 <dd> One of '0', '1', '2' or '.'.  '0' indicates that the specified
region is in frame, i.e. that its first base corresponds to the first
base of a codon.  '1' indicates that there is one extra base,
i.e. that the second base of the region corresponds to the first base
of a codon, and '2' means that the third base of the region is the
first base of a codon.  If the strand is '-', then the first base of
the region is value of &#060;end&#062;, because the corresponding
coding region will run from &#060;end&#062; to &#060;start&#062; on
the reverse strand.  As with &#060;strand&#062;, if the frame is not
relevant then set &#060;frame&#062; to '.'.  
It has been pointed out that "phase" might be a better descriptor than
"frame" for this field.<P>

 <dt><A NAME="group_field">[group] </A>
 <dd> An optional string-valued field that can be used as a name to
group together a set of records.  Typical uses might be to group the
introns and exons in one gene prediction (or experimentally verified
gene structure), or to group multiple regions of match to another
sequence, such as an EST or a protein.  See below for examples.<br>

<b>Version 2 change</b>: In version 2, the optional [group] field on the line
must have an tag value structure following the syntax used within
objects in a .ace file, flattened onto one line by semicolon
separators.  Tags must be standard identifiers
([A-Za-z][A-Za-z0-9_]*).  Free text values must be quoted with double
quotes. <em>Note: all non-printing characters in such free text value strings
(e.g. newlines, tabs, control characters, etc)
must be explicitly represented by their C (UNIX) style backslash-escaped
representation (e.g. newlines as '\n', tabs as '\t').</em>
As in ACEDB, multiple values can follow a specific tag.  The
aim is to establish consistent use of particular tags, corresponding
to an underlying implied ACEDB model if you want to think that way
(but acedb is not required).  Examples of these would be:
<font size="3"><pre>
seq1     BLASTX  similarity   101  235 87.1 + 0	Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"
</pre></font>

</dl>

All strings (i.e. values of the &#060;seqname&#062;,
&#060;source&#062; or &#060;feature&#062; fields) should be under 256
characters long, and should not include whitespace.  The whole line
should be under 32k long.  A character limit is not very desirable,
but helps write parsers in some languages. The slightly silly 32k
limit is to allow plenty of space for comments/extra data. <b>Version 2 change</b>:
field and line size limitations are removed; however, fields (except the optional
[group] field above) must still not include whitespace.
<P>

All of the above described fields should be separated by TAB characters ('\t').
<b>Version 2 note</b>: previous Version 2 permission to use arbitrary whitespace
as field delimiters is now <b>revoked</b>! (99/02/26)
<P>
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="comments"><h3> Comments </h3>

Comments are allowed, starting with "#" as in Perl, awk etc.
Everything following # until the end of the line is ignored.
Effectively this can be used in two ways.  Either it must be at the
beginning of the line (after any whitespace), to make the whole line a
comment, or the comment could come after all the required fields on
the line.
<P>
We also permit extra information to be given on the line following the
group field without a '#' character (<b>Version 2 change</b>: this extra
information <B>must</B> be delimited by the '#' comment delimiter <B>OR</B>
by another tab field delimiter character, following 
any and all [group] field tag-value pairs). 
<P>
This allows extra method-specific information to be transferred with the line.  However,
we discourage overuse of this feature: better to find a way to do it
with more true feature lines, and perhaps groups. (<b>Version 2
change</b>: we gave in and defined a structured way of passing
additional information, as described above under [group].  But the
sentiment of this paragraph still applies - don't overuse the
tag-value syntax. The use of tag-value pairs (with whitespace) renders problematic the parsing of
Version 1 style comments (following the group field, without a '#' character), so in Version 2,
such [group] trailing comments <B>must</B> start with the "#", as noted above.

<A NAME="meta_info"><h4> ## comment lines for meta information </h4>

There is a set of standardised (i.e. parsable) ## line types that can
be used optionally at the top of a gff file.  The philosophy is a
little like the special set of %% lines at the top of postscript
files, used for example to give the BoundingBox for EPS files.<P>

Current proposed ## lines are:

<dl>

  <dt><pre> ##gff-version 1 </pre>
  <dd> GFF version - in case it is a real success and we want to
change it.  The current version is 2. (<b>Version 2 change</b>!)

  <dt><pre> ##source-version {source} {version text} </pre>
  <dd> So that people can record what version of a program or package was
used to make the data in this file. I suggest the version is text
 without whitespace.  That allows things like 1.3, 4a etc.

  <dt> <pre> ##date {date} </pre>
  <dd> The date the file was made, or perhaps that the prediction
programs were run.  We suggest to use astronomical format: 1997-11-08
for 8th November 1997, first because these sort properly, and second
to avoid any US/European bias.

<dt> <pre> 
 ##DNA {seqname}
 ##acggctcggattggcgctggatgatagatcagacgac
 ##...
 ##end-DNA
</pre>

<dd> To give a DNA sequence.  Several people have pointed out that it may
be convenient to include the sequence in the file.  It should not
become mandatory to do so.  Often the seqname will be a well-known
identifier, and the sequence can easily be retrieved from a
database, or an accompanying file.

<dt> <pre> ##sequence-region {seqname} {start} {end} </pre>
<dd> To indicate that this file only contains entries for the
specified subregion of a sequence.

</dl>

Please feel free to propose new ## lines.

The ## line proposal came out of some discussions including Anders
Krogh, David Haussler, people at the Newton Institute on 1997-10-29
and some email from Suzanna Lewis.  Of course, naive programs can
ignore all of these...

<A NAME="file_names"><h3> File Naming </h3>

We propose that the format is called "GFF", with conventional file
name ending ".gff".
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="semantics"><h2> Semantics </h2>

We have intentionally avoided overspecifying the semantics of the
format.  For example, we have not restricted the items expressible in
GFF to a specified set of feature types (splice sites, exons etc.)
with defined semantics.  Therefore, in order for the information in a
gff file to be useful to somebody else, the person producing the
features must describe the meaning of the features.  <P>

In the example given above the feature "splice5" indicates that there
is a candidate 5' splice site between positions 172 and 173.  The
"sp5-20" feature is a prediction based on a window of 20 bp for the
same splice site.  To use either of these, you must know the position
within the feature of the predicted splice site.  This only needs to
be given once, possibly in comments at the head of the file, or in a
separate document.  <P>

Another example is the scoring scheme; we ourselves would like the
score to be a log-odds likelihood score in bits to a defined null
model, but that is not required, because different methods take
different approaches.

Avoiding a prespecified feature set also leaves open the possibility
for GFF to be used for new feature types, such as CpG islands,
hypersensitive sites, promoter/enhancer elements, etc.
<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="GFF_use"><h2> Ways to use GFF </h2>

Here are a few suggestions on how the GFF format might be used.
 <ol>
 <li> Simple sharing of sensors. In this case, researcher A has a sensor,
such as a 3' splice site sensor, and researcher B wants to test that
sensor.  They agree on a set of sequences, researcher A runs the
sensor on these sequences and sends the resulting GFF file to
researher B, who then evaluates the result.<P>

 <li> Representing experimental results.  GFF feature records can also
be created for experimentally confirmed exons and other features.  In
these cases there will presumably be no score.  Such "confirmed" GFF
files will be useful for evaluating predictions, using the same
software as you would to compare predictions.<P>

 <li> Integrated gene parsing. Several GFF files from different
researchers can be combined to provide the features used by an
integrated genefinder.  As mentioned above, this has the advantage
that different combinations of sensors and dynamic programming methods
for assembling sensor scores into consistent gene parses can be easily
explored.<P>

 <li> Reporting final predictions. GFF format can also be used to
communicate finished gene predictions. One simply reports final
predicted exons and other predicted gene features, either with their
original scores. or with some sort of posterior scores, rather than,
or in addition to, reporting all candidate gene features with their
scores.  To show that a set of the components belong to a single
prediction, a "group" field can be added to all the accepted sites.
This is useful for comparing the outputs of several integrated
genefinders among themselves, and to "confirmed" GFF files.  A
particular advantage of having the same format for both raw sensor
feature score files and final gene parse files is that one can easily
explore the possibility of combining the final gene parses from
several different genefinders, using another round of dynamic
programming, into a single integrated predicted parse.<P>

 <li> Visualisation. GFF will also provide a simple standard format for
standardising input to visualisation programs, showing predicted and
experimentally determined features, gene structures etc.

</ol>

<P>
Back to <A HREF="#TOC">Table of Contents</A>
<P>
<HR>
<A NAME="examples"><h3> Complex Examples</h3>

<A NAME="homology_feature">



( run in 0.493 second using v1.01-cache-2.11-cpan-d8267643d1d )