GBrowse
view release on metacpan or search on metacpan
htdocs/tutorial/tutorial.html view on Meta::CPAN
<h3><a name="data_file">1.1 The Data File</a></h3>
<p>
Let's look at the data file we loaded in detail now. If you open the
<a href="data_files/volvox_remarks.gff3">volvox_remarks.gff3</a> file in a text
editor, you will see that it contains a series of 15 genome "features"
that look like this:
<blockquote class="example"><pre>
ctgA example contig 1 50000 . . . Name=ctgA
ctgA example remark 1659 1984 . + . Name=f07;Note=This is an example
ctgA example remark 3014 6130 . + . Name=f06;Note=This is another example
ctgA example remark 4715 5968 . - . Name=f05;Note=Ok! Ok! I get the message.
ctgA example remark 13280 16394 . + . Name=f08
...
</pre></blockquote>
<p>
Each feature has a "source" of "example", a type of "remark", and
occupies a short range (roughly 1.5k) on a contig named "ctgA." In
addition to the features themselves, there is an entry for the contig
itself (type "contig"). This entry is needed to tell GBrowse what the
length of ctgA is.
<p>
The load file uses a standard known as <a
href="http://www.sequenceontology.org/gff3.shtml">GFF3 (General
Feature Format version 3)</a>. Each line of the file corresponds to a
feature on the genome, and the nine columns are separated by tabs.
<p>
The 9 columns are as follows:
<ol>
<li><b>reference sequence</b><br>
This is the name of the feature that will be used to establish the
coordinate system for the annotation. This is usually the name of
a chromosome, a clone, or a contig. In our example, the
reference sequence is "ctgA". A single GFF file can refer to
multiple reference sequences.</li><br>
<li><b>source</b><br>
The source of the annotation. This field describes how the
feature was derived. In the example, the source is
"example" for want of a better description. Many people find
the source as a way of distinguishing between similar features
that were derived by different methods, for example, gene
calls derived from different prediction software. You can
leave this column blank by replacing the source with a single
dot (".").</li><br>
<li><b>type</b><br>
This column describes the feature type. Although, you can choose anything
you like to describe the feature type, you are strongly encouraged to use
well-recognized sequence ontology (SO) terms such as "gene", "repeat_region", "exon",
and "CDS." You can find a list of the recognized SO terms at
<a
href="http://song.cvs.sourceforge.net/song/ontology/sofa.ontology?rev=HEAD&content-type=text/vnd.viewcvs-markup">the Sequence Ontology Project web site</a>. For
lack of a better name, the features in the volvox example are of
type "remark." Another </li><br>
<li><b>start position</b><br>
The position that the feature starts at, relative to the
reference sequence. The first base of the reference sequence
is position 1.</li><br>
<li><b>end position</b><br>
The end of the feature, again relative to the reference
sequence. End is always greater than or equal to start.</li><br>
<li><b>score</b><br>
For features that have a numeric score, such as sequence
similarities, this field holds the score. Score units are
arbitrary, but most people use the expectation value for
similarity features. You can leave it blank by replacing
the column with a dot.</li><br>
<li><b>strand</b><br>
For features that are strand-specific, this field is the
strand on which the annotation resides. It is "+" for the forward
strand, "-" for the reverse strand, or "." for annotations that are
not stranded. If you are unsure of whether a feature is
stranded, it won't hurt to use a "+" here.</li><br>
<li><b>phase</b><br>
For CDS features that encode proteins, this field describes
where the next codon starts.
The phase is one of the integers 0, 1, or 2, indicating the
number of bases that should be removed from the beginning of
this feature in order to reach the first base of the next codon. In other
words, a phase of "0" indicates that the next codon begins at
the first base of the region described by the current line, a
phase of "1" indicates that the next codon begins at the second
base of this region, and a phase of "2" indicates that the next codon
begins at the third base of this region. This
information is used by the "cds" glyph to show how the reading
frame changes across splice sites. For all other feature types,
use a dot here.</li><br>
<li><b>attributes</b><br>
A list of feature attributes in the format tag=value. Multiple
tag=value pairs are separated by semicolons. URL escaping rules are
used for tags or values containing the following characters: ",=;".
Spaces are allowed in this field, but tabs must be replaced with the
%09 URL escape.
<br><br>
These tags have predefined meanings:
<dl>
<dt>ID</dt>
<dd>Gives the feature a unique identifier. Useful when grouping features
together (such as all the exons in a transcript).</dd>
<dt>Name</dt>
<dd>Display name for the feature. This is the name to be
displayed to the user.</dd>
<dt>Alias</dt>
<dd>
A secondary name for the feature. It is suggested that
this tag be used whenever a secondary identifier for the
feature is needed, such as locus names and
accession numbers.</dd>
( run in 1.975 second using v1.01-cache-2.11-cpan-d8267643d1d )