GBrowse

 view release on metacpan or  search on metacpan

htdocs/tutorial/tutorial.html  view on Meta::CPAN

<h3><a name="data_file">1.1 The Data File</a></h3>

<p>


Let's look at the data file we loaded in detail now.  If you open the
<a href="data_files/volvox_remarks.gff3">volvox_remarks.gff3</a> file in a text
editor, you will see that it contains a series of 15 genome "features"
that look like this:

<blockquote class="example"><pre>
ctgA example contig 1     50000 . . . Name=ctgA
ctgA example remark 1659  1984  . + . Name=f07;Note=This is an example
ctgA example remark 3014  6130  . + . Name=f06;Note=This is another example
ctgA example remark 4715  5968  . - . Name=f05;Note=Ok! Ok! I get the message.
ctgA example remark 13280 16394 . + . Name=f08
...
</pre></blockquote>

<p>

Each feature has a "source" of "example", a type of "remark", and
occupies a short range (roughly 1.5k) on a contig named "ctgA."  In
addition to the features themselves, there is an entry for the contig
itself (type "contig").  This entry is needed to tell GBrowse what the
length of ctgA is.

<p>

The load file uses a standard known as <a
href="http://www.sequenceontology.org/gff3.shtml">GFF3 (General
Feature Format version 3)</a>.  Each line of the file corresponds to a
feature on the genome, and the nine columns are separated by tabs.

<p>

The 9 columns are as follows:

<ol>
  <li><b>reference sequence</b><br>
      This is the name of the feature that will be used to establish the
      coordinate system for the annotation.  This is usually the name of
      a chromosome, a clone, or a contig.  In our example, the
      reference sequence is "ctgA".  A single GFF file can refer to
      multiple reference sequences.</li><br>
  <li><b>source</b><br>
      The source of the annotation.  This field describes how the
      feature was derived.  In the example, the source is
      "example" for want of a better description.  Many people find
      the source as a way of distinguishing between similar features
      that were derived by different methods, for example, gene
      calls derived from different prediction software.  You can
      leave this column blank by replacing the source with a single
      dot (".").</li><br>
  <li><b>type</b><br>
      This column describes the feature type. Although, you can choose anything
      you like to describe the feature type, you are strongly encouraged to use
      well-recognized sequence ontology (SO) terms such as "gene", "repeat_region", "exon",
      and "CDS."  You can find a list of the recognized SO terms at
      <a
      href="http://song.cvs.sourceforge.net/song/ontology/sofa.ontology?rev=HEAD&content-type=text/vnd.viewcvs-markup">the Sequence Ontology Project web site</a>. For
      lack of a better name, the features in the volvox example are of
      type "remark." Another </li><br>
  <li><b>start position</b><br>
      The position that the feature starts at, relative to the
      reference sequence.  The first base of the reference sequence
      is position 1.</li><br>
  <li><b>end position</b><br>
      The end of the feature, again relative to the reference
      sequence.  End is always greater than or equal to start.</li><br>
  <li><b>score</b><br>
      For features that have a numeric score, such as sequence
      similarities, this field holds the score.  Score units are
      arbitrary, but most people use the expectation value for
      similarity features.  You can leave it blank by replacing
      the column with a dot.</li><br>
  <li><b>strand</b><br>
      For features that are strand-specific, this field is the
      strand on which the annotation resides.  It is "+" for the forward
      strand, "-" for the reverse strand, or "." for annotations that are
      not stranded.  If you are unsure of whether a feature is
      stranded, it won't hurt to use a "+" here.</li><br>
  <li><b>phase</b><br>
      For CDS features that encode proteins, this field describes
      where the next codon starts.
      The phase is one of the integers 0, 1, or 2, indicating the
      number of bases that should be removed from the beginning of
      this feature in order to reach the first base of the next codon. In other
      words, a phase of "0" indicates that the next codon begins at
      the first base of the region described by the current line, a
      phase of "1" indicates that the next codon begins at the second
      base of this region, and a phase of "2" indicates that the next codon
      begins at the third base of this region. This
      information is used by the "cds" glyph to show how the reading
      frame changes across splice sites.  For all other feature types,
      use a dot here.</li><br>
  <li><b>attributes</b><br>
      A list of feature attributes in the format tag=value.  Multiple
      tag=value pairs are separated by semicolons.  URL escaping rules are
      used for tags or values containing the following characters: ",=;".
      Spaces are allowed in this field, but tabs must be replaced with the
      %09 URL escape.

      <br><br>
      These tags have predefined meanings:
      <dl>
	<dt>ID</dt>
	<dd>Gives the feature a unique identifier. Useful when grouping features
	    together (such as all the exons in a transcript).</dd>
	    
	<dt>Name</dt>
	<dd>Display name for the feature.  This is the name to be
	    displayed to the user.</dd>

	<dt>Alias</dt>
	<dd>
	    A secondary name for the feature.  It is suggested that
	    this tag be used whenever a secondary identifier for the
	    feature is needed, such as locus names and
	    accession numbers.</dd>
	    



( run in 1.975 second using v1.01-cache-2.11-cpan-d8267643d1d )