BioPerl

 view release on metacpan or  search on metacpan

Bio/DB/SeqFeature/Store/FeatureFileLoader.pm  view on Meta::CPAN

  use Bio::DB::SeqFeature::Store::FeatureFileLoader;

  # Open the sequence database
  my $db      = Bio::DB::SeqFeature::Store->new( -adaptor => 'DBI::mysql',
                                                 -dsn     => 'dbi:mysql:test',
                                                 -write   => 1 );

  my $loader = 
    Bio::DB::SeqFeature::Store::FeatureFileLoader->new(-store    => $db,
                                                       -verbose  => 1,
					               -fast     => 1);

  $loader->load('./my_genome.fff');


=head1 DESCRIPTION

The Bio::DB::SeqFeature::Store::FeatureFileLoader object parsers
FeatureFile-format sequence annotation files and loads
Bio::DB::SeqFeature::Store databases. For certain combinations of
SeqFeature classes and SeqFeature::Store databases it features a "fast
load" mode which will greatly accelerate the loading of databases by a
factor of 5-10.

FeatureFile Format (.fff) is very simple:

 mRNA B0511.1 Chr1:1..100 Type=UTR;Note="putative primase"
 mRNA B0511.1 Chr1:101..200,300..400,500..800 Type=CDS
 mRNA B0511.1 Chr1:801..1000 Type=UTR

 reference = Chr3
 Cosmid	B0511	516..619
 Cosmid	B0511	3185..3294
 Cosmid	B0511	10946..11208
 Cosmid	B0511	13126..13511
 Cosmid	B0511	11394..11539
 EST	yk260e10.5	15569..15724
 EST	yk672a12.5	537..618,3187..3294
 EST	yk595e6.5	552..618
 EST	yk595e6.5	3187..3294
 EST	yk846e07.3	11015..11208
 EST	yk53c10
 	yk53c10.3	15000..15500,15700..15800
 	yk53c10.5	18892..19154
 EST	yk53c10.5	16032..16105
 SwissProt	PECANEX	13153-13656	Note="Swedish fish"
 FGENESH	"Predicted gene 1"	1-205,518-616,661-735,3187-3365,3436-3846	"Pfam domain"
 # file ends

There are up to four columns of WHITESPACE (not necessarily tab)
delimited text. Embedded whitespace must be escaped using shell
escaping rules (quoting the column or backslashing whitespace).

  Column 1: The feature type. You may use type:subtype as a convention
            for method:source.

  Column 2: The feature name/ID.

  Column 3: The position of this feature in base pair
            coordinates. Ranges can be given as either 
            start-end or start..end. A chromosome position
            can be specified using the format "reference:start..end".
            A discontinuous feature can be specified by giving
            multiple ranges separated by commas. Minus-strand features
            are indicated by specifying a start > end.

  Column 4: Comment/attribute field. A single Note can be given, or
            a series of attribute=value pairs, separated by
            spaces or semicolons, as in "score=23;type=transmembrane"

=head2 Specifying Positions and Ranges

A feature position is specified using a sequence ID (a genbank
accession number, a chromosome name, a contig, or any other meaningful
reference system, followed by a colon and a position range. Ranges are
two integers separated by double dots or the hyphen. Examples:
"Chr1:516..11208", "ctgA:1-5000". Negative coordinates are allowed, as
in "Chr1:-187..1000".

A discontinuous range ("split location") uses commas to separate the
ranges.  For example:

 Gene B0511.1  Chr1:516..619,3185..3294,10946..11208

In the case of a split location, the sequence id only has to appear in
front of the first range.

Alternatively, a split location can be indicated by repeating the
features type and name on multiple adjacent lines:

 Gene	B0511.1	Chr1:516..619
 Gene	B0511.1	Chr1:3185..3294
 Gene	B0511.1	Chr1:10946..11208

If all the locations are on the same reference sequence, you can
specify a default chromosome using a "reference=E<lt>seqidE<gt>":

 reference=Chr1
 Gene	B0511.1	516..619
 Gene	B0511.1	3185..3294
 Gene	B0511.1	10946..11208

The default seqid is in effect until the next "reference" line
appears.

=head2 Feature Tags

Tags can be added to features by adding a fourth column consisting of
"tag=value" pairs:

 Gene  B0511.1  Chr1:516..619,3185..3294 Note="Putative primase"

Tags and their values take any form you want, and multiple tags can be
separated by semicolons. You can also repeat tags multiple times:

 Gene  B0511.1  Chr1:516..619,3185..3294 GO_Term=GO:100;GO_Term=GO:2087

Several tags have special meanings:

 Tag     Meaning
 ---     -------

 view all matches for this distribution
 view release on metacpan -  search on metacpan

( run in 0.398 second using v1.00-cache-2.02-grep-82fe00e-cpan-1925d2aa809 )