Bio-WGS2NCBI

 view release on metacpan or  search on metacpan

README.md  view on Meta::CPAN

[![DOI](http://joss.theoj.org/papers/10.21105/joss.01364/status.svg)](https://doi.org/10.21105/joss.01364)


WGS2NCBI - preparing genomes for submission to NCBI
===================================================

The process of going from an annotated genome to a valid NCBI submission is somewhat 
cumbersome. "Boutique" genome projects typically produce a scaffolded assembly in FASTA 
format, as produced by any of a variety of de-novo assemblers, and predicted genes in GFF3 
tabular format, e.g. as produced by the 
[maker](http://www.yandell-lab.org/software/maker.html) pipeline. No convenient tools 
appear to exist to turn these results in a format and to a standard that NCBI accepts.

NCBI requires that "whole genome shotgunning" (WGS) genomes are submitted as `.sqn` files. 
A sqn file is a file in ASN.1 syntax that contains the sequences, their features, and 
the metadata about the submission, i.e. the authors, the publication title, the organism, 
etc.. `.sqn` files are normally produced by the 
[sequin](https://www.ncbi.nlm.nih.gov/Sequin/) program, which has a graphical user 
interface. Sequin works fine for a single gene or for a small genome (e.g. a mitochondrial 
genome) but for large genomes with thousands of genes spread out over potentially 
thousands of scaffolds the submission process done in this way is unworkable.

The alternative is to use the [tbl2asn](https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/) 
command line program, which takes a directory with FASTA files (`.fsa`), corresponding 
files with the gene features in tabular format (`.tbl`), and a submission template 
(template.sbt), to produce `.sqn` files. The trick thus becomes to convert the assembly
FASTA file and the annotation GFF3 file into a collection of FASTA chunks with
corresponding feature tables. This is doable in principle - several toolkits provide
generic convertors -, but NCBI places quite a few restrictions on what are permissible 
things to have in the FASTA headers, what coordinate ranges are credible as gene features,
and what gene and gene product names are acceptable.

This project remedies these challenges by providing a command-line utility (with no 3rd 
party dependencies except [URI::Escape](http://search.cpan.org/dist/URI-Escape)) to do 
the required data re-formatting and cleaning. Included is also a shell script that chains 
the Perl scripts together and runs NCBI's tbl2asn on the result. This shell script is 
intended as an example and should be edited or copied to provide the right values.

Installation
============

The WGS2NCBI release is organized in a way that is standard for software releases written
in the Perl5 programming language. This means that it can be installed using a series
of commands that either you yourself, or your systems administrator, are likely already
familiar with. The first step is to install a required dependency using the Perl5 package
manager ([cpan](https://perldoc.perl.org/cpan.html)), as follows:

    $ sudo cpan -i URI::Escape
    
The next steps assume that you have downloaded the WGS2NCBI release - for example from the
[git repository](https://github.com/naturalis/wgs2ncbi/archive/master.zip) - have unzipped
it, and have moved into the root folder of the release in your terminal. The next steps
then are as follows:

    $ perl Makefile.PL
    $ make test
    $ sudo make install

The second command (`make test`) performs a number of basic tests of the software on your
system. These should all pass without problems. If you do encounter issues, it is best
_not_ to proceed to the following step for the actual installation, but rather to try to
resolve the outstanding problems, for example by submitting an
[issue report](https://github.com/naturalis/wgs2ncbi/issues), so that the authors can help
you out.

In addition to the preceding steps, you also need to install the `tbl2asn` program. The 
instructions for this are [here](https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/).

Usage
=====

WGS2NCBI is used by following a number of steps, which are detailed below:

- [Before you start](#before-you-start) - set up all the input files, prepare a submission
  template
- [Subcommand `prepare`](#subcommand-prepare) - pre-process the annotation file for rapid
  access in the following steps
- [Subcommand `process`](#subcommand-process) - convert the genome file and annotations
  to FASTA chunks and feature tables
- [Subcommand `convert`](#subcommand-convert) - runs tbl2asn to convert the FASTA chunks
  and feature tables to SeqIn files
- [Subcommand `compress`](#subcommand-compress) - collates the SeqIn files into a single
  archive for upload to NCBI      

Before you start
----------------

Before issuing any commands, the following steps need to be taken:

1. The installation (see above) needs to be completed.
2. You need to have the genome assembly available as a FASTA file, and the annotations
   as a GFF3 file.
3. You will need to prepare a 
   [submission template](http://www.ncbi.nlm.nih.gov/WebSub/template.cgi). The file
   [template.sbt](share/template.sbt) is an example of what these files look like.
4. You need to have created a number of `.ini` files correctly. Using the linked files as 
   examples, the following need to be prepared:
   - [wgs2ncbi.ini](share/wgs2ncbi.ini) - the main configuration file, in which you 
     specify the locations of the input files and output directories. In addition, here
     you will specify the prefixes for the identifiers that will be inserted in the 
     feature tables and various parameters for what to filter on. The file is well 
     documented with comments.
   - [info.ini](share/info.ini) - a file with key/value pairs whose contents will be 
     inserted in the FASTA headers of the sequence files. These key/value pairs have to
     do with the organism that was sequenced, such as the taxon name, its sex, its
     developmental stages, what tissues were sampled, and so on.
   - [adaptors.ini](share/adaptors.ini) - this is a file that contains the coordinates 
     of sequence fragments that NCBI considers inadmissible. What will happen over the
     course of your submission is that NCBI will scan your sequence data for suspicious
     sequence fragments. These might be adaptor sequences of various sequencing platforms,
     and fragments that NCBI thinks might be contaminants. Hence, during your first pass
     it is more or less impossible to get the values right in this file: this part will
     be an iterative process where you blank out parts of your data that NCBI really will
     not accept. Start out with an empty file, and populate it based on the feedback you
     will get, making sure you follow the same syntax as the provided example file.
   - [products.ini](share/products.ini) - this is a file that contains mappings from 
     (parts of) the gene names that you assigned during the annotation process to names
     that NCBI will accept. Again, this is impossible to predict during the first pass:
     you will get feedback on which names NCBI doesn't like (for example because there are
     things in the names that look like database identifiers, organism names, molecular
     weights, etc.) and in this file you map these to allowed names.



( run in 0.659 second using v1.01-cache-2.11-cpan-5a3173703d6 )