Bio-WGS2NCBI
view release on metacpan or search on metacpan
[](https://doi.org/10.21105/joss.01364)
WGS2NCBI - preparing genomes for submission to NCBI
===================================================
The process of going from an annotated genome to a valid NCBI submission is somewhat
cumbersome. "Boutique" genome projects typically produce a scaffolded assembly in FASTA
format, as produced by any of a variety of de-novo assemblers, and predicted genes in GFF3
tabular format, e.g. as produced by the
[maker](http://www.yandell-lab.org/software/maker.html) pipeline. No convenient tools
appear to exist to turn these results in a format and to a standard that NCBI accepts.
NCBI requires that "whole genome shotgunning" (WGS) genomes are submitted as `.sqn` files.
A sqn file is a file in ASN.1 syntax that contains the sequences, their features, and
the metadata about the submission, i.e. the authors, the publication title, the organism,
etc.. `.sqn` files are normally produced by the
[sequin](https://www.ncbi.nlm.nih.gov/Sequin/) program, which has a graphical user
interface. Sequin works fine for a single gene or for a small genome (e.g. a mitochondrial
genome) but for large genomes with thousands of genes spread out over potentially
thousands of scaffolds the submission process done in this way is unworkable.
The alternative is to use the [tbl2asn](https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/)
command line program, which takes a directory with FASTA files (`.fsa`), corresponding
files with the gene features in tabular format (`.tbl`), and a submission template
(template.sbt), to produce `.sqn` files. The trick thus becomes to convert the assembly
FASTA file and the annotation GFF3 file into a collection of FASTA chunks with
corresponding feature tables. This is doable in principle - several toolkits provide
generic convertors -, but NCBI places quite a few restrictions on what are permissible
things to have in the FASTA headers, what coordinate ranges are credible as gene features,
and what gene and gene product names are acceptable.
This project remedies these challenges by providing a command-line utility (with no 3rd
party dependencies except [URI::Escape](http://search.cpan.org/dist/URI-Escape)) to do
the required data re-formatting and cleaning. Included is also a shell script that chains
the Perl scripts together and runs NCBI's tbl2asn on the result. This shell script is
intended as an example and should be edited or copied to provide the right values.
Installation
============
The WGS2NCBI release is organized in a way that is standard for software releases written
in the Perl5 programming language. This means that it can be installed using a series
of commands that either you yourself, or your systems administrator, are likely already
familiar with. The first step is to install a required dependency using the Perl5 package
manager ([cpan](https://perldoc.perl.org/cpan.html)), as follows:
$ sudo cpan -i URI::Escape
The next steps assume that you have downloaded the WGS2NCBI release - for example from the
[git repository](https://github.com/naturalis/wgs2ncbi/archive/master.zip) - have unzipped
it, and have moved into the root folder of the release in your terminal. The next steps
then are as follows:
$ perl Makefile.PL
$ make test
$ sudo make install
The second command (`make test`) performs a number of basic tests of the software on your
system. These should all pass without problems. If you do encounter issues, it is best
_not_ to proceed to the following step for the actual installation, but rather to try to
resolve the outstanding problems, for example by submitting an
[issue report](https://github.com/naturalis/wgs2ncbi/issues), so that the authors can help
you out.
In addition to the preceding steps, you also need to install the `tbl2asn` program. The
instructions for this are [here](https://www.ncbi.nlm.nih.gov/genbank/tbl2asn2/).
Usage
=====
WGS2NCBI is used by following a number of steps, which are detailed below:
- [Before you start](#before-you-start) - set up all the input files, prepare a submission
template
- [Subcommand `prepare`](#subcommand-prepare) - pre-process the annotation file for rapid
access in the following steps
- [Subcommand `process`](#subcommand-process) - convert the genome file and annotations
to FASTA chunks and feature tables
- [Subcommand `convert`](#subcommand-convert) - runs tbl2asn to convert the FASTA chunks
and feature tables to SeqIn files
- [Subcommand `compress`](#subcommand-compress) - collates the SeqIn files into a single
archive for upload to NCBI
Before you start
----------------
Before issuing any commands, the following steps need to be taken:
1. The installation (see above) needs to be completed.
2. You need to have the genome assembly available as a FASTA file, and the annotations
as a GFF3 file.
3. You will need to prepare a
[submission template](http://www.ncbi.nlm.nih.gov/WebSub/template.cgi). The file
[template.sbt](share/template.sbt) is an example of what these files look like.
4. You need to have created a number of `.ini` files correctly. Using the linked files as
examples, the following need to be prepared:
- [wgs2ncbi.ini](share/wgs2ncbi.ini) - the main configuration file, in which you
specify the locations of the input files and output directories. In addition, here
you will specify the prefixes for the identifiers that will be inserted in the
feature tables and various parameters for what to filter on. The file is well
documented with comments.
- [info.ini](share/info.ini) - a file with key/value pairs whose contents will be
inserted in the FASTA headers of the sequence files. These key/value pairs have to
do with the organism that was sequenced, such as the taxon name, its sex, its
developmental stages, what tissues were sampled, and so on.
- [adaptors.ini](share/adaptors.ini) - this is a file that contains the coordinates
of sequence fragments that NCBI considers inadmissible. What will happen over the
course of your submission is that NCBI will scan your sequence data for suspicious
sequence fragments. These might be adaptor sequences of various sequencing platforms,
and fragments that NCBI thinks might be contaminants. Hence, during your first pass
it is more or less impossible to get the values right in this file: this part will
be an iterative process where you blank out parts of your data that NCBI really will
not accept. Start out with an empty file, and populate it based on the feedback you
will get, making sure you follow the same syntax as the provided example file.
- [products.ini](share/products.ini) - this is a file that contains mappings from
(parts of) the gene names that you assigned during the annotation process to names
that NCBI will accept. Again, this is impossible to predict during the first pass:
you will get feedback on which names NCBI doesn't like (for example because there are
things in the names that look like database identifiers, organism names, molecular
weights, etc.) and in this file you map these to allowed names.
( run in 0.659 second using v1.01-cache-2.11-cpan-5a3173703d6 )