App-Sandy
view release on metacpan or search on metacpan
lib/App/Sandy/Command/Genome.pm view on Meta::CPAN
This subcommand simulates genome sequencing reads taking into account the
quality-profile and the genome-variation patterns, along with: raffle
seed; coverage (depth); fragment mean and standard deviation; single-end
(long and short fragments) and paired-end sequencing type; bam, sam,
fastq.gz and fastq output formats and more.
=head2 INPUT
I<sandy genome> expects as argument a fasta file with chromosome sequences.
For example, L<the GENCODE human genome|https://www.gencodegenes.org/human/>
GRCh38.p13 fasta file.
=head2 OUTPUT
The output file generated will depend on the I<output-format> (fastq, bam),
on the I<join-paired-ends> option (mate read pairs into a single file) and
on the I<sequencing-type> (single-end, paired-end). A file with the simulated
coverage (${prefix}_coverage.tsv) for each chromosome (read counts) also
accompanies the output file.
=head1 OPTIONS
=over 8
=item B<--help>
Print a brief help message and exits.
=item B<--man>
Prints the manual page and exits.
=item B<--verbose>
Prints log information to standard error
=item B<--prefix>
Concatenates the prefix to the output-file name.
=item B<--output-dir>
Creates output-file inside output-dir. If output-dir
does not exist, it is created recursively
=item B<--output-format>
Choose the output format. Available options are:
I<bam>, I<sam>, I<fastq.gz>, I<fastq>.
For I<bam> option, B<--append-id> is ignored, considering
that the sequence identifier is splitted by blank character, so
just the first field is included into the query name column
(first column).
=item B<--join-paired-ends>
By default, paired-end reads are put into two different files,
I<prefix_R[12]_001.fastq(\.gz)?>. If the user wants both outputs
together, she can pass this option.
If the B<--id> does not have the escape character %R, it is
automatically included right after the first field (blank separated values)
as in I<id/%R> - which resolves to I<id/1> or I<id/2>.
It is necessary to distinguish which read is R1/R2
=item B<--compression-level>
Regulates the speed of compression using the specified digit (between 1 and 9),
where "1" indicates the fastest compression method (less compression) and "9"
indicates the slowest compression method (best compression). The default
compression level is "6"
=item B<--append-id>
Append string template to the defined template id.
See B<Format>
=item B<--id>
Overlap the default defined template id:
I<single-end> %i.%U_%c_%s_%t_%n and I<paired-end> %i.%U_%c_%s_%S_%E
e.g. SR123.1_chr1_P_1001_1101
See B<Format>
=item B<Format>
A string B<Format> is a combination of literal and escape characters similar to the way I<printf> works.
That way, the user has the freedom to customize the fastq sequence identifier to fit her needs. Valid
escape characteres are:
B<Common escape characters>
----------------------------------------------------------------------------
Escape Meaning
----------------------------------------------------------------------------
%i instrument id composed by SR + PID
%I job slot number
%q quality profile
%e sequencing error
%x sequencing error position
%R read 1, or 2 if it is the paired-end mate
%U read number
%r read size
%m read mean
%d read standard deviation
%c sequence id as chromossome, gene/transcript id
%C sequence id type (reference or alternate non reference allele) ***
%s read strand
%t read start position
%n read end position
%a read start position regarding reference genome ***
%b read end position regarding reference genome ***
%v genomic variation position ***
----------------------------------------------------------------------------
*** specific for genomic variation (genome simulation only)
B<Paired-end specific escape characters>
----------------------------------------------------------------------------
Escape Meaning
----------------------------------------------------------------------------
%T mate read start position
%N mate read end position
%A mate read start position regarding reference genome ***
%B mate read end position regarding reference genome ***
%D distance between the paired-reads
%M fragment mean
%D fragment standard deviation
%f fragment size
%F fragment strand
%S fragment start position
%E fragment end position
%X fragment start position regarding reference genome ***
%Z fragment end position regarding reference genome ***
----------------------------------------------------------------------------
*** specific for genomic variation (genome simulation only)
=item B<--jobs>
Sets the number of child jobs to be created
=item B<--seed>
Sets the seed of the base generator. The ability to set the seed is
useful for those who want reproducible simulations. Pay attention to
the number of jobs (--jobs) set, because each job receives a different
seed calculated from the I<main seed>. So, for reproducibility, the
same seed set before needs the same number of jobs set before as well.
=item B<--read-mean>
Sets the read mean if quality-profile is equal to 'poisson'. The
quality-profile from database overrides the read-size
=item B<--read-stdd>
Sets the read standard deviation if quality-profile is equal to
'poisson'. The quality-profile from database overrides the read-stdd
=item B<--coverage>
Calculates the number of reads based on the genome
coverage: number_of_reads = (sequence_size * coverage) / read_size.
This is the default option for genome sequencing simulation
=item B<--sequencing-type>
Sets the sequencing type to single-end or paired-end
=item B<--fragment-mean>
If the sequencing-type is set to paired-end, it sets the
fragment mean
=item B<--fragment-stdd>
If the sequencing-type is set to paired-end, it sets the
lib/App/Sandy/Command/Genome.pm view on Meta::CPAN
distribution, but the user can choose among several profiles stored into the
database or import his own data.
See B<quality> command for more details
=item B<--genomic-variation>
Sets the genomic variation to be applied on the genome feeded. By
default no variation is included to the simulation, but the user has
the power to point some entries from B<variation> database or index his
own data. This option accepts a list with comma separated values
and can be passed multiple times, which is useful in order to join
various types of genomic variation into the same simulation. It is
possible to combine this option with B<--genomic-variation-regex>
See B<variation> command for the available list of genomic variation
entries
=item B<--genomic-variation-regex>
Applies perl-regex in the variation database and selects all entryes
that match the pattern. This option accepts a list with comma separated
values and can be passed multiple times. It is possible to combine this
option with B<--genomic-variation>
See B<variation> command for the available list of genomic variation
entries
=back
=head1 EXAMPLES
The following command will produce two fastq.gz files (paired-end), with a
coverage of 20x and a ${prefix}_coverage.tsv file with the read counts.
$ sandy genome \
--verbose \
--jobs=10 \
--sequencing-type=paired-end \
--coverage=20 hg38.fa 2> sim.log
or, equivalently:
$ sandy genome -v -j 10 -t paired-end -c 20 hg38.fa 2> sim.log
Using a seed for reproducibility.
$ sandy genome --seed=1717 my_fasta.fa
To simulate reads with a specific quality-profile other than the default
poisson:
$ sandy genome --quality-profile=hiseq_101 hg19.fa
To see the current list of available quality-profiles:
$ sandy quality
And in order to learn how to add your custom quality-profile, see:
$ sandy quality add --help
Sequence identifiers (first lines of fastq entries) may be customized in output using
a format string passed by the user. This format is a combination of literal and escaped
characters, in a similar fashion to that used in C programming languageâs printf function.
For example, letâs simulate a paired-end sequencing and add the read length, read position
and mate position into all sequence identifiers:
$ sandy genome \
--id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" \
hg38.fa
In this case, results would be:
==> Into R1
@SR.1 read=chr6:979-880 mate=chr6:736-835 length=100
...
==> Into R2
@SR.1 read=chr6:736-835 mate=chr6:979-880 length=100
See B<Format> section for details.
An important feature of B<Sandy> is to simulate a genome with genomic-variation.
So to exemplify, let's simulate a genome which includes the fusion between the genes
NPM1 and ALK:
$ sandy genome --genomic-variation=fusion_hg38_NPM1-ALK hg38.fa
To see the current list of available genomic-variation
$ sandy variation
And in order to learn how to add your custom genomic-variation, see:
$ sandy variation add --help
Putting all together:
$ sandy genome \
--verbose \
--jobs=10 \
--sequencing-type=paired-end \
--seed=1717 \
--quality-profile=hiseq_101 \
--id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" \
--genomic-variation=fusion_hg38_NPM1-ALK \
hg38.fa 2> sim.log
=head1 AUTHORS
=over 4
=item *
Thiago L. A. Miller <tmiller@mochsl.org.br>
=item *
J. Leonel Buzzo <lbuzzo@mochsl.org.br>
=item *
Felipe R. C. dos Santos <fsantos@mochsl.org.br>
( run in 2.201 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )