Bio-MUST-Apps-FortyTwo

 view release on metacpan or  search on metacpan

lib/Bio/MUST/Apps/FortyTwo/Manual.pod  view on Meta::CPAN

C<ref_bank_suffix> options, respectively. If your banks are built from protein
sequences, use C<.psq>; otherwise, for nucleotide sequences, use C<.nsq>.

Because of this scanning behavior, it is better to prepare your files directly
on the computer on which you plan to run C<42>. If you try to prepare your
C<config> file locally (for subsequent upload on a remote computer), it is very
likely that the wizard complains about some directories not being found.

=head3 Command-line options

Since the configuration (C<config>) file specifies all the details, running
C<42> boils down to a simple command:

    $ forty-two.pl --config=config.yaml *.fasta

By default, C<42> is very terse. Yet it can be made quite verbose using the
corresponding C<--verbosity> option. If you need all the debugging information,
select level C<6>. In any case, it is useful to redirect the C<STDERR> stream to
a log file for post-run analysis.

    $ forty-two.pl --config=config.yaml --verbosity=3 *.fasta 2> 42.log

C<42> supports multithreading by allowing parallel enrichment of multiple MSAs.
This is controlled by the C<--threads> command line option. MSAs will be
arranged in an internal queue and processed in parallel using the specified
number of threads. As long as there remain more MSAs to enrich than that
number, C<42> will makes efficient use of the CPU cores. Obviously, there is
no speed gain in specifying more threads than MSAs to process.

    $ forty-two.pl --config=config.yaml --threads=20 *.fasta

Unfortunately, the current parallel implementation scheme leads to completely
scrambled log files. There is thus no point to ask for a high verbosity level.

=head3 Estimation of the contamination level in metagenomic mode

C<42> can be used as a stand-alone contamination detection tool to spot foreign
sequences and estimate the contamination level in transcriptomic or genomic data
as well as the taxonomic sources of contamination. To this end, it comes with
two sets of ribosomal protein MSAs: one set of 78 eukaryotic MSAs, manually
curated and continuously enriched with new species in H. Philippe's lab, and one
set of 90 prokaryotic MSAs, fetched from I<RiboDB>. Both sets are available at
L<https://bitbucket.org/phylogeno/42-ribo-msas/>.

For each transcriptome/genome, C<42> recovers the ribosomal protein orthologs
and then labels each one by computing the last common ancestor (LCA) of their
closest relatives (best BLAST hits) in the corresponding MSA (excluding
self-matches). The algorithm relies on the C<megan-like> mode described in
L<"Contamination detection and handling">. In this regard, since ribosomal
proteins are highly conserved, we suggest a more stringent parameterization of
the C<megan-like> algorithm, so as to avoid false positives during LCA
computation, with a C<--tax_score_mul> of C<0.99> instead of C<0.95> and a
C<--tax_min_ident> of C<50> instead of C<0>.

The follow up consists in running C<debrief-42.pl>, which parses the taxonomic
reports produced by C<42> in order to compare the taxonomic label (LCA) of each
ortholog computed by C<42> with the source organism lineage (according to I<NCBI
Taxonomy>) and classifies the sequences as contaminants if they differ at a
predefined taxonomic rank, based on a first user-defined list of taxa
(C<--seq_labeling>). After each ortholog has been classified, an estimated
contamination percentage is computed.

Additionally, contaminations are further classified to determine the main
sources of contaminants, based on a second user-defined list of taxa
(C<--contam_labeling>), which allows the user to fine control the output report.
In this regard, we distinguish two types of sequences, B<classified
contaminations> and B<unclassified contaminations>. The latter are those that
bear an uninformative taxonomic label, i.e., too broad to point to a specific
lineage with accuracy (e.g., C<Sar>). Finally, the sequences that can only be
affiliated at the highest taxonomic levels, such as C<cellular organisms>,
C<Eukaryota>, C<Bacteria> or C<Archaea>, are classified as B<unknown sequences>.

A typical command for running the metagenomic debriefer is shown below:

    $ debrief-42.pl --indir=./MSAs/ --in=-42 --taxdir=taxdump/ \
        --seq_labeling=seq-labels.idl --contam_labeling=contam-labels.idl

=head1 AUTHOR

Denis BAURAIN <denis.baurain@uliege.be>

=head1 CONTRIBUTOR

=for stopwords Mick VAN VLIERBERGHE

Mick VAN VLIERBERGHE <mvanvlierberghe@doct.uliege.be>

=head1 COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.

This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.

=cut



( run in 1.620 second using v1.01-cache-2.11-cpan-8f98c5d2c55 )