Bio-MUST-Apps-FortyTwo
view release on metacpan or search on metacpan
lib/Bio/MUST/Apps/FortyTwo/Manual.pod view on Meta::CPAN
$ cpanm Bio::MUST::Drivers
Since C<Bio::MUST> modules rely on external bioinformatics programs and come
with complex test suites, they sometimes raise errors during installation. If
you encounter any such error, consider enabling C<--force> and/or C<--notest>
options of C<cpanm>.
$ cpanm --force Bio::MUST::Drivers
Install C<42> itself. All remaining dependencies can also be taken care of by
C<cpanm>.
$ cpanm Bio::MUST::Apps::FortyTwo
Finally install a local mirror of the I<NCBI Taxonomy>. It will be used by C<42>
to taxonomically affiliate inferred orthologous sequences.
$ setup-taxdir.pl --taxdir=taxdump/
=head2 Input and configuration files
To help with the configuration of the numerous parameters of the software, we
designed a C<config> file generator: C<yaml-generator-42.pl>. When run with the
C<--wizard> option, it will guide you through the configuration by prompting for
all required parameters (pressing C<ENTER> selects the default value). At the
end of process, it will produce a C<YAML> C<config> file named
C<config-$out_suffix.yaml> and a file (C<build-$out_suffix.sh>) providing the
command to reproduce the exact same configuration without using the wizard.
=head3 MSAs (C<*.fasta>)
C<42> native file format for MSAs is known as the C<ALI> format. It is very
similar to the well-known C<FASTA> format except for a few differences: (1)
sequences must appear on a single (long) line; (2) gaps are encoded as asterisk
characters (C<*>) instead of dashes (C<->) and any whitespace is interpreted as
missing character states; (3) sequence identifiers accept a single whitespace
between genus and species (more on this just below); and (4) comment lines
(starting with the hashtag character C<#>) are allowed. Although C<42> can read
and write C<FASTA> files transparently, its C<ALI> roots sometimes play tricks
to the user.
This is especially true for sequence identifiers. Basically, each identifier has
to hold the organism name (C<org>) followed by a separator (C<@>) and by a
protein/gene accession number. The organism name is usually the binomial name.
Genus and species must be separated by a whitespace (C< > if in C<ALI> format)
or underscore character (C<_> if in C<FASTA> format). In addition, strain name
and/or NCBI taxon id are also allowed after the species name but each preceded
by an underscore character (C<_>). If both are used in the sequence identifier,
the taxon id has to come last. Finally, all sequence identifiers must be unique
within each MSA. See examples below:
=for todo TODO: explain other formats and subtleties (e.g., families)
# Genus species@protacc
>Arabidopsis thaliana@AAL15244
# Genus species_taxonid@protacc
>Arabidopsis thaliana_3702@AAO44026
# Genus species_subspecies_taxonid@protacc
>Arabidopsis lyrata_lyrata_81972@EFH60692
# Genus species_taxonid@protacc
>Archaeoglobus fulgidus_2234@WP_048095550
# Genus species_strain_taxonid@protacc
>Archaeoglobus fulgidus_DSM4304_224325@AAB90113
# Genus species_strain_taxonid@protacc
>archaeon 13_1_20CM_2_54_9_1805008@OLE74253
=head3 Reference organisms (C<ref_orgs>, C<ref_banks>)
=for todo TODO: discuss inst-abbr-ids.pl
The reference proteome set must be described in the C<config> file. Firstly,
each of the reference proteomes must be in C<FASTA> format in order to be
formatted as a C<BLAST> database with the C<makeblastdb> command. For
robustness, it is advised to use simple (one-word) sequence identifiers here.
$ for REFORG in *.faa; do makeblastdb -in $REFORG -dbtype prot \
-out `basename $REFORG .faa` -parse_seqids; done
Then, C<yaml-generator-42.pl> will read a file describing the reference proteome
set (C<ref_org_mapper.idm>). This file is composed of two columns separated by a
tabulation character (C<\t>) with the first column being the organism name
(C<ref_org>) and the second being the database basename (C<ref_bank>).
If your banks are like this:
$ ls Arabidopsis_thaliana_3702_bank.*
Arabidopsis_thaliana_3702_bank.faa
Arabidopsis_thaliana_3702_bank.phr
Arabidopsis_thaliana_3702_bank.pin
Arabidopsis_thaliana_3702_bank.pog
Arabidopsis_thaliana_3702_bank.psd
Arabidopsis_thaliana_3702_bank.psi
Arabidopsis_thaliana_3702_bank.psq
Then the C<ref_org_mapper> file should look like this:
Arabidopsis thaliana_3702 Arabidopsis_thaliana_3702_bank
If you mainly work with microbes, you may want to name your banks after the NCBI
GCA/GCF accessions of the corresponding genome assemblies. In this case, you can
use C<fetch-tax.pl> to generate a suitable file from a list of such numbers:
$ head -n5 banks.idl
GCA_000008085.1
GCA_000011505.1
GCA_000012285.1
GCA_000014585.1
GCA_000019605.1
$ fetch-tax.pl --taxdir=taxdump/ --org-mapper --item-type=taxid banks.idl
$ head -n5 banks.org-idm
Nanoarchaeum equitans_GCA_000008085.1 GCA_000008085.1
Staphylococcus aureus_GCA_000011505.1 GCA_000011505.1
Sulfolobus acidocaldarius_GCA_000012285.1 GCA_000012285.1
Synechococcus sp._GCA_000014585.1 GCA_000014585.1
Korarchaeum cryptofilum_GCA_000019605.1 GCA_000019605.1
=head3 Query organisms (C<query_orgs>)
( run in 0.745 second using v1.01-cache-2.11-cpan-5735350b133 )