gid results from the CPAN

Bio-MUST-Apps-FortyTwo
view release on metacpan or search on metacpan
lib/Bio/MUST/Apps/FortyTwo/Manual.pod view on Meta::CPAN
    $ cpanm Bio::MUST::Drivers

Since C<Bio::MUST> modules rely on external bioinformatics programs and come
with complex test suites, they sometimes raise errors during installation. If
you encounter any such error, consider enabling C<--force> and/or C<--notest>
options of C<cpanm>.

    $ cpanm --force Bio::MUST::Drivers

Install C<42> itself. All remaining dependencies can also be taken care of by
C<cpanm>.

    $ cpanm Bio::MUST::Apps::FortyTwo

Finally install a local mirror of the I<NCBI Taxonomy>. It will be used by C<42>
to taxonomically affiliate inferred orthologous sequences.

    $ setup-taxdir.pl --taxdir=taxdump/

=head2 Input and configuration files

To help with the configuration of the numerous parameters of the software, we
designed a C<config> file generator: C<yaml-generator-42.pl>. When run with the
C<--wizard> option, it will guide you through the configuration by prompting for
all required parameters (pressing C<ENTER> selects the default value). At the
end of process, it will produce a C<YAML> C<config> file named
C<config-$out_suffix.yaml> and a file (C<build-$out_suffix.sh>) providing the
command to reproduce the exact same configuration without using the wizard.

=head3 MSAs (C<*.fasta>)

C<42> native file format for MSAs is known as the C<ALI> format. It is very
similar to the well-known C<FASTA> format except for a few differences: (1)
sequences must appear on a single (long) line; (2) gaps are encoded as asterisk
characters (C<*>) instead of dashes (C<->) and any whitespace is interpreted as
missing character states; (3) sequence identifiers accept a single whitespace
between genus and species (more on this just below); and (4) comment lines
(starting with the hashtag character C<#>) are allowed. Although C<42> can read
and write C<FASTA> files transparently, its C<ALI> roots sometimes play tricks
to the user.

This is especially true for sequence identifiers. Basically, each identifier has
to hold the organism name (C<org>) followed by a separator (C<@>) and by a
protein/gene accession number. The organism name is usually the binomial name.
Genus and species must be separated by a whitespace (C< > if in C<ALI> format)
or underscore character (C<_> if in C<FASTA> format). In addition, strain name
and/or NCBI taxon id are also allowed after the species name but each preceded
by an underscore character (C<_>). If both are used in the sequence identifier,
the taxon id has to come last. Finally, all sequence identifiers must be unique
within each MSA. See examples below:

=for todo TODO: explain other formats and subtleties (e.g., families)

    # Genus species@protacc
    >Arabidopsis thaliana@AAL15244
    # Genus species_taxonid@protacc
    >Arabidopsis thaliana_3702@AAO44026
    # Genus species_subspecies_taxonid@protacc
    >Arabidopsis lyrata_lyrata_81972@EFH60692
    # Genus species_taxonid@protacc
    >Archaeoglobus fulgidus_2234@WP_048095550
    # Genus species_strain_taxonid@protacc
    >Archaeoglobus fulgidus_DSM4304_224325@AAB90113
    # Genus species_strain_taxonid@protacc
    >archaeon 13_1_20CM_2_54_9_1805008@OLE74253

=head3 Reference organisms (C<ref_orgs>, C<ref_banks>)

=for todo TODO: discuss inst-abbr-ids.pl

The reference proteome set must be described in the C<config> file. Firstly,
each of the reference proteomes must be in C<FASTA> format in order to be
formatted as a C<BLAST> database with the C<makeblastdb> command. For
robustness, it is advised to use simple (one-word) sequence identifiers here.

    $ for REFORG in *.faa; do makeblastdb -in $REFORG -dbtype prot \
        -out `basename $REFORG .faa` -parse_seqids; done

Then, C<yaml-generator-42.pl> will read a file describing the reference proteome
set (C<ref_org_mapper.idm>). This file is composed of two columns separated by a
tabulation character (C<\t>) with the first column being the organism name
(C<ref_org>) and the second being the database basename (C<ref_bank>).

If your banks are like this:

    $ ls Arabidopsis_thaliana_3702_bank.*

    Arabidopsis_thaliana_3702_bank.faa
    Arabidopsis_thaliana_3702_bank.phr
    Arabidopsis_thaliana_3702_bank.pin
    Arabidopsis_thaliana_3702_bank.pog
    Arabidopsis_thaliana_3702_bank.psd
    Arabidopsis_thaliana_3702_bank.psi
    Arabidopsis_thaliana_3702_bank.psq

Then the C<ref_org_mapper> file should look like this:

    Arabidopsis thaliana_3702    Arabidopsis_thaliana_3702_bank

If you mainly work with microbes, you may want to name your banks after the NCBI
GCA/GCF accessions of the corresponding genome assemblies. In this case, you can
use C<fetch-tax.pl> to generate a suitable file from a list of such numbers:

    $ head -n5 banks.idl

    GCA_000008085.1
    GCA_000011505.1
    GCA_000012285.1
    GCA_000014585.1
    GCA_000019605.1

    $ fetch-tax.pl --taxdir=taxdump/ --org-mapper --item-type=taxid banks.idl

    $ head -n5 banks.org-idm
    
    Nanoarchaeum equitans_GCA_000008085.1        GCA_000008085.1
    Staphylococcus aureus_GCA_000011505.1        GCA_000011505.1
    Sulfolobus acidocaldarius_GCA_000012285.1    GCA_000012285.1
    Synechococcus sp._GCA_000014585.1            GCA_000014585.1
    Korarchaeum cryptofilum_GCA_000019605.1      GCA_000019605.1

=head3 Query organisms (C<query_orgs>)
( run in 0.682 second using v1.01-cache-2.11-cpan-3c2a17b8caa )