transparent results from the CPAN

transparent
Bio-MUST-Apps-FortyTwo
view release on metacpan or search on metacpan
lib/Bio/MUST/Apps/FortyTwo/Manual.pod view on Meta::CPAN
    $ perlbrew available
    # install the last even version (e.g., 5.24.x, 5.26.x, 5.28.x)
    # (this will take a while)
    $ perlbrew install perl-5.26.2
    # install cpanm (for Perl dependencies)
    $ perlbrew install-cpanm

    # enable the just-installed version
    $ perlbrew list
    $ perlbrew switch perl-5.26.2

    # make perlbrew always available
    # if using bash (be sure to use double >> to append)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.bashrc
    # if using zsh  (only the destination file changes)
    $ echo "source ~/perl5/perlbrew/etc/bashrc" >> ~/.zshrc

Major C<42> dependencies are the C<Bio::MUST> series of modules. Install them as
follows.

    $ cpanm Bio::FastParsers
    $ cpanm Bio::MUST::Core
    $ cpanm Bio::MUST::Drivers

Since C<Bio::MUST> modules rely on external bioinformatics programs and come
with complex test suites, they sometimes raise errors during installation. If
you encounter any such error, consider enabling C<--force> and/or C<--notest>
options of C<cpanm>.

    $ cpanm --force Bio::MUST::Drivers

Install C<42> itself. All remaining dependencies can also be taken care of by
C<cpanm>.

    $ cpanm Bio::MUST::Apps::FortyTwo

Finally install a local mirror of the I<NCBI Taxonomy>. It will be used by C<42>
to taxonomically affiliate inferred orthologous sequences.

    $ setup-taxdir.pl --taxdir=taxdump/

=head2 Input and configuration files

To help with the configuration of the numerous parameters of the software, we
designed a C<config> file generator: C<yaml-generator-42.pl>. When run with the
C<--wizard> option, it will guide you through the configuration by prompting for
all required parameters (pressing C<ENTER> selects the default value). At the
end of process, it will produce a C<YAML> C<config> file named
C<config-$out_suffix.yaml> and a file (C<build-$out_suffix.sh>) providing the
command to reproduce the exact same configuration without using the wizard.

=head3 MSAs (C<*.fasta>)

C<42> native file format for MSAs is known as the C<ALI> format. It is very
similar to the well-known C<FASTA> format except for a few differences: (1)
sequences must appear on a single (long) line; (2) gaps are encoded as asterisk
characters (C<*>) instead of dashes (C<->) and any whitespace is interpreted as
missing character states; (3) sequence identifiers accept a single whitespace
between genus and species (more on this just below); and (4) comment lines
(starting with the hashtag character C<#>) are allowed. Although C<42> can read
and write C<FASTA> files transparently, its C<ALI> roots sometimes play tricks
to the user.

This is especially true for sequence identifiers. Basically, each identifier has
to hold the organism name (C<org>) followed by a separator (C<@>) and by a
protein/gene accession number. The organism name is usually the binomial name.
Genus and species must be separated by a whitespace (C< > if in C<ALI> format)
or underscore character (C<_> if in C<FASTA> format). In addition, strain name
and/or NCBI taxon id are also allowed after the species name but each preceded
by an underscore character (C<_>). If both are used in the sequence identifier,
the taxon id has to come last. Finally, all sequence identifiers must be unique
within each MSA. See examples below:

=for todo TODO: explain other formats and subtleties (e.g., families)

    # Genus species@protacc
    >Arabidopsis thaliana@AAL15244
    # Genus species_taxonid@protacc
    >Arabidopsis thaliana_3702@AAO44026
    # Genus species_subspecies_taxonid@protacc
    >Arabidopsis lyrata_lyrata_81972@EFH60692
    # Genus species_taxonid@protacc
    >Archaeoglobus fulgidus_2234@WP_048095550
    # Genus species_strain_taxonid@protacc
    >Archaeoglobus fulgidus_DSM4304_224325@AAB90113
    # Genus species_strain_taxonid@protacc
    >archaeon 13_1_20CM_2_54_9_1805008@OLE74253

=head3 Reference organisms (C<ref_orgs>, C<ref_banks>)

=for todo TODO: discuss inst-abbr-ids.pl

The reference proteome set must be described in the C<config> file. Firstly,
each of the reference proteomes must be in C<FASTA> format in order to be
formatted as a C<BLAST> database with the C<makeblastdb> command. For
robustness, it is advised to use simple (one-word) sequence identifiers here.

    $ for REFORG in *.faa; do makeblastdb -in $REFORG -dbtype prot \
        -out `basename $REFORG .faa` -parse_seqids; done

Then, C<yaml-generator-42.pl> will read a file describing the reference proteome
set (C<ref_org_mapper.idm>). This file is composed of two columns separated by a
tabulation character (C<\t>) with the first column being the organism name
(C<ref_org>) and the second being the database basename (C<ref_bank>).

If your banks are like this:

    $ ls Arabidopsis_thaliana_3702_bank.*

    Arabidopsis_thaliana_3702_bank.faa
    Arabidopsis_thaliana_3702_bank.phr
    Arabidopsis_thaliana_3702_bank.pin
    Arabidopsis_thaliana_3702_bank.pog
    Arabidopsis_thaliana_3702_bank.psd
    Arabidopsis_thaliana_3702_bank.psi
    Arabidopsis_thaliana_3702_bank.psq

Then the C<ref_org_mapper> file should look like this:

    Arabidopsis thaliana_3702    Arabidopsis_thaliana_3702_bank
( run in 1.268 second using v1.01-cache-2.11-cpan-524268b4103 )