Bio-MUST-Apps-FortyTwo

 view release on metacpan or  search on metacpan

lib/Bio/MUST/Apps/FortyTwo/Manual.pod  view on Meta::CPAN

# PODNAME: FortyTwo::Manual
# ABSTRACT: User Guide for Forty-Two
# CONTRIBUTOR: Mick VAN VLIERBERGHE <mvanvlierberghe@doct.uliege.be>

# perl -M'List::AllUtils qw(uniq)' -nle 'push @found, m/(C<.*?>)/g; END{ print join "\n", uniq @found }' Manual.pod

# Feedback Forty-two (Nicolas MAGAIN)
# 
# [DONE] 1. Lorsque je copie/colle les commandes depuis le manuel de Forty-two vers le terminal ou un éditeur de texte, je dois remplacer manuellement tous les tirets. Les tirets du manuel apparaissent plus courts et font bugger les commandes.
# 
# [DONE] 2. Le manuel s'arrête à la génération du fichier yaml mais il serait utile de faire un paragraphe sur le lancement du script forty-two.pl en lui même avec --config, --verbosity etc. On l'obtient en faisant forty-two.pl --help mais un mo...
# 
# [DONE] 3. Dans cette partie du manuel:
# query_orgs should be listed in a file (queries.txt) and spelled exactly as in your MSAs. This file will be processed by yaml‐generator‐42.pl to populate the config file. To easily draft a list of query_orgs, you can for example use the 10 to 20...
# $grep‐h\>*.fasta|cut‐f1‐d'@'|sort|uniq‐c|sort‐rn|head‐n10
# 22498 >Danio_rerio
# 21071 >Homo_sapiens
# 20722 >Mus_musculus
# 18933 >Monodelphis_domestica
# 18616 >Loxodonta_africana
# 17762 >Latimeria_chalumnae
# 17678 >Canis_familiaris
# 17114 >Xenopus_tropicalis
# 16665 >Anolis_carolinensis
# 16611 >Sarcophilus_harrisii
# 
# Dans cette partie, il n'est pas clair si le > doit faire partie du nom de l'organisme ou pas, puisqu'il apparait dans l'exemple ci-dessus. Une personne non-avertie serait tentée de recopier les noms tels quels dans son fichier queries./txt
# 
# Comme c'est sans le '>' vous pourriez préciser, et ensuite par exemple rajouter la commande ci-dessous pour directement créer queries.txt
# grep -h \>*.fasta | cut -f1 -d'@' | sort | uniq -c | sort -rn | head -n 10 | cut -f2 -d'>' > queries.txt
# 
# Si vous faites des lignes de code toutes faites comme ça, vous simplifierez la vie des débutants.
# 
# [DONE] 4a. Je trouve que la partie sur tax_filter n'est pas claire (sûrement destinée à des gens qui maîtrisent déjà ces concepts?). Je ne comprends pas bien à quoi cela sert.
# 
# [DONE] 4b. "but the generator only supports the plain tax_filter syntax shown in the first example." -> je trouve que cela n'est pas clair. Est-ce seulement la première des 4 lignes (+Poaceae) qui fonctionne par rapport aux trois autres examples, ...
# 
# [DONE] 4c. Un utilisateur qui n'a pas besoin de tax_filter ou qui ne comprend pas ce que c'est se demande, après lecture de ce paragraphe, ce qu'il doit faire concernant cette partie. Est-ce qu'il doit préparer quelque chose quand même, ou pourr...
# 
# [DONE] 5. Je n'ai pas trouvé d'explications dans le manuel (ou je n'ai pas compris que les explications portaient sur cela) au sujet de ces 4 paramètres dans le wizzard
# Set ref_brh_mode
# Set reference banks suffix
# Set trim_max_shift
# Set candiate banks suffix
# En fait, il y a plus d'explications dans le fichier yaml lui-même, mais puisqu'on ne l'a pas encore quand on lance le premier wizzard, il faudrait que les explications arrivent plus tôt, soit dans le manuel soit dans le wizzard
# 
# [DONE] 6. Si je prépare un fichier config localement pour ensuite l'uploader sur un cluster, si je veux indiquer le path vers le dossier qui contient les génomes de référence, ou vers les génomes à miner sur le cluster, je vais me retrouver a...

__END__

=pod

=head1 NAME

FortyTwo::Manual - User Guide for Forty-Two

=head1 VERSION

version 0.213470

=encoding UTF-8

=head1 Background

=head2 Aim and features

C<42> is a phylogenomic tool designed to add (and optionally align) sequences to
a preexisting multiple sequence alignment (MSA) while controlling for orthology
relationships and potentially contaminating sequences. Sequences to add are
either nucleotide transcripts resulting from transcriptome assembly or already
translated protein sequences. In theory, one can also use genomic nucleotide
sequences (because C<42> can splice introns), but this possibility has not been
extensively tested so far.

=for todo TODO: amend these paragraphs after publication...

The working hypothesis of C<42> is that its orthology-controlling heuristics can
enrich not only MSAs of single-copy genes but also more complicated MSAs
including terminally duplicated genes (in-paralogues) and/or corresponding to
multigenic families featuring different out-paralogues of different ages.
Preliminary tests on a broadly sampled eukaryotic data set suggest that the

lib/Bio/MUST/Apps/FortyTwo/Manual.pod  view on Meta::CPAN


Then, C<yaml-generator-42.pl> will read a file describing the reference proteome
set (C<ref_org_mapper.idm>). This file is composed of two columns separated by a
tabulation character (C<\t>) with the first column being the organism name
(C<ref_org>) and the second being the database basename (C<ref_bank>).

If your banks are like this:

    $ ls Arabidopsis_thaliana_3702_bank.*

    Arabidopsis_thaliana_3702_bank.faa
    Arabidopsis_thaliana_3702_bank.phr
    Arabidopsis_thaliana_3702_bank.pin
    Arabidopsis_thaliana_3702_bank.pog
    Arabidopsis_thaliana_3702_bank.psd
    Arabidopsis_thaliana_3702_bank.psi
    Arabidopsis_thaliana_3702_bank.psq

Then the C<ref_org_mapper> file should look like this:

    Arabidopsis thaliana_3702    Arabidopsis_thaliana_3702_bank

If you mainly work with microbes, you may want to name your banks after the NCBI
GCA/GCF accessions of the corresponding genome assemblies. In this case, you can
use C<fetch-tax.pl> to generate a suitable file from a list of such numbers:

    $ head -n5 banks.idl

    GCA_000008085.1
    GCA_000011505.1
    GCA_000012285.1
    GCA_000014585.1
    GCA_000019605.1

    $ fetch-tax.pl --taxdir=taxdump/ --org-mapper --item-type=taxid banks.idl

    $ head -n5 banks.org-idm
    
    Nanoarchaeum equitans_GCA_000008085.1        GCA_000008085.1
    Staphylococcus aureus_GCA_000011505.1        GCA_000011505.1
    Sulfolobus acidocaldarius_GCA_000012285.1    GCA_000012285.1
    Synechococcus sp._GCA_000014585.1            GCA_000014585.1
    Korarchaeum cryptofilum_GCA_000019605.1      GCA_000019605.1

=head3 Query organisms (C<query_orgs>)

C<query_orgs> should be listed in a file (C<queries.txt>) and spelled exactly as
in your MSAs (excluding the C<FASTA>-specific `>` character preceding each
sequence identifier). This file will be processed by C<yaml-generator-42.pl> to
populate the C<config> file. To easily draft a list of C<query_orgs>, you can
for example use the 10 to 20 most represented organisms across all your MSAs
(prior to enrichment).

    $ grep -h \> *.fasta | cut -f1 -d'@' | cut -c2- | sort | uniq -c | sort -rn | head -n10

    22498 Danio_rerio
    21071 Homo_sapiens
    20722 Mus_musculus
    18933 Monodelphis_domestica
    18616 Loxodonta_africana
    17762 Latimeria_chalumnae
    17678 Canis_familiaris
    17114 Xenopus_tropicalis
    16665 Anolis_carolinensis
    16611 Sarcophilus_harrisii

B<Note:> Organism names must follow the same rules as above. This means that no
underscore should appear between genus and species. C<42> emits a warning when
suspecting you got it wrong. However, it cannot fix this for you. When working
with native C<ALI> files, this issue does not crop up:

    $ grep -h \> *.ali | cut -f1 -d'@' | cut -c2- | sort | uniq -c | sort -rn | head -n10

    22498 Danio rerio
    21071 Homo sapiens
    ...

=head3 Candidate organisms (C<orgs>, C<banks>)

The candidate organisms set must be described in the C<config> file. Firstly,
each of the candidate organism files must be in C<FASTA> format in order to
produce a C<BLAST> database with the C<makeblastdb> command:

    $ for ORG in *.fna; do makeblastdb -in $ORG -dbtype nucl \
        -out `basename $ORG .fna` -parse_seqids; done

Within each C<BLAST> database, sequence identifiers must be unique. C<42> will
use the first run of non-whitespace characters as the accession. If this first
chunk is composed of multiple parts separated by pipe characters (C<|>), only
the last part is taken as the sequence accession (see L<"Orthologue
naming..."|"Family affiliation and orthologue naming"> above).

    sequence identifier                      accession

    >seq37                                   seq37
    >comp12_c0_seq1                          comp12_c0_seq1
    >EH093040.1 Sl_SlB_01N04_T7 SLB ...      EH093040.1
    >MMETSP0151_2-20130828|7_1 len=174       7_1
    >gi|301500844|ref|YP_003795256.1| ...    YP_003795256.1

Then, as for C<ref_orgs> above, you need to produce a C<bank_mapper.idm> file
composed of two columns separated by a tabulation character (C<\t>) with the
first column being the organism name (C<org>) and the second being the database
basename (C<bank>).

    Euglena gracilis    Euglena_bank

B<Note:> Again, organism names must follow the same rules as above!

=head3 Taxonomic filters (C<tax_filter>)

B<Note:> this section deals with an advanced use of C<42>. It can be skipped if
you do not plan to check added sequences for potential contaminations (see
L<"Contamination detection..."|"Contamination detection and handling"> for
theoretical background).

C<Bio::MUST> modules provide quite sophisticated taxonomic filters. Hence, in
C<tax_filter> syntax, wanted taxa are to be prefixed by a C<+> symbol, whereas
unwanted taxa are to be prefixed by a C<-> symbol. Wanted taxa are linked by
logical ORs while unwanted taxa are linked by logical ANDs. Unwanted taxa should
not include wanted taxa because the former ones take precedence over the latter



( run in 0.575 second using v1.01-cache-2.11-cpan-2398b32b56e )