Bio-MUST-Apps-FortyTwo

 view release on metacpan or  search on metacpan

bin/compress-db.pl  view on Meta::CPAN


=item --cap3-o=<n>

Overlap length cutoff for CAP3 (should be > 15) [default: 40].

=for Euclid: n.type:    n > 15
    n.default: 40

=item --cap3-p=<n>

Overlap percent identity cutoff for CAP3 (should be > 65) [default: 90].

=for Euclid: n.type:    n > 65
    n.default: 90

=item --verbosity=<level>

Verbosity level for logging to STDERR [default: 0]. Available levels range from
0 to 6. Level 6 corresponds to debugging mode.

=for Euclid: level.type: int, level >= 0 && level <= 6

bin/debrief-42.pl  view on Meta::CPAN


# Write file contents
say {$out}     join "\t", @{ $line_for{$_} }              for @sort_all_banks;
say {$out_sum} join "\t", @{ $line_for{$_} }[0..8,-7..-1] for @sort_all_banks;

### Done!


##################################### SUBS #####################################

sub compute_percentage {
    my $array = shift;
    my $total = shift;

    my @results;
    my $percentage;
    CALC:
    for my $value (@$array) {
        if ($value == 0) {
            $percentage = 0;
            push @results, '-/-';
            next CALC;
        }
        else {
            $percentage = $value / $total;
#            $percentage = $value * 100 / $total;
        }
        $percentage = sprintf("%.2f", $percentage);
        push @results, $value . '/' . $percentage;
    }
    return \@results;
}

# for testing:
# perl -Ilib bin/debrief-42.pl --indir=xtest/tax_reports/ \
#   --in-strip=-42-camera-megan99-tf --taxdir=../Bio-MUST-Core/test/taxdump
#   --seq_labeling=xtest/seq-labels.idl --contam_labeling=xtest/contam-labels.idl \
#   --outdir=dbout

bin/prune-outliers.pl  view on Meta::CPAN


Minimal identity value used for selecting sequences that match at least this
proportion in the all-versus-all BLAST searches [default: n.default]. An output
dir will be created by step of 0.1 between the min threshold and max threshold.

=for Euclid: n.type: num
    n.default: 0.3

=item --max-ident=<n> | --max_ident=<n>

Maximum percent value used for selecting sequences that match at least this
proportion in the all versus all BLAST searches [default: n.default]. An output
dir will be created by step of 0.1 between the min threshold and max threshold.

=for Euclid: n.type: num
    n.default: 0.8

=item --min-hits=<n> | --min_hits=<n>

Minimum number of hits in the all-versus-all BLAST searches required for a
sequence to be retained in the output file [default: n.default].

bin/yaml-generator-42.pl  view on Meta::CPAN

[% IF megan_like -%]
tax_max_hits: 100
[% ELSIF best_hit -%]
tax_max_hits: 1
[% ELSIF tax_max_hits -%]
tax_max_hits: [% tax_max_hits %]
[% END -%]

# ===Min identity of relatives to use when inferring taxonomy of new seqs===
# Only meaningful when enabling 'tax_reports' or specifying 'tax_filter'.
# This parameter is the traditional BLAST 'percent identity' statistics except
# that it is specified as a fractional number (between 0 and 1). It is
# evaluated on the first HSP of potential relatives.
# When not specified, 'tax_min_ident' internally defaults to 0.
[% IF megan_like -%]
tax_min_ident: 0
[% ELSIF best_hit -%]
tax_min_ident: 0
[% ELSIF tax_min_ident -%]
tax_min_ident: [% tax_min_ident %]
[% END -%]

bin/yaml-generator-42.pl  view on Meta::CPAN


                my $hit_filtering = prompt "\nSet hit-filtering mode: ",
                    -menu => { 'default values' => 0, 'length/identity' => 'length_identity', 'Bit score' => 'bitscore' };

                if ($hit_filtering eq 'bitscore') {
                    $ARGV{'--tax_min_score'} = prompt "\nSet minimum bit score to consider a hit: ",
                        -must => { 'be an integer' => qr{^[0-9]+\z} },
                        -def  => $ARGV{'--tax_min_score'};
                }
                if ($hit_filtering eq 'length_identity') {
                    $ARGV{'--tax_min_ident'} = prompt "\nSet minimum percentage of identity to consider a hit: ",
                        -must => { 'be a number between 0 and 1' => qr{^[0-1](?:\.\d+)?}xms },
                        -def  => $ARGV{'--tax_min_ident'};

                    $ARGV{'--tax_min_len'} = prompt "\nSet minimum length to consider a hit: ",
                        -must => { 'be an integer' => qr{^[1-9]+} },
                        -def  => $ARGV{'--tax_min_len'};
                }
            }

            # MEGAN_LIKE

lib/Bio/MUST/Apps/FortyTwo/Manual.pod  view on Meta::CPAN

the C<megan-like> algorithm, so as to avoid false positives during LCA
computation, with a C<--tax_score_mul> of C<0.99> instead of C<0.95> and a
C<--tax_min_ident> of C<50> instead of C<0>.

The follow up consists in running C<debrief-42.pl>, which parses the taxonomic
reports produced by C<42> in order to compare the taxonomic label (LCA) of each
ortholog computed by C<42> with the source organism lineage (according to I<NCBI
Taxonomy>) and classifies the sequences as contaminants if they differ at a
predefined taxonomic rank, based on a first user-defined list of taxa
(C<--seq_labeling>). After each ortholog has been classified, an estimated
contamination percentage is computed.

Additionally, contaminations are further classified to determine the main
sources of contaminants, based on a second user-defined list of taxa
(C<--contam_labeling>), which allows the user to fine control the output report.
In this regard, we distinguish two types of sequences, B<classified
contaminations> and B<unclassified contaminations>. The latter are those that
bear an uninformative taxonomic label, i.e., too broad to point to a specific
lineage with accuracy (e.g., C<Sar>). Finally, the sequences that can only be
affiliated at the highest taxonomic levels, such as C<cellular organisms>,
C<Eukaryota>, C<Bacteria> or C<Archaea>, are classified as B<unknown sequences>.

lib/Bio/MUST/Apps/FortyTwo/OrgProcessor.pm  view on Meta::CPAN

        $ali->add_seq( shift @seqs2cap );
        return;
    }

    # TODO: add debugging comments?

    # try to cap seqs
    my $cap = Cap3->new(
        seqs      => \@seqs2cap,
        cap3_args => {
            -p => $rp->merge_min_ident * 100.0,     # CAP3 expects percents
            -o => $rp->merge_min_len,
        },
    );

    # add singlet seqs
    my @singlets = $cap->all_singlets;
    $ali->add_seq($_) for @singlets;

    # proceed only if contigs of seqs
    my @contigs  = $cap->all_contigs;

test/config-42-prot-tax.yaml  view on Meta::CPAN

tax_min_hits: 1

# ===Max number of relatives to use when inferring taxonomy of new seqs===
# Only meaningful when enabling 'tax_reports' or specifying 'tax_filter'.
# As for 'tax_min_hits' above, this parameter is a upper bound.
# When not specified, 'tax_max_hits' internally defaults to unlimited.
tax_max_hits: 1

# ===Min identity of relatives to use when inferring taxonomy of new seqs===
# Only meaningful when enabling 'tax_reports' or specifying 'tax_filter'.
# This parameter is the traditional BLAST 'percent identity' statistics except
# that it is specified as a fractional number (between 0 and 1). It is
# evaluated on the first HSP of potential relatives.
# When not specified, 'tax_min_ident' internally defaults to 0.
tax_min_ident: 0
# ===Min length of relatives to use when inferring taxonomy of new seqs===
# Only meaningful when enabling 'tax_reports' or specifying 'tax_filter'.
# This parameter is the traditional BLAST 'alignment length' statistics. It is
# evaluated on the first HSP of potential relatives.
# When not specified, 'tax_min_len' internally defaults to 0.
tax_min_len: 0



( run in 0.428 second using v1.01-cache-2.11-cpan-709fd43a63f )