Bio-MUST-Core
view release on metacpan or search on metacpan
lib/Bio/MUST/Core/Ali.pm view on Meta::CPAN
This method requires one argument.
=head2 idealize
Computes and applies an ideal sequence mask to the Ali and returns it. This
is only a convenience method.
When invoked without arguments, it will discard the gaps that are
universally shared by all the sequences. Otherwise, the provided argument
corresponds to the threshold of the C<ideal_mask> method described in
L<Bio::MUST::Core::SeqMask>.
use aliased 'Bio::MUST::Core::IdList';
my $fast_seqs = IdList->load('fast_evolving_seqs.idl');
my $seqs2keep = $fast_seqs->negative_list($ali);
$ali->apply_list($seqs2keep); # discard fast-evolving seqs
$ali->idealize; # discard newly shared gaps caused by fast seqs
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('hmm_based.ali');
$ali->idealize(0.05); # discard insertions due to <5% of the seqs
This method accepts an optional argument.
=head1 MISC METHODS
=head2 gapmiss_regex
Returns a regular expression matching gaps and ambiguous or missing states.
The exact regex returned depends on the type of sequences in the Ali (nucl. or
proteins).
my $regex = $ali->gapmiss_regex;
my $first_seq = $ali->get_seq(0)->seq;
my $gapmiss_n = $first_seq =~ m/($regex)/xmsg;
say "The first sequence has $gapmiss_n gaps or ambiguous/missing sites";
This method does not accept any arguments.
=head2 map_coords
Converts a set of site positions from Ali coordinates to coordinates of the
specified sequence (thereby ignoring positions due to gaps). Returns the
converted sites in sequence coordinates as an array refrence.
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('input.ali');
my $id = 'GIV-Norovirus Hum.GIV.1.POL_1338688@508124125';
my $ali_coords = [ 4, 25, 73, 89, 104, 116 ];
my $seq_coords = $ali->map_coords($id, $ali_coords);
# $seq_coords is [ 3, 23, 59, 71, 71, 74 ]
This method requires two arguments: the id of a sequence and an array
reference of input sites in Ali coordinates.
=head1 I/O METHODS
=head2 load
Class method (constructor) returning a new Ali read from disk. This method
will transparently import plain FASTA files in addition to the MUST
pseudo-FASTA format (ALI files).
use Test::Deeply;
use aliased 'Bio::MUST::Core::Ali';
my $ali1 = Ali->load('example.ali');
my $ali2 = Ali->load('example.fasta');
my @seqs1 = $ali1->all_seqs;
my @seqs2 = $ali2->all_seqs;
is_deeply, \@seqs1, \@seqs2, 'should be true';
This method requires one argument.
=head2 store
Writes the Ali to disk in the MUST pseudo-FASTA format (ALI files).
Note that the ALI format is only used when the suffix of the outfile name is
'.ali'. In all other cases (including lack of suffix), this method
automatically forwards the call to C<store_fasta>.
$ali->store('output.ali');
# output.ali is written in ALI format
$ali->store('output.fasta');
# output.fasta is written in FASTA format
This method requires one argument (but see C<store_fasta> in case of automatic
forwarding of the method call).
=head2 store_fasta
Writes the Ali to disk in the plain FASTA format.
For compatibility purposes, this method automatically fetches sequence ids
using the C<foreign_id> method instead of the native C<full_id> method, both
described in L<Bio::MUST::Core::SeqId>.
$ali->store_fasta( 'output.fasta' );
$ali->store_fasta( 'output.fasta', {chunk => -1, degap => 1} );
This method requires one argument and accepts a second optional argument
controlling the output format. It is a hash reference that may contain one
or more of the following keys:
- clean: replace all ambiguous and missing states by C<X> (default: false)
- degap: boolean value controlling degapping (default: false)
- chunk: line width (default is 60 chars; negative values means no wrap)
Finally, it is possible to fine-tune the behavior of the C<clean> option by
providing another character than C<X> through the C<gapify> key. This can be
useful to replace all ambiguous and missing states by gaps, as shown below:
$ali->store_fasta( 'output.fasta, { clean => 1, gapify => '*' } );
=head2 temp_fasta
Writes a temporary copy of the Ali to disk in the plain FASTA format using
numeric sequence ids and returns the name of the temporary file. This is
only a convenience method.
In list context, returns the IdMapper object along with temporary filename.
lib/Bio/MUST/Core/Ali.pm view on Meta::CPAN
Writes the Ali to disk in the interleaved (or sequential) PHYLIP format.
To ensure maximal flexibility, this method fetches sequence ids using the
native C<full_id> method described in L<Bio::MUST::Core::SeqId>, but
truncates them to 10 characters, as expected by the original PHYLIP software
package. No other tinkering is carried out on the ids. Thus, if the ids
contain whitespace or are not unique in their 10 first characters, it is
advised to first map them using one of the constructors in
L<Bio::MUST::Core::IdMapper>.
use aliased 'Bio::MUST::Core::Ali';
use aliased 'Bio::MUST::Core::IdMapper';
my $ali = Ali->load('input.ali');
my $mapper = IdMapper->std_mapper($ali);
$ali->shorten_ids($mapper);
$ali->store_phylip( 'input.phy', { chunk => 50 } );
This method requires one argument and accepts a second optional argument
controlling the output format. It is a hash reference that may contain one
or more of the following keys:
- short: truncate ids to 10 chars, as in original PHYLIP (defaut: yes)
- clean: replace all ambiguous and missing states by 'X' (default: false)
- chunk: line width (default: 60 chars; negative values means no wrap)
To store the Ali in PHYLIP sequential format, specify a negative chunk (-1).
=head2 load_stockholm
Class method (constructor) returning a new Ali read from a file in the
STOCKHOLM format. =GF comments are retained (see above) but not the other
comment classes (=GS, =GR and =GC).
use aliased 'Bio::MUST::Core::Ali';
my $ali = Ali->load('upsk.stockholm');
say $ali->header;
# outputs:
# ID UPSK
# SE Predicted; Infernal
# SS Published; PMID 9223489
# RN [1]
# RM 9223489
# RT The role of the pseudoknot at the 3' end of turnip yellow mosaic
# RT virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
# RT polymerase.
# RA Deiman BA, Kortlever RM, Pleij CW;
# RL J Virol 1997;71:5990-5996.
This method requires one argument.
=head2 load_tinyseq
Class method (constructor) returning a new Ali read from a file in NCBI
TinySeq XML format.
=head2 instant_store
Class method intended to transform a large sequence file read from disk
without loading it in memory. This method will transparently process plain
FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
my $chunk = 200;
my $split = sub {
my $seq = shift;
my $base_id = ( split /\s+/xms, $seq->full_id )[0];
my $max_pos = $seq->seq_len - $chunk;
my $n = 0;
my $out_str;
for (my $pos = 0; $pos <= $max_pos; $pos += $chunk, $n++) {
$out_str .= ">$base_id.$n\n" . $seq->edit_seq($pos,
$pos + $chunk <= $max_pos ? $chunk : 2 * $chunk
) . "\n";
}
return $out_str;
};
use aliased 'Bio::MUST::Core::Ali';
Ali->instant_store(
'outfile.fasta', { infile => 'infile.fasta', coderef => $split }
);
This method requires two arguments. The sercond is a hash reference that must
contain the following keys:
- infile: input sequence file
- coderef: subroutine implementing the transforming logic
=head2 instant_count
Class method returning the number of seqs in any sequence file read from disk
without loading it in memory. This method will transparently process plain
FASTA files in addition to the MUST pseudo-FASTA format (ALI files).
use aliased 'Bio::MUST::Core::Ali';
my $seq_n = Ali->instant_count('input.ali');
say $seq_n;
=head1 ALIASES
=head2 height
Alias for C<count_seqs> method. For API consistency.
=head1 AUTHOR
Denis BAURAIN <denis.baurain@uliege.be>
=head1 CONTRIBUTORS
=for stopwords Catherine COLSON Arnaud DI FRANCO
=over 4
=item *
Catherine COLSON <ccolson@doct.uliege.be>
=item *
Arnaud DI FRANCO <arnaud.difranco@gmail.com>
=back
=head1 COPYRIGHT AND LICENSE
This software is copyright (c) 2013 by University of Liege / Unit of Eukaryotic Phylogenomics / Denis BAURAIN.
This is free software; you can redistribute it and/or modify it under
the same terms as the Perl 5 programming language system itself.
=cut
( run in 0.518 second using v1.01-cache-2.11-cpan-39bf76dae61 )