Bio-MUST-Apps-FortyTwo

 view release on metacpan or  search on metacpan

lib/Bio/MUST/Apps/FortyTwo/Manual.pod  view on Meta::CPAN


=head3 MSAs (C<*.fasta>)

C<42> native file format for MSAs is known as the C<ALI> format. It is very
similar to the well-known C<FASTA> format except for a few differences: (1)
sequences must appear on a single (long) line; (2) gaps are encoded as asterisk
characters (C<*>) instead of dashes (C<->) and any whitespace is interpreted as
missing character states; (3) sequence identifiers accept a single whitespace
between genus and species (more on this just below); and (4) comment lines
(starting with the hashtag character C<#>) are allowed. Although C<42> can read
and write C<FASTA> files transparently, its C<ALI> roots sometimes play tricks
to the user.

This is especially true for sequence identifiers. Basically, each identifier has
to hold the organism name (C<org>) followed by a separator (C<@>) and by a
protein/gene accession number. The organism name is usually the binomial name.
Genus and species must be separated by a whitespace (C< > if in C<ALI> format)
or underscore character (C<_> if in C<FASTA> format). In addition, strain name
and/or NCBI taxon id are also allowed after the species name but each preceded
by an underscore character (C<_>). If both are used in the sequence identifier,
the taxon id has to come last. Finally, all sequence identifiers must be unique



( run in 0.758 second using v1.01-cache-2.11-cpan-a1d94b6210f )