BioPerl
view release on metacpan or search on metacpan
Bio/Assembly/IO/tigr.pm view on Meta::CPAN
This package loads and writes assembly information in/from files in the default
TIGR Assembler v2 format. The files are lassie-formatted and often have the
.tasm extension. This module was written to be used as a driver module for
Bio::Assembly::IO input/output.
=head2 Implementation
Assemblies are loaded into Bio::Assembly::Scaffold objects composed of
Bio::Assembly::Contig and Bio::Assembly::Singlet objects. Since aligned reads
and contig gapped consensus can be obtained in the tasm files, only
aligned/gapped sequences are added to the different BioPerl objects.
Additional assembly information is stored as features. Contig objects have
SeqFeature information associated with the primary_tag:
_main_contig_feature:$contig_id -> misc contig information
_quality_clipping:$read_id -> quality clipping position
Read objects have sub_seqFeature information associated with the
primary_tag:
_main_read_feature:$read_id -> misc read information
Singlets are considered by TIGR Assembler as contigs of one sequence. Contigs
are represented here with features having these primary_tag:
_main_contig_feature:$contig_id
_quality_clipping:$read_primary_id
_main_read_feature:$read_primary_id
_aligned_coord:$read_primary_id
=head1 THE TIGR TASM LASSIE FORMAT
=head2 Description
In the TIGR tasm lassie format, contigs are separated by a line containing a single
pipe character "|", whereas the reads in a contig are separated by a blank line.
Singlets can be present in the file and are represented as a contig
composed of a single sequence.
Other than the two above-mentioned separators, each line has an attribute name,
followed a tab and then an attribute value.
The tasm format is used by more TIGR applications than just TIGR Assembler.
Some of the attributes are not used by TIGR Assembler or have constant values.
They are indicated by an asterisk *
Contigs have the following attributes:
asmbl_id -> contig ID
sequence -> contig ungapped consensus sequence (ambiguities are lowercase)
lsequence -> gapped consensus sequence (lowercase ambiguities)
quality -> gapped consensus quality score (in hexadecimal)
seq_id -> *
com_name -> *
type -> *
method -> always 'asmg' *
ed_status -> *
redundancy -> fold coverage of the contig consensus
perc_N -> percent of ambiguities in the contig consensus
seq# -> number of sequences in the contig
full_cds -> *
cds_start -> start of coding sequence *
cds_end -> end of coding sequence *
ed_pn -> name of editor (always 'GRA') *
ed_date -> date and time of edition
comment -> some comments *
frameshift -> *
Each read has the following attributes:
seq_name -> read name
asm_lend -> position of first base on contig ungapped consensus sequence
asm_rend -> position of last base on contig ungapped consensus sequence
seq_lend -> start of quality-trimmed sequence (aligned read coordinates)
seq_rend -> end of quality-trimmed sequence (aligned read coordinates)
best -> always '0' *
comment -> some comments *
db -> database name associated with the sequence (e.g. >my_db|seq1234)
offset -> offset of the sequence (gapped consensus coordinates)
lsequence -> aligned read sequence (ambiguities are uppercase)
When asm_rend E<lt> asm_lend, the sequence was on the complementary DNA strand but
its reverse complement is shown in the aligned sequence of the assembly file,
not the original read.
Ambiguities are reflected in the contig consensus sequence as
lowercase IUPAC characters: a c g t u m r w s y k x n . In the read
sequences, however, ambiguities are uppercase: M R W S Y K X N
=head2 Example
Example of a contig containing three sequences:
sequence CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGG...
lsequence CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTG...
quality 0x0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909...
asmbl_id 93
seq_id
com_name
type
method asmg
ed_status
redundancy 1.11
perc_N 0.20
seq# 3
full_cds
cds_start
cds_end
ed_pn GRA
ed_date 08/16/07 17:10:12
comment
frameshift
seq_name SDSU_RFPERU_010_C09.x01.phd.1
asm_lend 1
asm_rend 4423
seq_lend 1
seq_rend 442
best 0
Bio/Assembly/IO/tigr.pm view on Meta::CPAN
"best\t$readinfo{'best'}\n".
"comment\t$readinfo{'comment'}\n".
"db\t$readinfo{'db'}\n".
"offset\t$readinfo{'offset'}\n".
"lsequence\t$readinfo{'lsequence'}\n"
);
if ($seqno < $contiginfo{'seqnum'}) {
$self->_print("\n");
} else {
$self->_print("|\n")
};
}
}
return 1;
}
=head2 write_header
Title : write_header
Usage : $asmio->write_header($assembly)
Function: In the TIGR Asseformat assembly driver, this does nothing. The
method is present for compatibility with other assembly drivers
that need to write a file header.
Returns : 1 on success, 0 for error
Args : A Bio::Assembly::Scaffold object
=cut
sub write_header {
my ($self) = @_;
return 1;
}
=head2 write_footer
Title : write_footer
Usage : $asmio->write_footer($assembly)
Function: Write TIGR footer, i.e. do nothing except making sure that the
file does not end with a '|'.
Returns : 1 on success, 0 for error
Args : A Bio::Assembly::Scaffold object
=cut
sub write_footer {
my ($self) = @_;
# In this implementation, the TIGR file always ends with '|\n'. Remove it.
seek $self->_fh, -length("|\n"), 2;
$self->_print("\n\n");
return 1;
}
=head2 _perc_N
Title : _perc_N
Usage : my $perc_N = $asmio->_perc_N($sequence_string)
Function: Calculate the percent of ambiguities in a sequence.
M R W S Y K X N are regarded as ambiguities in an aligned read
sequence by TIGR Assembler. In the case of a gapped contig
consensus sequence, all lowercase symbols are ambiguities, i.e.:
a c g t u m r w s y k x n.
Returns : decimal number
Args : string
=cut
sub _perc_N {
my ($self, $seq_string) = @_;
$self->throw("Cannot accept an empty sequence") if length($seq_string) == 0;
my $perc_N = 0;
for my $base ( split //, $seq_string ) {
# individual base matches an ambiguity?
if (( $base =~ m/[x|n|m|r|w|s|y|k]/i ) || ( $base =~ m/[a|c|g|t|u]/ ) ) {
$perc_N++;
}
}
$perc_N = $perc_N * 100 / length $seq_string;
return $perc_N;
}
=head2 _redundancy
Title : _redundancy
Usage : my $ref = $asmio->_redundancy($contigobj)
Function: Calculate the fold coverage (redundancy) of a contig consensus
(average number of read base pairs covering the consensus)
Returns : decimal number
Args : Bio::Assembly::Contig
=cut
sub _redundancy {
# redundancy = (sum of all aligned read lengths - ( number of gaps in gapped
# consensus + number of gaps in aligned reads that are also in the consensus ) )
# / length of ungapped consensus
my ($self, $contigobj) = @_;
my $redundancy = 0;
# sum of all aligned read lengths
my $read_tot = 0;
for my $readobj ( $contigobj->each_seq ) {
my $read_length = length($readobj->seq);
$read_tot += $read_length;
}
$redundancy += $read_tot;
# - respected gaps
my $consensus_sequence = $contigobj->get_consensus_sequence->seq;
my @consensus_gaps = ();
$contigobj->_register_gaps($consensus_sequence, \@consensus_gaps);
my $respected_gaps = scalar(@consensus_gaps);
if ($respected_gaps > 0) {
my @cons_arr = split //, $consensus_sequence;
for my $gap_pos_cons ( @consensus_gaps ) {
for my $readobj ( $contigobj->each_seq ) {
my $readid = $readobj->id;
( run in 3.242 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )