BioPerl

 view release on metacpan or  search on metacpan

Bio/Assembly/IO/tigr.pm  view on Meta::CPAN


This package loads and writes assembly information in/from files in the default
TIGR Assembler v2 format. The files are lassie-formatted and often have the
.tasm extension. This module was written to be used as a driver module for
Bio::Assembly::IO input/output.

=head2 Implementation

Assemblies are loaded into Bio::Assembly::Scaffold objects composed of
Bio::Assembly::Contig and Bio::Assembly::Singlet objects. Since aligned reads
and contig gapped consensus can be obtained in the tasm files, only
aligned/gapped sequences are added to the different BioPerl objects.

Additional assembly information is stored as features. Contig objects have
SeqFeature information associated with the primary_tag:

    _main_contig_feature:$contig_id -> misc contig information
    _quality_clipping:$read_id      -> quality clipping position

Read objects have sub_seqFeature information associated with the
primary_tag:

    _main_read_feature:$read_id     -> misc read information

Singlets are considered by TIGR Assembler as contigs of one sequence. Contigs
are represented here with features having these primary_tag: 

    _main_contig_feature:$contig_id
    _quality_clipping:$read_primary_id
    _main_read_feature:$read_primary_id
    _aligned_coord:$read_primary_id

=head1 THE TIGR TASM LASSIE FORMAT

=head2 Description

In the TIGR tasm lassie format, contigs are separated by a line containing a single
pipe character "|", whereas the reads in a contig are separated by a blank line.
Singlets can be present in the file and are represented as a contig
composed of a single sequence.

Other than the two above-mentioned separators, each line has an attribute name,
followed a tab and then an attribute value.

The tasm format is used by more TIGR applications than just TIGR Assembler.
Some of the attributes are not used by TIGR Assembler or have constant values.
They are indicated by an asterisk *

Contigs have the following attributes:

    asmbl_id   -> contig ID
    sequence   -> contig ungapped consensus sequence (ambiguities are lowercase)
    lsequence  -> gapped consensus sequence (lowercase ambiguities)
    quality    -> gapped consensus quality score (in hexadecimal)
    seq_id     -> *
    com_name   -> *
    type       -> *
    method     -> always 'asmg' *
    ed_status  -> *
    redundancy -> fold coverage of the contig consensus
    perc_N     -> percent of ambiguities in the contig consensus
    seq#       -> number of sequences in the contig
    full_cds   -> *
    cds_start  -> start of coding sequence *
    cds_end    -> end of coding sequence *
    ed_pn      -> name of editor (always 'GRA') *
    ed_date    -> date and time of edition
    comment    -> some comments *
    frameshift -> *

Each read has the following attributes:

    seq_name  -> read name
    asm_lend  -> position of first base on contig ungapped consensus sequence
    asm_rend  -> position of last base on contig ungapped consensus sequence
    seq_lend  -> start of quality-trimmed sequence (aligned read coordinates)
    seq_rend  -> end of quality-trimmed sequence (aligned read coordinates)
    best      -> always '0' *
    comment   -> some comments *
    db        -> database name associated with the sequence (e.g. >my_db|seq1234)
    offset    -> offset of the sequence (gapped consensus coordinates)
    lsequence -> aligned read sequence (ambiguities are uppercase)

When asm_rend E<lt> asm_lend, the sequence was on the complementary DNA strand but
its reverse complement is shown in the aligned sequence of the assembly file,
not the original read.

Ambiguities are reflected in the contig consensus sequence as
lowercase IUPAC characters: a c g t u m r w s y k x n . In the read
sequences, however, ambiguities are uppercase: M R W S Y K X N

=head2 Example

Example of a contig containing three sequences:

    sequence	CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTGG...
    lsequence	CGATGCTGTACGGCTGTTGCGACAGATTGCGCTGGGTCGATACCGCGTTGGTGATCGGCTTGTTCAGCGGGCTCTGGTTCGGCGACAGCGCGGCGATCTTGGCGGCTGCGAAGGTTGCCGGCGCAATCATGCGCTGCTGACCGTTGACCTGGTCCTGCCAGTACACCCAGTCGCCCACCATGACCTTCAGCGCGTAGCTGTCACAGCCGGCTGTGGTCAGCGCAGTGGCGACGGTG...
    quality	0x0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B0B090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909090909...
    asmbl_id	93
    seq_id	
    com_name	
    type	
    method	asmg
    ed_status	
    redundancy	1.11
    perc_N	0.20
    seq#	3
    full_cds	
    cds_start	
    cds_end	
    ed_pn	GRA
    ed_date	08/16/07 17:10:12
    comment	
    frameshift	

    seq_name	SDSU_RFPERU_010_C09.x01.phd.1
    asm_lend	1
    asm_rend	4423
    seq_lend	1
    seq_rend	442
    best	0

Bio/Assembly/IO/tigr.pm  view on Meta::CPAN

                "best\t$readinfo{'best'}\n".
                "comment\t$readinfo{'comment'}\n".
                "db\t$readinfo{'db'}\n".
                "offset\t$readinfo{'offset'}\n".
                "lsequence\t$readinfo{'lsequence'}\n"
            );
            if ($seqno < $contiginfo{'seqnum'}) {
                $self->_print("\n");
            } else {
                $self->_print("|\n")
            };
        }
    }
    return 1;
}

=head2 write_header

    Title   : write_header
    Usage   : $asmio->write_header($assembly)
    Function: In the TIGR Asseformat assembly driver, this does nothing. The
              method is present for compatibility with other assembly drivers
              that need to write a file header.
    Returns : 1 on success, 0 for error
    Args    : A Bio::Assembly::Scaffold object

=cut

sub write_header {
    my ($self) = @_;
    return 1;
}


=head2 write_footer

    Title   : write_footer
    Usage   : $asmio->write_footer($assembly)
    Function: Write TIGR footer, i.e. do nothing except making sure that the
              file does not end with a '|'.
    Returns : 1 on success, 0 for error
    Args    : A Bio::Assembly::Scaffold object

=cut

sub write_footer {
    my ($self) = @_;

    # In this implementation, the TIGR file always ends with '|\n'. Remove it.
    seek $self->_fh, -length("|\n"), 2;
    $self->_print("\n\n");

    return 1;
}


=head2 _perc_N

    Title   : _perc_N
    Usage   : my $perc_N = $asmio->_perc_N($sequence_string)
    Function: Calculate the percent of ambiguities in a sequence.
              M R W S Y K X N are regarded as ambiguities in an aligned read
              sequence by TIGR Assembler. In the case of a gapped contig
              consensus sequence, all lowercase symbols are ambiguities, i.e.:
              a c g t u m r w s y k x n.
    Returns : decimal number
    Args    : string

=cut

sub _perc_N {
    my ($self, $seq_string) = @_;
    $self->throw("Cannot accept an empty sequence") if length($seq_string) == 0;
    my $perc_N = 0;
    for my $base ( split //, $seq_string ) {
        # individual base matches an ambiguity?
        if (( $base =~ m/[x|n|m|r|w|s|y|k]/i ) || ( $base =~ m/[a|c|g|t|u]/ ) ) {
            $perc_N++;
        }
    }
    $perc_N = $perc_N * 100 / length $seq_string;
    return $perc_N;
}


=head2 _redundancy

    Title   : _redundancy
    Usage   : my $ref = $asmio->_redundancy($contigobj)
    Function: Calculate the fold coverage (redundancy) of a contig consensus
              (average number of read base pairs covering the consensus)
    Returns : decimal number
    Args    : Bio::Assembly::Contig

=cut

sub _redundancy {
    # redundancy = (sum of all aligned read lengths - ( number of gaps in gapped
    # consensus + number of gaps in aligned reads that are also in the consensus ) )
    # / length of ungapped consensus
    my ($self, $contigobj) = @_;
    my $redundancy = 0;
    
    # sum of all aligned read lengths
    my $read_tot = 0;
    for my $readobj ( $contigobj->each_seq ) {
        my $read_length = length($readobj->seq);
        $read_tot += $read_length;
    }
    $redundancy += $read_tot;
    
    # - respected gaps
    my $consensus_sequence = $contigobj->get_consensus_sequence->seq;
    my @consensus_gaps = ();
    $contigobj->_register_gaps($consensus_sequence, \@consensus_gaps);
    my $respected_gaps = scalar(@consensus_gaps);
    if ($respected_gaps > 0) {
        my @cons_arr = split //, $consensus_sequence;
        for my $gap_pos_cons ( @consensus_gaps ) {
            for my $readobj ( $contigobj->each_seq ) {
                my $readid = $readobj->id;



( run in 3.242 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )