desktop results from the CPAN

Bio-SAGE-DataProcessing
view release on metacpan or search on metacpan
lib/Bio/SAGE/DataProcessing.pm view on Meta::CPAN
B<PURPOSE>

This module facilitates the processing of SAGE data.
Specifically:

  1. extracting ditags from raw sequence reads.
  2. extracting tags from ditags, with the option to
     exclude tags if the Phred scores (described by
     Ewing and Green, 1998a and Ewing et al., 1998b)
     do not meet a minimum cutoff value.
  3. calculating descriptive values
  4. statistical analysis to determine, where possible,
     additional nucleotides to extend the length of the
     SAGE tag (thus facilitating more accurate tag to
     gene mapping).

Both regular SAGE (14mer tag) and LongSAGE (21mer tag)
are supported by this module.  Future protocols should
be configurable with this module.

B<REFERENCES>

  Velculescu V, Zhang L, Vogelstein B, Kinzler KW. (1995)
  Serial analysis of gene expression. Science. 270:484-487.

  Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B,
  Kinzler KW, Velculescu V. (2002) Using the transcriptome
  to annotate the genome. Nat. Biotechnol. 20:508-512.

  Ewing B, Hillier L, Wendl MC, Green P. (1998a) Base-calling
  of automated sequencer traces using phred. I. Accuracy
  assessment. Genome Res. 8:175-185.

  Ewing B, Green P. (1998b) Base-calling of automated
  sequencer traces using phred. II. Error probabilities.
  Genome Res. 8:186-194.

=head1 INSTALLATION

Follow the usual steps for installing any Perl module:

  perl Makefile.PL
  make test
  make install

=head1 PREREQUISITES

None (this module used to require the C<Statistics::Distributions> package).

=head1 CHANGES

  1.20 2004.10.15 - Minor spelling errors and misuse of terminology fixed in docs.
                  - Module now allows FASTA files with a blank line folling the
                    header (denoting an attempted read with no sequence), but prints
                    a warning to STDERR that this has happened. Module died previously.
  1.11 2004.06.20 - Added flag in constructor to keep duplicate ditags.
  1.10 2004.06.02 - Wrote new documentation and modified several methods to use the read-by-read
                    processing approach (see line below).
                  - Revamped the module to conserve memory. Reads are now processed one at a time
                    and then discarded. The memory requirements in the previous versions were
                    prohibitive to those with regular desktop machines.
                  - The Bio::SAGE::DataProcessing::Filter package can be subclassed to create
                    custom filters at the ditag and tag processing steps (previous versions only
                    allowed one approach to ditag/tag filtering).
  1.01 2004.05.27 - Fixed bug where extract_tag_counts didn't work with quality cutoff defined.
                  - extract_tags was not applying the get_quality_cutoff value (was returning all data)
                  - Duplicate ditags are now removed by default.
  1.00 2004.05.23 - Initial release.

=cut

use strict;
use diagnostics;
use vars qw( $VERSION @ISA @EXPORT @EXPORT_OK $PROTOCOL_SAGE $PROTOCOL_LONGSAGE $DEBUG $ENZYME_NLAIII $ENZYME_SAU3A $DEFAULT_DITAG_FILTER $DEFAULT_TAG_FILTER );

require Exporter;
require AutoLoader;

@ISA = qw( Exporter AutoLoader );
@EXPORT = qw();
$VERSION = "1.11";

#use Statistics::Distributions;
use Bio::SAGE::DataProcessing::Filter;
use Bio::SAGE::DataProcessing::AveragePhredFilter;
use Bio::SAGE::DataProcessing::MinimumPhredFilter;

my $PACKAGE = "Bio::SAGE::DataProcessing";

=pod

=head1 VARIABLES

B<Globals>

=over 2

I<$PROTOCOL_SAGE>

  Hashref containing default protocol parameters for the
  regular/original SAGE protocol (see set_protocol
  documentation for more information).

I<$PROTOCOL_LONGSAGE>

  Hashref containing default protocol parameters for the
  LongSAGE protocol (see set_protocol documentation
  for more information).

I<$ENZYME_NLAIII> = 'CATG'

  Constant denoting the recognition sequence for NlaIII.

I<$ENZYME_SAU3A> = 'GATC'

  Constant denoting the recognition sequence for Sau3a.

I<$DEFAULT_TAG_FILTER>

  A default tag filter used when none is specified.
  This filter rejects tags that contain any base pair
( run in 1.730 second using v1.01-cache-2.11-cpan-524268b4103 )