Bio-SAGE-DataProcessing
view release on metacpan or search on metacpan
lib/Bio/SAGE/DataProcessing.pm view on Meta::CPAN
B<PURPOSE>
This module facilitates the processing of SAGE data.
Specifically:
1. extracting ditags from raw sequence reads.
2. extracting tags from ditags, with the option to
exclude tags if the Phred scores (described by
Ewing and Green, 1998a and Ewing et al., 1998b)
do not meet a minimum cutoff value.
3. calculating descriptive values
4. statistical analysis to determine, where possible,
additional nucleotides to extend the length of the
SAGE tag (thus facilitating more accurate tag to
gene mapping).
Both regular SAGE (14mer tag) and LongSAGE (21mer tag)
are supported by this module. Future protocols should
be configurable with this module.
B<REFERENCES>
Velculescu V, Zhang L, Vogelstein B, Kinzler KW. (1995)
Serial analysis of gene expression. Science. 270:484-487.
Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B,
Kinzler KW, Velculescu V. (2002) Using the transcriptome
to annotate the genome. Nat. Biotechnol. 20:508-512.
Ewing B, Hillier L, Wendl MC, Green P. (1998a) Base-calling
of automated sequencer traces using phred. I. Accuracy
assessment. Genome Res. 8:175-185.
Ewing B, Green P. (1998b) Base-calling of automated
sequencer traces using phred. II. Error probabilities.
Genome Res. 8:186-194.
=head1 INSTALLATION
Follow the usual steps for installing any Perl module:
perl Makefile.PL
make test
make install
=head1 PREREQUISITES
None (this module used to require the C<Statistics::Distributions> package).
=head1 CHANGES
1.20 2004.10.15 - Minor spelling errors and misuse of terminology fixed in docs.
- Module now allows FASTA files with a blank line folling the
header (denoting an attempted read with no sequence), but prints
a warning to STDERR that this has happened. Module died previously.
1.11 2004.06.20 - Added flag in constructor to keep duplicate ditags.
1.10 2004.06.02 - Wrote new documentation and modified several methods to use the read-by-read
processing approach (see line below).
- Revamped the module to conserve memory. Reads are now processed one at a time
and then discarded. The memory requirements in the previous versions were
prohibitive to those with regular desktop machines.
- The Bio::SAGE::DataProcessing::Filter package can be subclassed to create
custom filters at the ditag and tag processing steps (previous versions only
allowed one approach to ditag/tag filtering).
1.01 2004.05.27 - Fixed bug where extract_tag_counts didn't work with quality cutoff defined.
- extract_tags was not applying the get_quality_cutoff value (was returning all data)
- Duplicate ditags are now removed by default.
1.00 2004.05.23 - Initial release.
=cut
use strict;
use diagnostics;
use vars qw( $VERSION @ISA @EXPORT @EXPORT_OK $PROTOCOL_SAGE $PROTOCOL_LONGSAGE $DEBUG $ENZYME_NLAIII $ENZYME_SAU3A $DEFAULT_DITAG_FILTER $DEFAULT_TAG_FILTER );
require Exporter;
require AutoLoader;
@ISA = qw( Exporter AutoLoader );
@EXPORT = qw();
$VERSION = "1.11";
#use Statistics::Distributions;
use Bio::SAGE::DataProcessing::Filter;
use Bio::SAGE::DataProcessing::AveragePhredFilter;
use Bio::SAGE::DataProcessing::MinimumPhredFilter;
my $PACKAGE = "Bio::SAGE::DataProcessing";
=pod
=head1 VARIABLES
B<Globals>
=over 2
I<$PROTOCOL_SAGE>
Hashref containing default protocol parameters for the
regular/original SAGE protocol (see set_protocol
documentation for more information).
I<$PROTOCOL_LONGSAGE>
Hashref containing default protocol parameters for the
LongSAGE protocol (see set_protocol documentation
for more information).
I<$ENZYME_NLAIII> = 'CATG'
Constant denoting the recognition sequence for NlaIII.
I<$ENZYME_SAU3A> = 'GATC'
Constant denoting the recognition sequence for Sau3a.
I<$DEFAULT_TAG_FILTER>
A default tag filter used when none is specified.
This filter rejects tags that contain any base pair
( run in 1.730 second using v1.01-cache-2.11-cpan-524268b4103 )