Bio-ToolBox
view release on metacpan or search on metacpan
scripts/get_gene_regions.pl view on Meta::CPAN
which is described at L<Ensembl TSL glossary entry|http://uswest.ensembl.org/info/website/glossary.html>.
Provide a level of support to filter. Values include:
1 All splice junctions supported by evidence
2 Transcript flagged as suspect or only support from multiple ESTs
3 Only support from single EST
4 Best supporting EST is suspect
5 No support
best Transcripts at the best (lowest) available level are taken
best1 The word followed by a digit 1-5, indicating any transcript
at or better (lower) than the indicated level
NA Only transcripts without a level (NA) are retained.
=item --unique
Compare start and stop coordinates of each collected region from
each feature and remove duplicate regions. When the --slop option
is provided, only the start coordinate plus/minus the slop factor
is checked.
=item --slop E<lt>integerE<gt>
When identifying unique regions, specify the number of bp to
add and subtract to the start position (the slop or fudge factor)
of the regions when considering duplicates. Any other region
within this window will be considered a duplicate. Useful, for
example, when start sites of transcription are not precisely mapped,
but not useful with defined introns and exons. This does not take
into consideration transcripts from other genes, only the current
gene. The default is 0 (no sloppiness).
=item --chrskip E<lt>regexE<gt>
Exclude features from the output whose sequence ID or chromosome matches
the provided regex-compatible string. Expressions should be quoted or
properly escaped on the command line. Examples might be
'chrM'
'scaffold.+'
'chr.+alt|chrUn.+|chr.+_random'
=back
=head2 Adjustments
=over 4
=item --start E<lt>integerE<gt>
=item --begin E<lt>integerE<gt>
=item --stop E<lt>integerE<gt>
=item --end E<lt>integerE<gt>
Optionally specify adjustment values to adjust the reported start and
end coordinates of the collected regions. A negative value is shifted
upstream (5' direction), and a positive value is shifted downstream.
Adjustments are made relative to the feature's strand, such that
a start adjustment will always modify the feature's 5'end, either
the feature startpoint or endpoint, depending on its orientation.
=back
=head2 Output options
=over 4
=item --out E<lt>filenameE<gt>
Specify the output filename.
=item --bed
Automatically convert the output file to a BED file.
=item --bedname E<lt>name<gt>
Specify what to use for the Name column in the output BED file.
Several options are available, including:
geneid - The Primary ID of the parent Gene feature
genename - The Display Name of the parent Gene feature
transcriptid - The Primary ID of the parent Transcript feature
transcriptname - The Display Name of the parent Transcript feature
featurename - The generated name of the feature (default)
=item --gz
Specify whether (or not) the output file should be compressed with gzip.
=back
=head2 General options
=over 4
=item --version
Print the version number.
=item --help
Display this POD documentation.
=back
=head1 DESCRIPTION
This program will collect specific regions from annotated genes and/or
transcripts. Often these regions are not explicitly defined in the
source GFF3 annotation, necessitating a script to pull them out. These
regions include the start and stop sites of transcription, introns,
the splice sites (both 5' and 3'), exons, the first (5') or last (3')
exons, or all alternate or common exons of genes with multiple
transcripts. Importantly, unique regions may only be reported,
especially important when a single gene may have multiple alternative
transcripts. A slop factor is included for imprecise annotation.
The program will report the chromosome, start and stop coordinates,
strand, name, and parent and transcript names for each region
( run in 0.579 second using v1.01-cache-2.11-cpan-39bf76dae61 )