Bio-ToolBox
view release on metacpan or search on metacpan
parallel process forks.
- Fixed bug where the interactive menu would exit upon an empty value
in script manipulate_datasets.pl. A "q" must now be provided to exit.
- Minor optimization when calculating shift values in script bam2wig.pl.
v.1.12 (svn 619)
- Major improvements to performance of some data collection scripts by
adding multi-threaded options. These include get_datasets.pl,
get_relative_data.pl, average_gene.pl, and bam2wig.pl. The number of
CPU forks may be specified with the --cpu option (default 2). This option
requires the installation of Parallel::ForkManager, available through
CPAN. Run the check_dependencies.pl script to install it.
- All gzip compression read and writes are now forked through an
external gzip utility for a considerable boost in performance (2-5X).
The gzip executable must be in your path for this to work (it usually
is on most Unix-like environments).
- Added --long option when collecting data from long features in script
average_gene.pl.
- Improved efficiency when collecting data from very large windows in
both get_relative_data.pl and average_gene.pl.
- Summing the total number of read alignments in Bam files is also
multi-threaded. Summing the total number of intervals in a BigBed file
is also improved.
- Fixed a critical error where not all windows had data collected when
using the script get_relative_data.pl
v.1.11 (svn 603)
- Major revision of how features are now retrieved from the database
using primary_IDs rather than relying on unique names in the database.
Generating lists of features will now return Primary_ID, Name, and Type.
The Primary_ID is unique to a database and is usually non-portable.
Current feature lists with only Name and Type will still work, and are
subject to limitations of non-unique Names in the database. This affects
all scripts that work with database features, including get_features.pl,
get_feature_info.pl, get_datasets.pl, get_relative_data.pl,
average_gene.pl, get_intersecting_features.pl, and correlate_position_data.pl.
- GFF3 annotation scripts get_ensembl_annotation.pl and ucsc_table2gff3.pl
now produce GFF3 files that better match the GFF3 specification. Names
are no longer made unique (which broke ties with the originating data),
proper Dbxref tags are attributed when external sources could be
identified, and chromosomes are now sorted by name. Other minor
improvements were also made.
- Fixed critical bug that prevented spliced alignments from being
counted in script bam2wig.pl. Thanks to Pinal K. for reporting.
v.1.10.3 (svn 597)
- Unified column names and improved their recognition in scripts
get_feature_info.pl and the graphing scripts graph_data.pl,
graph_histogram.pl, and graph_profile.pl.
- Graphing scripts now write the output graph directory in the input
file parent directory instead of the current directory.
v.1.10.2 (svn 591)
- Added a new option of position when adjusting coordinates of retrieved
features using the script get_features.pl. Coordinates may be adjusted
at the 5 prime, 3 prime, or both ends of stranded features. This also
fixes bugs where collected features on the reverse strand with adjusted
coordinates were not reported properly.
- Improved automatic recognition of the name, score, and other columns
in the convertor scripts data2bed.pl, data2gff.pl, and data2wig.pl.
- Improved the Cluster and Treeview export function in script
manipulate_datasets.pl. The CDT files generated now include separate ID
and NAME columns per the specification, and new manipulations are
included prior to exporting, including percentile rank and log2.
- The convert null function now also converts zero values if requested
in script manipulate_datasets.pl.
- Added new option of a minimum size when trimming windows in the script
find_enriched_regions.pl.
- Increased the radius from 35 bp to 50 bp when verifying a putative
mapped nucleosome in script map_nucleosomes.pl, leading to fewer
overlapping or offset nucleosomes.
- Added new option to re-center offset nucleosomes in script
verify_nucleosome_mapping.pl. Also improved report formatting.
- Added checks and warnings when writing file names longer than 256
characters. Some scripts automatically generate file names that may
exceed this limit, preventing writing. File names are now truncated.
Thanks to Adam F. for reporting.
- Added new methods and code improvements to the gff3 parsing library.
- Fixed a bug in script merge_datasets.pl where the column index for a
second file may not be properly validated leading to premature
termination.
- Fixed a bug where multiple datasets combined with an ampersand for
merging were not properly verified.
- Fixed a bug where a user may not be prompted to select a dataset from
a database if none was supplied from the command line.
- Fixed a bug where files containing trailing nulls do not load
properly.
- Fixed a bug related to finding specific data columns by name.
- Fixed a bug with writing summary files.
v.1.10.1 (svn 568)
- Added support for Bio::DB::Fasta in the main BioToolBox library, and
added the support to scripts data2fasta.pl and CpG_calculator.pl. Any
BioToolBox program that requires chromosome information or sequence can
now use a genomic multi-fasta or directory of fasta files in the --db
option.
- Fixed critical error in data2gff.pl that prevented files from being
converted to GFF format.
- Fixed critical error merge_datasets.pl that prevented column headers
from being written to the output file.
- Made the warning about unavailable files on the UCSC FTP server less
scary in the script ucsc_table2gff3.pl.
- Updated and clarified some script documentation.
v.1.10 (svn 559)
- Significantly improved performance when collecting data from Bam files
by using a low level API. Improvements of at least 2X may be realized.
- Significantly improved the performance of the bam2wig.pl script by at
least 2X. Added a new option of recording extended regions across the
predicted fragment based on empirically determined shift values.
Sampling to determine shift values has been increased. BedGraph files
are now written more efficiently. Maximum number of identical reads are
now enforced.
- Significantly improved the performance of the split_bam_by_isize.pl
script to increase speed by at least 2X. Added an option to skip
checking of mates. Improved reporting of results.
- Added a filter option to remove overlapping nucleosomes in script
verify_nucleosome_mapping.pl; also fixed bugs in reporting offset
distances and improved output reporting.
- Removed confusing separate scan and tag datasets required for script
map_nucleosomes.pl. Cleaned up and organized code. Fixed bugs that
v.1.9.5 (svn 525)
- Changed the non-intuitive --except option to a more intuitive --zero
option in script manipulate_datasets.pl; this is now a boolean option to
include or exclude zero values when calculating statistics. The printed
statistics output has also been cleaned up and no longer includes
decimal formatting. The export function will automatically generate a
name when executed automatically.
- Added capability to use a column of source values rather than a static
text string for the GFF source tag in script data2gff.pl. Also made
improvements to the interactive ask session.
- Added the capability to use a big file dataset as the database for
chromosome information in script find_enriched_regions.pl.
- Added an option to automatically convert the output file to a BED file
in script get_gene_regions.pl, and included a description of the --in
option in the POD documentation.
v.1.9.4 (svn 519)
- Fixed first critical bug in script get_datasets.pl where strand
information in input files with genomic coordinates (e.g. BED files) was
not considered when adjusting coordinates (start, stop, or fractional).
- Fixed second critical bug in script get_datasets.pl where collecting
fractional data for named database features resulted in data collection
over the entire feature.
- Improved interpretation of input file features as genomic regions or
named features in script get_datasets.pl.
- Changed the --set_strand option to --force_strand in multiple data
collection scripts. This should make the function a little more obvious
as to its purpose. Documentation changed as appropriate.
v.1.9.3 (svn 516)
- Fixed bug where wig definition lines may not be written when no
alignments exist in the first 2 Mb of a chromosome when converting a bam
file to a wig file in script bam2wig.pl. Definition lines are now always
written. Thanks to Matt J. for reporting.
- Fixed bug where the format_with_commas sub was not properly imported
into the tim_db_helper library
- Fixed bug where the bed output from script get_features.pl did not
properly report strand information.
v.1.9.2 (svn 510)
- Fixed critical bug where codon changes were not reported correctly for
minus strand genes in script locate_SNPs.pl. Thanks to Craig K. for
reporting.
v.1.9.1 (svn 507)
- Added critical code to interpret strand information from input files
such as Bed and GFF into BioPerl standards. Essential for collecting
stranded data. Also properly writes back strand information for valid
Bed and GFF files
- Updated and unified internal library methods for validating and
requesting database feature types. By default, all database features are
presented to the user as a list when selecting database features to
collect data. The source_exclude parameter in the biotoolbox.cfg
configuration file is now deprecated.
- Upgraded script get_intersecting_features.pl to automatically
recognize input file columns and search for more than 1 feature type
- Fixed bug in script get_datasets.pl where the program will not
continue when only a data database was provided
- Fixed bug of requesting index when using a .kgg file as a gene list in
script pull_features.pl
- Fixed bug in generating file name for Treeview export function in
script manipulate_datasets.pl
- Fixed behavior when reading files to prevent adding the current
program name to the metadata when the input file does not have this
metadata
- Minor updates to script novo_wrapper.pl
v.1.9.0 (svn 493)
- Added new script get_features.pl which generates a list of features
for one or more feature types from a database. Information about the
features may be returned, including name, type, and coordinates. Sub
features may be included. The data may be written as a BioToolBox
formatted text file, GFF or BED.
- Added new script correlate_position_data.pl that calculates a Pearson
correlation between the score values at identical positions along a
feature between two datasets. This helps in identifying changes in
spatial distribution of values. An option for calculating shifts is also
available.
- Improved Big File generation such that Bio::DB::BigWig or
Bio::DB::BigBed is no longer required just to generate the big file, as
conversion uses external utilities anyway.
- Fixed generation of bin values when calculating distribution
frequencies in scripts data2frequency.pl and graph_histogram.pl
v.1.8.7 (svn 487)
- Added new command line options to script merge_datasets.pl to control
the program's behavior. The "--lookupname" option allows you to specify
the name of the lookup column, while "--manual" turns off all automatic
guessing of columns. Also improved handling of original_file metadata.
- Added a new option to collect data from long features (such as genomic
annotations) instead of point data (microarray or sequence data) in
script get_relative_data.pl.
- Added option to convert to and from Roman numerals in chromosome names
and support for wig files in script change_chr_prefix.pl
- Added option to change the IP port number when connecting to a remote
MySQL database host in script get_ensembl_annotation.pl
- Fixed bug to properly close opened files in script split_data_file.pl
and avoid unnecessary error messages.
- Modified statements and warnings regarding step and span values in
script data2wig.pl
v.1.8.6 (svn 477)
- Added numerous enhancements and bug fixes to script data2wig.pl,
including automatically assigning the span parameter in the wig file,
identifying coordinate columns, adding command line options for
coordinate columns, and updating the POD documentation
- Improved the treeview export function in script manipulate_datasets.pl
to include different manipulations, including median center of genes or
datasets, converting to Z-scores, and converting null values. Also
changed the default output name to <basename>.cdt.
- Added advanced option to script merge_datasets.pl to specify the
column order on the command line instead of interactively. Also
increased the number of columns that can be specified as letters.
- Added the "value" command line option to specify the type of data to
collect to the script find_enriched_regions.pl. Also added the sum
method plus some improvements for identifying depleted regions.
- Updated the script run_cluster.pl to accept any file name as input,
and added basic file format validation checks prior to running the
cluster algorithm, among a few other minor improvements
- Improved handling of error messages when attempting to open databases
that do not exist or can not otherwise be opened.
( run in 0.479 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )