Bio-ToolBox
view release on metacpan or search on metacpan
lib/Bio/ToolBox/Data/file.pm view on Meta::CPAN
=head1 DESCRIPTION
These are methods for providing file IO for the L<Bio::ToolBox::Data>
data structure. These file IO methods work with any generic tab-delimited
text file of rows and columns. It also properly handles comment, metadata,
and column-specific metadata custom to L<Bio::ToolBox> programs.
Special file formats used in bioinformatics, including for example
GFF and BED files, are automatically recognized by their file extension and
appropriate metadata added.
Files opened using these subroutines are stored in a specific complex data
structure described below. This format allows for data access as well as
records metadata about each column (dataset) and the file in general. This
metadata helps preserve a "history" of the dataset: where it came from, how
it was collected, and how it was processed.
Additional subroutines are also present for general processing and output of
this data structure.
The data file format is described below, and following that a
description of the data structure.
=head1 RECOGNIZED FILE FORMATS
L<Bio::ToolBox> will recognize a number of standard bioinformatic file
formats, almost all of which are recognized by their extension. Recognition
is NOT guaranteed if an alternate file extension is used!!!!
These formats include
=over 4
=item BED
These include file extensions F<.bed>, F<.bedgraph>, and F<.bdg>.
Bed files must have 3-12 columns. BedGraph files must have 4 columns.
=item GFF
These include file extensions F<.gff>, F<.gff3>, and F<.gtf>.
The specific format may also be recognized by the C<gff-version> pragma.
These files must have 9 columns.
=item UCSC tables
These include file extensions F<.refFlat>, F<.genePred>, and F<.ucsc>. In
some cases, a simple F<.txt> can also be recognized if the file matches the
expected file structure. Different formats are typically recognized by the
number of columns, and can include simple refFlat, gene prediction, extended
gene prediction, and known Gene tables. The Bin column may or may not be present.
=item Peak files
These include file extensions F<.narrowPeak> and F<.broadPeak>.
These are special "BED6+4" file formats.
=item CDT
These include file extension F<.cdt>.
Cluster data files used with Cluster 3.0 and Treeview.
=item SGR
Rare file format of chromosome, position, score. File extension F<.sgr>.
=item TEXT
Almost any tab-delimited text file with a F<.txt> or F<.tsv> extension
can be loaded.
=item Compressed files
File extension F<.gz> and F<.bz2> are recognized as compressed files.
Compressed files are usually read through an external decompression
program. All of the above formats can be loaded as compressed files.
=back
=head1 DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT
When not writing to a defined format, e.g. BED or GFF, a L<Bio::ToolBox>
Data structure is written as a simple tab-delimited text file, with the
first line being the column header names. Such files are easily parsed
by other programs.
If additional metadata is included in the Data object, then these are
written as comment lines, prefixed by a "# ", before the table. Metadata
can describe the data within the table with regards to its type, source,
methodology, history, and processing. The metadata is designed to be read
by both human and computer. Opening files without this metadata
will result in basic default metadata assigned to each column.
Some common metadata lines that are specifically recognized are listed below.
=over 4
=item Feature
The Feature describes the types of features represented on each row in the
data table. These can include gene, transcript, genome, etc.
=item Database
The name of the database used in generation of the feature table. This
is often also the database used in collecting the data, unless the dataset
metadata specifies otherwise.
=item Program
The name of the program generating the data table and file. It usually
includes the whole path of the executable.
=item Column
The next header lines include column specific metadata. Each column
will have a separate header line, specified initially by the word
'Column', followed by the column number (1-based).
Following this is a series of 'key=value' pairs separated by ';'.
Spaces are generally not allowed. Obviously '=' or ';' are not
allowed or they will interfere with the parsing. The metadata
( run in 1.911 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )