Bio-ToolBox

 view release on metacpan or  search on metacpan

lib/Bio/ToolBox/Data/file.pm  view on Meta::CPAN


=head1 DESCRIPTION

These are methods for providing file IO for the L<Bio::ToolBox::Data> 
data structure. These file IO methods work with any generic tab-delimited 
text file of rows and columns. It also properly handles comment, metadata, 
and column-specific metadata custom to L<Bio::ToolBox> programs.
Special file formats used in bioinformatics, including for example
GFF and BED files, are automatically recognized by their file extension and 
appropriate metadata added. 

Files opened using these subroutines are stored in a specific complex data 
structure described below. This format allows for data access as well as 
records metadata about each column (dataset) and the file in general. This
metadata helps preserve a "history" of the dataset: where it came from, how
it was collected, and how it was processed.

Additional subroutines are also present for general processing and output of
this data structure.

The data file format is described below, and following that a 
description of the data structure.

=head1 RECOGNIZED FILE FORMATS

L<Bio::ToolBox> will recognize a number of standard bioinformatic file 
formats, almost all of which are recognized by their extension. Recognition 
is NOT guaranteed if an alternate file extension is used!!!!

These formats include

=over 4

=item BED 

These include file extensions F<.bed>, F<.bedgraph>, and F<.bdg>.
Bed files must have 3-12 columns. BedGraph files must have 4 columns.

=item GFF

These include file extensions F<.gff>, F<.gff3>, and F<.gtf>. 
The specific format may also be recognized by the C<gff-version> pragma. 
These files must have 9 columns.

=item UCSC tables

These include file extensions F<.refFlat>, F<.genePred>, and F<.ucsc>. In 
some cases, a simple F<.txt> can also be recognized if the file matches the 
expected file structure. Different formats are typically recognized by the 
number of columns, and can include simple refFlat, gene prediction, extended 
gene prediction, and known Gene tables. The Bin column may or may not be present.

=item Peak files

These include file extensions F<.narrowPeak> and F<.broadPeak>. 
These are special "BED6+4" file formats. 

=item CDT

These include file extension F<.cdt>. 
Cluster data files used with Cluster 3.0 and Treeview.

=item SGR

Rare file format of chromosome, position, score. File extension F<.sgr>.

=item TEXT

Almost any tab-delimited text file with a F<.txt> or F<.tsv> extension
can be loaded.

=item Compressed files

File extension F<.gz> and F<.bz2> are recognized as compressed files.
Compressed files are usually read through an external decompression 
program. All of the above formats can be loaded as compressed files.

=back

=head1 DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT

When not writing to a defined format, e.g. BED or GFF, a L<Bio::ToolBox> 
Data structure is written as a simple tab-delimited text file, with the 
first line being the column header names. Such files are easily parsed 
by other programs. 

If additional metadata is included in the Data object, then these are 
written as comment lines, prefixed by a "# ", before the table. Metadata 
can describe the data within the table with regards to its type, source, 
methodology, history, and processing. The metadata is designed to be read 
by both human and computer. Opening files without this metadata 
will result in basic default metadata assigned to each column. 

Some common metadata lines that are specifically recognized are listed below.

=over 4

=item Feature

The Feature describes the types of features represented on each row in the 
data table. These can include gene, transcript, genome, etc.

=item Database

The name of the database used in generation of the feature table. This 
is often also the database used in collecting the data, unless the dataset
metadata specifies otherwise.

=item Program

The name of the program generating the data table and file. It usually 
includes the whole path of the executable.

=item Column

The next header lines include column specific metadata. Each column 
will have a separate header line, specified initially by the word 
'Column', followed by the column number (1-based). 
Following this is a series of 'key=value' pairs separated by ';'. 
Spaces are generally not allowed. Obviously '=' or ';' are not 
allowed or they will interfere with the parsing. The metadata 



( run in 1.911 second using v1.01-cache-2.11-cpan-cdf2f3d4e48 )