formula results from the CPAN

formula
GO-TermFinder
view release on metacpan or search on metacpan
lib/GO/AnnotationProvider/AnnotationParser.pm view on Meta::CPAN
       10       0,n       Alias(es) of the annotated product
       11       1         type of annotated entity (one of gene, transcript, protein)
       12       1,2       taxonomic id of the organism encoding and/or using the product
       13       1         Date of annotation YYYYMMDD
       14       1         Assigned_by : The database which made the annotation

Columns are separated by tabs.  For those entries with a cardinality
greater than 1, multiple entries are pipe , |, delimited.

Further details can be found at:

http://www.geneontology.org/doc/GO.annotation.html#file

The following assumptions about the file are made (and should be true):

    1.  All aliases appear for all entries of a given annotated product
    2.  The database identifiers are unique, in that two different
        entities cannot have the same database id.

=head1 TODO

Also see the TODO list in the parent, GO::AnnotationProvider.

 1.  Add in methods that will allow retrieval of evidence codes with
     the annotations for a particular entity.

 2.  Add in methods that return all the annotated entities for a
     particular GOID.

 3.  Add in the ability to request only annotations either including
     or excluding particular evidence codes.  Such evidence codes
     could be provided as an anonymous array as the value of a named
     argument.

 4.  Same as number 3, except allow the retrieval of annotated
     entities for a particular GOID, based on inclusion or exclusion
     of certain evidence codes.

 These first four items will require a reworking of how data are
 stored on the backend, and thus the parsing code itself, though it
 should not affect any of the already existing API.

 5.  Instead of 'use'ing Storable, 'require' it instead, only at the
     point of use, which will mean that AnnotationParser can be
     happily used in the absence of Storable, just without those
     functions that need it.

 6.  Extend the ValidateFile class method to check that an entity
     should never be annotated to the same node twice, with the same
     evidence, with the same reference.

 7.  An additional checker, that uses an AnnotationProvider in
     conjunction with an OntologyProvider, would be useful, that
     checks that some of the annotations themselves are valid, ie
     that no entities are annotated to the 'unknown' node in a
     particular aspect, and also to another node within that same
     aspect.  Can annotations be redundant? ie, if an entity is
     annotated to a node, and an ancestor of the node, is that
     annotation redundant?  Does it depend on the evidence codes and
     references.  Or are such annotations reinforcing?  These things
     are useful to consider when formulating the confidence which can
     be attributed to an annotation.

=cut

use strict;
use warnings;
use diagnostics;

use Storable qw (nstore);
use IO::File;

use vars qw (@ISA $PACKAGE $VERSION);

use GO::AnnotationProvider;
@ISA = qw (GO::AnnotationProvider);

$PACKAGE = "GO::AnnotationProvider::AnnotationParser";
$VERSION = "0.15";

# CLASS Attributes
#
# These should be considered as constants, and are initialized here

my $DEBUG = 0;

# constants for instance attribute name


my $kDatabaseName           = $PACKAGE.'::__databaseName';           # stores the name of the annotating database
my $kFileName               = $PACKAGE.'::__fileName';               # stores the name of the file used to instantiate the object
my $kNameToIdMapInsensitive = $PACKAGE.'::__nameToIdMapInsensitive'; # stores a case insensitive map of all unambiguous names for a gene to the database id
my $kNameToIdMapSensitive   = $PACKAGE.'::__nameToIdMapSensitive';   # stores a case sensitive map of all names where a particular casing is unambiguous for a gene to the database id
my $kAmbiguousNames         = $PACKAGE.'::__ambiguousNames';         # stores the database id's for all ambiguous names
my $kIdToStandardName       = $PACKAGE.'::__idToStandardName';       # stores a map of database id's to standard names of all entities
my $kStandardNameToId       = $PACKAGE.'::__StandardNameToId';       # stores a map of standard names to their database id's
my $kUcIdToId               = $PACKAGE.'::__ucIdToId';               # stores a map of uppercased databaseIds to the databaseId
my $kUcStdNameToStdName     = $PACKAGE.'::__ucStdNameToStdName';     # stores a map of uppercased standard names to the standard name
my $kNameToCount            = $PACKAGE.'::__nameToCount';            # stores a case sensitive map of the number of times a name has been seen
my $kGoids                  = $PACKAGE.'::__goids';                  # stores all the goid annotations
my $kNumAnnotatedGenes      = $PACKAGE.'::__numAnnotatedGenes';      # stores number of genes with annotations, per aspect

my $kAmbiguousNamesSensitive = $PACKAGE.'::__ambiguousNamesSensitive'; # names (case sensitive) that are ambiguous

my $kTotalNumAnnotatedGenes = $PACKAGE.'::__totalNumAnnotatedGenes'; # total number of annotated genes

# constants to describe what is in which column in the annotation file

my $kDatabaseNameColumn = 0;
my $kDatabaseIdColumn   = 1;
my $kStandardNameColumn = 2;
my $kNotColumn          = 3;
my $kGoidColumn         = 4;
my $kReferenceColumn    = 5;
my $kEvidenceColumn     = 6;
my $kWithColumn         = 7;
my $kAspectColumn       = 8;
my $kNameColumn         = 9;
my $kAliasesColumn      = 10;
my $kEntityTypeColumn   = 11;
my $kTaxonomicIDColumn  = 12;
( run in 0.639 second using v1.01-cache-2.11-cpan-84de2e75c66 )