ALBD

 view release on metacpan or  search on metacpan

INSTALL  view on Meta::CPAN

     make test

     make install

DESCRIPTION
    ALBD provides a system for performing ABC co-occurrence literature based
    discovery using a variety of options, and association-based ranking
    methods

REQUIREMENTS
    ALBD REQUIRES that the following software packages and data:

  Programming Languages
         Perl (version 5.16.3 or better)

  CPAN Modules
         UMLS::Association
         UMLS::Interface

  Required for some Methods:
         MATLAB

INSTALL  view on Meta::CPAN

    If you have supervisor access, or have configured MCPAN for local
    install, you can install each of these via:

         perl -MCPAN -e shell
         > install <packageName>

   UMLS::Interface
    The core UMLS package provides a dictionary from content unqiue
    identifiers (CUI) to their meanings in the Unified Medical Language
    System. Refer to the UMLS::Interface documentation for how to install
    the UMLS database on your system.

    The package is freely available at:
    <http://search.cpan.org/dist/UMLS-Interface/>

   UMLS::Association
    Use to calculate association scores used in most of the ranking method.

    The package is freely available at:

        <http://search.cpan.org/dist/UMLS-Association/>

INSTALL  view on Meta::CPAN

    details, see:

        <http://search.cpan.org/dist/Test-Simple/lib/Test/Builder.pm#EXIT_CODES>

  Stage 4: Create an co-occurrence matrix
    ALBD requires that a co-occurrence matrix of CUIs has been created. This
    matrix is stored as a flat file, in a sparse matrix format such that
    each line contains three tab seperated values, cui_1, cui_2, n_11 = the
    count of their co-occurrences. Any matrix with that format is
    acceptable, however the intended method of matrix generation is to
    convert a UMLS::Association database into a flat matrix file. These
    databases are created using the CUICollector tool of UMLS::Association,
    and are run over the MetaMapped Medline baseline. With that file, run
    utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
    database into a matrix file. Notice that code in dbToTab.pl is just a
    sample mysql command. If the input database is created in another
    method, a different command may be needed. As long as the resulting
    co-occurrence matrix is in the correct format LBD may be run on it. This
    allows flexibility in where co-occurrence information comes from.

    Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
    on the resulting tab seperated file, if quotes are inlcuded in the
    resulting co-ocurrence matrix file.

  Stage 5: Set up Dummy UMLS::Association Database
    UMLS::Association requires that a database can be connected to that is
    in the correct format. Although this database is not required for ALBD
    (since co-occurrence data is loaded from a co-occurrence matrix), it is
    required to run UMLS:Association. If you ran UMLS::Association to
    generate a co-occurrence matrix, you should be fine. Otherwise you will
    need to create a dummy database that it can connect to. This can be done
    in a few steps:

    1) open mysql type mysql at the terminal

    2) create the default database in the correct format, type: CREATE
    DATABASE cuicounts; use cuicounts; CREATE TABLE N_11(cui_1 CHAR(10),
    cui_2 CHAR(10), n_11 BIGINT(20));

CONTACT US
    If you have any trouble installing and using ALBD, please contact us us
    directly :

        Sam Henry: henryst at vcu.edu

        Bridget McInnes: btmcinnes at vcu.edu

MANIFEST  view on Meta::CPAN

samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
utils/datasetCreator/fromMySQL/removeQuotes.pl
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl
utils/datasetCreator/squaring/squareMatrix.m
utils/datasetCreator/squaring/squareMatrix_partial.m
utils/datasetCreator/squaring/squareMatrix_perl.pl
META.yml                                 Module YAML meta-data (added by MakeMaker)
META.json                                Module JSON meta-data (added by MakeMaker)

README  view on Meta::CPAN

NAME
    ALBD README

  SYNOPSIS
        This package consists of Perl modules along with supporting Perl
        programs that perform Literature Based Discovery (LBD). The core 
        data from which LBD is performed are co-occurrences matrices 
        generated from UMLS::Association. ALBD is based on the ABC
        co-occurrence model. Many options can be specified, and many
        ranking methods are available. The novel ranking methods that use
        association measure are available as well as frequency based
        ranking methods. See samples/lbd for more info. Can perform open and
        closed LBD as well as time slicing evaluation.

        ALBD requires UMLS::Association both to compute the co-occurrence
        database that the co-occurrence matrix is derived from, but also for 
        ranking the generated C terms.

        UMLS::Association requires the UMLS::Interface module to access 
        the Unified Medical Language System (UMLS) for semantic type filtering
        and to determine if CUIs are valid.

        The following sections describe the organization of this software
        package and how to use it. A few typical examples are given to help
        clearly understand the usage of the modules and the supporting
        utilities.

README  view on Meta::CPAN

        details of these can be found in the ExtUtils::MakeMaker documentation.
        However, it is highly recommended not messing around with other
        parameters, unless you know what you're doing.

  CO-OCCURRENCE MATRIX SETUP
    ALBD requires that a co-occurrence matrix of CUIs has been created. This
    matrix is stored as a flat file, in a sparse matrix format such that
    each line contains three tab seperated values, cui_1, cui_2, n_11 = the
    count of their co-occurrences. Any matrix with that format is
    acceptable, however the intended method of matrix generation is to
    convert a UMLS::Association database into a flat matrix file. These
    databases are created using the CUICollector tool of UMLS::Association,
    and are run over the MetaMapped Medline baseline. With that file, run
    utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
    database into a matrix file. Notice that code in dbToTab.pl is just a
    sample mysql command. If the input database is created in another
    method, a different command may be needed. As long as the resulting
    co-occurrence matrix is in the correct format LBD may be run on it. This
    allows flexibility in where co-occurrence information comes from.

    Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
    on the resulting tab seperated file, if quotes are inlcuded in the
    resulting co-ocurrence matrix file.

  Set Up Dummy UMLS::Association Database
    UMLS::Association requires that a database can be connected to that is
    in the correct format. Although this database is not required for ALBD
    (since co-occurrence data is loaded from a co-occurrence matrix), it is
    required to run UMLS:Association. If you ran UMLS::Association to
    generate a co-occurrence matrix, you should be fine. Otherwise you will
    need to create a dummy database that it can connect to. This can be done
    in a few steps:

    1) open mysql type mysql at the terminal

    2) create the default database in the correct format, type: CREATE
    DATABASE cuicounts; use cuicounts; CREATE TABLE N_11(cui_1 CHAR(10),
    cui_2 CHAR(10), n_11 BIGINT(20));

  INITIALIZING THE MODULE
    To create an instance of the ALBD object, using default values for all
    configuration options: %options = (); $options{'lbdConfig'} =
    'configFile'; my $lbd = LiteratureBasedDiscovery->new(\%options);
    $lbd->performLBD();

    The following configuration options are also provided though:

README  view on Meta::CPAN

    present in the '/lib' directory tree of the package.

    The package contains a utils/ directory that contain Perl utility
    programs. These utilities use the modules or provide some supporting
    functionality.

    runDiscovery.pl -- runs LBD using the parameters specified in the input
    file, and outputs to an output file.

    The package contains a large selection of functions to manipulate CUI
    Co-occurrence matrices in the utils/datasetCreator/ directory. These are
    short scripts and generally require modifying the code at the top with
    user input paramaters specific for each run. These scripts include:

    applyMaxThreshold.pl -- applies a maximum co-occurrence threshold to the
    co-occurrence matrix

    applyMinThreshold.pl -- applies a minimum co-occurrence threshold to the
    co-occurrence matrix

    applySemanticFilter.pl -- applies a semantic type and/or group filter to

README  view on Meta::CPAN

    removeCUIPair.pl -- removes all occurrences of the specified CUI pair
    from the co-occurrence matrix

    removeExplicit.pl -- removes any keys that occur in an explicit
    co-occurrence matrix from another co-occurrence matrix (typically the
    squared explicit co-occurrence matrix itself, which generates a
    prediction matrix, or the post cutoff matrix used in time slicing to
    generate a gold standard file)

    testMatrixEquality.pl -- checks to see if two co-occurrence matrix files
    contain the same data

    Also included are several subfolders with more specific purposes. Within
    the dataStats subfolder are scripts to collect various statistics about
    the co-occurrence matrices used in LBD. These scriptsinclude:

    getCUICooccurrences.pl -- a data statistics file that gets the number of
    co-occurrences, and number of unique co-occurrences for every CUI in the
    dataset

    getMatrixStats.pl -- determines the number of rows, columns, and entries
    of a co-occurrence matrix

    metaAnalysis.pl -- determines the number of rows, columns, vocabulary
    size, and total number of co-occurrences of a co-occurrence file, or set
    of co-occurrence files

    There is another folder containing scripts to square co-occurrence
    matrices. Squaring an explicit (A to B) co-occurrence matrix results in

README  view on Meta::CPAN

    squareMatrix.m -- MATLAB script to square a matrix while holding
    everything in ram. Faster, but requires more ram.

    squareMatrix_partial.m -- MATLAB script to square a matrix in chunks.
    Only loads parts of the matrix into ram at a time which makes squaring
    any size matrix possible, but potentially take impracticle amounts of
    time.

    squareMatrix_perl.pl -- squares a matrix in perl, but requires the most
    ram of any squaring method. The easiest method to use, but only
    practical for small datasets.

    The fromMySQL folder contains scripts that convery UMLS::Association
    databases to ALBD co-occurrence matrices. The files contained are:

    dbToTab.pl -- converts a UMLS::Association co-occurrence database to a
    sparse format co-occurrence matrix used for ALBD

    removeQuotes.pl -- removes quotes from lines in the co-occurrence matrix
    file after converting from a database (sometimes needed)

  REFERENCING
        If you write a paper that has used UMLS-Association in some way, we'd 
        certainly be grateful if you sent us a copy.

  CONTACT US
    If you have any trouble installing and using ALBD, please contact us
    directly if you prefer :

        Sam Henry: henryst at vcu.edu

config/association  view on Meta::CPAN

##############################################################################
#                Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Association for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>bigrams" will pass the 'database' parameter with a 
# value of 'bigrams' to the  UMLS::Association options hash for its 
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')

<database>CUI_Bigram
<hostname>192.168.24.89
<username>henryst
<password>OhFaht3eique
<socket>/var/run/mysqld.sock
<t>

config/interface  view on Meta::CPAN

#############################################################################
#                Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Interface for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>umls" will pass the 'database' parameter with a value 
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')

<t>
<config>interfaceConfig

<hostname>192.168.24.89
<username>henryst

lib/ALBD.pm  view on Meta::CPAN

    use ALBD;
    %options = ();
    $options{'lbdConfig'} = 'configFile'
    my $lbd = LiteratureBasedDiscovery->new(\%options);
    $lbd->performLBD();

=head1 ABSTRACT

      This package consists of Perl modules along with supporting Perl
      programs that perform Literature Based Discovery (LBD). The core 
      data from which LBD is performed are co-occurrences matrices 
      generated from UMLS::Association. ALBD is based on the ABC
      co-occurrence model. Many options can be specified, and many
      ranking methods are available. The novel ranking methods that use
      association measure are available as well as frequency based
      ranking methods. See samples/lbd for more info. Can perform open and
      closed LBD as well as time slicing evaluation.

=head1 INSTALL

To install the module, run the following magic commands:

lib/ALBD.pm  view on Meta::CPAN

#
# LiteratureBasedDiscovery.pm - provides functionality to perform LBD
#
# Matrix Representation:
# LBD is performed using Matrix and Vector operations. The major components 
# are an explicit knowledge matrix, which is squared to find the implicit 
# knowledge matrix.
#
# The explicit knowledge is read from UMLS::Association N11 matrix. This 
# matrix contains the co-occurrence counts for all CUI pairs. The 
# UMLS::Association database is completely independent from 
# implementation, so any dataset, window size, or anything else may be used. 
# Data is read in as a sparse matrix using the Discovery::tableToSparseMatrix 
# function. This returns the primary data structures and variables used 
# throughtout LBD.
#
# Matrix representation: 
# This module uses a matrix representation for LBD. All operations are 
# performed either as matrix or vector operations. The core data structure
# are the co-occurrence matrices explicitMatrix and implicitMatrix. These
# matrices have dimensions vocabulary size by vocabulary size. Each row 
# corresponds to the all co-occurrences for a single CUI. Each column of that 
# row corresponding to a co-occurrence with a single CUI. Since the matrices 
# tend to be sparse, they are stored as hashes of hashes, where the the first 
# key is for a row, and the second key is for a column. The keys of each hash 
# are the indeces within the matrix. The hash values are the number of 
# co-ocurrences for that CUI pair (e.g. ${${$explicit{C0000000}}{C1111111} = 10 
# means that CUI C0000000 and C1111111 co-occurred 10 times).
#
# Now with an understanding of the data strucutres, below is a breif 
# description of each: 
#
# startingMatrix <- A matrix containing the explicit matrix rows for all of the
#                   start terms. This makes it easy to have multiple start terms
#                   and using this matrix as opposed to the entire explicit 
#                   matrix drastically improves performance.
# explicitMatrix <- A matrix containing explicit connections (known connections)
#                   for every CUI in the dataset.            
# implicitMatrix <- A matrix containing implicit connections (discovered 
#                   connections) for every CUI in the datast


package ALBD;

use strict;
use warnings;

use LiteratureBasedDiscovery::Discovery;
use LiteratureBasedDiscovery::Evaluation;
use LiteratureBasedDiscovery::Rank;

lib/ALBD.pm  view on Meta::CPAN

    
    #read in all options from the config file
    open IN, $configFileName or die("Error: Cannot open config file: $configFileName\n");
    my %optionsHash = ();
    my $firstChar;
    while (my $line = <IN>) {
	#check if its a comment or blank line
	$firstChar = substr $line, 0, 1;
	
	if ($firstChar ne '#' && $line =~ /[^\s]+/) {
	    #line contains data, grab the key and value
	    $line =~ /<([^>]+)>([^\n]*)/;	  

	    #make sure the data was read in correctly
	    if (!$1) {
		print STDERR 
		    "Warning: Invalid line in $configFileName: $line\n";
	    }
	    else {
		#data was grabbed from the line, add to hash
		if ($2) {
		    #add key and value to the optionsHash
		    $optionsHash{$1} = $2;
		}
		else {
		    #add key and set default value to the optionsHash
		    $optionsHash{$1} = 1;
		}
	    }
	}

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN


package Discovery;
use strict;
use warnings;
use DBI;

######################################################################
#                        MySQL Notes
######################################################################
#TODO I think some of these notes should be elsewhere
# A Note about the database structure expected
#   Each LBD database is expected to have:
#   PreCutoff_N11
#   PostCutoff_N11
#   PreCutoff_Implicit
#
# Both PreCutoff_N11 and PostCutoff_N11 should
# be generated manually using CUI_Collector
# PreCutoff_Implicit is generated using the tableToSparseMatrix
# function here, which exports a sparse matrix. That matrix 
# can then be imported into matlab, squared, and reloaded into
# a mysql database. With these 3 tables LBD can be performed


######################################################################
#                          Description
######################################################################
# Discovery.pm - provides matrix operations from  n11 counts from 
# UMLS::Association
#
#TODO I think some of these notes should be elsewhere
# 'B' term filtering may be applied by removing elements from the 

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN

		}
	    }
	}
    }
    return $implicitMatrixRef;
}


# loads a tab seperated file as a sparse matrix (a hash of hashes)
#    each line of the file contains CUI1 <TAB> CUI2 <TAB> Count
# input:  the filename containing the data
# output: a hash ref to the sparse matrix (${$hash{$index1}}{$index2} = value)
sub fileToSparseMatrix {
    my $fileName = shift;

    open IN, $fileName or die ("unable to open file: $fileName\n");
    my %matrix = ();
    my ($cui1,$cui2,$val);
    while (my $line = <IN>) {
	chomp $line;
	($cui1,$cui2,$val) = split(/\t/,$line);

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN

	}
	$matrix{$cui1}{$cui2} = $val;
    }
    close IN;
    return \%matrix;
}

# outputs the matrix to the output file in sparse matrix format, which
# is a file containing rowKey\tcolKey\tvalue
# input:  $outFile - a string specifying the output file
#         $matrixRef - a ref to the sparse matrix containing the data
# output: nothing, but the matrix is output to file
sub outputMatrixToFile {
    my $outFile = shift;
    my $matrixRef = shift;
    
    #open the output file and output fhe matrx
    open OUT, ">$outFile" or die ("Error opening matrix output file: $outFile\n");
    my $rowRef;
    foreach my $rowKey (keys %{$matrixRef}) {
	$rowRef = ${$matrixRef}{$rowKey};

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN

#  retreive a table from mysql and convert it to a sparse matrix (a hash of 
#     hashes)
#  input : $tableName <- the name of the table to output
#          #cuiFinder <- an instance of UMLS::Interface::CuiFinder
#  output: a hash ref to the sparse matrix (${$hash{$index1}}{$index2} = value)
sub tableToSparseMatrix {
    my $tableName = shift;
    my $cuiFinder = shift;

    # check tableName
    #TODO check that the table exists in the database
    # or die "Error: table does not exist: $tableName\n";

    #  set up database
    my $db = $cuiFinder->_getDB(); 
    
    # retreive the table as a nested hash where keys are CUI1, 
    # then CUI2, value is N11
     my @keyFields = ('cui_1', 'cui_2');
     my $matrixRef = $db->selectall_hashref(
	"select * from $tableName", \@keyFields);

    # set values of the loaded table to n_11
    # ...default is hash of hash of hash

lib/LiteratureBasedDiscovery/Evaluation.pm  view on Meta::CPAN

# ALBD::Evaluation.pm
#
# Provides functionality to evaluate LBD systems
# Key components are:
# Results Matrix <- all new knowledge generated by an LBD system (e.g.
#                   all proposed discoveries of a system from pre-cutoff
#                   data).
# Gold Standard Matrix <- the gold standard knowledge matrix (e.g. all
#                         knowledge present in the post-cutoff dataset
#                         that is not present in the pre-cutoff dataset).
#
# Copyright (c) 2017
#
# Sam Henry
# henryst at vcu.edu
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN

    seek IN, 0,0; #reset to the beginning of the implicit file

    #iterate over the lines of interest, and grab values
    my %np1 = ();
    my %n11 = ();
    my $n1p = 0;
    my $npp = 0;
    my $matchedCuiB = 0;
    my ($cuiA, $cuiB, $val);
    while (my $line = <IN>) {
	#grab data from the line
	($cuiA, $cuiB, $val) = split(/\t/,$line);

	#see if updates are necessary
	if (exists $aTerms{$cuiA} || exists $cTerms{$cuiB}) {

	    #update npp
	    $npp += $3;
	    
	    #update np1
	    if (exists $cTerms{$cuiB}) {

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN

	#see if this line contains a key that should be read in 
	if (exists $cuisToGrab{$cui1}) {

	    #add the value
	    if (!(defined $postCutoffMatrix{$cui1})) {
		my %newHash = ();
		$postCutoffMatrix{$cui1} = \%newHash;
	    }

	    #check to ensure that the column cui is in the 
	    #  vocabulary of the pre-cutoff dataset.
	    #  it is impossible to make predictions of words that
	    #  don't already exist
	    #NOTE: this assumes $explicitMatrixRef is a square 
	    #   matrix (so unordered)
	    if (exists ${$explicitMatrixRef}{$cui2}) {
		${$postCutoffMatrix{$cui1}}{$cui2} = $val;
	    }
	}
    }
    close IN;

samples/configFileSamples/UMLSAssociationConfig  view on Meta::CPAN

##############################################################################
#                Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Association for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>bigrams" will pass the 'database' parameter with a 
# value of 'bigrams' to the  UMLS::Association options hash for its 
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Association for more detailed

# Database of Association Scores. Not used, but required to initialize
# UMLS::Association
<database>CUI_Bigram

# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock

# makes the UMLS::Association not print to the command line
<t>

samples/configFileSamples/UMLSInterfaceConfig  view on Meta::CPAN

#############################################################################
#                Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Interface for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>umls" will pass the 'database' parameter with a value 
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Interface for more detail

# makes the UMLS::Interface not print to the command line
<t>

samples/runSample.pl  view on Meta::CPAN

#Demo file, showing how to run open discovery using the sample data, and how 
# to perform time slicing evaluation using the sample data

# run a sample lbd using the parameters in the lbd configuration file
print "\n           OPEN DISCOVERY          \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";

# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is 
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold 
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you. 
# The variables for this example in removeExplicit.pl are:
#  my $matrixFileName = 'sampleExplicitMatrix';
#  my $squaredMatrixFileName = postCutoffMatrix;
#  my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;

# next, run time slicing 
print "          TIME SLICING          \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";

samples/timeSlicingConfig  view on Meta::CPAN


# similar to target termcept groups, this restricts the acceptable target (C) 
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem

# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix

# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix

# Input file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the 
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries. It is easiest to generate
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix

t/test.t  view on Meta::CPAN

	last;
    }
}
ok($fAtKSame == 1, "Frequency at K Matches");

print "Done with Time Slicing Tests\n";



############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
    my $fileName = shift;

    #read in the gold time slicing values
    my @APScores = ();
    my $MAP;
    my @PAtKScores = ();
    my @FAtKScores = ();
    open IN, "$fileName" 
    #open IN, './t/goldSampleTimeSliceOutput'

utils/datasetCreator/applySemanticFilter.pl  view on Meta::CPAN

	    $matrixRef, $acceptTypesRef, $umls_interface);
    } else {
	Filters::semanticTypeFilter_rowsAndColumns(
	    $matrixRef, $acceptTypesRef, $umls_interface);
    }

    #output the matrix
    Discovery::outputMatrixToFile($outputFileName, $matrixRef);

    #TODO re-enable this and then try to run again
    #disconnect from the database and return
    #$umls_interface->disconnect();
}


# transforms the string of accept types or groups into a hash of accept TUIs
# input:  a string specifying whether linking or target types are being defined
# output: a hash of acceptable TUIs
sub getAcceptTypes {
    my $umls_interface = shift;
    my $acceptTypesString = shift;

utils/datasetCreator/combineCooccurrenceMatrices.pl  view on Meta::CPAN

# different time slicing or discovery replication results. We ran CUI Collector
# seperately for each year of the MetaMapped MEDLINES baseline and stored each
# co-occurrence matrix in a single folder "hadoopByYear/output/". That folder 
# contained file named the year and window size used (e.g. 1975_window8).
# The code may need to be modified slightly for other purposes.
use strict;
use warnings;
my $startYear;
my $endYear;
my $windowSize;
my $dataFolder;

#user input
$dataFolder = '/home/henryst/hadoopByYear/output/';
$startYear = '1983';
$endYear = '1985';
$windowSize = 8;
&combineFiles($startYear,$endYear,$windowSize);


#####################################################
####### Program Start ########
sub combineFiles {
    my $startYear = shift;

utils/datasetCreator/combineCooccurrenceMatrices.pl  view on Meta::CPAN

    my $outFileName = "$startYear".'_'."$endYear".'_window'."$windowSize";
(!(-e $outFileName)) 
    or die ("ERROR: output file already exists: $outFileName\n");
open OUT, ">$outFileName" 
    or die ("ERROR: unable to open output file: $outFileName\n");

#combine the files
my %matrix = ();
for(my $year = $startYear; $year <= $endYear; $year++) {
    print "reading $year\n";
    my $inFile = $dataFolder.$year.'_window'.$windowSize;
    if (!(open IN, $inFile)) {
	print "   ERROR: unable to open $inFile\n";
	next;
    }

    #read each line of the file and add to the matrix
    while (my $line = <IN>) {
	#read values from the line
	$line =~ /([^\s]+)\t([^\s]+)\t([^\s]+)/;
	my $rowKey = $1;

utils/datasetCreator/dataStats/getCUICooccurrences.pl  view on Meta::CPAN

# A data statistics tool that gets a list of all cuis, and outputs their number
# of co-occurrences, and their number of unique co-occurrences to file

my $inputFile = '/home/henryst/lbdData/groupedData/reg/1975_1999_window8_noOrder';
my $outputFile = '/home/henryst/lbdData/groupedData/1975_1999_window8_noOrder_stats';

###################################
###################################

#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");

utils/datasetCreator/dataStats/metaAnalysis.pl  view on Meta::CPAN

# co-occurrences of a co-occurrence file, or set of co-occurrence files
use strict;
use warnings;

#perform meta-analysis on a single co-occurrence matrix
&metaAnalysis('/home/henryst/lbdData/groupedData/1960_1989_window8_noOrder');

#perform meta-analysis on a date range of co-occurrence matrices in a folder
# this expects a folder to contain a co-occurrence matrix for every year
# specified within the date range
my $dataFolder = '/home/henryst/lbdData/dataByYear/1960_1989';
my $startYear = '1809';
my $endYear = '2015';
my $windowSize = 1;
my $statsOutFileName = '/home/henryst/lbdData/stats_window1';
&folderMetaAnalysis($startYear, $endYear, $windowSize, $statsOutFileName, $dataFolder);


#####################
# runs meta analysis on a set of files
sub folderMetaAnalysis {
    my $startYear = shift;
    my $endYear = shift;
    my $windowSize = shift;
    my $statsOutFileName= shift;
    my $dataFolder = shift;

    #Check on I/O
    open OUT, ">$statsOutFileName" 
	or die ("ERROR: unable to open stats out file: $statsOutFileName\n");

    #print header row
    print OUT "year\tnumRows\tnumCols\tvocabularySize\tnumCooccurrences\n";

    #get stats for each file and output to file
    for(my $year = $startYear; $year <= $endYear; $year++) {
	print "reading $year\n";
	my $inFile = $dataFolder.$year.'_window'.$windowSize;
	if (open IN, $inFile) {
	    (my $numRows, my $numCols, my $vocabularySize, my $numCooccurrences)
		= &metaAnalysis($inFile);
	    print OUT "$year\t$numRows\t$numCols\t$vocabularySize\t$numCooccurrences\n"	
	}
	else {
	    #just skip the file
	    print "   ERROR: unable to open $inFile\n";
	}
    }

utils/datasetCreator/fromMySQL/dbToTab.pl  view on Meta::CPAN

#converts a mysql database to tab seperated readable by LBD
#command is of the form:
#`mysql <DB_NAME> -e "SELECT * FROM N_11 INTO OUTFILE '<OUTPUT_FILE>' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`
#
# the following line is an example using a database with cui co-occurrence 
# counts from 1980 to 1984 with a window size of 1. The mysql database is 
# called 1980_1984_window1, and the output matrix file is called 
# 1980_1984_window1_data.txt
`mysql 1980_1984_window1 -e "SELECT * FROM N_11 INTO OUTFILE '1980_1984_window1_data.txt' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`;

utils/datasetCreator/fromMySQL/removeQuotes.pl  view on Meta::CPAN

#renoves quotes from a db to tab file

my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';


open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");

while (my $line  = <IN>) {
    $line =~ s/"//g;
    #print $line;
    print OUT $line;

utils/datasetCreator/removeCUIPair.pl  view on Meta::CPAN

# removes the cui pair from the dataset
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;

my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);

print STDERR "DONE\n";

###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
    my $cuiA = shift;
    my $cuiB = shift;
    my $matrixFileName = shift;
    my $matrixOutFileName = shift;
    print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
    
    #open the in and out files
    open IN, $matrixFileName 
	or die ("ERROR: cannot open matrix in file: $matrixFileName\n");

utils/datasetCreator/squaring/squareMatrix.m  view on Meta::CPAN

clear all;
close all;

sparseSquare('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared');

error('DONE!');


function sparseSquare(fileIn, fileOut)

    %load the data
    data = load(fileIn);
    disp('   loaded data');

    %convert to sparse
    vals = max(data);
    maxVal = vals(1);
    if (vals(2) > maxVal) 
       maxVal = vals(2); 
    end
    sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
    clear data;
    clear vals;
    clear maxVal;
    disp('   converted to sparse');

    %square the matrix
    squared = sp*sp;
    clear sp;
    disp('    squared');

    %output the matrix
    [i,j,val] = find(squared);
    clear squared;
    disp('    values grabbed for output');
    data_dump = [i,j,val];
    clear i;
    clear j;
    clear val;
    disp('    values ready for output dump');
    fid = fopen(fileOut,'w');
    fprintf( fid,'%d %d %d\n', transpose(data_dump) );
    fclose(fid);
    disp('   DONE!');

end

utils/datasetCreator/squaring/squareMatrix_partial.m  view on Meta::CPAN

sparseSquare_sectioned('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared_secondTry',increment);
error('DONE!');

function sparseSquare_sectioned(fileIn, fileOut, increment)
  disp(fileIn);

  %open, close, and clear the output file
  fid = fopen(fileOut,'w');
  fclose(fid);

  %load the data
  data = load(fileIn);
    
  vals = max(data);
  matrixSize = vals(1);
  if (vals(2) > matrixSize) 
    matrixSize = vals(2); 
  end
  disp('got matrixDim');
  clear data;

  %multiply each segment of the matrices
  for rowStartIndex = 1:increment:matrixSize
    rowEndIndex = rowStartIndex+increment-1;
    if (rowEndIndex > matrixSize) 
      rowEndIndex = matrixSize;
    end

    for colStartIndex = 1: increment: matrixSize
      colEndIndex = colStartIndex+increment-1;
      if (colEndIndex > matrixSize)
        colEndIndex = matrixSize;
      end

      dispString = [num2str(rowStartIndex), ',', num2str(rowEndIndex),' - ', num2str(colStartIndex),', ', num2str(colEndIndex),':'];
      disp(dispString)
      clear dispString;

      %load the data
      data = load(fileIn);
      disp('   loaded data');

      %convert to sparse
      vals = max(data);
      maxVal = vals(1);
      if (vals(2) > maxVal) 
        maxVal = vals(2); 
      end
      sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
      clear data;
      clear vals;
      clear maxVal;
      disp('   converted to sparse');

      %grab a peice of the matrix
      sp1 = sparse(matrixSize,matrixSize);
      sp2 = sparse(matrixSize,matrixSize);
      sp1(rowStartIndex:rowEndIndex,:) = sp(rowStartIndex:rowEndIndex,:);
      sp2(:,colStartIndex:colEndIndex) = sp(:,colStartIndex:colEndIndex);
      clear sp;
    
      %square the matrix
      squared = sp1*sp2;
      clear sp1,sp2;
      disp('    squared');

      %output the matrix
      [i,j,val] = find(squared);
      clear squared;
      disp('    values grabbed for output');
      data_dump = [i,j,val];
      clear i;
      clear j;
      clear val;
      disp('    values ready for output dump');
      fid = fopen(fileOut,'a+');
      fprintf( fid,'%d %d %d\n', transpose(data_dump) );
      clear data_dump;
      fclose(fid);
      disp('   values output');
    end
  end
end



( run in 0.876 second using v1.01-cache-2.11-cpan-8d75d55dd25 )