ALBD

 view release on metacpan or  search on metacpan

INSTALL  view on Meta::CPAN


  Stage 4: Create an co-occurrence matrix
    ALBD requires that a co-occurrence matrix of CUIs has been created. This
    matrix is stored as a flat file, in a sparse matrix format such that
    each line contains three tab seperated values, cui_1, cui_2, n_11 = the
    count of their co-occurrences. Any matrix with that format is
    acceptable, however the intended method of matrix generation is to
    convert a UMLS::Association database into a flat matrix file. These
    databases are created using the CUICollector tool of UMLS::Association,
    and are run over the MetaMapped Medline baseline. With that file, run
    utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
    database into a matrix file. Notice that code in dbToTab.pl is just a
    sample mysql command. If the input database is created in another
    method, a different command may be needed. As long as the resulting
    co-occurrence matrix is in the correct format LBD may be run on it. This
    allows flexibility in where co-occurrence information comes from.

    Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
    on the resulting tab seperated file, if quotes are inlcuded in the
    resulting co-ocurrence matrix file.

  Stage 5: Set up Dummy UMLS::Association Database
    UMLS::Association requires that a database can be connected to that is
    in the correct format. Although this database is not required for ALBD
    (since co-occurrence data is loaded from a co-occurrence matrix), it is
    required to run UMLS:Association. If you ran UMLS::Association to
    generate a co-occurrence matrix, you should be fine. Otherwise you will
    need to create a dummy database that it can connect to. This can be done

MANIFEST  view on Meta::CPAN

samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
utils/datasetCreator/fromMySQL/removeQuotes.pl
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl
utils/datasetCreator/squaring/squareMatrix.m
utils/datasetCreator/squaring/squareMatrix_partial.m
utils/datasetCreator/squaring/squareMatrix_perl.pl
META.yml                                 Module YAML meta-data (added by MakeMaker)
META.json                                Module JSON meta-data (added by MakeMaker)

README  view on Meta::CPAN


  CO-OCCURRENCE MATRIX SETUP
    ALBD requires that a co-occurrence matrix of CUIs has been created. This
    matrix is stored as a flat file, in a sparse matrix format such that
    each line contains three tab seperated values, cui_1, cui_2, n_11 = the
    count of their co-occurrences. Any matrix with that format is
    acceptable, however the intended method of matrix generation is to
    convert a UMLS::Association database into a flat matrix file. These
    databases are created using the CUICollector tool of UMLS::Association,
    and are run over the MetaMapped Medline baseline. With that file, run
    utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
    database into a matrix file. Notice that code in dbToTab.pl is just a
    sample mysql command. If the input database is created in another
    method, a different command may be needed. As long as the resulting
    co-occurrence matrix is in the correct format LBD may be run on it. This
    allows flexibility in where co-occurrence information comes from.

    Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
    on the resulting tab seperated file, if quotes are inlcuded in the
    resulting co-ocurrence matrix file.

  Set Up Dummy UMLS::Association Database
    UMLS::Association requires that a database can be connected to that is
    in the correct format. Although this database is not required for ALBD
    (since co-occurrence data is loaded from a co-occurrence matrix), it is
    required to run UMLS:Association. If you ran UMLS::Association to
    generate a co-occurrence matrix, you should be fine. Otherwise you will
    need to create a dummy database that it can connect to. This can be done

README  view on Meta::CPAN

    present in the '/lib' directory tree of the package.

    The package contains a utils/ directory that contain Perl utility
    programs. These utilities use the modules or provide some supporting
    functionality.

    runDiscovery.pl -- runs LBD using the parameters specified in the input
    file, and outputs to an output file.

    The package contains a large selection of functions to manipulate CUI
    Co-occurrence matrices in the utils/datasetCreator/ directory. These are
    short scripts and generally require modifying the code at the top with
    user input paramaters specific for each run. These scripts include:

    applyMaxThreshold.pl -- applies a maximum co-occurrence threshold to the
    co-occurrence matrix

    applyMinThreshold.pl -- applies a minimum co-occurrence threshold to the
    co-occurrence matrix

    applySemanticFilter.pl -- applies a semantic type and/or group filter to

README  view on Meta::CPAN


    testMatrixEquality.pl -- checks to see if two co-occurrence matrix files
    contain the same data

    Also included are several subfolders with more specific purposes. Within
    the dataStats subfolder are scripts to collect various statistics about
    the co-occurrence matrices used in LBD. These scriptsinclude:

    getCUICooccurrences.pl -- a data statistics file that gets the number of
    co-occurrences, and number of unique co-occurrences for every CUI in the
    dataset

    getMatrixStats.pl -- determines the number of rows, columns, and entries
    of a co-occurrence matrix

    metaAnalysis.pl -- determines the number of rows, columns, vocabulary
    size, and total number of co-occurrences of a co-occurrence file, or set
    of co-occurrence files

    There is another folder containing scripts to square co-occurrence
    matrices. Squaring an explicit (A to B) co-occurrence matrix results in
    a co-occurrence matrix containing all implicit (A to C) connections.
    This is useful for time slicing and other analysis. Removal of the
    original explicit matrix is an additional step that is required if you
    wish to create a predictions matrix file for every CUI. This can be done
    with the removeExplicit.pl script. Squaring a co-occurrence matrix can
    be very computationally expensive, both in terms of ram and cpu. For

README  view on Meta::CPAN

    squareMatrix.m -- MATLAB script to square a matrix while holding
    everything in ram. Faster, but requires more ram.

    squareMatrix_partial.m -- MATLAB script to square a matrix in chunks.
    Only loads parts of the matrix into ram at a time which makes squaring
    any size matrix possible, but potentially take impracticle amounts of
    time.

    squareMatrix_perl.pl -- squares a matrix in perl, but requires the most
    ram of any squaring method. The easiest method to use, but only
    practical for small datasets.

    The fromMySQL folder contains scripts that convery UMLS::Association
    databases to ALBD co-occurrence matrices. The files contained are:

    dbToTab.pl -- converts a UMLS::Association co-occurrence database to a
    sparse format co-occurrence matrix used for ALBD

    removeQuotes.pl -- removes quotes from lines in the co-occurrence matrix
    file after converting from a database (sometimes needed)

config/lbd  view on Meta::CPAN

##############################################################################
#            Configuration File for Literature Based Discovery
##############################################################################
# All the options in this file are parsed and used as parameters for
# Literature Based Discovery
# Options keys are in <>'s, and values follow directly after with no space. 
# As as example, the line "<rankingMethod>ll" will set the 'rankingMethod' 
# parameter with a value of 'll' for literature based discovery
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')

<rankingProcedure>averageMinimumWeight
<rankingMethod>chi
<implicitOutputFile>../samples/sampleLBDOutput

<linkingAcceptGroups>CHEM,DISO,GENE,PHYS,ANAT

lib/ALBD.pm  view on Meta::CPAN

# LiteratureBasedDiscovery.pm - provides functionality to perform LBD
#
# Matrix Representation:
# LBD is performed using Matrix and Vector operations. The major components 
# are an explicit knowledge matrix, which is squared to find the implicit 
# knowledge matrix.
#
# The explicit knowledge is read from UMLS::Association N11 matrix. This 
# matrix contains the co-occurrence counts for all CUI pairs. The 
# UMLS::Association database is completely independent from 
# implementation, so any dataset, window size, or anything else may be used. 
# Data is read in as a sparse matrix using the Discovery::tableToSparseMatrix 
# function. This returns the primary data structures and variables used 
# throughtout LBD.
#
# Matrix representation: 
# This module uses a matrix representation for LBD. All operations are 
# performed either as matrix or vector operations. The core data structure
# are the co-occurrence matrices explicitMatrix and implicitMatrix. These
# matrices have dimensions vocabulary size by vocabulary size. Each row 
# corresponds to the all co-occurrences for a single CUI. Each column of that 

lib/ALBD.pm  view on Meta::CPAN

# means that CUI C0000000 and C1111111 co-occurred 10 times).
#
# Now with an understanding of the data strucutres, below is a breif 
# description of each: 
#
# startingMatrix <- A matrix containing the explicit matrix rows for all of the
#                   start terms. This makes it easy to have multiple start terms
#                   and using this matrix as opposed to the entire explicit 
#                   matrix drastically improves performance.
# explicitMatrix <- A matrix containing explicit connections (known connections)
#                   for every CUI in the dataset.            
# implicitMatrix <- A matrix containing implicit connections (discovered 
#                   connections) for every CUI in the datast


package ALBD;

use strict;
use warnings;

use LiteratureBasedDiscovery::Discovery;

lib/ALBD.pm  view on Meta::CPAN

#####################################################
####################################################

# performs LBD
# input:  none
# ouptut: none, but a results file is written to disk
sub performLBD {
    my $self = shift;
    my $start; #used to record run times

    #implicit matrix ranking requires a different set of procedures
    if ($lbdOptions{'rankingProcedure'} eq 'implicitMatrix') { 
	$self->performLBD_implicitMatrixRanking();
	return;
    }
    if (exists $lbdOptions{'targetCuis'}) {
	$self->performLBD_closedDiscovery();
	return;
    }
    if (exists $lbdOptions{'precisionAndRecall_explicit'}) {
	$self->timeSlicing_generatePrecisionAndRecall_explicit();

lib/ALBD.pm  view on Meta::CPAN

    my $className = shift;
    my $optionsHashRef = shift;
    bless($self, $className);

    $self->_initialize($optionsHashRef);
    return $self;
}

# Initializes everything needed for Literature Based Discovery
# input: $optionsHashRef <- reference to LBD options hash (command line input)
# output: none, but global parameters are set
sub _initialize {
    my $self = shift;
    my $optionsHashRef = shift; 

    #initialize UMLS::Interface
    my %tHash = ();
    $tHash{'t'} = 1; #default hash values are with t=1 (silence module output)
    my $componentOptions = \%tHash;
    if (${$optionsHashRef}{'interfaceConfig'} ne '') {
	#read configuration file if its defined

lib/ALBD.pm  view on Meta::CPAN


    #initialize LBD parameters
    %lbdOptions = %{$self->_readConfigFile(${$optionsHashRef}{'lbdConfig'})};
    
}    

# Reads the config file in as an options hash
# input: the name of a configuration file that has key fields in '<>'s, 
#        The '>' is followed directly by the value for that key, no space.
#        Each line of the file contains a new key-value pair (e.g. <key>value)
#        If no value is provided, a default value of 1 is set
# output: a hash ref to a hash containing each key value pair
sub _readConfigFile {
    my $self = shift;
    my $configFileName = shift;
    
    #read in all options from the config file
    open IN, $configFileName or die("Error: Cannot open config file: $configFileName\n");
    my %optionsHash = ();
    my $firstChar;
    while (my $line = <IN>) {

lib/ALBD.pm  view on Meta::CPAN

		print STDERR 
		    "Warning: Invalid line in $configFileName: $line\n";
	    }
	    else {
		#data was grabbed from the line, add to hash
		if ($2) {
		    #add key and value to the optionsHash
		    $optionsHash{$1} = $2;
		}
		else {
		    #add key and set default value to the optionsHash
		    $optionsHash{$1} = 1;
		}
	    }
	}
    }
    close IN;

    return \%optionsHash;
}

lib/ALBD.pm  view on Meta::CPAN

#         $printTo <- optional, outputs the $printTo top ranked terms. If not
#                     specified, all terms are output
# output: a line seperated string containing ranked terms, scores, and thier
#         preferred terms
sub _rankedTermsToString {
    my $self = shift;
    my $scoresRef = shift;
    my $ranksRef = shift;
    my $printTo = shift;

    #set printTo
    if (!$printTo) {
	$printTo = scalar @{$ranksRef};
    }
    
    #construct the output string
    my $string = '';
    my $index;
    for (my $i = 0; $i < $printTo; $i++) {
	#add the rank
	$index = $i+1;

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN

# 6) filter impicit knowledge
# 
# which has code as:
# TODO insert sample code

#NOTE: CUI merging/term expansion can also be easily done by adding
#   two or more explicit vectors, then generating explicit knowledge from
#   them.  BUT also interesting is that term expansion, etc... is 
#   unnecassary if we just rank against every term. We may however need 
#   to modify the ranking metrics to account for synonyms, etc.. (max value
#   of a set of synonyms or something)


######################################################################
#           Functions to perform Literature Based Discovery
######################################################################


# gets the rows of the cuis from the matrix
# input:  $cuisRef <- an array reference to a list of CUIs
#         $matrixRef <- a reference to a co-occurrence matrix

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN

#          #cuiFinder <- an instance of UMLS::Interface::CuiFinder
#  output: a hash ref to the sparse matrix (${$hash{$index1}}{$index2} = value)
sub tableToSparseMatrix {
    my $tableName = shift;
    my $cuiFinder = shift;

    # check tableName
    #TODO check that the table exists in the database
    # or die "Error: table does not exist: $tableName\n";

    #  set up database
    my $db = $cuiFinder->_getDB(); 
    
    # retreive the table as a nested hash where keys are CUI1, 
    # then CUI2, value is N11
     my @keyFields = ('cui_1', 'cui_2');
     my $matrixRef = $db->selectall_hashref(
	"select * from $tableName", \@keyFields);

    # set values of the loaded table to n_11
    # ...default is hash of hash of hash
    foreach my $key1(keys %{$matrixRef}) {
	foreach my $key2(keys %{${$matrixRef}{$key1}}) {
	    ${${$matrixRef}{$key1}}{$key2} = ${${${$matrixRef}{$key1}}{$key2}}{'n_11'};
	}
    }
    return $matrixRef;
}
=cut

lib/LiteratureBasedDiscovery/Evaluation.pm  view on Meta::CPAN

# ALBD::Evaluation.pm
#
# Provides functionality to evaluate LBD systems
# Key components are:
# Results Matrix <- all new knowledge generated by an LBD system (e.g.
#                   all proposed discoveries of a system from pre-cutoff
#                   data).
# Gold Standard Matrix <- the gold standard knowledge matrix (e.g. all
#                         knowledge present in the post-cutoff dataset
#                         that is not present in the pre-cutoff dataset).
#
# Copyright (c) 2017
#
# Sam Henry
# henryst at vcu.edu
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.

lib/LiteratureBasedDiscovery/Filters.pm  view on Meta::CPAN

}

# gets the semantic types of the group
# input:  $group <- a string specifying a semantic group
#         $umls <- an instance of UMLS::Interface
# output: a ref to a hash of TUIs
sub getTypesOfGroup {
    my $group = shift;
    my $umls = shift;

    #add each type of the group to the set of accept types
    my %acceptTuis = ();
    my @groupTypes = @{ $umls->getStsFromSg($group) };
    foreach my $abr(@groupTypes) {
	#check that it is defined (types that are no longer in 
	#the UMLS may be returned as part of the group)
	if (defined $abr) {
	    my $tui = uc $umls->getStTui($abr);
	    $acceptTuis{$tui} = 1;
	}
    }

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN


	#find the sum of C squared
	my $cSum = 0;
	foreach my $key (keys ${$aVectorRef}) {
	    $cSum += ($key*$key);
	}

	#find the denominator, which is the product of A and C lengths
	my $denom = sqrt($aSum)*sqrt($cSum);

	#set the score (maximum score seen for that C term)
	my $score = -1;
	if ($denom != 0) {
	    $score = $numerator/$denom;
	}
	if (exists $scores{$cKey}) {
	    if ($score > $scores{$cKey}) {
		$scores{$cKey} = $score;
	    }
	}
	else {
	    $scores{$cKey} = $score;
	}	
    }
    
    return \%scores;
}

# gets a list of A->C pairs, and sets the value as the implicit matrix value
# input:  $startingMatrixRef <- ref to the starting matrix
#         $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref where keys are comma seperated cui pairs hash{'C000,C111'}
#         and values are set to the value at that index in the implicit matrix
sub _getACPairs {
    my $startingMatrixRef = shift;
    my $implicitMatrixRef = shift;

    #generate a list of ac pairs
    my %acPairs = ();
    foreach my $keyA (keys %{$implicitMatrixRef}) {
	foreach my $keyC (%{${$implicitMatrixRef}{$keyA}}) {
	    $acPairs{$keyA,$keyC} = ${${$implicitMatrixRef}{$keyA}}{$keyC};
	}

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN

	}
    }

######################################
    #Get Co-occurrence values, N11, N1P, NP1, NPP
######################################
    #NPP is the number of Co-occurreces total
    #@NP1 is the number of co-occurrences of a C term with any term ... so sum of XXX\tCTerm\tVal for each cTerm
    #@N1P is the number of co-occurrences of any A term ... so sum of anyATerm\tXXX\t
    #N11{Cterm} is the sum of anyATerm\tCTerm\tVal
    seek IN, 0,0; #reset to the beginning of the implicit file

    #iterate over the lines of interest, and grab values
    my %np1 = ();
    my %n11 = ();
    my $n1p = 0;
    my $npp = 0;
    my $matchedCuiB = 0;
    my ($cuiA, $cuiB, $val);
    while (my $line = <IN>) {
	#grab data from the line

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN


    #get all the cTerms (unique column values in the implicit matrix)
    my %cTerms = ();
    foreach my $rowKey(keys %{$implicitMatrixRef}) {
	$rowRef = ${$implicitMatrixRef}{$rowKey};
	foreach my $colKey (keys %{$rowRef}) {
	    $cTerms{$colKey} = 1;
	}
    }

    #get all bc pairs, set value to be the frequency of co-occurrence
    my %bcPairs = ();
    foreach my $bTerm(keys %bTerms) {
	$rowRef = ${$explicitMatrixRef}{$bTerm};
	if ($rowRef) {
	    foreach my $cTerm(keys %{$rowRef}) {
		if (exists $cTerms{$cTerm}) {
		    #add because this a->b->c term (%cTerms) is also a b->c term
		    $bcPairs{"$bTerm,$cTerm"} = ${$rowRef}{$cTerm};
		}
	    }

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN

    }

    #return the ranked cuis
    return \@rankedCuis;
}


#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

# gets association scores for a set of cui pairs 
# input:  $cuiPairsRef <- reference to a hash of pairs of matrix indeces (key = '1,2')
#         $matrixRef <- a reference to a sparse matrix of n11 values
#         $measure <- the association measure to perform
#         $association <- an instance of UMLS::Association
# output: none, bu the cuiPairs ref has values updated to reflect the 
#         computed assocation score
sub getBatchAssociationScores {
    my $cuiPairsRef = shift;
    my $matrixRef = shift;
    my $measure = shift;

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN

	#see if this line contains a key that should be read in 
	if (exists $cuisToGrab{$cui1}) {

	    #add the value
	    if (!(defined $postCutoffMatrix{$cui1})) {
		my %newHash = ();
		$postCutoffMatrix{$cui1} = \%newHash;
	    }

	    #check to ensure that the column cui is in the 
	    #  vocabulary of the pre-cutoff dataset.
	    #  it is impossible to make predictions of words that
	    #  don't already exist
	    #NOTE: this assumes $explicitMatrixRef is a square 
	    #   matrix (so unordered)
	    if (exists ${$explicitMatrixRef}{$cui2}) {
		${$postCutoffMatrix{$cui1}}{$cui2} = $val;
	    }
	}
    }
    close IN;

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN


	#add key if val >= threshold
	if (${$assocScoresRef}{$key} >= $threshold) {
	    ($cui1,$cui2) = split(/,/, $key);

	    #create new hash at rowkey location
	    if (!(exists $thresholdedMatrix{$cui1})) {
		my %newHash = ();
		$thresholdedMatrix{$cui1} = \%newHash;
	    }
	    #set key value
	    ${$thresholdedMatrix{$cui1}}{$cui2} = ${${$matrixRef}{$cui1}}{$cui2};
	    $postKeyCount++;
	}
    }

    #return the thresholded matrix
    return \%thresholdedMatrix;
}

# Grabs the K highest ranked samples. This is for thresholding based the number 

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN

    my ($cui1, $cui2);
    foreach my $key (@sortedKeys) {
	($cui1, $cui2) = split(/,/, $key);

	#create new hash at rowkey location (if needed)
	if (!(exists $thresholdedMatrix{$cui1})) {
	    my %newHash = ();
	    $thresholdedMatrix{$cui1} = \%newHash;
	}

	#set key value for the key pair
	${$thresholdedMatrix{$cui1}}{$cui2} = ${${$matrixRef}{$cui1}}{$cui2};
	$postKeyCount++;

	#stop adding keys when below the threshold
	if (${$assocScoresRef}{$key} < $threshold) {
	    last;
	}
    }
    #return the thresholded matrix
    return \%thresholdedMatrix;

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN

	if ($k == 10) {
	    $interval = 10;
	}
    }

    #return the mean precisions at k
    return \%meanPrecision;
}


# calculates the number of co-occurrences in the gold set of the top ranked 
# k predictions at k at intervals of 1, from k = 1-10 and intervals of 10 
# for 10-100. Co-occurrence counts are averaged over each of the starting terms
# input:  $trueMatrixRef <- a ref to a hash of true discoveries
#         $rowRanksRef <- a ref to a hash of arrays of ranked predictions. 
#                         Each hash key is a cui,  each hash element is an 
#                         array of ranked predictions for that cui. The ranked 
#                         predictions are cuis are ordered in descending order 
#                         based on association. (from Rank::RankDescending)
# output: \%meanCooccurrenceCounts <- a hash of mean preicsions at K, each key 
#                                     is the value of k, the the value is the 

samples/lbdConfig  view on Meta::CPAN

##############################################################################
#            Configuration File for Literature Based Discovery
##############################################################################
# All the options in this file are parsed and used as parameters for
# Literature Based Discovery
# Options keys are in <>'s, and values follow directly after with no space. 
# As as example, the line "<rankingMethod>ll" will set the 'rankingMethod' 
# parameter with a value of 'll' for literature based discovery
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
# lines started with a # are skipped and may be used for comments

# The ranking procedure to use for LBD
# valid ranking procedures are:
#   allPairs (maxBC) - maximum B to C term value
#   averageMinimumWeight (AMW) - average of minimum A to B and B to C values

samples/runSample.pl  view on Meta::CPAN


# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is 
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold 
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you. 
# The variables for this example in removeExplicit.pl are:
#  my $matrixFileName = 'sampleExplicitMatrix';
#  my $squaredMatrixFileName = postCutoffMatrix;
#  my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;

# next, run time slicing 
print "          TIME SLICING          \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";

samples/timeSlicingConfig  view on Meta::CPAN

##############################################################################
#            Configuration File for Literature Based Discovery
##############################################################################
# All the options in this file are parsed and used as parameters for
# Literature Based Discovery
# Options keys are in <>'s, and values follow directly after with no space. 
# As as example, the line "<rankingMethod>ll" will set the 'rankingMethod' 
# parameter with a value of 'll' for literature based discovery
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
# lines started with a # are skipped and may be used for comments

#----- Time Slicing Specific Parameters ------------------------

#Tell LBD to enter precision and recall mode (time slicing)
<precisionAndRecall_implicit>

samples/timeSlicingConfig  view on Meta::CPAN


# similar to target termcept groups, this restricts the acceptable target (C) 
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem

# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix

# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix

# Input file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the 
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries. It is easiest to generate
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix

utils/datasetCreator/applyMaxThreshold.pl  view on Meta::CPAN

	or die ("ERROR: unable to open outputFile: $outputFile\n");

    print "ApplyingThreshold\n";
    #threshold each line of the file
    my ($cui1, $cui2, $val);
    while (my $line = <IN>) {
	#grab values 
	($cui1, $cui2, $val) = split(/\t/,$line);

	#skip if either $cui1 or $cui2 are greater than the threshold
	# the counts in %count have been set already according to 
	# whether $applyToUnique or not
	if (${$countRef}{$cui1} > $maxThreshold 
	    || ${$countRef}{$cui2} > $maxThreshold) {
	    next;
	}
	else {
	    print OUT $line;
	}

    }

utils/datasetCreator/dataStats/metaAnalysis.pl  view on Meta::CPAN

# determines the number of rows, columns, vocabulary size, and total number of 
# co-occurrences of a co-occurrence file, or set of co-occurrence files
use strict;
use warnings;

#perform meta-analysis on a single co-occurrence matrix
&metaAnalysis('/home/henryst/lbdData/groupedData/1960_1989_window8_noOrder');

#perform meta-analysis on a date range of co-occurrence matrices in a folder
# this expects a folder to contain a co-occurrence matrix for every year
# specified within the date range
my $dataFolder = '/home/henryst/lbdData/dataByYear/1960_1989';
my $startYear = '1809';
my $endYear = '2015';
my $windowSize = 1;
my $statsOutFileName = '/home/henryst/lbdData/stats_window1';
&folderMetaAnalysis($startYear, $endYear, $windowSize, $statsOutFileName, $dataFolder);


#####################
# runs meta analysis on a set of files
sub folderMetaAnalysis {
    my $startYear = shift;
    my $endYear = shift;
    my $windowSize = shift;
    my $statsOutFileName= shift;
    my $dataFolder = shift;

    #Check on I/O
    open OUT, ">$statsOutFileName" 
	or die ("ERROR: unable to open stats out file: $statsOutFileName\n");

utils/datasetCreator/removeCUIPair.pl  view on Meta::CPAN

# removes the cui pair from the dataset
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;

my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);

print STDERR "DONE\n";

###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
    my $cuiA = shift;
    my $cuiB = shift;
    my $matrixFileName = shift;
    my $matrixOutFileName = shift;
    print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
    
    #open the in and out files
    open IN, $matrixFileName 
	or die ("ERROR: cannot open matrix in file: $matrixFileName\n");

utils/runDiscovery.pl  view on Meta::CPAN

."   runDiscovery lbdConfigFile\n";
;

#############################################################################
#                       Parse command line options 
#############################################################################
my $DEBUG = 0;      # Prints EVERYTHING. Use with small testing files.        
my $HELP = '';      # Prints usage and exits if true.
my $VERSION;

#set default param values
my %options = ();
$options{'assocConfig'}  = '';
$options{'interfaceConfig'} = '';

#grab all the options and set values
GetOptions( 'debug'             => \$DEBUG, 
            'help'              => \$HELP,
	    'version'           => \$VERSION,
            'assocConfig=s'     => \$options{'assocConfig'},
            'interfaceConfig=s' => \$options{'interfaceConfig'},
);
 
#Check for version or help
if ($VERSION) {
    print "current version is ".(ALBD->version())."\n";



( run in 0.706 second using v1.01-cache-2.11-cpan-49f99fa48dc )