ALBD

 view release on metacpan or  search on metacpan

INSTALL  view on Meta::CPAN

     make test

     make install

DESCRIPTION
    ALBD provides a system for performing ABC co-occurrence literature based
    discovery using a variety of options, and association-based ranking
    methods

REQUIREMENTS
    ALBD REQUIRES that the following software packages and data:

  Programming Languages
         Perl (version 5.16.3 or better)

  CPAN Modules
         UMLS::Association
         UMLS::Interface

  Required for some Methods:
         MATLAB

MANIFEST  view on Meta::CPAN

samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
utils/datasetCreator/fromMySQL/removeQuotes.pl
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl
utils/datasetCreator/squaring/squareMatrix.m
utils/datasetCreator/squaring/squareMatrix_partial.m
utils/datasetCreator/squaring/squareMatrix_perl.pl
META.yml                                 Module YAML meta-data (added by MakeMaker)
META.json                                Module JSON meta-data (added by MakeMaker)

README  view on Meta::CPAN

NAME
    ALBD README

  SYNOPSIS
        This package consists of Perl modules along with supporting Perl
        programs that perform Literature Based Discovery (LBD). The core 
        data from which LBD is performed are co-occurrences matrices 
        generated from UMLS::Association. ALBD is based on the ABC
        co-occurrence model. Many options can be specified, and many
        ranking methods are available. The novel ranking methods that use
        association measure are available as well as frequency based
        ranking methods. See samples/lbd for more info. Can perform open and
        closed LBD as well as time slicing evaluation.

        ALBD requires UMLS::Association both to compute the co-occurrence
        database that the co-occurrence matrix is derived from, but also for 
        ranking the generated C terms.

        UMLS::Association requires the UMLS::Interface module to access 
        the Unified Medical Language System (UMLS) for semantic type filtering
        and to determine if CUIs are valid.

        The following sections describe the organization of this software
        package and how to use it. A few typical examples are given to help
        clearly understand the usage of the modules and the supporting
        utilities.

config/association  view on Meta::CPAN

##############################################################################
#                Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Association for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>bigrams" will pass the 'database' parameter with a 
# value of 'bigrams' to the  UMLS::Association options hash for its 
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')

<database>CUI_Bigram
<hostname>192.168.24.89
<username>henryst
<password>OhFaht3eique
<socket>/var/run/mysqld.sock
<t>

config/interface  view on Meta::CPAN

#############################################################################
#                Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Interface for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>umls" will pass the 'database' parameter with a value 
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')

<t>
<config>interfaceConfig

<hostname>192.168.24.89
<username>henryst

lib/ALBD.pm  view on Meta::CPAN

    use ALBD;
    %options = ();
    $options{'lbdConfig'} = 'configFile'
    my $lbd = LiteratureBasedDiscovery->new(\%options);
    $lbd->performLBD();

=head1 ABSTRACT

      This package consists of Perl modules along with supporting Perl
      programs that perform Literature Based Discovery (LBD). The core 
      data from which LBD is performed are co-occurrences matrices 
      generated from UMLS::Association. ALBD is based on the ABC
      co-occurrence model. Many options can be specified, and many
      ranking methods are available. The novel ranking methods that use
      association measure are available as well as frequency based
      ranking methods. See samples/lbd for more info. Can perform open and
      closed LBD as well as time slicing evaluation.

=head1 INSTALL

To install the module, run the following magic commands:

lib/LiteratureBasedDiscovery/Discovery.pm  view on Meta::CPAN


package Discovery;
use strict;
use warnings;
use DBI;

######################################################################
#                        MySQL Notes
######################################################################
#TODO I think some of these notes should be elsewhere
# A Note about the database structure expected
#   Each LBD database is expected to have:
#   PreCutoff_N11
#   PostCutoff_N11
#   PreCutoff_Implicit
#
# Both PreCutoff_N11 and PostCutoff_N11 should
# be generated manually using CUI_Collector
# PreCutoff_Implicit is generated using the tableToSparseMatrix
# function here, which exports a sparse matrix. That matrix 
# can then be imported into matlab, squared, and reloaded into
# a mysql database. With these 3 tables LBD can be performed


######################################################################
#                          Description
######################################################################
# Discovery.pm - provides matrix operations from  n11 counts from 
# UMLS::Association
#
#TODO I think some of these notes should be elsewhere
# 'B' term filtering may be applied by removing elements from the 

lib/LiteratureBasedDiscovery/Rank.pm  view on Meta::CPAN

    seek IN, 0,0; #reset to the beginning of the implicit file

    #iterate over the lines of interest, and grab values
    my %np1 = ();
    my %n11 = ();
    my $n1p = 0;
    my $npp = 0;
    my $matchedCuiB = 0;
    my ($cuiA, $cuiB, $val);
    while (my $line = <IN>) {
	#grab data from the line
	($cuiA, $cuiB, $val) = split(/\t/,$line);

	#see if updates are necessary
	if (exists $aTerms{$cuiA} || exists $cTerms{$cuiB}) {

	    #update npp
	    $npp += $3;
	    
	    #update np1
	    if (exists $cTerms{$cuiB}) {

lib/LiteratureBasedDiscovery/TimeSlicing.pm  view on Meta::CPAN

	#see if this line contains a key that should be read in 
	if (exists $cuisToGrab{$cui1}) {

	    #add the value
	    if (!(defined $postCutoffMatrix{$cui1})) {
		my %newHash = ();
		$postCutoffMatrix{$cui1} = \%newHash;
	    }

	    #check to ensure that the column cui is in the 
	    #  vocabulary of the pre-cutoff dataset.
	    #  it is impossible to make predictions of words that
	    #  don't already exist
	    #NOTE: this assumes $explicitMatrixRef is a square 
	    #   matrix (so unordered)
	    if (exists ${$explicitMatrixRef}{$cui2}) {
		${$postCutoffMatrix{$cui1}}{$cui2} = $val;
	    }
	}
    }
    close IN;

samples/configFileSamples/UMLSAssociationConfig  view on Meta::CPAN

##############################################################################
#                Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Association for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>bigrams" will pass the 'database' parameter with a 
# value of 'bigrams' to the  UMLS::Association options hash for its 
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Association for more detailed

# Database of Association Scores. Not used, but required to initialize
# UMLS::Association
<database>CUI_Bigram

# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock

# makes the UMLS::Association not print to the command line
<t>

samples/configFileSamples/UMLSInterfaceConfig  view on Meta::CPAN

#############################################################################
#                Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and 
# passed directly to UMLS::Interface for initialization. Options hash keys 
# are in <>'s, and values follow directly after with no space. As as example, 
# the line "<database>umls" will pass the 'database' parameter with a value 
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Interface for more detail

# makes the UMLS::Interface not print to the command line
<t>

samples/runSample.pl  view on Meta::CPAN

#Demo file, showing how to run open discovery using the sample data, and how 
# to perform time slicing evaluation using the sample data

# run a sample lbd using the parameters in the lbd configuration file
print "\n           OPEN DISCOVERY          \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";

# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is 
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold 
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you. 
# The variables for this example in removeExplicit.pl are:
#  my $matrixFileName = 'sampleExplicitMatrix';
#  my $squaredMatrixFileName = postCutoffMatrix;
#  my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;

# next, run time slicing 
print "          TIME SLICING          \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";

samples/timeSlicingConfig  view on Meta::CPAN


# similar to target termcept groups, this restricts the acceptable target (C) 
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem

# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix

# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix

# Input file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the 
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries. It is easiest to generate
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix

t/test.t  view on Meta::CPAN

	last;
    }
}
ok($fAtKSame == 1, "Frequency at K Matches");

print "Done with Time Slicing Tests\n";



############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
    my $fileName = shift;

    #read in the gold time slicing values
    my @APScores = ();
    my $MAP;
    my @PAtKScores = ();
    my @FAtKScores = ();
    open IN, "$fileName" 
    #open IN, './t/goldSampleTimeSliceOutput'

utils/datasetCreator/applySemanticFilter.pl  view on Meta::CPAN

	    $matrixRef, $acceptTypesRef, $umls_interface);
    } else {
	Filters::semanticTypeFilter_rowsAndColumns(
	    $matrixRef, $acceptTypesRef, $umls_interface);
    }

    #output the matrix
    Discovery::outputMatrixToFile($outputFileName, $matrixRef);

    #TODO re-enable this and then try to run again
    #disconnect from the database and return
    #$umls_interface->disconnect();
}


# transforms the string of accept types or groups into a hash of accept TUIs
# input:  a string specifying whether linking or target types are being defined
# output: a hash of acceptable TUIs
sub getAcceptTypes {
    my $umls_interface = shift;
    my $acceptTypesString = shift;

utils/datasetCreator/combineCooccurrenceMatrices.pl  view on Meta::CPAN

# different time slicing or discovery replication results. We ran CUI Collector
# seperately for each year of the MetaMapped MEDLINES baseline and stored each
# co-occurrence matrix in a single folder "hadoopByYear/output/". That folder 
# contained file named the year and window size used (e.g. 1975_window8).
# The code may need to be modified slightly for other purposes.
use strict;
use warnings;
my $startYear;
my $endYear;
my $windowSize;
my $dataFolder;

#user input
$dataFolder = '/home/henryst/hadoopByYear/output/';
$startYear = '1983';
$endYear = '1985';
$windowSize = 8;
&combineFiles($startYear,$endYear,$windowSize);


#####################################################
####### Program Start ########
sub combineFiles {
    my $startYear = shift;

utils/datasetCreator/dataStats/metaAnalysis.pl  view on Meta::CPAN

# co-occurrences of a co-occurrence file, or set of co-occurrence files
use strict;
use warnings;

#perform meta-analysis on a single co-occurrence matrix
&metaAnalysis('/home/henryst/lbdData/groupedData/1960_1989_window8_noOrder');

#perform meta-analysis on a date range of co-occurrence matrices in a folder
# this expects a folder to contain a co-occurrence matrix for every year
# specified within the date range
my $dataFolder = '/home/henryst/lbdData/dataByYear/1960_1989';
my $startYear = '1809';
my $endYear = '2015';
my $windowSize = 1;
my $statsOutFileName = '/home/henryst/lbdData/stats_window1';
&folderMetaAnalysis($startYear, $endYear, $windowSize, $statsOutFileName, $dataFolder);


#####################
# runs meta analysis on a set of files
sub folderMetaAnalysis {
    my $startYear = shift;
    my $endYear = shift;
    my $windowSize = shift;
    my $statsOutFileName= shift;
    my $dataFolder = shift;

    #Check on I/O
    open OUT, ">$statsOutFileName" 
	or die ("ERROR: unable to open stats out file: $statsOutFileName\n");

    #print header row
    print OUT "year\tnumRows\tnumCols\tvocabularySize\tnumCooccurrences\n";

    #get stats for each file and output to file
    for(my $year = $startYear; $year <= $endYear; $year++) {
	print "reading $year\n";
	my $inFile = $dataFolder.$year.'_window'.$windowSize;
	if (open IN, $inFile) {
	    (my $numRows, my $numCols, my $vocabularySize, my $numCooccurrences)
		= &metaAnalysis($inFile);
	    print OUT "$year\t$numRows\t$numCols\t$vocabularySize\t$numCooccurrences\n"	
	}
	else {
	    #just skip the file
	    print "   ERROR: unable to open $inFile\n";
	}
    }

utils/datasetCreator/fromMySQL/dbToTab.pl  view on Meta::CPAN

#converts a mysql database to tab seperated readable by LBD
#command is of the form:
#`mysql <DB_NAME> -e "SELECT * FROM N_11 INTO OUTFILE '<OUTPUT_FILE>' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`
#
# the following line is an example using a database with cui co-occurrence 
# counts from 1980 to 1984 with a window size of 1. The mysql database is 
# called 1980_1984_window1, and the output matrix file is called 
# 1980_1984_window1_data.txt
`mysql 1980_1984_window1 -e "SELECT * FROM N_11 INTO OUTFILE '1980_1984_window1_data.txt' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`;

utils/datasetCreator/fromMySQL/removeQuotes.pl  view on Meta::CPAN

#renoves quotes from a db to tab file

my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';


open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");

while (my $line  = <IN>) {
    $line =~ s/"//g;
    #print $line;
    print OUT $line;

utils/datasetCreator/removeCUIPair.pl  view on Meta::CPAN

# removes the cui pair from the dataset
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;

my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);

print STDERR "DONE\n";

###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
    my $cuiA = shift;
    my $cuiB = shift;
    my $matrixFileName = shift;
    my $matrixOutFileName = shift;
    print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
    
    #open the in and out files
    open IN, $matrixFileName 
	or die ("ERROR: cannot open matrix in file: $matrixFileName\n");

utils/datasetCreator/squaring/squareMatrix.m  view on Meta::CPAN

clear all;
close all;

sparseSquare('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared');

error('DONE!');


function sparseSquare(fileIn, fileOut)

    %load the data
    data = load(fileIn);
    disp('   loaded data');

    %convert to sparse
    vals = max(data);
    maxVal = vals(1);
    if (vals(2) > maxVal) 
       maxVal = vals(2); 
    end
    sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
    clear data;
    clear vals;
    clear maxVal;
    disp('   converted to sparse');

    %square the matrix
    squared = sp*sp;
    clear sp;
    disp('    squared');

    %output the matrix
    [i,j,val] = find(squared);
    clear squared;
    disp('    values grabbed for output');
    data_dump = [i,j,val];
    clear i;
    clear j;
    clear val;
    disp('    values ready for output dump');
    fid = fopen(fileOut,'w');
    fprintf( fid,'%d %d %d\n', transpose(data_dump) );
    fclose(fid);
    disp('   DONE!');

end

utils/datasetCreator/squaring/squareMatrix_partial.m  view on Meta::CPAN

sparseSquare_sectioned('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared_secondTry',increment);
error('DONE!');

function sparseSquare_sectioned(fileIn, fileOut, increment)
  disp(fileIn);

  %open, close, and clear the output file
  fid = fopen(fileOut,'w');
  fclose(fid);

  %load the data
  data = load(fileIn);
    
  vals = max(data);
  matrixSize = vals(1);
  if (vals(2) > matrixSize) 
    matrixSize = vals(2); 
  end
  disp('got matrixDim');
  clear data;

  %multiply each segment of the matrices
  for rowStartIndex = 1:increment:matrixSize
    rowEndIndex = rowStartIndex+increment-1;
    if (rowEndIndex > matrixSize) 
      rowEndIndex = matrixSize;
    end

    for colStartIndex = 1: increment: matrixSize
      colEndIndex = colStartIndex+increment-1;
      if (colEndIndex > matrixSize)
        colEndIndex = matrixSize;
      end

      dispString = [num2str(rowStartIndex), ',', num2str(rowEndIndex),' - ', num2str(colStartIndex),', ', num2str(colEndIndex),':'];
      disp(dispString)
      clear dispString;

      %load the data
      data = load(fileIn);
      disp('   loaded data');

      %convert to sparse
      vals = max(data);
      maxVal = vals(1);
      if (vals(2) > maxVal) 
        maxVal = vals(2); 
      end
      sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
      clear data;
      clear vals;
      clear maxVal;
      disp('   converted to sparse');

      %grab a peice of the matrix
      sp1 = sparse(matrixSize,matrixSize);
      sp2 = sparse(matrixSize,matrixSize);
      sp1(rowStartIndex:rowEndIndex,:) = sp(rowStartIndex:rowEndIndex,:);
      sp2(:,colStartIndex:colEndIndex) = sp(:,colStartIndex:colEndIndex);
      clear sp;
    
      %square the matrix
      squared = sp1*sp2;
      clear sp1,sp2;
      disp('    squared');

      %output the matrix
      [i,j,val] = find(squared);
      clear squared;
      disp('    values grabbed for output');
      data_dump = [i,j,val];
      clear i;
      clear j;
      clear val;
      disp('    values ready for output dump');
      fid = fopen(fileOut,'a+');
      fprintf( fid,'%d %d %d\n', transpose(data_dump) );
      clear data_dump;
      fclose(fid);
      disp('   values output');
    end
  end
end

 view all matches for this distribution
 view release on metacpan -  search on metacpan

( run in 1.649 second using v1.00-cache-2.02-grep-82fe00e-cpan-4673cadbf75 )