view release on metacpan or search on metacpan
make test
make install
DESCRIPTION
ALBD provides a system for performing ABC co-occurrence literature based
discovery using a variety of options, and association-based ranking
methods
REQUIREMENTS
ALBD REQUIRES that the following software packages and data:
Programming Languages
Perl (version 5.16.3 or better)
CPAN Modules
UMLS::Association
UMLS::Interface
Required for some Methods:
MATLAB
samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
utils/datasetCreator/fromMySQL/removeQuotes.pl
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl
utils/datasetCreator/squaring/squareMatrix.m
utils/datasetCreator/squaring/squareMatrix_partial.m
utils/datasetCreator/squaring/squareMatrix_perl.pl
META.yml Module YAML meta-data (added by MakeMaker)
META.json Module JSON meta-data (added by MakeMaker)
NAME
ALBD README
SYNOPSIS
This package consists of Perl modules along with supporting Perl
programs that perform Literature Based Discovery (LBD). The core
data from which LBD is performed are co-occurrences matrices
generated from UMLS::Association. ALBD is based on the ABC
co-occurrence model. Many options can be specified, and many
ranking methods are available. The novel ranking methods that use
association measure are available as well as frequency based
ranking methods. See samples/lbd for more info. Can perform open and
closed LBD as well as time slicing evaluation.
ALBD requires UMLS::Association both to compute the co-occurrence
database that the co-occurrence matrix is derived from, but also for
ranking the generated C terms.
UMLS::Association requires the UMLS::Interface module to access
the Unified Medical Language System (UMLS) for semantic type filtering
and to determine if CUIs are valid.
The following sections describe the organization of this software
package and how to use it. A few typical examples are given to help
clearly understand the usage of the modules and the supporting
utilities.
config/association view on Meta::CPAN
##############################################################################
# Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Association for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>bigrams" will pass the 'database' parameter with a
# value of 'bigrams' to the UMLS::Association options hash for its
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
<database>CUI_Bigram
<hostname>192.168.24.89
<username>henryst
<password>OhFaht3eique
<socket>/var/run/mysqld.sock
<t>
config/interface view on Meta::CPAN
#############################################################################
# Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Interface for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>umls" will pass the 'database' parameter with a value
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
<t>
<config>interfaceConfig
<hostname>192.168.24.89
<username>henryst
lib/ALBD.pm view on Meta::CPAN
use ALBD;
%options = ();
$options{'lbdConfig'} = 'configFile'
my $lbd = LiteratureBasedDiscovery->new(\%options);
$lbd->performLBD();
=head1 ABSTRACT
This package consists of Perl modules along with supporting Perl
programs that perform Literature Based Discovery (LBD). The core
data from which LBD is performed are co-occurrences matrices
generated from UMLS::Association. ALBD is based on the ABC
co-occurrence model. Many options can be specified, and many
ranking methods are available. The novel ranking methods that use
association measure are available as well as frequency based
ranking methods. See samples/lbd for more info. Can perform open and
closed LBD as well as time slicing evaluation.
=head1 INSTALL
To install the module, run the following magic commands:
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
package Discovery;
use strict;
use warnings;
use DBI;
######################################################################
# MySQL Notes
######################################################################
#TODO I think some of these notes should be elsewhere
# A Note about the database structure expected
# Each LBD database is expected to have:
# PreCutoff_N11
# PostCutoff_N11
# PreCutoff_Implicit
#
# Both PreCutoff_N11 and PostCutoff_N11 should
# be generated manually using CUI_Collector
# PreCutoff_Implicit is generated using the tableToSparseMatrix
# function here, which exports a sparse matrix. That matrix
# can then be imported into matlab, squared, and reloaded into
# a mysql database. With these 3 tables LBD can be performed
######################################################################
# Description
######################################################################
# Discovery.pm - provides matrix operations from n11 counts from
# UMLS::Association
#
#TODO I think some of these notes should be elsewhere
# 'B' term filtering may be applied by removing elements from the
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
seek IN, 0,0; #reset to the beginning of the implicit file
#iterate over the lines of interest, and grab values
my %np1 = ();
my %n11 = ();
my $n1p = 0;
my $npp = 0;
my $matchedCuiB = 0;
my ($cuiA, $cuiB, $val);
while (my $line = <IN>) {
#grab data from the line
($cuiA, $cuiB, $val) = split(/\t/,$line);
#see if updates are necessary
if (exists $aTerms{$cuiA} || exists $cTerms{$cuiB}) {
#update npp
$npp += $3;
#update np1
if (exists $cTerms{$cuiB}) {
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#see if this line contains a key that should be read in
if (exists $cuisToGrab{$cui1}) {
#add the value
if (!(defined $postCutoffMatrix{$cui1})) {
my %newHash = ();
$postCutoffMatrix{$cui1} = \%newHash;
}
#check to ensure that the column cui is in the
# vocabulary of the pre-cutoff dataset.
# it is impossible to make predictions of words that
# don't already exist
#NOTE: this assumes $explicitMatrixRef is a square
# matrix (so unordered)
if (exists ${$explicitMatrixRef}{$cui2}) {
${$postCutoffMatrix{$cui1}}{$cui2} = $val;
}
}
}
close IN;
samples/configFileSamples/UMLSAssociationConfig view on Meta::CPAN
##############################################################################
# Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Association for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>bigrams" will pass the 'database' parameter with a
# value of 'bigrams' to the UMLS::Association options hash for its
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Association for more detailed
# Database of Association Scores. Not used, but required to initialize
# UMLS::Association
<database>CUI_Bigram
# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock
# makes the UMLS::Association not print to the command line
<t>
samples/configFileSamples/UMLSInterfaceConfig view on Meta::CPAN
#############################################################################
# Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Interface for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>umls" will pass the 'database' parameter with a value
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Interface for more detail
# makes the UMLS::Interface not print to the command line
<t>
samples/runSample.pl view on Meta::CPAN
#Demo file, showing how to run open discovery using the sample data, and how
# to perform time slicing evaluation using the sample data
# run a sample lbd using the parameters in the lbd configuration file
print "\n OPEN DISCOVERY \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";
# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you.
# The variables for this example in removeExplicit.pl are:
# my $matrixFileName = 'sampleExplicitMatrix';
# my $squaredMatrixFileName = postCutoffMatrix;
# my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;
# next, run time slicing
print " TIME SLICING \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";
samples/timeSlicingConfig view on Meta::CPAN
# similar to target termcept groups, this restricts the acceptable target (C)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix
# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix
# Input file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries. It is easiest to generate
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix
last;
}
}
ok($fAtKSame == 1, "Frequency at K Matches");
print "Done with Time Slicing Tests\n";
############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
my $fileName = shift;
#read in the gold time slicing values
my @APScores = ();
my $MAP;
my @PAtKScores = ();
my @FAtKScores = ();
open IN, "$fileName"
#open IN, './t/goldSampleTimeSliceOutput'
utils/datasetCreator/applySemanticFilter.pl view on Meta::CPAN
$matrixRef, $acceptTypesRef, $umls_interface);
} else {
Filters::semanticTypeFilter_rowsAndColumns(
$matrixRef, $acceptTypesRef, $umls_interface);
}
#output the matrix
Discovery::outputMatrixToFile($outputFileName, $matrixRef);
#TODO re-enable this and then try to run again
#disconnect from the database and return
#$umls_interface->disconnect();
}
# transforms the string of accept types or groups into a hash of accept TUIs
# input: a string specifying whether linking or target types are being defined
# output: a hash of acceptable TUIs
sub getAcceptTypes {
my $umls_interface = shift;
my $acceptTypesString = shift;
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
# different time slicing or discovery replication results. We ran CUI Collector
# seperately for each year of the MetaMapped MEDLINES baseline and stored each
# co-occurrence matrix in a single folder "hadoopByYear/output/". That folder
# contained file named the year and window size used (e.g. 1975_window8).
# The code may need to be modified slightly for other purposes.
use strict;
use warnings;
my $startYear;
my $endYear;
my $windowSize;
my $dataFolder;
#user input
$dataFolder = '/home/henryst/hadoopByYear/output/';
$startYear = '1983';
$endYear = '1985';
$windowSize = 8;
&combineFiles($startYear,$endYear,$windowSize);
#####################################################
####### Program Start ########
sub combineFiles {
my $startYear = shift;
utils/datasetCreator/dataStats/metaAnalysis.pl view on Meta::CPAN
# co-occurrences of a co-occurrence file, or set of co-occurrence files
use strict;
use warnings;
#perform meta-analysis on a single co-occurrence matrix
&metaAnalysis('/home/henryst/lbdData/groupedData/1960_1989_window8_noOrder');
#perform meta-analysis on a date range of co-occurrence matrices in a folder
# this expects a folder to contain a co-occurrence matrix for every year
# specified within the date range
my $dataFolder = '/home/henryst/lbdData/dataByYear/1960_1989';
my $startYear = '1809';
my $endYear = '2015';
my $windowSize = 1;
my $statsOutFileName = '/home/henryst/lbdData/stats_window1';
&folderMetaAnalysis($startYear, $endYear, $windowSize, $statsOutFileName, $dataFolder);
#####################
# runs meta analysis on a set of files
sub folderMetaAnalysis {
my $startYear = shift;
my $endYear = shift;
my $windowSize = shift;
my $statsOutFileName= shift;
my $dataFolder = shift;
#Check on I/O
open OUT, ">$statsOutFileName"
or die ("ERROR: unable to open stats out file: $statsOutFileName\n");
#print header row
print OUT "year\tnumRows\tnumCols\tvocabularySize\tnumCooccurrences\n";
#get stats for each file and output to file
for(my $year = $startYear; $year <= $endYear; $year++) {
print "reading $year\n";
my $inFile = $dataFolder.$year.'_window'.$windowSize;
if (open IN, $inFile) {
(my $numRows, my $numCols, my $vocabularySize, my $numCooccurrences)
= &metaAnalysis($inFile);
print OUT "$year\t$numRows\t$numCols\t$vocabularySize\t$numCooccurrences\n"
}
else {
#just skip the file
print " ERROR: unable to open $inFile\n";
}
}
utils/datasetCreator/fromMySQL/dbToTab.pl view on Meta::CPAN
#converts a mysql database to tab seperated readable by LBD
#command is of the form:
#`mysql <DB_NAME> -e "SELECT * FROM N_11 INTO OUTFILE '<OUTPUT_FILE>' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`
#
# the following line is an example using a database with cui co-occurrence
# counts from 1980 to 1984 with a window size of 1. The mysql database is
# called 1980_1984_window1, and the output matrix file is called
# 1980_1984_window1_data.txt
`mysql 1980_1984_window1 -e "SELECT * FROM N_11 INTO OUTFILE '1980_1984_window1_data.txt' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`;
utils/datasetCreator/fromMySQL/removeQuotes.pl view on Meta::CPAN
#renoves quotes from a db to tab file
my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';
open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");
while (my $line = <IN>) {
$line =~ s/"//g;
#print $line;
print OUT $line;
utils/datasetCreator/removeCUIPair.pl view on Meta::CPAN
# removes the cui pair from the dataset
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;
my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);
print STDERR "DONE\n";
###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
my $cuiA = shift;
my $cuiB = shift;
my $matrixFileName = shift;
my $matrixOutFileName = shift;
print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
#open the in and out files
open IN, $matrixFileName
or die ("ERROR: cannot open matrix in file: $matrixFileName\n");
utils/datasetCreator/squaring/squareMatrix.m view on Meta::CPAN
clear all;
close all;
sparseSquare('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared');
error('DONE!');
function sparseSquare(fileIn, fileOut)
%load the data
data = load(fileIn);
disp(' loaded data');
%convert to sparse
vals = max(data);
maxVal = vals(1);
if (vals(2) > maxVal)
maxVal = vals(2);
end
sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
clear data;
clear vals;
clear maxVal;
disp(' converted to sparse');
%square the matrix
squared = sp*sp;
clear sp;
disp(' squared');
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'w');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
fclose(fid);
disp(' DONE!');
end
utils/datasetCreator/squaring/squareMatrix_partial.m view on Meta::CPAN
sparseSquare_sectioned('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared_secondTry',increment);
error('DONE!');
function sparseSquare_sectioned(fileIn, fileOut, increment)
disp(fileIn);
%open, close, and clear the output file
fid = fopen(fileOut,'w');
fclose(fid);
%load the data
data = load(fileIn);
vals = max(data);
matrixSize = vals(1);
if (vals(2) > matrixSize)
matrixSize = vals(2);
end
disp('got matrixDim');
clear data;
%multiply each segment of the matrices
for rowStartIndex = 1:increment:matrixSize
rowEndIndex = rowStartIndex+increment-1;
if (rowEndIndex > matrixSize)
rowEndIndex = matrixSize;
end
for colStartIndex = 1: increment: matrixSize
colEndIndex = colStartIndex+increment-1;
if (colEndIndex > matrixSize)
colEndIndex = matrixSize;
end
dispString = [num2str(rowStartIndex), ',', num2str(rowEndIndex),' - ', num2str(colStartIndex),', ', num2str(colEndIndex),':'];
disp(dispString)
clear dispString;
%load the data
data = load(fileIn);
disp(' loaded data');
%convert to sparse
vals = max(data);
maxVal = vals(1);
if (vals(2) > maxVal)
maxVal = vals(2);
end
sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
clear data;
clear vals;
clear maxVal;
disp(' converted to sparse');
%grab a peice of the matrix
sp1 = sparse(matrixSize,matrixSize);
sp2 = sparse(matrixSize,matrixSize);
sp1(rowStartIndex:rowEndIndex,:) = sp(rowStartIndex:rowEndIndex,:);
sp2(:,colStartIndex:colEndIndex) = sp(:,colStartIndex:colEndIndex);
clear sp;
%square the matrix
squared = sp1*sp2;
clear sp1,sp2;
disp(' squared');
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'a+');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
clear data_dump;
fclose(fid);
disp(' values output');
end
end
end
view all matches for this distributionview release on metacpan - search on metacpan