view release on metacpan or search on metacpan
make test
make install
DESCRIPTION
ALBD provides a system for performing ABC co-occurrence literature based
discovery using a variety of options, and association-based ranking
methods
REQUIREMENTS
ALBD REQUIRES that the following software packages and data:
Programming Languages
Perl (version 5.16.3 or better)
CPAN Modules
UMLS::Association
UMLS::Interface
Required for some Methods:
MATLAB
If you have supervisor access, or have configured MCPAN for local
install, you can install each of these via:
perl -MCPAN -e shell
> install <packageName>
UMLS::Interface
The core UMLS package provides a dictionary from content unqiue
identifiers (CUI) to their meanings in the Unified Medical Language
System. Refer to the UMLS::Interface documentation for how to install
the UMLS database on your system.
The package is freely available at:
<http://search.cpan.org/dist/UMLS-Interface/>
UMLS::Association
Use to calculate association scores used in most of the ranking method.
The package is freely available at:
<http://search.cpan.org/dist/UMLS-Association/>
details, see:
<http://search.cpan.org/dist/Test-Simple/lib/Test/Builder.pm#EXIT_CODES>
Stage 4: Create an co-occurrence matrix
ALBD requires that a co-occurrence matrix of CUIs has been created. This
matrix is stored as a flat file, in a sparse matrix format such that
each line contains three tab seperated values, cui_1, cui_2, n_11 = the
count of their co-occurrences. Any matrix with that format is
acceptable, however the intended method of matrix generation is to
convert a UMLS::Association database into a flat matrix file. These
databases are created using the CUICollector tool of UMLS::Association,
and are run over the MetaMapped Medline baseline. With that file, run
utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
database into a matrix file. Notice that code in dbToTab.pl is just a
sample mysql command. If the input database is created in another
method, a different command may be needed. As long as the resulting
co-occurrence matrix is in the correct format LBD may be run on it. This
allows flexibility in where co-occurrence information comes from.
Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
on the resulting tab seperated file, if quotes are inlcuded in the
resulting co-ocurrence matrix file.
Stage 5: Set up Dummy UMLS::Association Database
UMLS::Association requires that a database can be connected to that is
in the correct format. Although this database is not required for ALBD
(since co-occurrence data is loaded from a co-occurrence matrix), it is
required to run UMLS:Association. If you ran UMLS::Association to
generate a co-occurrence matrix, you should be fine. Otherwise you will
need to create a dummy database that it can connect to. This can be done
in a few steps:
1) open mysql type mysql at the terminal
2) create the default database in the correct format, type: CREATE
DATABASE cuicounts; use cuicounts; CREATE TABLE N_11(cui_1 CHAR(10),
cui_2 CHAR(10), n_11 BIGINT(20));
CONTACT US
If you have any trouble installing and using ALBD, please contact us us
directly :
Sam Henry: henryst at vcu.edu
Bridget McInnes: btmcinnes at vcu.edu
samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
utils/datasetCreator/fromMySQL/removeQuotes.pl
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl
utils/datasetCreator/squaring/squareMatrix.m
utils/datasetCreator/squaring/squareMatrix_partial.m
utils/datasetCreator/squaring/squareMatrix_perl.pl
META.yml Module YAML meta-data (added by MakeMaker)
META.json Module JSON meta-data (added by MakeMaker)
NAME
ALBD README
SYNOPSIS
This package consists of Perl modules along with supporting Perl
programs that perform Literature Based Discovery (LBD). The core
data from which LBD is performed are co-occurrences matrices
generated from UMLS::Association. ALBD is based on the ABC
co-occurrence model. Many options can be specified, and many
ranking methods are available. The novel ranking methods that use
association measure are available as well as frequency based
ranking methods. See samples/lbd for more info. Can perform open and
closed LBD as well as time slicing evaluation.
ALBD requires UMLS::Association both to compute the co-occurrence
database that the co-occurrence matrix is derived from, but also for
ranking the generated C terms.
UMLS::Association requires the UMLS::Interface module to access
the Unified Medical Language System (UMLS) for semantic type filtering
and to determine if CUIs are valid.
The following sections describe the organization of this software
package and how to use it. A few typical examples are given to help
clearly understand the usage of the modules and the supporting
utilities.
details of these can be found in the ExtUtils::MakeMaker documentation.
However, it is highly recommended not messing around with other
parameters, unless you know what you're doing.
CO-OCCURRENCE MATRIX SETUP
ALBD requires that a co-occurrence matrix of CUIs has been created. This
matrix is stored as a flat file, in a sparse matrix format such that
each line contains three tab seperated values, cui_1, cui_2, n_11 = the
count of their co-occurrences. Any matrix with that format is
acceptable, however the intended method of matrix generation is to
convert a UMLS::Association database into a flat matrix file. These
databases are created using the CUICollector tool of UMLS::Association,
and are run over the MetaMapped Medline baseline. With that file, run
utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
database into a matrix file. Notice that code in dbToTab.pl is just a
sample mysql command. If the input database is created in another
method, a different command may be needed. As long as the resulting
co-occurrence matrix is in the correct format LBD may be run on it. This
allows flexibility in where co-occurrence information comes from.
Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
on the resulting tab seperated file, if quotes are inlcuded in the
resulting co-ocurrence matrix file.
Set Up Dummy UMLS::Association Database
UMLS::Association requires that a database can be connected to that is
in the correct format. Although this database is not required for ALBD
(since co-occurrence data is loaded from a co-occurrence matrix), it is
required to run UMLS:Association. If you ran UMLS::Association to
generate a co-occurrence matrix, you should be fine. Otherwise you will
need to create a dummy database that it can connect to. This can be done
in a few steps:
1) open mysql type mysql at the terminal
2) create the default database in the correct format, type: CREATE
DATABASE cuicounts; use cuicounts; CREATE TABLE N_11(cui_1 CHAR(10),
cui_2 CHAR(10), n_11 BIGINT(20));
INITIALIZING THE MODULE
To create an instance of the ALBD object, using default values for all
configuration options: %options = (); $options{'lbdConfig'} =
'configFile'; my $lbd = LiteratureBasedDiscovery->new(\%options);
$lbd->performLBD();
The following configuration options are also provided though:
present in the '/lib' directory tree of the package.
The package contains a utils/ directory that contain Perl utility
programs. These utilities use the modules or provide some supporting
functionality.
runDiscovery.pl -- runs LBD using the parameters specified in the input
file, and outputs to an output file.
The package contains a large selection of functions to manipulate CUI
Co-occurrence matrices in the utils/datasetCreator/ directory. These are
short scripts and generally require modifying the code at the top with
user input paramaters specific for each run. These scripts include:
applyMaxThreshold.pl -- applies a maximum co-occurrence threshold to the
co-occurrence matrix
applyMinThreshold.pl -- applies a minimum co-occurrence threshold to the
co-occurrence matrix
applySemanticFilter.pl -- applies a semantic type and/or group filter to
removeCUIPair.pl -- removes all occurrences of the specified CUI pair
from the co-occurrence matrix
removeExplicit.pl -- removes any keys that occur in an explicit
co-occurrence matrix from another co-occurrence matrix (typically the
squared explicit co-occurrence matrix itself, which generates a
prediction matrix, or the post cutoff matrix used in time slicing to
generate a gold standard file)
testMatrixEquality.pl -- checks to see if two co-occurrence matrix files
contain the same data
Also included are several subfolders with more specific purposes. Within
the dataStats subfolder are scripts to collect various statistics about
the co-occurrence matrices used in LBD. These scriptsinclude:
getCUICooccurrences.pl -- a data statistics file that gets the number of
co-occurrences, and number of unique co-occurrences for every CUI in the
dataset
getMatrixStats.pl -- determines the number of rows, columns, and entries
of a co-occurrence matrix
metaAnalysis.pl -- determines the number of rows, columns, vocabulary
size, and total number of co-occurrences of a co-occurrence file, or set
of co-occurrence files
There is another folder containing scripts to square co-occurrence
matrices. Squaring an explicit (A to B) co-occurrence matrix results in
squareMatrix.m -- MATLAB script to square a matrix while holding
everything in ram. Faster, but requires more ram.
squareMatrix_partial.m -- MATLAB script to square a matrix in chunks.
Only loads parts of the matrix into ram at a time which makes squaring
any size matrix possible, but potentially take impracticle amounts of
time.
squareMatrix_perl.pl -- squares a matrix in perl, but requires the most
ram of any squaring method. The easiest method to use, but only
practical for small datasets.
The fromMySQL folder contains scripts that convery UMLS::Association
databases to ALBD co-occurrence matrices. The files contained are:
dbToTab.pl -- converts a UMLS::Association co-occurrence database to a
sparse format co-occurrence matrix used for ALBD
removeQuotes.pl -- removes quotes from lines in the co-occurrence matrix
file after converting from a database (sometimes needed)
REFERENCING
If you write a paper that has used UMLS-Association in some way, we'd
certainly be grateful if you sent us a copy.
CONTACT US
If you have any trouble installing and using ALBD, please contact us
directly if you prefer :
Sam Henry: henryst at vcu.edu
config/association view on Meta::CPAN
##############################################################################
# Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Association for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>bigrams" will pass the 'database' parameter with a
# value of 'bigrams' to the UMLS::Association options hash for its
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
<database>CUI_Bigram
<hostname>192.168.24.89
<username>henryst
<password>OhFaht3eique
<socket>/var/run/mysqld.sock
<t>
config/interface view on Meta::CPAN
#############################################################################
# Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Interface for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>umls" will pass the 'database' parameter with a value
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
<t>
<config>interfaceConfig
<hostname>192.168.24.89
<username>henryst
lib/ALBD.pm view on Meta::CPAN
use ALBD;
%options = ();
$options{'lbdConfig'} = 'configFile'
my $lbd = LiteratureBasedDiscovery->new(\%options);
$lbd->performLBD();
=head1 ABSTRACT
This package consists of Perl modules along with supporting Perl
programs that perform Literature Based Discovery (LBD). The core
data from which LBD is performed are co-occurrences matrices
generated from UMLS::Association. ALBD is based on the ABC
co-occurrence model. Many options can be specified, and many
ranking methods are available. The novel ranking methods that use
association measure are available as well as frequency based
ranking methods. See samples/lbd for more info. Can perform open and
closed LBD as well as time slicing evaluation.
=head1 INSTALL
To install the module, run the following magic commands:
lib/ALBD.pm view on Meta::CPAN
#
# LiteratureBasedDiscovery.pm - provides functionality to perform LBD
#
# Matrix Representation:
# LBD is performed using Matrix and Vector operations. The major components
# are an explicit knowledge matrix, which is squared to find the implicit
# knowledge matrix.
#
# The explicit knowledge is read from UMLS::Association N11 matrix. This
# matrix contains the co-occurrence counts for all CUI pairs. The
# UMLS::Association database is completely independent from
# implementation, so any dataset, window size, or anything else may be used.
# Data is read in as a sparse matrix using the Discovery::tableToSparseMatrix
# function. This returns the primary data structures and variables used
# throughtout LBD.
#
# Matrix representation:
# This module uses a matrix representation for LBD. All operations are
# performed either as matrix or vector operations. The core data structure
# are the co-occurrence matrices explicitMatrix and implicitMatrix. These
# matrices have dimensions vocabulary size by vocabulary size. Each row
# corresponds to the all co-occurrences for a single CUI. Each column of that
# row corresponding to a co-occurrence with a single CUI. Since the matrices
# tend to be sparse, they are stored as hashes of hashes, where the the first
# key is for a row, and the second key is for a column. The keys of each hash
# are the indeces within the matrix. The hash values are the number of
# co-ocurrences for that CUI pair (e.g. ${${$explicit{C0000000}}{C1111111} = 10
# means that CUI C0000000 and C1111111 co-occurred 10 times).
#
# Now with an understanding of the data strucutres, below is a breif
# description of each:
#
# startingMatrix <- A matrix containing the explicit matrix rows for all of the
# start terms. This makes it easy to have multiple start terms
# and using this matrix as opposed to the entire explicit
# matrix drastically improves performance.
# explicitMatrix <- A matrix containing explicit connections (known connections)
# for every CUI in the dataset.
# implicitMatrix <- A matrix containing implicit connections (discovered
# connections) for every CUI in the datast
package ALBD;
use strict;
use warnings;
use LiteratureBasedDiscovery::Discovery;
use LiteratureBasedDiscovery::Evaluation;
use LiteratureBasedDiscovery::Rank;
lib/ALBD.pm view on Meta::CPAN
#read in all options from the config file
open IN, $configFileName or die("Error: Cannot open config file: $configFileName\n");
my %optionsHash = ();
my $firstChar;
while (my $line = <IN>) {
#check if its a comment or blank line
$firstChar = substr $line, 0, 1;
if ($firstChar ne '#' && $line =~ /[^\s]+/) {
#line contains data, grab the key and value
$line =~ /<([^>]+)>([^\n]*)/;
#make sure the data was read in correctly
if (!$1) {
print STDERR
"Warning: Invalid line in $configFileName: $line\n";
}
else {
#data was grabbed from the line, add to hash
if ($2) {
#add key and value to the optionsHash
$optionsHash{$1} = $2;
}
else {
#add key and set default value to the optionsHash
$optionsHash{$1} = 1;
}
}
}
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
package Discovery;
use strict;
use warnings;
use DBI;
######################################################################
# MySQL Notes
######################################################################
#TODO I think some of these notes should be elsewhere
# A Note about the database structure expected
# Each LBD database is expected to have:
# PreCutoff_N11
# PostCutoff_N11
# PreCutoff_Implicit
#
# Both PreCutoff_N11 and PostCutoff_N11 should
# be generated manually using CUI_Collector
# PreCutoff_Implicit is generated using the tableToSparseMatrix
# function here, which exports a sparse matrix. That matrix
# can then be imported into matlab, squared, and reloaded into
# a mysql database. With these 3 tables LBD can be performed
######################################################################
# Description
######################################################################
# Discovery.pm - provides matrix operations from n11 counts from
# UMLS::Association
#
#TODO I think some of these notes should be elsewhere
# 'B' term filtering may be applied by removing elements from the
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
}
}
}
}
return $implicitMatrixRef;
}
# loads a tab seperated file as a sparse matrix (a hash of hashes)
# each line of the file contains CUI1 <TAB> CUI2 <TAB> Count
# input: the filename containing the data
# output: a hash ref to the sparse matrix (${$hash{$index1}}{$index2} = value)
sub fileToSparseMatrix {
my $fileName = shift;
open IN, $fileName or die ("unable to open file: $fileName\n");
my %matrix = ();
my ($cui1,$cui2,$val);
while (my $line = <IN>) {
chomp $line;
($cui1,$cui2,$val) = split(/\t/,$line);
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
}
$matrix{$cui1}{$cui2} = $val;
}
close IN;
return \%matrix;
}
# outputs the matrix to the output file in sparse matrix format, which
# is a file containing rowKey\tcolKey\tvalue
# input: $outFile - a string specifying the output file
# $matrixRef - a ref to the sparse matrix containing the data
# output: nothing, but the matrix is output to file
sub outputMatrixToFile {
my $outFile = shift;
my $matrixRef = shift;
#open the output file and output fhe matrx
open OUT, ">$outFile" or die ("Error opening matrix output file: $outFile\n");
my $rowRef;
foreach my $rowKey (keys %{$matrixRef}) {
$rowRef = ${$matrixRef}{$rowKey};
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
# retreive a table from mysql and convert it to a sparse matrix (a hash of
# hashes)
# input : $tableName <- the name of the table to output
# #cuiFinder <- an instance of UMLS::Interface::CuiFinder
# output: a hash ref to the sparse matrix (${$hash{$index1}}{$index2} = value)
sub tableToSparseMatrix {
my $tableName = shift;
my $cuiFinder = shift;
# check tableName
#TODO check that the table exists in the database
# or die "Error: table does not exist: $tableName\n";
# set up database
my $db = $cuiFinder->_getDB();
# retreive the table as a nested hash where keys are CUI1,
# then CUI2, value is N11
my @keyFields = ('cui_1', 'cui_2');
my $matrixRef = $db->selectall_hashref(
"select * from $tableName", \@keyFields);
# set values of the loaded table to n_11
# ...default is hash of hash of hash
lib/LiteratureBasedDiscovery/Evaluation.pm view on Meta::CPAN
# ALBD::Evaluation.pm
#
# Provides functionality to evaluate LBD systems
# Key components are:
# Results Matrix <- all new knowledge generated by an LBD system (e.g.
# all proposed discoveries of a system from pre-cutoff
# data).
# Gold Standard Matrix <- the gold standard knowledge matrix (e.g. all
# knowledge present in the post-cutoff dataset
# that is not present in the pre-cutoff dataset).
#
# Copyright (c) 2017
#
# Sam Henry
# henryst at vcu.edu
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
seek IN, 0,0; #reset to the beginning of the implicit file
#iterate over the lines of interest, and grab values
my %np1 = ();
my %n11 = ();
my $n1p = 0;
my $npp = 0;
my $matchedCuiB = 0;
my ($cuiA, $cuiB, $val);
while (my $line = <IN>) {
#grab data from the line
($cuiA, $cuiB, $val) = split(/\t/,$line);
#see if updates are necessary
if (exists $aTerms{$cuiA} || exists $cTerms{$cuiB}) {
#update npp
$npp += $3;
#update np1
if (exists $cTerms{$cuiB}) {
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#see if this line contains a key that should be read in
if (exists $cuisToGrab{$cui1}) {
#add the value
if (!(defined $postCutoffMatrix{$cui1})) {
my %newHash = ();
$postCutoffMatrix{$cui1} = \%newHash;
}
#check to ensure that the column cui is in the
# vocabulary of the pre-cutoff dataset.
# it is impossible to make predictions of words that
# don't already exist
#NOTE: this assumes $explicitMatrixRef is a square
# matrix (so unordered)
if (exists ${$explicitMatrixRef}{$cui2}) {
${$postCutoffMatrix{$cui1}}{$cui2} = $val;
}
}
}
close IN;
samples/configFileSamples/UMLSAssociationConfig view on Meta::CPAN
##############################################################################
# Configuration File for UMLS::Association
##############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Association for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>bigrams" will pass the 'database' parameter with a
# value of 'bigrams' to the UMLS::Association options hash for its
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Association for more detailed
# Database of Association Scores. Not used, but required to initialize
# UMLS::Association
<database>CUI_Bigram
# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock
# makes the UMLS::Association not print to the command line
<t>
samples/configFileSamples/UMLSInterfaceConfig view on Meta::CPAN
#############################################################################
# Configuration File for UMLS::Interface
############################################################################
# All the options in this file are passed to put into an options hash and
# passed directly to UMLS::Interface for initialization. Options hash keys
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>umls" will pass the 'database' parameter with a value
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Interface for more detail
# makes the UMLS::Interface not print to the command line
<t>
samples/runSample.pl view on Meta::CPAN
#Demo file, showing how to run open discovery using the sample data, and how
# to perform time slicing evaluation using the sample data
# run a sample lbd using the parameters in the lbd configuration file
print "\n OPEN DISCOVERY \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";
# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you.
# The variables for this example in removeExplicit.pl are:
# my $matrixFileName = 'sampleExplicitMatrix';
# my $squaredMatrixFileName = postCutoffMatrix;
# my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;
# next, run time slicing
print " TIME SLICING \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";
samples/timeSlicingConfig view on Meta::CPAN
# similar to target termcept groups, this restricts the acceptable target (C)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix
# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix
# Input file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries. It is easiest to generate
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix
last;
}
}
ok($fAtKSame == 1, "Frequency at K Matches");
print "Done with Time Slicing Tests\n";
############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
my $fileName = shift;
#read in the gold time slicing values
my @APScores = ();
my $MAP;
my @PAtKScores = ();
my @FAtKScores = ();
open IN, "$fileName"
#open IN, './t/goldSampleTimeSliceOutput'
utils/datasetCreator/applySemanticFilter.pl view on Meta::CPAN
$matrixRef, $acceptTypesRef, $umls_interface);
} else {
Filters::semanticTypeFilter_rowsAndColumns(
$matrixRef, $acceptTypesRef, $umls_interface);
}
#output the matrix
Discovery::outputMatrixToFile($outputFileName, $matrixRef);
#TODO re-enable this and then try to run again
#disconnect from the database and return
#$umls_interface->disconnect();
}
# transforms the string of accept types or groups into a hash of accept TUIs
# input: a string specifying whether linking or target types are being defined
# output: a hash of acceptable TUIs
sub getAcceptTypes {
my $umls_interface = shift;
my $acceptTypesString = shift;
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
# different time slicing or discovery replication results. We ran CUI Collector
# seperately for each year of the MetaMapped MEDLINES baseline and stored each
# co-occurrence matrix in a single folder "hadoopByYear/output/". That folder
# contained file named the year and window size used (e.g. 1975_window8).
# The code may need to be modified slightly for other purposes.
use strict;
use warnings;
my $startYear;
my $endYear;
my $windowSize;
my $dataFolder;
#user input
$dataFolder = '/home/henryst/hadoopByYear/output/';
$startYear = '1983';
$endYear = '1985';
$windowSize = 8;
&combineFiles($startYear,$endYear,$windowSize);
#####################################################
####### Program Start ########
sub combineFiles {
my $startYear = shift;
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
my $outFileName = "$startYear".'_'."$endYear".'_window'."$windowSize";
(!(-e $outFileName))
or die ("ERROR: output file already exists: $outFileName\n");
open OUT, ">$outFileName"
or die ("ERROR: unable to open output file: $outFileName\n");
#combine the files
my %matrix = ();
for(my $year = $startYear; $year <= $endYear; $year++) {
print "reading $year\n";
my $inFile = $dataFolder.$year.'_window'.$windowSize;
if (!(open IN, $inFile)) {
print " ERROR: unable to open $inFile\n";
next;
}
#read each line of the file and add to the matrix
while (my $line = <IN>) {
#read values from the line
$line =~ /([^\s]+)\t([^\s]+)\t([^\s]+)/;
my $rowKey = $1;
utils/datasetCreator/dataStats/getCUICooccurrences.pl view on Meta::CPAN
# A data statistics tool that gets a list of all cuis, and outputs their number
# of co-occurrences, and their number of unique co-occurrences to file
my $inputFile = '/home/henryst/lbdData/groupedData/reg/1975_1999_window8_noOrder';
my $outputFile = '/home/henryst/lbdData/groupedData/1975_1999_window8_noOrder_stats';
###################################
###################################
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
utils/datasetCreator/dataStats/metaAnalysis.pl view on Meta::CPAN
# co-occurrences of a co-occurrence file, or set of co-occurrence files
use strict;
use warnings;
#perform meta-analysis on a single co-occurrence matrix
&metaAnalysis('/home/henryst/lbdData/groupedData/1960_1989_window8_noOrder');
#perform meta-analysis on a date range of co-occurrence matrices in a folder
# this expects a folder to contain a co-occurrence matrix for every year
# specified within the date range
my $dataFolder = '/home/henryst/lbdData/dataByYear/1960_1989';
my $startYear = '1809';
my $endYear = '2015';
my $windowSize = 1;
my $statsOutFileName = '/home/henryst/lbdData/stats_window1';
&folderMetaAnalysis($startYear, $endYear, $windowSize, $statsOutFileName, $dataFolder);
#####################
# runs meta analysis on a set of files
sub folderMetaAnalysis {
my $startYear = shift;
my $endYear = shift;
my $windowSize = shift;
my $statsOutFileName= shift;
my $dataFolder = shift;
#Check on I/O
open OUT, ">$statsOutFileName"
or die ("ERROR: unable to open stats out file: $statsOutFileName\n");
#print header row
print OUT "year\tnumRows\tnumCols\tvocabularySize\tnumCooccurrences\n";
#get stats for each file and output to file
for(my $year = $startYear; $year <= $endYear; $year++) {
print "reading $year\n";
my $inFile = $dataFolder.$year.'_window'.$windowSize;
if (open IN, $inFile) {
(my $numRows, my $numCols, my $vocabularySize, my $numCooccurrences)
= &metaAnalysis($inFile);
print OUT "$year\t$numRows\t$numCols\t$vocabularySize\t$numCooccurrences\n"
}
else {
#just skip the file
print " ERROR: unable to open $inFile\n";
}
}
utils/datasetCreator/fromMySQL/dbToTab.pl view on Meta::CPAN
#converts a mysql database to tab seperated readable by LBD
#command is of the form:
#`mysql <DB_NAME> -e "SELECT * FROM N_11 INTO OUTFILE '<OUTPUT_FILE>' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`
#
# the following line is an example using a database with cui co-occurrence
# counts from 1980 to 1984 with a window size of 1. The mysql database is
# called 1980_1984_window1, and the output matrix file is called
# 1980_1984_window1_data.txt
`mysql 1980_1984_window1 -e "SELECT * FROM N_11 INTO OUTFILE '1980_1984_window1_data.txt' FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '\"' LINES TERMINATED BY '\n';"`;
utils/datasetCreator/fromMySQL/removeQuotes.pl view on Meta::CPAN
#renoves quotes from a db to tab file
my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';
open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");
while (my $line = <IN>) {
$line =~ s/"//g;
#print $line;
print OUT $line;
utils/datasetCreator/removeCUIPair.pl view on Meta::CPAN
# removes the cui pair from the dataset
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;
my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);
print STDERR "DONE\n";
###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
my $cuiA = shift;
my $cuiB = shift;
my $matrixFileName = shift;
my $matrixOutFileName = shift;
print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
#open the in and out files
open IN, $matrixFileName
or die ("ERROR: cannot open matrix in file: $matrixFileName\n");
utils/datasetCreator/squaring/squareMatrix.m view on Meta::CPAN
clear all;
close all;
sparseSquare('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared');
error('DONE!');
function sparseSquare(fileIn, fileOut)
%load the data
data = load(fileIn);
disp(' loaded data');
%convert to sparse
vals = max(data);
maxVal = vals(1);
if (vals(2) > maxVal)
maxVal = vals(2);
end
sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
clear data;
clear vals;
clear maxVal;
disp(' converted to sparse');
%square the matrix
squared = sp*sp;
clear sp;
disp(' squared');
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'w');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
fclose(fid);
disp(' DONE!');
end
utils/datasetCreator/squaring/squareMatrix_partial.m view on Meta::CPAN
sparseSquare_sectioned('/home/henryst/lbdData/squaring/1975_1999_window8_noOrder','/home/henryst/lbdData/squaring/1975_1999_window8_noOrder_squared_secondTry',increment);
error('DONE!');
function sparseSquare_sectioned(fileIn, fileOut, increment)
disp(fileIn);
%open, close, and clear the output file
fid = fopen(fileOut,'w');
fclose(fid);
%load the data
data = load(fileIn);
vals = max(data);
matrixSize = vals(1);
if (vals(2) > matrixSize)
matrixSize = vals(2);
end
disp('got matrixDim');
clear data;
%multiply each segment of the matrices
for rowStartIndex = 1:increment:matrixSize
rowEndIndex = rowStartIndex+increment-1;
if (rowEndIndex > matrixSize)
rowEndIndex = matrixSize;
end
for colStartIndex = 1: increment: matrixSize
colEndIndex = colStartIndex+increment-1;
if (colEndIndex > matrixSize)
colEndIndex = matrixSize;
end
dispString = [num2str(rowStartIndex), ',', num2str(rowEndIndex),' - ', num2str(colStartIndex),', ', num2str(colEndIndex),':'];
disp(dispString)
clear dispString;
%load the data
data = load(fileIn);
disp(' loaded data');
%convert to sparse
vals = max(data);
maxVal = vals(1);
if (vals(2) > maxVal)
maxVal = vals(2);
end
sp = sparse(data(:,1), data(:,2), data(:,3), maxVal, maxVal);
clear data;
clear vals;
clear maxVal;
disp(' converted to sparse');
%grab a peice of the matrix
sp1 = sparse(matrixSize,matrixSize);
sp2 = sparse(matrixSize,matrixSize);
sp1(rowStartIndex:rowEndIndex,:) = sp(rowStartIndex:rowEndIndex,:);
sp2(:,colStartIndex:colEndIndex) = sp(:,colStartIndex:colEndIndex);
clear sp;
%square the matrix
squared = sp1*sp2;
clear sp1,sp2;
disp(' squared');
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'a+');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
clear data_dump;
fclose(fid);
disp(' values output');
end
end
end