ALBD
view release on metacpan or search on metacpan
file, and outputs to an output file.
The package contains a large selection of functions to manipulate CUI
Co-occurrence matrices in the utils/datasetCreator/ directory. These are
short scripts and generally require modifying the code at the top with
user input paramaters specific for each run. These scripts include:
applyMaxThreshold.pl -- applies a maximum co-occurrence threshold to the
co-occurrence matrix
applyMinThreshold.pl -- applies a minimum co-occurrence threshold to the
co-occurrence matrix
applySemanticFilter.pl -- applies a semantic type and/or group filter to
the co-occurrence matrix.
combineCooccurrenceMatrices.pl -- combines the co-occurrence counts of
multiple co-occurrence matrices
makeOrderNotMatter.pl -- makes the order of CUI co-occurrences not
matter by updating the co-occurrence matrix file. (UMLS::Association
generates co-occurrence files where order does matter, so the sentence
'cui1 cui2' will only mark a co-occurrence between cui1 and cui2, but
not between cui2 and cui1).
removeCUIPair.pl -- removes all occurrences of the specified CUI pair
from the co-occurrence matrix
removeExplicit.pl -- removes any keys that occur in an explicit
co-occurrence matrix from another co-occurrence matrix (typically the
squared explicit co-occurrence matrix itself, which generates a
prediction matrix, or the post cutoff matrix used in time slicing to
generate a gold standard file)
testMatrixEquality.pl -- checks to see if two co-occurrence matrix files
contain the same data
Also included are several subfolders with more specific purposes. Within
the dataStats subfolder are scripts to collect various statistics about
the co-occurrence matrices used in LBD. These scriptsinclude:
getCUICooccurrences.pl -- a data statistics file that gets the number of
co-occurrences, and number of unique co-occurrences for every CUI in the
dataset
getMatrixStats.pl -- determines the number of rows, columns, and entries
of a co-occurrence matrix
metaAnalysis.pl -- determines the number of rows, columns, vocabulary
size, and total number of co-occurrences of a co-occurrence file, or set
of co-occurrence files
There is another folder containing scripts to square co-occurrence
matrices. Squaring an explicit (A to B) co-occurrence matrix results in
a co-occurrence matrix containing all implicit (A to C) connections.
This is useful for time slicing and other analysis. Removal of the
original explicit matrix is an additional step that is required if you
wish to create a predictions matrix file for every CUI. This can be done
with the removeExplicit.pl script. Squaring a co-occurrence matrix can
be very computationally expensive, both in terms of ram and cpu. For
this reason MATLAB scripts are preferred over perl scripts. Even using
MATLAB ram can become an issue, and squaring sections of a matrix and
combining them into a single output matrix may be necassary, but takes
much longer. Scripts in the squaring folder include:
convertForSquaring_MATLAB.pl -- functions to convert to and from ALBD
and MATLAB sparse matrix formats
squareMatrix.m -- MATLAB script to square a matrix while holding
everything in ram. Faster, but requires more ram.
squareMatrix_partial.m -- MATLAB script to square a matrix in chunks.
Only loads parts of the matrix into ram at a time which makes squaring
any size matrix possible, but potentially take impracticle amounts of
time.
squareMatrix_perl.pl -- squares a matrix in perl, but requires the most
ram of any squaring method. The easiest method to use, but only
practical for small datasets.
The fromMySQL folder contains scripts that convery UMLS::Association
databases to ALBD co-occurrence matrices. The files contained are:
dbToTab.pl -- converts a UMLS::Association co-occurrence database to a
sparse format co-occurrence matrix used for ALBD
removeQuotes.pl -- removes quotes from lines in the co-occurrence matrix
file after converting from a database (sometimes needed)
REFERENCING
If you write a paper that has used UMLS-Association in some way, we'd
certainly be grateful if you sent us a copy.
CONTACT US
If you have any trouble installing and using ALBD, please contact us
directly if you prefer :
Sam Henry: henryst at vcu.edu
Bridget McInnes: btmcinnes at vcu.edu
SOFTWARE COPYRIGHT AND LICENSE
Copyright (C) 2017 Sam Henry & Bridget McInnes
This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published
by the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.
This program is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
Public License for more details.
You should have received a copy of the GNU General Public License along
with this program; if not, write to the Free Software Foundation, Inc.,
59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
Note: The text of the GNU General Public License is provided in the file
'GPL.txt' that you should have received with this distribution.
( run in 4.613 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )