view release on metacpan or search on metacpan
This License is a kind of "copyleft", which means that derivative
works of the document must themselves be free in the same sense. It
complements the GNU General Public License, which is a copyleft
license designed for free software.
We have designed this License in order to use it for manuals for free
software, because free software needs free documentation: a free
program should come with manuals providing the same freedoms that the
software does. But this License is not limited to software manuals;
it can be used for any textual work, regardless of subject matter or
whether it is published as a printed book. We recommend this License
principally for works whose purpose is instruction or reference.
1. APPLICABILITY AND DEFINITIONS
This License applies to any manual or other work, in any medium, that
contains a notice placed by the copyright holder saying it can be
distributed under the terms of this License. Such a notice grants a
world-wide, royalty-free license, unlimited in duration, to use that
work under the conditions stated herein. The "Document", below,
ASCII without markup, Texinfo input format, LaTeX input format, SGML
or XML using a publicly available DTD, and standard-conforming simple
HTML, PostScript or PDF designed for human modification. Examples of
transparent image formats include PNG, XCF and JPG. Opaque formats
include proprietary formats that can be read and edited only by
proprietary word processors, SGML or XML for which the DTD and/or
processing tools are not generally available, and the
machine-generated HTML, PostScript or PDF produced by some word
processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself,
plus such following pages as are needed to hold, legibly, the material
this License requires to appear in the title page. For works in
formats which do not have any title page as such, "Title Page" means
the text near the most prominent appearance of the work's title,
preceding the beginning of the body of the text.
A section "Entitled XYZ" means a named subunit of the Document whose
title either is precisely XYZ or contains XYZ in parentheses following
text that translates XYZ in another language. (Here XYZ stands for a
specific section name mentioned below, such as "Acknowledgements",
copying of the copies you make or distribute. However, you may accept
compensation in exchange for copies. If you distribute a large enough
number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and
you may publicly display copies.
3. COPYING IN QUANTITY
If you publish printed copies (or copies in media that commonly have
printed covers) of the Document, numbering more than 100, and the
Document's license notice requires Cover Texts, you must enclose the
copies in covers that carry, clearly and legibly, all these Cover
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
the back cover. Both covers must also clearly and legibly identify
you as the publisher of these copies. The front cover must present
the full title with all words of the title equally prominent and
visible. You may add other material on the covers in addition.
Copying with changes limited to the covers, as long as they preserve
the title of the Document and satisfy these conditions, can be treated
as verbatim copying in other respects.
of the compilation's users beyond what the individual works permit.
When the Document is included in an aggregate, this License does not
apply to the other works in the aggregate which are not themselves
derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these
copies of the Document, then if the Document is less than one half of
the entire aggregate, the Document's Cover Texts may be placed on
covers that bracket the Document within the aggregate, or the
electronic equivalent of covers if the Document is in electronic form.
Otherwise they must appear on printed covers that bracket the whole
aggregate.
8. TRANSLATION
Translation is considered a kind of modification, so you may
distribute translations of the Document under the terms of section 4.
Replacing Invariant Sections with translations requires special
permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the
a) You must cause the modified files to carry prominent notices
stating that you changed the files and the date of any change.
b) You must cause any work that you distribute or publish, that in
whole or in part contains or is derived from the Program or any
part thereof, to be licensed as a whole at no charge to all third
parties under the terms of this License.
c) If the modified program normally reads commands interactively
when run, you must cause it, when started running for such
interactive use in the most ordinary way, to print or display an
announcement including an appropriate copyright notice and a
notice that there is no warranty (or else, saying that you provide
a warranty) and that users may redistribute the program under
these conditions, and telling the user how to view a copy of this
License. (Exception: if the Program itself is interactive but
does not normally print such an announcement, your work based on
the Program is not required to print an announcement.)
These requirements apply to the modified work as a whole. If
identifiable sections of that work are not derived from the Program,
and can be reasonably considered independent and separate works in
themselves, then this License, and its terms, do not apply to those
sections when you distribute them as separate works. But when you
distribute the same sections as part of a whole which is a work based
on the Program, the distribution of the whole must be on the terms of
this License, whose permissions for other licensees extend to the
entire whole, and thus to each and every part regardless of who wrote it.
lib/ALBD.pm view on Meta::CPAN
return;
}
if (exists $lbdOptions{'precisionAndRecall_explicit'}) {
$self->timeSlicing_generatePrecisionAndRecall_explicit();
return;
}
if (exists $lbdOptions{'precisionAndRecall_implicit'}) {
$self->timeSlicing_generatePrecisionAndRecall_implicit();
return;
}
print "Open Discovery\n";
print $self->_parametersToString();
#Get inputs
my $startCuisRef = $self->_getStartCuis();
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
my $targetAcceptTypesRef = $self->_getAcceptTypes('target');
print "startCuis = ".(join(',', @{$startCuisRef}))."\n";
print "linkingAcceptTypes = ".(join(',', keys %{$linkingAcceptTypesRef}))."\n";
print "targetAcceptTypes = ".(join(',', keys %{$targetAcceptTypesRef}))."\n";
#Get the Explicit Matrix
$start = time;
my $explicitMatrixRef;
if(!defined $lbdOptions{'explicitInputFile'}) {
die ("ERROR: explicitInputFile must be defined in LBD config file\n");
}
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'explicitInputFile'});
print "Got Explicit Matrix in ".(time() - $start)."\n";
#Get the Starting Matrix
$start = time();
my $startingMatrixRef =
Discovery::getRows($startCuisRef, $explicitMatrixRef);
print "Got Starting Matrix in ".(time() - $start)."\n";
#if using average minimum weight, grab the a->b scores
my %abPairsWithScores = ();
if ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight'
|| $lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
#apply semantic type filter to columns only
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
lib/ALBD.pm view on Meta::CPAN
}
}
Rank::getBatchAssociationScores(\%abPairsWithScores, $explicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
}
#Apply Semantic Type Filter to the explicit matrix
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
$start = time();
Filters::semanticTypeFilter_rowsAndColumns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
print "Semantic Type Filter in ".(time() - $start)."\n";
}
#Get Implicit Connections
$start = time();
my $implicitMatrixRef;
if (defined $lbdOptions{'implicitInputFile'}) {
$implicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'implicitInputFile'});
} else {
$implicitMatrixRef = Discovery::findImplicit($explicitMatrixRef, $startingMatrixRef);
}
print "Got Implicit Matrix in ".(time() - $start)."\n";
#Remove Known Connections
$start = time();
$implicitMatrixRef = Discovery::removeExplicit($startingMatrixRef, $implicitMatrixRef);
print "Removed Known Connections in ".(time() - $start)."\n";
#Apply Semantic Type Filter
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
$start = time();
Filters::semanticTypeFilter_columns(
$implicitMatrixRef, $targetAcceptTypesRef, $umls_interface);
print "Semantic Type Filter in ".(time() - $start)."\n";
}
#Score Implicit Connections
$start = time();
my $scoresRef;
if ($lbdOptions{'rankingProcedure'} eq 'allPairs') {
$scoresRef = Rank::scoreImplicit_fromAllPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
} elsif ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight') {
$scoresRef = Rank::scoreImplicit_averageMinimumWeight($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores);
} elsif ($lbdOptions{'rankingProcedure'} eq 'linkingTermCount') {
$scoresRef = Rank::scoreImplicit_linkingTermCount($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
} elsif ($lbdOptions{'rankingProcedure'} eq 'frequency') {
$scoresRef = Rank::scoreImplicit_frequency($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltcAssociation') {
$scoresRef = Rank::scoreImplicit_ltcAssociation($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
$scoresRef = Rank::scoreImplicit_LTC_AMW($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores);
} else {
die ("Error: Invalid Ranking Procedure\n");
}
print "Scored in: ".(time()-$start)."\n";
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
close OUT;
#Done
print "DONE!\n\n";
}
#----------------------------------------------------------------------------
# performs LBD, closed discovery
# input: none
# ouptut: none, but a results file is written to disk
sub performLBD_closedDiscovery {
my $self = shift;
my $start; #used to record run times
print "Closed Discovery\n";
print $self->_parametersToString();
#Get inputs
my $startCuisRef = $self->_getStartCuis();
my $targetCuisRef = $self->_getTargetCuis();
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
#Get the Explicit Matrix
$start = time;
my $explicitMatrixRef;
if(!defined $lbdOptions{'explicitInputFile'}) {
die ("ERROR: explicitInputFile must be defined in LBD config file\n");
}
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'explicitInputFile'});
print "Got Explicit Matrix in ".(time() - $start)."\n";
#Get the Starting Matrix
$start = time();
my $startingMatrixRef =
Discovery::getRows($startCuisRef, $explicitMatrixRef);
print "Got Starting Matrix in ".(time() - $start)."\n";
print " numRows in startMatrix = ".(scalar keys %{$startingMatrixRef})."\n";
#Apply Semantic Type Filter to the explicit matrix
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
$start = time();
Filters::semanticTypeFilter_rowsAndColumns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
print "Semantic Type Filter in ".(time() - $start)."\n";
}
#Get the Target Matrix
$start = time();
my $targetMatrixRef =
Discovery::getRows($targetCuisRef, $explicitMatrixRef);
print "Got Target Matrix in ".(time() - $start)."\n";
print " numRows in targetMatrix = ".(scalar keys %{$targetMatrixRef})."\n";
#find the linking terms in common for starting and target matrices
print "Finding terms in common\n";
#get starting linking terms
my %startLinks = ();
foreach my $row (keys %{$startingMatrixRef}) {
foreach my $col (keys %{${$startingMatrixRef}{$row}}) {
$startLinks{$col} = ${${$startingMatrixRef}{$row}}{$col};
}
}
print " num start links = ".(scalar keys %startLinks)."\n";
#get target linking terms
my %targetLinks = ();
foreach my $row (keys %{$targetMatrixRef}) {
foreach my $col (keys %{${$targetMatrixRef}{$row}}) {
$targetLinks{$col} = ${${$targetMatrixRef}{$row}}{$col};
}
}
print " num target links = ".(scalar keys %targetLinks)."\n";
#find linking terms in common
my %inCommon = ();
foreach my $startLink (keys %startLinks) {
if (exists $targetLinks{$startLink}) {
$inCommon{$startLink} = $startLinks{$startLink} + $targetLinks{$startLink};
}
}
print " num in common = ".(scalar keys %inCommon)."\n";
#Score and Rank
#Score the linking terms in common
my $scoresRef = \%inCommon;
#TODO score is just summed frequency right now
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
print OUT "\n\n---------------------------------------\n\n";
print OUT "starting linking terms:\n";
print OUT join("\n", keys %startLinks);
print OUT "\n\n---------------------------------------\n\n";
print OUT "target linking terms:\n";
print OUT join("\n", keys %targetLinks, );
close OUT;
#Done
print "DONE!\n\n";
}
#NOTE, this is experimental code for using the implicit matrix as input
# to association measures and then rank. This provides a nice method of
# association for implicit terms, but there are implementation problems
# primarily memory constraints or time constraints now, because this
# requires the entire implicit matrix be computed. This can be done, but
# access to it is then slow. Would require a major redo of the code
#
=comment
# performs LBD, but using implicit matrix ranking schemes.
# Since the order of operations for those methods are slighly different
# a new method has been created.
# input: none
# output: none, but a results file is written to disk
sub performLBD_implicitMatrixRanking {
my $self = shift;
my $start; #used to record run times
print $self->_parametersToString();
print "In Implicit Ranking\n";
#Get inputs
my $startCuisRef = $self->_getStartCuis();
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
my $targetAcceptTypesRef = $self->_getAcceptTypes('target');
print "startCuis = ".(join(',', @{$startCuisRef}))."\n";
print "linkingAcceptTypes = ".(join(',', keys %{$linkingAcceptTypesRef}))."\n";
print "targetAcceptTypes = ".(join(',', keys %{$targetAcceptTypesRef}))."\n";
#Score Implicit Connections
$start = time();
my $scoresRef;
$scoresRef = Rank::scoreImplicit_fromImplicitMatrix($startCuisRef, $lbdOptions{'implicitInputFile'}, $lbdOptions{rankingMeasue}, $umls_association);
print "Scored in: ".(time()-$start)."\n";
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
close OUT;
#Done
print "DONE!\n\n";
}
=cut
##################################################
################ Time Slicing ####################
##################################################
#NOTE: This function isn't really tested, and is really slow right now
# Generates precision and recall values by varying the threshold
# of the A->B ranking measure.
# input: none
# output: none, but precision and recall values are printed to STDOUT
sub timeSlicing_generatePrecisionAndRecall_explicit {
my $NUM_SAMPLES = 100; #TODO, read fomr file number of samples to average over for timeslicing
my $self = shift;
print "In timeSlicing_generatePrecisionAndRecall\n";
my $numIntervals = 10;
#Get inputs
my $startAcceptTypesRef = $self->_getAcceptTypes('start');
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
my $targetAcceptTypesRef = $self->_getAcceptTypes('target');
#Get the Explicit Matrix
lib/ALBD.pm view on Meta::CPAN
my $trueMax = -999999;
my $predictedMin = 999999;
my $predictedMax = 999999;
my $predictedTotal = 0;
my $trueTotal = 0;
my $allPairsCount = scalar keys %{$assocScoresRef};
for (my $i = $numIntervals; $i >= 0; $i--) {
#determine the number of samples to threshold
my $numSamples = $i*($allPairsCount/$numIntervals);
print "i, numSamples/allPairsCount = $i, $numSamples/$allPairsCount\n";
#grab samples at just 10 to estimate the final point (this is what
# makes it an 11 point curve)
if ($numSamples == 0) {
$numSamples = 10;
}
#apply a threshold (number of samples)
my $thresholdedStartingMatrixRef = TimeSlicing::grabKHighestRankedSamples($numSamples, $assocScoresRef, $startingMatrixRef);
#generate implicit knowledge
lib/ALBD.pm view on Meta::CPAN
#apply a semantic type filter to the implicit matrix
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$implicitMatrixRef, $targetAcceptTypesRef, $umls_interface);
}
#calculate precision and recall
my ($precision, $recall) = TimeSlicing::calculatePrecisionRecall(
$implicitMatrixRef, $postCutoffMatrixRef);
print "precision = $precision, recall = $recall\n";
#calculate averages/min/max only for $i= $numIntervals, which is all terms
if ($i == $numIntervals) {
#average over all terms
foreach my $rowKey(keys %{$implicitMatrixRef}) {
#get the counts true and predicted for this term (row of matrix)
my $numPredicted = scalar keys %{${$implicitMatrixRef}{$rowKey}};
my $numTrue = scalar keys %{${$postCutoffMatrixRef}{$rowKey}};
#sum counts
lib/ALBD.pm view on Meta::CPAN
$trueTotal += $numTrue;
}
#take the average, both true and predicted matrices
# have the same number of rows.
$predictedAverage /= (scalar keys %{$implicitMatrixRef});
$trueAverage /= (scalar keys %{$implicitMatrixRef});
}
}
#output stats
print "predicted - total, min, max, average = $predictedTotal, $predictedMin, $predictedMax, $predictedAverage\n";
print "true - total, min, max, average = $trueTotal, $trueMin, $trueMax, $trueAverage\n";
}
# generates precision and recall values by varying the threshold
# of the A->C ranking measure. Also generates precision at k, and
# mean average precision
# input: none
# output: none, but precision, recall, precision at k, and map values
# output to STDOUT
sub timeSlicing_generatePrecisionAndRecall_implicit {
my $NUM_SAMPLES = 200; #TODO, read fomr file number of samples to average over for timeslicing
my $self = shift;
my $start; #used to record run times
print "In timeSlicing_generatePrecisionAndRecall_implicit\n";
#Get inputs
my $startAcceptTypesRef = $self->_getAcceptTypes('start');
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
my $targetAcceptTypesRef = $self->_getAcceptTypes('target');
#-----------
# Starting Matrix Creation
#-----------
#Get the Explicit Matrix
print "loading explicit\n";
my $explicitMatrixRef;
if(!defined $lbdOptions{'explicitInputFile'}) {
die ("ERROR: explicitInputFile must be defined in LBD config file\n");
}
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'explicitInputFile'});
#create the starting matrix
print "generating starting\n";
my $startingMatrixRef
= TimeSlicing::generateStartingMatrix($explicitMatrixRef, \%lbdOptions, $startAcceptTypesRef, $NUM_SAMPLES, $umls_interface);
#----------
#--------
# Gold Loading/Creation
#--------
#load or create the gold matrix
my $goldMatrixRef;
if (exists $lbdOptions{'goldInputFile'}) {
print "inputting gold\n";
$goldMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'goldInputFile'});
}
else {
print "loading post cutoff\n";
$goldMatrixRef = TimeSlicing::loadPostCutOffMatrix($startingMatrixRef, $explicitMatrixRef, $lbdOptions{'postCutoffFileName'});
#remove explicit knowledge from the post cutoff matrix
$goldMatrixRef = Discovery::removeExplicit($startingMatrixRef, $goldMatrixRef);
#apply a semantic type filter to the post cutoff matrix
print "applying semantic filter to post-cutoff matrix\n";
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$goldMatrixRef, $targetAcceptTypesRef, $umls_interface);
}
#TODO why is the gold matrix outputting with an extra line between samples?
#output the gold matrix
if (exists $lbdOptions{'goldOutputFile'}) {
print "outputting gold\n";
Discovery::outputMatrixToFile($lbdOptions{'goldOutputFile'}, $goldMatrixRef);
}
}
#-------
#-------
# AB Scoring (if needed)
#-------
#if using average minimum weight, grab the a->b scores, #TODO this is sloppy here, but it has to be here...how to make it fit better?
my %abPairsWithScores = ();
if ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight'
|| $lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
print "getting AB scores\n";
#apply semantic type filter to columns only
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
}
#intitialize the abPairs to the frequency of co-ocurrence
foreach my $row (keys %{$startingMatrixRef}) {
foreach my $col (keys %{${$startingMatrixRef}{$row}}) {
$abPairsWithScores{"$row,$col"} = ${${$startingMatrixRef}{$row}}{$col};
lib/ALBD.pm view on Meta::CPAN
Rank::getBatchAssociationScores(
\%abPairsWithScores, $explicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
}
#--------
#------------
# Matrix Filtering/Thresholding
#------------
#load or threshold the matrix
if (exists $lbdOptions{'thresholdedMatrix'}) {
print "loading thresholded matrix\n";
$explicitMatrixRef = (); #clear (for memory)
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'thresholdedMatrix'});
}
#else {#TODO apply a threshold}
#NOTE, we must threshold the entire matrix because that is how we are calculating association scores
#Apply Semantic Type Filter to the explicit matrix
print "applying semantic filter to explicit matrix\n";
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_rowsAndColumns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
}
#------------
# Prediction Generation
#------------
#load or create the predictions matrix
my $predictionsMatrixRef;
if (exists $lbdOptions{'predictionsInFile'}) {
print "loading predictions\n";
$predictionsMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'predictionsInFile'});
}
else {
print "generating predictions\n";
#generate implicit knowledge
print "Squaring Matrix\n";
$predictionsMatrixRef = Discovery::findImplicit(
$explicitMatrixRef, $startingMatrixRef);
#Remove Known Connections
print "Removing Known from Predictions\n";
$predictionsMatrixRef
= Discovery::removeExplicit($startingMatrixRef, $predictionsMatrixRef);
#apply a semantic type filter to the predictions matrix
print "Applying Semantic Filter to Predictions\n";
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$predictionsMatrixRef, $targetAcceptTypesRef, $umls_interface);
}
#save the implicit knowledge matrix to file
if (exists ($lbdOptions{'predictionsOutFile'})) {
print "outputting predictions\n";
Discovery::outputMatrixToFile($lbdOptions{'predictionsOutFile'}, $predictionsMatrixRef);
}
}
#-------------------------------------------
#At this point, the explicitMatrixRef has been filtered and thresholded
#The predictions matrix Ref has been generated from the filtered and
# thresholded explicitMatrixRef, only rows of starting terms remain, filtered, and
# had explicit removed
lib/ALBD.pm view on Meta::CPAN
#--------------
# Get the ranks of all predictions
#--------------
#get the scores and ranks seperately for each row
# thereby generating scores and ranks for each starting
# term individually
my %rowRanks = ();
my ($n1pRef, $np1Ref, $npp);
print "getting row ranks\n";
foreach my $rowKey (keys %{$predictionsMatrixRef}) {
#grab rows from start and implicit matrices
my %startingRow = ();
$startingRow{$rowKey} = ${$startingMatrixRef}{$rowKey};
my %implicitRow = ();
$implicitRow{$rowKey} = ${$predictionsMatrixRef}{$rowKey};
#Score Implicit Connections
my $scoresRef;
if ($lbdOptions{'rankingProcedure'} eq 'allPairs') {
lib/ALBD.pm view on Meta::CPAN
while (my $line = <IN>) {
#check if its a comment or blank line
$firstChar = substr $line, 0, 1;
if ($firstChar ne '#' && $line =~ /[^\s]+/) {
#line contains data, grab the key and value
$line =~ /<([^>]+)>([^\n]*)/;
#make sure the data was read in correctly
if (!$1) {
print STDERR
"Warning: Invalid line in $configFileName: $line\n";
}
else {
#data was grabbed from the line, add to hash
if ($2) {
#add key and value to the optionsHash
$optionsHash{$1} = $2;
}
else {
#add key and set default value to the optionsHash
lib/ALBD.pm view on Meta::CPAN
}
##############################################################################
# function to produce output
##############################################################################
# outputs the implicit terms to string
# input: $scoresRef <- a reference to a hash of scores (hash{CUI}=score)
# $ranksRef <- a reference to an array of CUIs ranked by their score
# $printTo <- optional, outputs the $printTo top ranked terms. If not
# specified, all terms are output
# output: a line seperated string containing ranked terms, scores, and thier
# preferred terms
sub _rankedTermsToString {
my $self = shift;
my $scoresRef = shift;
my $ranksRef = shift;
my $printTo = shift;
#set printTo
if (!$printTo) {
$printTo = scalar @{$ranksRef};
}
#construct the output string
my $string = '';
my $index;
for (my $i = 0; $i < $printTo; $i++) {
#add the rank
$index = $i+1;
$string .= "$index\t";
#add the score
$string .= sprintf "%.5f\t", "${$scoresRef}{${$ranksRef}[$i]}\t";
#add the CUI
$string .= "${$ranksRef}[$i]\t";
#add the name
my $name = $umls_interface->getPreferredTerm(${$ranksRef}[$i]);
#if no preferred name, get anything
if (!defined $name || $name eq '') {
my $termListRef = $umls_interface->getTermList('C0440102');
if (scalar @{$termListRef} > 0) {
$name = '.**'.${$termListRef}[0];
}
lib/ALBD.pm view on Meta::CPAN
}
##############################################################################
# functions for debugging
##############################################################################
=comment
sub debugLBD {
my $self = shift;
my $startingCuisRef = shift;
print "Starting CUIs = ".(join(',', @{$startingCuisRef}))."\n";
#Get the Explicit Matrix
my ($explicitMatrixRef, $cuiToIndexRef, $indexToCuiRef, $matrixSize) =
Discovery::tableToSparseMatrix('N_11', $cuiFinder);
print "Explicit Matrix:\n";
_printMatrix($explicitMatrixRef, $matrixSize, $indexToCuiRef);
print "-----------------------\n";
#Get the Starting Matrix
my $startingMatrixRef =
Discovery::getRows($startingCuisRef, $explicitMatrixRef);
print "Starting Matrix:\n";
_printMatrix($startingMatrixRef, $matrixSize, $indexToCuiRef);
print "-----------------------\n";
#Get Implicit Connections
my $implicitMatrixRef
= Discovery::findImplicit($explicitMatrixRef, $startingMatrixRef,
$indexToCuiRef, $matrixSize);
print "Implicit Matrix:\n";
_printMatrix($implicitMatrixRef, $matrixSize, $indexToCuiRef);
print "-----------------------\n";
#Remove Known Connections
$implicitMatrixRef = Discovery::removeExplicit($explicitMatrixRef,
$implicitMatrixRef);
print "Implicit Matrix with Explicit Removed\n";
_printMatrix($implicitMatrixRef, $matrixSize, $indexToCuiRef);
print "-----------------------\n";
print "\n\n";
#Test N11, N1P, etc...
#NOTE...always do n11 first, if n11 = -1, no need to compute the others...there is no co-occurrence between them
my $n11 = Rank::getN11('C0','C2',$explicitMatrixRef);
my $npp = Rank::getNPP($explicitMatrixRef);
my $n1p = Rank::getN1P('C0', $explicitMatrixRef);
my $np1 = Rank::getNP1('C2', $explicitMatrixRef);
print "Contingency Table Values from Explicit Matrix\n";
print "n11 = $n11\n";
print "npp = $npp\n";
print "n1p = $n1p\n";
print "np1 = $np1\n";
#Test other rank methods
my $scoresRef = Rank::scoreImplicit_fromAllPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{rankingMethod}, $umls_association);
my $ranksRef = Rank::rankDescending($scoresRef);
print "Scores: \n";
foreach my $cui (keys %{$scoresRef}) {
print " scores{$cui} = ${$scoresRef}{$cui}\n";
}
print "Ranks = ".join(',', @{$ranksRef})."\n";
}
sub _printMatrix {
my $matrixRef = shift;
my $matrixSize = shift;
my $indexToCuiRef = shift;
for (my $i = 0; $i < $matrixSize; $i++) {
my $index1 = ${$indexToCuiRef}{$i};
for (my $j = 0; $j < $matrixSize; $j++) {
my $printed = 0;
my $index2 = ${$indexToCuiRef}{$j};
my $hash1Ref = ${$matrixRef}{$index1};
if (defined $hash1Ref) {
my $val = ${$hash1Ref}{$index2};
if (defined $val) {
print $val."\t";
$printed = 1;
}
}
if (!$printed) {
print "0\t";
}
}
print "\n";
}
}
=cut
1;
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
sub outputMatrixToFile {
my $outFile = shift;
my $matrixRef = shift;
#open the output file and output fhe matrx
open OUT, ">$outFile" or die ("Error opening matrix output file: $outFile\n");
my $rowRef;
foreach my $rowKey (keys %{$matrixRef}) {
$rowRef = ${$matrixRef}{$rowKey};
foreach my $colKey (keys %{$rowRef}) {
print OUT "$rowKey\t$colKey\t${$rowRef}{$colKey}\n";
}
}
}
#Note: Table to sparse is no longer used, but could be useful in the future
=comment
# retreive a table from mysql and convert it to a sparse matrix (a hash of
# hashes)
# input : $tableName <- the name of the table to output
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
my $umls = shift;
=comment
#Count the number of keys before and after filtering (for debugging)
my %termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys before filtering = ".(scalar keys %termsHash)."\n";
=cut
#eliminate values that are incorrect semantic groups
#do each row at a time, remove column values that
#are the incorrect semantic type
my %cuisChecked = ();
#cuisChecked keeps track of cuis that have been checked
# for elimination. If the cui has been checked its key
# will exist in the hash. Values of -1 indicate it should
# be eliminated, values of 1 indicate it should stay.
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
=comment
#Count the number of keys after filtering (for debugging)
%termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys after filtering = ".(scalar keys %termsHash)."\n";
=cut
}
# applies a semantic group filter to the matrix, by removing keys that
# are not allowed semantic type. Only removes types from rows,
# so is applied for times slicing, before randomly selecting terms of
# one semantic type
# input: $matrixRef <- ref to a sparse matrix to be filtered
# $acceptTypesRef <- a ref to a hash of accept type strings
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
my $umls = shift;
=comment
#Count the number of keys before and after filtering (for debugging)
my %termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys before filtering = ".(scalar keys %termsHash)."\n";
=cut
#eliminate values that are incorrect semantic groups
#do each row at a time, remove column values that
#are the incorrect semantic type
my $keep = -1;
#cuisChecked keeps track of cuis that have been checked
# for elimination. If the cui has been checked its key
# will exist in the hash. Values of -1 indicate it should
# be eliminated, values of 1 indicate it should stay.
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
}
=comment
#Count the number of keys after filtering (for debugging)
%termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys after filtering = ".(scalar keys %termsHash)."\n";
=cut
}
# applies a semantic group filter to the matrix, by removing keys that
# are not allowed semantic type. Only removes types from columns,
# so is applied to the implicit matrix (starting term rows with implicit
# columns).
# input: $matrixRef <- ref to a sparse matrix to be filtered
# $acceptTypesRef <- a ref to a hash of accept type strings
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
my $umls = shift;
=comment
#Count the number of keys before and after filtering (for debugging)
my %termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys before filtering = ".(scalar keys %termsHash)."\n";
=cut
#eliminate values that are incorrect semantic groups
#do each row at a time, remove column values that
#are the incorrect semantic type
my %cuisChecked = ();
#cuisChecked keeps track of cuis that have been checked
# for elimination. If the cui has been checked its key
# will exist in the hash. Values of -1 indicate it should
# be eliminated, values of 1 indicate it should stay.
lib/LiteratureBasedDiscovery/Filters.pm view on Meta::CPAN
}
=comment
#Count the number of keys after filtering (for debugging)
%termsHash = ();
foreach my $key1 (keys %{$matrixRef}) {
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
$termsHash{$key2} = 1;
}
}
print " number of keys after filtering = ".(scalar keys %termsHash)."\n";
=cut
}
# gets the semantic types of the group
# input: $group <- a string specifying a semantic group
# $umls <- an instance of UMLS::Interface
# output: a ref to a hash of TUIs
sub getTypesOfGroup {
my $group = shift;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#grab the input
my $goldMatrixRef = shift;
my $rowRanksRef = shift;
my $numIntervals = shift;
#calculate and output stats
#------------------------------------------
#calculate precision and recall
print "calculating precision and recall\n";
my ($precisionRef, $recallRef) = &calculatePrecisionAndRecall_implicit(
$goldMatrixRef, $rowRanksRef, $numIntervals);
#output precision and recall
print "----- average precision at 10% recall intervals (i recall precision) ----> \n";
foreach my $i (sort {$a <=> $b} keys %{$precisionRef}) {
print " $i ${$recallRef}{$i} ${$precisionRef}{$i}\n";
}
print "\n";
#-------------------------------------------
#calculate mean average precision
my $map = &calculateMeanAveragePrecision(
$goldMatrixRef, $rowRanksRef);
#output mean average precision
print "---------- mean average precision ---------------> \n";
print " MAP = $map\n";
print "\n";
#-------------------------------------------
#calculate precision at k
print "calculating precision at k\n";
my $precisionAtKRef = &calculatePrecisionAtK($goldMatrixRef, $rowRanksRef);
#output precision at k
print "---------- mean precision at k intervals ---------------> \n";
foreach my $k (sort {$a <=> $b} keys %{$precisionAtKRef}) {
print " $k ${$precisionAtKRef}{$k}\n";
}
print "\n";
#-------------------------------------------
#calculate cooccurrences at k
print "calculating mean cooccurrences at k\n";
my $cooccurrencesAtKRef = &calculateMeanCooccurrencesAtK($goldMatrixRef, $rowRanksRef);
#output cooccurrences at k
print "---------- mean cooccurrences at k intervals ---------------> \n";
foreach my $k (sort {$a <=> $b} keys %{$cooccurrencesAtKRef}) {
print " $k ${$cooccurrencesAtKRef}{$k}\n";
}
print "\n";
}
# loads a list of cuis for use in time slicing from file
# the CUI file contains a line seperated list of CUIs
# input: $cuiFileName <- a string specifying the file to load cuis from
# output: $\%cuis <- a ref to a hash of cuis, each key is a cui, values are 1
sub loadCUIs {
my $cuiFileName = shift;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# calculates average precision and recall of the generated implicit matrix
# compared to the post cutoff matrix
# input: $predictionsMatrixRef <- a ref to a sparse matrix of predicted
# discoveries
# $trueMatrixRef <- a ref to a sparse matrix of true discoveries
# output: ($precision, $recall) <- two scalar values specifying the precision
# and recall
sub calculatePrecisionRecall {
my $predictionsMatrixRef = shift; #a matrix of predicted discoveries
my $trueMatrixRef = shift; #a matrix of true discoveries
print "calculating precision and recall\n";
#bounds check, the predictions matrix must contain keys
if ((scalar keys %{$predictionsMatrixRef}) < 1) {
return (0,0); #precision and recall are both zero
}
#calculate precision and recall averaged over each cui
my $precision = 0;
my $recall = 0;
#each row key corresponds to a term for which we calculate
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# to rows in the starting matrix ref to save memory, and because those are
# the only rows that are needed.
# input: $startingMatrixRef <- a ref to the starting sparse matrix
# $explicitMatrix Ref <- a ref to the explicit sparse matrix
# $postCutoffFileName <- the filename to the postCutoffMatrix
# output: \%postCutoffMatrix <- a ref to the postCutoff sparse matrix
sub loadPostCutOffMatrix {
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $postCutoffFileName = shift;
print "loading postCutoff Matrix\n";
#open the post cutoff file
open IN, $postCutoffFileName
or die ("ERROR: cannot open post cutoff file: $postCutoffFileName");
#create hash of cuis to grab
my %cuisToGrab = ();
foreach my $rowKey (keys %{$startingMatrixRef}) {
$cuisToGrab{$rowKey} = 1;
}
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#check if a file is defined
if (exists ${$lbdOptionsRef}{'cuiListFileName'}) {
#grab the rows defined by the cuiListFile
my $cuisRef = &loadCUIs(${$lbdOptionsRef}{'cuiListFileName'});
foreach my $cui (keys %{$cuisRef}) {
if(exists ${$explicitMatrixRef}{$cui}) {
$startingMatrix{$cui} = ${$explicitMatrixRef}{$cui};
}
else {
print STDERR "WARNING: CUI from cuiListFileName is not in explicitMatrix: $cui\n";
}
}
}
else {
#randomly grab rows
#apply semantic filter to the rows (just retreive appropriate rows)
my $rowsToKeepRef = getRowsOfSemanticTypes(
$explicitMatrixRef, $startTermAcceptTypesRef, $umls_interface);
((scalar keys %{$rowsToKeepRef}) >= $numRows) or die("ERROR: number of acceptable rows starting terms is less than $numRows\n");
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
if (exists $rowNumbers{$i}) {
$startingMatrix{$key} = ${$explicitMatrixRef}{$key}
}
$i++;
}
#output the cui list if needed
if (exists ${$lbdOptionsRef}{'cuiListOutputFile'}) {
open OUT, ">".${$lbdOptionsRef}{'cuiListOutputFile'} or die ("ERROR: cannot open cuiListOutputFile:".${$lbdOptionsRef}{'cuiListOutputFile'}."\n");
foreach my $cui (keys %startingMatrix) {
print OUT "$cui\n";
}
close OUT;
}
}
#return the starting matrix
return \%startingMatrix;
}
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# $umls_association <- an instance of UMLS::Association
# output: \%cuiPairs <- a ref to a hash of CUI pairs and their assocaition
# each key of the hash is a comma seperated string
# containing cui1, and cui2 of the pair
# (e.g. 'cui1,cui2'), and each value is their association
# score using the specified assocition measure
sub getAssociationScores {
my $matrixRef = shift;
my $rankingMeasure = shift;
my $umls_association = shift;
print " getting Association Scores, rankingMeasure = $rankingMeasure\n";
#generate a list of cui pairs in the matrix
my %cuiPairs = ();
print " generating association scores:\n";
foreach my $rowKey (keys %{$matrixRef}) {
foreach my $colKey (keys %{${$matrixRef}{$rowKey}}) {
$cuiPairs{"$rowKey,$colKey"} = ${${$matrixRef}{$rowKey}}{$colKey};
}
}
#get ranks for all the cui pairs in the matrix
#return a hash of cui pairs and their frequency
if ($rankingMeasure eq 'frequency') {
return \%cuiPairs;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# (e.g. 'cui1,cui2'), values are their association
# scores.
# $matrixRef <- a reference to a co-occurrence sparse matrix that
# corresponds to the assocScoresRef
# output: \%thresholdedMatrix <- a ref to a sparse matrix containing only the
# $k ranked samples (cui pairs)
sub grabKHighestRankedSamples {
my $k = shift;
my $assocScoresRef = shift;
my $matrixRef = shift;
print "getting $k highest ranked samples\n";
#apply the threshold
my $preKeyCount = scalar keys %{$assocScoresRef};
my $postKeyCount = 0;
my %thresholdedMatrix = ();
#get the keys sorted by value in descending order
my @sortedKeys = sort { $assocScoresRef->{$b} <=> $assocScoresRef->{$a} } keys(%$assocScoresRef);
my $threshold = ${$assocScoresRef}{$sortedKeys[$k-1]};
print " threshold = $threshold\n";
#add the first k keys to the thresholded matrix
my ($cui1, $cui2);
foreach my $key (@sortedKeys) {
($cui1, $cui2) = split(/,/, $key);
#create new hash at rowkey location (if needed)
if (!(exists $thresholdedMatrix{$cui1})) {
my %newHash = ();
$thresholdedMatrix{$cui1} = \%newHash;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# $rowRanksRef <- a ref to a hash of arrays of ranked predictions.
# Each hash key is a cui, each hash element is an
# array of ranked predictions for that cui. The ranked
# predictions are cuis are ordered in descending order
# based on association. (from Rank::RankDescending)
# output: $map <- a scalar value of mean average precision (MAP)
sub calculateMeanAveragePrecision {
#grab the input
my $trueMatrixRef = shift; # a matrix of true discoveries
my $rowRanksRef = shift; # a hash of ranked predicted discoveries
print "calculating mean average precision\n";
#calculate MAP for each true discovery being predicted
my $map = 0;
foreach my $rowKey (keys %{$trueMatrixRef}) {
my $rankedPredictionsRef = ${$rowRanksRef}{$rowKey}; #an array ref of ranked predictions
#skip for rows that have no predictions
if (!defined $rankedPredictionsRef) {
next;
}
samples/configFileSamples/UMLSAssociationConfig view on Meta::CPAN
# UMLS::Association
<database>CUI_Bigram
# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock
# makes the UMLS::Association not print to the command line
<t>
samples/configFileSamples/UMLSInterfaceConfig view on Meta::CPAN
# are in <>'s, and values follow directly after with no space. As as example,
# the line "<database>umls" will pass the 'database' parameter with a value
# of 'umls' of UMLS::Interface options hash for its initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Interface for more detail
# makes the UMLS::Interface not print to the command line
<t>
# Put the full pathname of the UMLS::Interface Config File
<config>/home/sam/assocLBD-0.01/config/interfaceConfig
# If the UMLS::Interface Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
samples/runSample.pl view on Meta::CPAN
#Demo file, showing how to run open discovery using the sample data, and how
# to perform time slicing evaluation using the sample data
# run a sample lbd using the parameters in the lbd configuration file
print "\n OPEN DISCOVERY \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";
# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you.
# The variables for this example in removeExplicit.pl are:
# my $matrixFileName = 'sampleExplicitMatrix';
# my $squaredMatrixFileName = postCutoffMatrix;
# my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;
# next, run time slicing
print " TIME SLICING \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";
# open and closed discovery code portions
#########################################################
#Test that the demo file can run correctly
`(cd ./samples/; perl runSample.pl) &`;
#######################################################
#test that the demo output matches the expected demo output
#########################################################
print "Performing Open Discovery Tests:\n";
#read in the gold scores from the open discovery gold
my %goldScores = ();
open IN, './t/goldSampleOutput'
or die ("Error: Cannot open gold sample output\n");
while (my $line = <IN>) {
if ($line =~ /\d+\t(\d+\.\d+)\t(C\d+)/) {
$goldScores{$2} = $1;
}
}
}
else {
$allExist = 0;
$allMatch = 0;
last;
}
}
ok ($allExist == 1, "All CUIs exist in the output"); #all cuis exist in the new output file
ok ($allMatch == 1, "All Scores are the same in the output"); #all scores are the same in the new output file
print "Done with Open Discovery Tests\n\n";
#######################################################
#test that time slicing is computed correctly
#########################################################
print "Performing Time Slicing Tests\n";
#read in gold time slicing output
(my $goldAPScoresRef, my $goldMAP, my $goldPAtKScoresRef, my $goldFAtKScoresRef)
= &readTimeSlicingData('./t/goldSampleTimeSliceOutput');
#read in new time slicing output
(my $newAPScoresRef, my $newMAP, my $newPAtKScoresRef, my $newFAtKScoresRef)
= &readTimeSlicingData('./samples/sampleTimeSliceOutput');
#check that the correct number of values are read for all the
# (within error tolerance)
my $fAtKSame = 1;
for (my $i = 0; $i < scalar @{$goldFAtKScoresRef}; $i++) {
if (abs(${$goldFAtKScoresRef}[$i] - ${$newFAtKScoresRef}[$i]) > $atKErrorTol) {
$fAtKSame = 0;
last;
}
}
ok($fAtKSame == 1, "Frequency at K Matches");
print "Done with Time Slicing Tests\n";
############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
my $fileName = shift;
#read in the gold time slicing values
my @APScores = ();
utils/datasetCreator/applyMaxThreshold.pl view on Meta::CPAN
# gets co-occurrence stats, returns a hash of (unique) co-occurrence counts
# for each CUI. (count is unique or not depending on $applyToUnique)
sub getStats {
my $inputFile = shift;
my $applyToUnique = shift;
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
print "Getting Stats\n";
#count stats for each line of the file
my ($cui1, $cui2, $val);
my %count = (); #a count of the number of (unique) co-occurrences
while (my $line = <IN>) {
#split the line
($cui1, $cui2, $val) = split(/\t/,$line);
if ($applyToUnique) {
#update the unique co-occurrence counts
$count{$cui1}++;
utils/datasetCreator/applyMaxThreshold.pl view on Meta::CPAN
my $inputFile = shift;
my $outputFile = shift;
my $maxThreshold = shift;
my $countRef = shift;
#open the input and output
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
open OUT, ">$outputFile"
or die ("ERROR: unable to open outputFile: $outputFile\n");
print "ApplyingThreshold\n";
#threshold each line of the file
my ($cui1, $cui2, $val);
while (my $line = <IN>) {
#grab values
($cui1, $cui2, $val) = split(/\t/,$line);
#skip if either $cui1 or $cui2 are greater than the threshold
# the counts in %count have been set already according to
# whether $applyToUnique or not
if (${$countRef}{$cui1} > $maxThreshold
|| ${$countRef}{$cui2} > $maxThreshold) {
next;
}
else {
print OUT $line;
}
}
close IN;
close OUT;
print "Done!\n";
}
utils/datasetCreator/applyMinThreshold.pl view on Meta::CPAN
#grab the input
my $minThreshold = shift;
my $inputFile = shift;
my $outputFile = shift;
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
open OUT, ">$outputFile"
or die ("ERROR: unable to open outputFile: $outputFile\n");
print "Reading File\n";
#threshold each line of the file
my ($key, $cui1, $cui2, $val);
while (my $line = <IN>) {
#grab values
($cui1, $cui2, $val) = split(/\t/,$line);
#check minThreshold
if ($val > $minThreshold) {
print OUT $line;
}
}
close IN;
print "Done!\n";
}
utils/datasetCreator/applySemanticFilter.pl view on Meta::CPAN
# Applies the semantic type filter
sub applySemanticFilter {
#grab the input
my $matrixFileName = shift;
my $outputFileName = shift;
my $acceptTypesString = shift;
my $acceptGroupsString = shift;
my $interfaceConfig = shift;
my $columnsOnly = shift;
print STDERR "Applying Semantic Filter to $matrixFileName\n";
#load the matrix
my $matrixRef = Discovery::fileToSparseMatrix($matrixFileName);
#initialize the UMLS::Interface
my $componentOptions =
LiteratureBasedDiscovery::_readConfigFile('',$interfaceConfig);
my $umls_interface = UMLS::Interface->new($componentOptions)
or die "Error: Unable to create UMLS::Interface object.\n";
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
#Check on I/O
my $outFileName = "$startYear".'_'."$endYear".'_window'."$windowSize";
(!(-e $outFileName))
or die ("ERROR: output file already exists: $outFileName\n");
open OUT, ">$outFileName"
or die ("ERROR: unable to open output file: $outFileName\n");
#combine the files
my %matrix = ();
for(my $year = $startYear; $year <= $endYear; $year++) {
print "reading $year\n";
my $inFile = $dataFolder.$year.'_window'.$windowSize;
if (!(open IN, $inFile)) {
print " ERROR: unable to open $inFile\n";
next;
}
#read each line of the file and add to the matrix
while (my $line = <IN>) {
#read values from the line
$line =~ /([^\s]+)\t([^\s]+)\t([^\s]+)/;
my $rowKey = $1;
my $colKey = $2;
my $val = $3;
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
}
if (!exists ${$matrix{$rowKey}}{$colKey}) {
${$matrix{$rowKey}}{$colKey} = 0;
}
${$matrix{$rowKey}}{$colKey}+=$val;
}
close IN;
}
#output the matrix
print "outputting the matrix\n";
foreach my $rowKey(keys %matrix) {
foreach my $colKey(keys %{$matrix{$rowKey}}) {
print OUT "$rowKey\t$colKey\t${$matrix{$rowKey}}{$colKey}\n";
}
}
close OUT;
print "DONE!\n";
}
utils/datasetCreator/dataStats/getCUICooccurrences.pl view on Meta::CPAN
###################################
###################################
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
open OUT, ">$outputFile"
or die ("ERROR: unable to open outputFile: $outputFile\n");
print "Reading File\n";
#count stats for each line of the file
my %ucoCount = (); #a count of the number of unique co-occurrences
my %coCount = (); #a count of the number of co-occurrences
my ($cui1, $cui2, $val);
while (my $line = <IN>) {
#split the line
($cui1, $cui2, $val) = split(/\t/,$line);
#update the cooccurrence count
$coCount{$cui1}+=$val;
utils/datasetCreator/dataStats/getCUICooccurrences.pl view on Meta::CPAN
#update the unique co-occurrence counts
$ucoCount{$cui1}++;
#NOTE: do not update counts for 2, because in the case where order
#does not matter, the matrix will have been pre-processed to ensure
#the second cui will appear first in the key. In the case where order
#does matter we just shouldnt be counting it anyway
}
close IN;
print "Outputting Results\n";
#output the co-occurrence counts, sorted by number of unique
# co-occurrences (descending)
foreach my $cui(sort {$ucoCount{$b}<=>$ucoCount{$a}} keys %ucoCount) {
#coCount and ucoCount will have the same keys (see above loop)
print OUT "$cui\t$coCount{$cui}\t$ucoCount{$cui}\n";
}
close OUT;
print "Done!\n";
utils/datasetCreator/dataStats/getMatrixStats.pl view on Meta::CPAN
# (number of rows, number of columns, number of keys)
&getStats('/home/henryst/lbdData/groupedData/1852_window1_squared_inParts');
#############################################
# gets the stats for the matrix
#############################################
sub getStats {
my $fileName = shift;
print STDERR "$fileName\n";
#read in the matrix
open IN, $fileName or die ("unable to open file: $fileName\n");
my %matrix = ();
my $numCooccurrences = 0;
while (my $line = <IN>) {
#$line =~ /([^\t]+)\t([^\t]+)\t([\d]+)/;
$line =~ /([^\s]+)\s([^\s]+)\s([\d]+)/;
if (!exists $matrix{$1}) {
my %hash = ();
$matrix{$1} = \%hash;
}
$matrix{$1}{$2} = $3;
$numCooccurrences += $3;
}
close IN;
print STDERR " num rows in matrix = ".(scalar keys %matrix)."\n";
#count the number of columns and the number of keys
# this is done outside of the loop above because I also need to count the number of columns
my $numKeys = 0;
my %colKeys = ();
foreach my $row (keys %matrix) {
foreach my $colKey (keys %{$matrix{$row}}) {
$colKeys{$colKey} = 1;
$numKeys++;
}
}
print STDERR " num columns in matrix = ".(scalar keys %colKeys)."\n";
print STDERR " number of keys in the matrix = $numKeys\n";
print STDERR " number of cooccurrences in the matrix = $numCooccurrences\n";
}
utils/datasetCreator/dataStats/metaAnalysis.pl view on Meta::CPAN
my $startYear = shift;
my $endYear = shift;
my $windowSize = shift;
my $statsOutFileName= shift;
my $dataFolder = shift;
#Check on I/O
open OUT, ">$statsOutFileName"
or die ("ERROR: unable to open stats out file: $statsOutFileName\n");
#print header row
print OUT "year\tnumRows\tnumCols\tvocabularySize\tnumCooccurrences\n";
#get stats for each file and output to file
for(my $year = $startYear; $year <= $endYear; $year++) {
print "reading $year\n";
my $inFile = $dataFolder.$year.'_window'.$windowSize;
if (open IN, $inFile) {
(my $numRows, my $numCols, my $vocabularySize, my $numCooccurrences)
= &metaAnalysis($inFile);
print OUT "$year\t$numRows\t$numCols\t$vocabularySize\t$numCooccurrences\n"
}
else {
#just skip the file
print " ERROR: unable to open $inFile\n";
}
}
close OUT;
print "Done getting stats\n";
}
##############################
# runs meta analysis on a single file
sub metaAnalysis {
my $fileName = shift;
open IN, $fileName or die ("unable to open file: $fileName\n");
utils/datasetCreator/dataStats/metaAnalysis.pl view on Meta::CPAN
$uniqueKeys{$1} = 1;
$uniqueKeys{$2} = 1;
$numCooccurrences++;
}
close IN;
my $numRows = scalar keys %rowKeys;
my $numCols = scalar keys %colKeys;
my $vocabularySize = scalar keys %uniqueKeys;
print "$fileName: $numRows, $numCols, $vocabularySize, $numCooccurrences\n";
return $numRows, $numCols, $vocabularySize, $numCooccurrences;
}
utils/datasetCreator/fromMySQL/removeQuotes.pl view on Meta::CPAN
my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';
open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");
while (my $line = <IN>) {
$line =~ s/"//g;
#print $line;
print OUT $line;
}
close IN;
close OUT;
utils/datasetCreator/makeOrderNotMatter.pl view on Meta::CPAN
#make order not matter
#...output every $outputLimit iterations to avoid too much IO
my %matrix = ();
while (my $line = <IN>) {
#TODO use split instead of regex match
$line =~ /([^\s]+)\t([^\s]+)\t([^\s]+)/;
#$1 = row, $2 = col, $3 = val
if (!(defined $1) || !(defined $2) || !(defined $3)) {
print "Not all defined: $line";
}
#initialize rows if needed
if (!(exists $matrix{$1})) {
my %newHash = ();
$matrix{$1} = \%newHash;
}
if (!(exists $matrix{$2})) {
my %newHash = ();
$matrix{$2} = \%newHash;
utils/datasetCreator/makeOrderNotMatter.pl view on Meta::CPAN
#add the value
${$matrix{$1}}{$2} += $3;
#${$matrix{$2}}{$1} += $3;
}
close IN;
#output the matrix
foreach my $key1 (keys %matrix) {
foreach my $key2 (keys %{$matrix{$key1}}) {
print OUT "$key1\t$key2\t${$matrix{$key1}}{$key2}\n";
}
}
foreach my $key1 (keys %matrix) {
foreach my $key2 (keys %{$matrix{$key1}}) {
print OUT "$key2\t$key1\t${$matrix{$key1}}{$key2}\n";
}
}
close OUT;
print "DONE!\n";
utils/datasetCreator/removeCUIPair.pl view on Meta::CPAN
# used to remove Somatomedic C and Arginine from the 1960-1989 datasets
use strict;
use warnings;
my $cuiA = 'C0021665'; #somatomedic c
my $cuiB = 'C0003765'; #arginine
my $matrixFileName = '/home/henryst/lbdData/groupedData/1960_1989_window8_ordered';
my $matrixOutFileName = $matrixFileName.'_removed';
&removeCuiPair($cuiA, $cuiB, $matrixFileName, $matrixOutFileName);
print STDERR "DONE\n";
###########################################
# remove the CUI pair from the dataset
sub removeCuiPair {
my $cuiA = shift;
my $cuiB = shift;
my $matrixFileName = shift;
my $matrixOutFileName = shift;
print STDERR "removing $cuiA,$cuiB from $matrixFileName\n";
#open the in and out files
open IN, $matrixFileName
or die ("ERROR: cannot open matrix in file: $matrixFileName\n");
open OUT, ">$matrixOutFileName"
or die ("ERROR: cannot open matrix out file: $matrixOutFileName\n");
# read in each line of the matrix and copy to the new file
# but omit any $cuiA,$cuiB or $cuiB,$cuiA lines
while (my $line = <IN>) {
if ($line =~ /$cuiA\t$cuiB/ || $line =~ /$cuiB\t$cuiA/) {
print " removing $line";
next;
}
else {
print OUT $line;
}
}
}
utils/datasetCreator/removeExplicit.pl view on Meta::CPAN
###############################
###############################
#removes explicit knowledge ($matrixFileName) from the implicit
# knowledge ($squaredMatrixFileName)
sub removeExplicit {
my $matrixFileName = shift; #the explicit knowledge matrix (usually not filtered)
my $squaredMatrixFileName = shift; #the implicit with explicit knowledge matrix (filtered squared)
my $outputFileName = shift; #the implicit knowledge matrix output file
print STDERR "Removing Explicit from $matrixFileName\n";
#read in the matrix
open IN, $matrixFileName
or die("ERROR: unable to open matrix input file: $matrixFileName\n");
my %matrix = ();
my $numCooccurrences = 0;
while (my $line = <IN>) {
#$line =~ /([^\t]+)\t([^\t]+)\t([\d]+)/;
$line =~ /([^\s]+)\s([^\s]+)\s([\d]+)/;
if (!exists $matrix{$1}) {
utils/datasetCreator/removeExplicit.pl view on Meta::CPAN
close IN;
#copy the implicit values of the squared matrix over to a new file
open IN, $squaredMatrixFileName
or die("ERROR: unable to open squared matrix input file: $squaredMatrixFileName\n");
open OUT, ">$outputFileName"
or die("ERROR: unable to open output file: $outputFileName\n");
while (my $line = <IN>) {
$line =~ /([^\s]+)\s([^\s]+)\s([\d]+)/;
if (!exists ${$matrix{$1}}{$2}) {
print OUT $line;
}
}
close IN;
close OUT;
print STDERR "DONE!\n";
}
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl view on Meta::CPAN
########################################
########################################
#converts the matrix to format for squaring in MATLAB
sub convertTo {
#grab input
my $inFile = shift;
my $matrixOutFile = shift;
my $keyOutFile = shift;
print STDERR "converting $inFile\n";
#open all the files
open IN, $inFile
or die ("ERROR: unable to open inFile: $inFile\n");
open MATRIX_OUT, ">$matrixOutFile"
or die ("ERROR: unable to open matrixOutFile: $matrixOutFile\n");
open KEY_OUT, ">$keyOutFile"
or die ("ERROR: unable to open keyOutFile: $keyOutFile\n");
#convert the infile to the proper format
print " outputting matrix\n";
open IN, $inFile or die ("ERROR unable to reopen inFile: $inFile\n");
my %keyHash = ();
my ($cui1,$cui2,$value);
while (my $line = <IN>) {
#$line =~ /([^\s]+)\t([^\s]+)\t([^\s]+)/;
#my $cui1 = $1;
#my $cui2 = $2;
#my $value = $3;
($cui1,$cui2,$value) = split(/\t/,$line);
if (!exists $keyHash{$cui1}) {
$keyHash{$cui1} = (scalar keys %keyHash)+1;
}
if (!exists $keyHash{$cui2}) {
$keyHash{$cui2} = (scalar keys %keyHash)+1;
}
#NOTE: $value has a \n character
print MATRIX_OUT "$keyHash{$cui1}\t$keyHash{$cui2}\t$value";
}
close IN;
#output the keys file
print " Outputting keys\n";
foreach my $key (sort keys %keyHash) {
print KEY_OUT "$key\t$keyHash{$key}\n";
}
close KEY_OUT;
print " DONE!\n";
}
#converts the from format for squaring in MATLAB
sub convertFrom {
#grab input
my $matrixInFile = shift;
my $matrixOutFile = shift;
my $keyInFile = shift;
print "converting $matrixInFile\n";
#open all the files
open IN, $matrixInFile
or die ("ERROR: unable to open matrixInFile: $matrixInFile\n");
open MATRIX_OUT, ">$matrixOutFile"
or die ("ERROR: unable to open matrixOutFile: $matrixOutFile\n");
open KEY_IN, $keyInFile
or die ("ERROR: unable to open keyOutFile: $keyInFile\n");
#read in all the keys
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl view on Meta::CPAN
}
close KEY_IN;
#read in the file and convert on output
while (my $line = <IN>) {
$line =~ /([^\s]+)\s([^\s]+)\s([^\s]+)/;
my $key1 = $1;
my $key2 = $2;
my $value = $3;
print MATRIX_OUT "$keyHash{$key1}\t$keyHash{$key2}\t$value\n";
}
close IN;
close MATRIX_OUT;
print " DONE!\n";
}
utils/datasetCreator/squaring/squareMatrix.m view on Meta::CPAN
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'w');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
fclose(fid);
disp(' DONE!');
end
utils/datasetCreator/squaring/squareMatrix_partial.m view on Meta::CPAN
%output the matrix
[i,j,val] = find(squared);
clear squared;
disp(' values grabbed for output');
data_dump = [i,j,val];
clear i;
clear j;
clear val;
disp(' values ready for output dump');
fid = fopen(fileOut,'a+');
fprintf( fid,'%d %d %d\n', transpose(data_dump) );
clear data_dump;
fclose(fid);
disp(' values output');
end
end
end
utils/datasetCreator/squaring/squareMatrix_perl.pl view on Meta::CPAN
}
}
#output if needed
if ($keyCount > $dumpThreshold) {
&outputMatrix(\%product, $options{'outputFile'});
$keyCount = 0;
}
}
print STDERR "done with row: $count/$total\n";
$count++;
}
#output any other elements in the matrix and finish
&outputMatrix(\%product, $options{'outputFile'});
print STDERR "DONE!\n";
#########################################################
# Helper Functions
#########################################################
sub outputMatrix {
my $matrixRef = shift;
my $outputFile = shift;
#append to the output file
print STDERR "outputFile = $outputFile\n";
open OUT, '>>'.$outputFile or die ("ERROR: unable to open output file: $options{outputFile}\n");
#ouput the matrix
foreach my $key0 (keys %{$matrixRef}) {
foreach my $key1 (keys %{$product{$key0}}) {
print OUT "$key0\t$key1\t".${$product{$key0}}{$key1}."\n";
}
}
#clear the matrix
my %newHash = ();
$matrixRef = \%newHash;
close OUT;
}
utils/datasetCreator/testMatrixEquality.pl view on Meta::CPAN
#check that matrix B has all the same elements as matrix A
my $equal = 1;
foreach my $key1 (keys %{$matrixARef}) {
foreach my $key2 (keys %{${$matrixARef}{$key1}}) {
#check that it exists in matrix B and that the value is the same
if (exists ${${$matrixBRef}{$key1}}{$key2}) {
if (${${$matrixARef}{$key1}}{$key2} != ${${$matrixBRef}{$key1}}{$key2}) {
$equal = 0;
print "A\n";
last;
}
} else {
$equal = 0;
print "B\n";
last;
}
#remove from matrix B
delete ${${$matrixBRef}{$key1}}{$key2};
}
if (!$equal) {
last;
}
}
#check the matrix B doesn't contain any elements that aren't in matrix A
if ($equal) {
foreach my $key1 (keys %{$matrixBRef}) {
if (scalar keys %{${$matrixBRef}{$key1}} > 0) {
$equal = 0;
print "C\n";
last;
}
}
}
#print the reults
if ($equal) {
print "Matrices are Equal\n";
} else {
print "Matrices are NOT Equal\n";
}
print "DONE!\n";
utils/runDiscovery.pl view on Meta::CPAN
#grab all the options and set values
GetOptions( 'debug' => \$DEBUG,
'help' => \$HELP,
'version' => \$VERSION,
'assocConfig=s' => \$options{'assocConfig'},
'interfaceConfig=s' => \$options{'interfaceConfig'},
);
#Check for version or help
if ($VERSION) {
print "current version is ".(ALBD->version())."\n";
exit;
}
if ($HELP) {
&showHelp();
exit;
}
############################################################################
# Begin Running LBD
utils/runDiscovery.pl view on Meta::CPAN
defined $options{'lbdConfig'} or die ($usage);
my $lbd = ALBD->new(\%options);
$lbd->performLBD();
############################################################################
# function to output help messages for this program
############################################################################
sub showHelp() {
print "This utility takes an lbd configuration file and outputs\n";
print "the results of lbd to file. The parameters for LBD are\n";
print "specified in the input file. Please see samples/lbd or\n";
print "samples/thresholding for sample input files and descriptions\n";
print "of parameters and full details on what can be in an LBD input\n";
print "file.\n";
print "\n";
print "Usage: runDiscovery.pl LBD_CONFIG_FILE [OPTIONS]\n";
print "\n";
print "General Options:\n\n";
print "--help displays help, a quick summary of program\n";
print " options\n";
print "--assocConfig path to a UMLS::Association configuration\n";
print " file. Default location is \n";
print " '../config/association'. Replace this file\n";
print " for your computer to avoid having to specify\n";
print " each time.\n";
print "--interfaceConfig path to a UMLS::Interface configuration\n";
print " file. Default location is \n";
print " '../config/interface'. Replace this file \n";
print " for your computer to avoid having to specify\n";
print " each time.\n";
print "--debug enter debug mode\n";
print "--version prints the current version to screen\n";
};