view release on metacpan or search on metacpan
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
0. PREAMBLE
The purpose of this License is to make a manual, textbook, or other
functional and useful document "free" in the sense of freedom: to
assure everyone the effective freedom to copy and redistribute it,
with or without modifying it, either commercially or noncommercially.
Secondarily, this License preserves for the author and publisher a way
to get credit for their work, while not being considered responsible
for modifications made by others.
This License is a kind of "copyleft", which means that derivative
works of the document must themselves be free in the same sense. It
complements the GNU General Public License, which is a copyleft
license designed for free software.
We have designed this License in order to use it for manuals for free
software, because free software needs free documentation: a free
program should come with manuals providing the same freedoms that the
1. APPLICABILITY AND DEFINITIONS
This License applies to any manual or other work, in any medium, that
contains a notice placed by the copyright holder saying it can be
distributed under the terms of this License. Such a notice grants a
world-wide, royalty-free license, unlimited in duration, to use that
work under the conditions stated herein. The "Document", below,
refers to any such manual or work. Any member of the public is a
licensee, and is addressed as "you". You accept the license if you
copy, modify or distribute the work in a way requiring permission
under copyright law.
A "Modified Version" of the Document means any work containing the
Document or a portion of it, either copied verbatim, or with
modifications and/or translated into another language.
A "Secondary Section" is a named appendix or a front-matter section of
the Document that deals exclusively with the relationship of the
publishers or authors of the Document to the Document's overall subject
allowed to be designated as Invariant. The Document may contain zero
Invariant Sections. If the Document does not identify any Invariant
Sections then there are none.
The "Cover Texts" are certain short passages of text that are listed,
as Front-Cover Texts or Back-Cover Texts, in the notice that says that
the Document is released under this License. A Front-Cover Text may
be at most 5 words, and a Back-Cover Text may be at most 25 words.
A "Transparent" copy of the Document means a machine-readable copy,
represented in a format whose specification is available to the
general public, that is suitable for revising the document
straightforwardly with generic text editors or (for images composed of
pixels) generic paint programs or (for drawings) some widely available
drawing editor, and that is suitable for input to text formatters or
for automatic translation to a variety of formats suitable for input
to text formatters. A copy made in an otherwise Transparent file
format whose markup, or absence of markup, has been arranged to thwart
or discourage subsequent modification by readers is not Transparent.
An image format is not Transparent if used for any substantial amount
of text. A copy that is not "Transparent" is called "Opaque".
HTML, PostScript or PDF designed for human modification. Examples of
transparent image formats include PNG, XCF and JPG. Opaque formats
include proprietary formats that can be read and edited only by
proprietary word processors, SGML or XML for which the DTD and/or
processing tools are not generally available, and the
machine-generated HTML, PostScript or PDF produced by some word
processors for output purposes only.
The "Title Page" means, for a printed book, the title page itself,
plus such following pages as are needed to hold, legibly, the material
this License requires to appear in the title page. For works in
formats which do not have any title page as such, "Title Page" means
the text near the most prominent appearance of the work's title,
preceding the beginning of the body of the text.
A section "Entitled XYZ" means a named subunit of the Document whose
title either is precisely XYZ or contains XYZ in parentheses following
text that translates XYZ in another language. (Here XYZ stands for a
specific section name mentioned below, such as "Acknowledgements",
"Dedications", "Endorsements", or "History".) To "Preserve the Title"
of such a section when you modify the Document means that it remains a
section "Entitled XYZ" according to this definition.
The Document may include Warranty Disclaimers next to the notice which
states that this License applies to the Document. These Warranty
Disclaimers are considered to be included by reference in this
License, but only as regards disclaiming warranties: any other
implication that these Warranty Disclaimers may have is void and has
no effect on the meaning of this License.
2. VERBATIM COPYING
You may copy and distribute the Document in any medium, either
commercially or noncommercially, provided that this License, the
copyright notices, and the license notice saying this License applies
to the Document are reproduced in all copies, and that you add no other
conditions whatsoever to those of this License. You may not use
technical measures to obstruct or control the reading or further
copying of the copies you make or distribute. However, you may accept
compensation in exchange for copies. If you distribute a large enough
number of copies you must also follow the conditions in section 3.
You may also lend copies, under the same conditions stated above, and
you may publicly display copies.
3. COPYING IN QUANTITY
If you publish printed copies (or copies in media that commonly have
printed covers) of the Document, numbering more than 100, and the
Document's license notice requires Cover Texts, you must enclose the
copies in covers that carry, clearly and legibly, all these Cover
Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
the back cover. Both covers must also clearly and legibly identify
you as the publisher of these copies. The front cover must present
the full title with all words of the title equally prominent and
visible. You may add other material on the covers in addition.
Copying with changes limited to the covers, as long as they preserve
the title of the Document and satisfy these conditions, can be treated
as verbatim copying in other respects.
If the required texts for either cover are too voluminous to fit
legibly, you should put the first ones listed (as many as fit
reasonably) on the actual cover, and continue the rest onto adjacent
pages.
If you publish or distribute Opaque copies of the Document numbering
more than 100, you must either include a machine-readable Transparent
copy along with each Opaque copy, or state in or with each Opaque copy
a computer-network location from which the general network-using
public has access to download using public-standard network protocols
a complete Transparent copy of the Document, free of added material.
If you use the latter option, you must take reasonably prudent steps,
when you begin distribution of Opaque copies in quantity, to ensure
Version filling the role of the Document, thus licensing distribution
and modification of the Modified Version to whoever possesses a copy
of it. In addition, you must do these things in the Modified Version:
A. Use in the Title Page (and on the covers, if any) a title distinct
from that of the Document, and from those of previous versions
(which should, if there were any, be listed in the History section
of the Document). You may use the same title as a previous version
if the original publisher of that version gives permission.
B. List on the Title Page, as authors, one or more persons or entities
responsible for authorship of the modifications in the Modified
Version, together with at least five of the principal authors of the
Document (all of its principal authors, if it has fewer than five),
unless they release you from this requirement.
C. State on the Title page the name of the publisher of the
Modified Version, as the publisher.
D. Preserve all the copyright notices of the Document.
E. Add an appropriate copyright notice for your modifications
adjacent to the other copyright notices.
F. Include, immediately after the copyright notices, a license notice
giving the public permission to use the Modified Version under the
terms of this License, in the form shown in the Addendum below.
G. Preserve in that license notice the full lists of Invariant Sections
and required Cover Texts given in the Document's license notice.
H. Include an unaltered copy of this License.
I. Preserve the section Entitled "History", Preserve its Title, and add
to it an item stating at least the title, year, new authors, and
publisher of the Modified Version as given on the Title Page. If
there is no section Entitled "History" in the Document, create one
stating the title, year, authors, and publisher of the Document as
given on its Title Page, then add an item describing the Modified
Version as stated in the previous sentence.
J. Preserve the network location, if any, given in the Document for
public access to a Transparent copy of the Document, and likewise
the network locations given in the Document for previous versions
it was based on. These may be placed in the "History" section.
You may omit a network location for a work that was published at
least four years before the Document itself, or if the original
publisher of the version it refers to gives permission.
K. For any section Entitled "Acknowledgements" or "Dedications",
Preserve the Title of the section, and preserve in the section all
the substance and tone of each of the contributor acknowledgements
and/or dedications given therein.
L. Preserve all the Invariant Sections of the Document,
unaltered in their text and in their titles. Section numbers
or the equivalent are not considered part of the section titles.
M. Delete any section Entitled "Endorsements". Such a section
may not be included in the Modified Version.
N. Do not retitle any existing section to be Entitled "Endorsements"
or to conflict in title with any Invariant Section.
O. Preserve any Warranty Disclaimers.
If the Modified Version includes new front-matter sections or
appendices that qualify as Secondary Sections and contain no material
copied from the Document, you may at your option designate some or all
of these sections as invariant. To do this, add their titles to the
list of Invariant Sections in the Modified Version's license notice.
These titles must be distinct from any other section titles.
You may add a section Entitled "Endorsements", provided it contains
nothing but endorsements of your Modified Version by various
imply endorsement of any Modified Version.
5. COMBINING DOCUMENTS
You may combine the Document with other documents released under this
License, under the terms defined in section 4 above for modified
versions, provided that you include in the combination all of the
Invariant Sections of all of the original documents, unmodified, and
list them all as Invariant Sections of your combined work in its
license notice, and that you preserve all their Warranty Disclaimers.
The combined work need only contain one copy of this License, and
multiple identical Invariant Sections may be replaced with a single
copy. If there are multiple Invariant Sections with the same name but
different contents, make the title of each such section unique by
adding at the end of it, in parentheses, the name of the original
author or publisher of that section if known, or else a unique number.
Make the same adjustment to the section titles in the list of
Invariant Sections in the license notice of the combined work.
and any sections Entitled "Dedications". You must delete all sections
Entitled "Endorsements".
6. COLLECTIONS OF DOCUMENTS
You may make a collection consisting of the Document and other documents
released under this License, and replace the individual copies of this
License in the various documents with a single copy that is included in
the collection, provided that you follow the rules of this License for
verbatim copying of each of the documents in all other respects.
You may extract a single document from such a collection, and distribute
it individually under this License, provided you insert a copy of this
License into the extracted document, and follow this License in all
other respects regarding verbatim copying of that document.
7. AGGREGATION WITH INDEPENDENT WORKS
A compilation of the Document or its derivatives with other separate
and independent documents or works, in or on a volume of a storage or
distribution medium, is called an "aggregate" if the copyright
resulting from the compilation is not used to limit the legal rights
of the compilation's users beyond what the individual works permit.
When the Document is included in an aggregate, this License does not
apply to the other works in the aggregate which are not themselves
derivative works of the Document.
If the Cover Text requirement of section 3 is applicable to these
copies of the Document, then if the Document is less than one half of
the entire aggregate, the Document's Cover Texts may be placed on
covers that bracket the Document within the aggregate, or the
electronic equivalent of covers if the Document is in electronic form.
Otherwise they must appear on printed covers that bracket the whole
aggregate.
8. TRANSLATION
Translation is considered a kind of modification, so you may
distribute translations of the Document under the terms of section 4.
Replacing Invariant Sections with translations requires special
permission from their copyright holders, but you may include
translations of some or all Invariant Sections in addition to the
original versions of these Invariant Sections. You may include a
translation of this License, and all the license notices in the
Document, and any Warranty Disclaimers, provided that you also include
the original English version of this License and the original versions
of those notices and disclaimers. In case of a disagreement between
the translation and the original version of this License or a notice
or disclaimer, the original version will prevail.
If a section in the Document is Entitled "Acknowledgements",
"Dedications", or "History", the requirement (section 4) to Preserve
its Title (section 1) will typically require changing the actual
title.
9. TERMINATION
You may not copy, modify, sublicense, or distribute the Document except
as expressly provided for under this License. Any other attempt to
copy, modify, sublicense or distribute the Document is void, and will
automatically terminate your rights under this License. However,
parties who have received copies, or rights, from you under this
License will not have their licenses terminated so long as such
parties remain in full compliance.
10. FUTURE REVISIONS OF THIS LICENSE
The Free Software Foundation may publish new, revised versions
of the GNU Free Documentation License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns. See
http://www.gnu.org/copyleft/.
Each version of the License is given a distinguishing version number.
If the Document specifies that a particular numbered version of this
License "or any later version" applies to it, you have the option of
following the terms and conditions either of that specified version or
of any later version that has been published (not as a draft) by the
Free Software Foundation. If the Document does not specify a version
number of this License, you may choose any version ever published (not
as a draft) by the Free Software Foundation.
the GNU Library General Public License instead.) You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
this service if you wish), that you receive source code or can get it
if you want it, that you can change the software or use pieces of it
in new free programs; and that you know you can do these things.
To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have. You must make sure that they, too, receive or can get the
source code. And you must show them these terms so they know their
rights.
We protect your rights with two steps: (1) copyright the software, and
(2) offer you this license which gives you legal permission to copy,
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. This License applies to any program or other work which contains
a notice placed by the copyright holder saying it may be distributed
under the terms of this General Public License. The "Program", below,
refers to any such program or work, and a "work based on the Program"
means either the Program or any derivative work under copyright law:
that is to say, a work containing the Program or a portion of it,
either verbatim or with modifications and/or translated into another
language. (Hereinafter, translation is included without limitation in
the term "modification".) Each licensee is addressed as "you".
Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of
running the Program is not restricted, and the output from the Program
is covered only if its contents constitute a work based on the
Program (independent of having been made by running the Program).
Whether that is true depends on what the Program does.
1. You may copy and distribute verbatim copies of the Program's
source code as you receive it, in any medium, provided that you
conspicuously and appropriately publish on each copy an appropriate
copyright notice and disclaimer of warranty; keep intact all the
notices that refer to this License and to the absence of any warranty;
and give any other recipients of the Program a copy of this License
In addition, mere aggregation of another work not based on the Program
with the Program (or with a work based on the Program) on a volume of
a storage or distribution medium does not bring the other work under
the scope of this License.
3. You may copy and distribute the Program (or a work based on it,
under Section 2) in object code or executable form under the terms of
Sections 1 and 2 above provided that you also do one of the following:
a) Accompany it with the complete corresponding machine-readable
source code, which must be distributed under the terms of Sections
1 and 2 above on a medium customarily used for software interchange; or,
b) Accompany it with a written offer, valid for at least three
years, to give any third party, for a charge no more than your
cost of physically performing source distribution, a complete
machine-readable copy of the corresponding source code, to be
distributed under the terms of Sections 1 and 2 above on a medium
customarily used for software interchange; or,
c) Accompany it with the information you received as to the offer
to distribute corresponding source code. (This alternative is
allowed only for noncommercial distribution and only if you
received the program in object code or executable form with such
an offer, in accord with Subsection b above.)
The source code for a work means the preferred form of the work for
making modifications to it. For an executable work, complete source
code means all the source code for all modules it contains, plus any
associated interface definition files, plus the scripts used to
control compilation and installation of the executable. However, as a
special exception, the source code distributed need not include
operating system on which the executable runs, unless that component
itself accompanies the executable.
If distribution of executable or object code is made by offering
access to copy from a designated place, then offering equivalent
access to copy the source code from the same place counts as
distribution of the source code, even though third parties are not
compelled to copy the source along with the object code.
4. You may not copy, modify, sublicense, or distribute the Program
except as expressly provided under this License. Any attempt
otherwise to copy, modify, sublicense or distribute the Program is
void, and will automatically terminate your rights under this License.
However, parties who have received copies, or rights, from you under
this License will not have their licenses terminated so long as such
parties remain in full compliance.
5. You are not required to accept this License, since you have not
signed it. However, nothing else grants you permission to modify or
distribute the Program or its derivative works. These actions are
prohibited by law if you do not accept this License. Therefore, by
modifying or distributing the Program (or any work based on the
Program), you indicate your acceptance of this License to do so, and
all its terms and conditions for copying, distributing or modifying
the Program or works based on it.
6. Each time you redistribute the Program (or any work based on the
Program), the recipient automatically receives a license from the
original licensor to copy, distribute or modify the Program subject to
these terms and conditions. You may not impose any further
restrictions on the recipients' exercise of the rights granted herein.
You are not responsible for enforcing compliance by third parties to
this License.
7. If, as a consequence of a court judgment or allegation of patent
infringement or for any other reason (not limited to patent issues),
conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot
distribute so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you
may not distribute the Program at all. For example, if a patent
such claims; this section has the sole purpose of protecting the
integrity of the free software distribution system, which is
implemented by public license practices. Many people have made
generous contributions to the wide range of software distributed
through that system in reliance on consistent application of that
system; it is up to the author/donor to decide if he or she is willing
to distribute software through any other system and a licensee cannot
impose that choice.
This section is intended to make thoroughly clear what is believed to
be a consequence of the rest of this License.
8. If the distribution and/or use of the Program is restricted in
certain countries either by patents or by copyrighted interfaces, the
original copyright holder who places the Program under this License
may add an explicit geographical distribution limitation excluding
those countries, so that distribution is permitted only in or among
countries not thus excluded. In such case, this License incorporates
the limitation as if written in the body of this License.
9. The Free Software Foundation may publish revised and/or new versions
of the General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the Program
specifies a version number of this License which applies to it and "any
later version", you have the option of following the terms and conditions
either of that version or of any later version published by the Free
Software Foundation. If the Program does not specify a version number of
this License, you may choose any version ever published by the Free Software
Foundation.
10. If you wish to incorporate parts of the Program into other free
programs whose distribution conditions are different, write to the author
to ask for permission. For software which is copyrighted by the Free
Software Foundation, write to the Free Software Foundation; we sometimes
make exceptions for this. Our decision will be guided by the two goals
of preserving the free status of all derivatives of our free software and
of promoting the sharing and reuse of software generally.
NO WARRANTY
11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, the commands you use may
be called something other than `show w' and `show c'; they could even be
mouse-clicks or menu items--whatever suits your program.
You should also get your employer (if you work as a programmer) or your
school, if any, to sign a "copyright disclaimer" for the program, if
necessary. Here is a sample; alter the names:
Yoyodyne, Inc., hereby disclaims all copyright interest in the program
`Gnomovision' (which makes passes at compilers) written by James Hacker.
<signature of Ty Coon>, 1 April 1989
Ty Coon, President of Vice
This General Public License does not permit incorporating your program into
proprietary programs. If your program is a subroutine library, you may
consider it more useful to permit linking proprietary applications with the
library. If this is what you want to do, use the GNU Library General
Public License instead of this License.
UMLS::Interface
The core UMLS package provides a dictionary from content unqiue
identifiers (CUI) to their meanings in the Unified Medical Language
System. Refer to the UMLS::Interface documentation for how to install
the UMLS database on your system.
The package is freely available at:
<http://search.cpan.org/dist/UMLS-Interface/>
UMLS::Association
Use to calculate association scores used in most of the ranking method.
The package is freely available at:
<http://search.cpan.org/dist/UMLS-Association/>
Stage 3: Install ALBD package
The usual way to install the package is to run the following commands:
perl Makefile.PL
make
see a summary of the tests that failed, followed by a message of the
form "make: *** [test_dynamic] Error Y" where Y is a number between 1
and 255 (inclusive). If the number is less than 255, then it indicates
how many test failed (if more than 254 tests failed, then 254 will still
be shown). If one or more tests died, then 255 will be shown. For more
details, see:
<http://search.cpan.org/dist/Test-Simple/lib/Test/Builder.pm#EXIT_CODES>
Stage 4: Create an co-occurrence matrix
ALBD requires that a co-occurrence matrix of CUIs has been created. This
matrix is stored as a flat file, in a sparse matrix format such that
each line contains three tab seperated values, cui_1, cui_2, n_11 = the
count of their co-occurrences. Any matrix with that format is
acceptable, however the intended method of matrix generation is to
convert a UMLS::Association database into a flat matrix file. These
databases are created using the CUICollector tool of UMLS::Association,
and are run over the MetaMapped Medline baseline. With that file, run
utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
database into a matrix file. Notice that code in dbToTab.pl is just a
sample mysql command. If the input database is created in another
method, a different command may be needed. As long as the resulting
co-occurrence matrix is in the correct format LBD may be run on it. This
allows flexibility in where co-occurrence information comes from.
Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
on the resulting tab seperated file, if quotes are inlcuded in the
resulting co-ocurrence matrix file.
Stage 5: Set up Dummy UMLS::Association Database
UMLS::Association requires that a database can be connected to that is
in the correct format. Although this database is not required for ALBD
(since co-occurrence data is loaded from a co-occurrence matrix), it is
required to run UMLS:Association. If you ran UMLS::Association to
generate a co-occurrence matrix, you should be fine. Otherwise you will
need to create a dummy database that it can connect to. This can be done
in a few steps:
1) open mysql type mysql at the terminal
2) create the default database in the correct format, type: CREATE
samples/sampleGoldMatrix
samples/timeSliceCuiList
samples/timeSlicingConfig
samples/configFileSamples/UMLSAssociationConfig
samples/configFileSamples/UMLSInterfaceConfig
samples/configFileSamples/UMLSInterfaceInternalConfig
t/test.t
t/goldSampleOutput
t/goldSampleTimeSliceOutput
utils/runDiscovery.pl
utils/datasetCreator/applyMaxThreshold.pl
utils/datasetCreator/applyMinThreshold.pl
utils/datasetCreator/applySemanticFilter.pl
utils/datasetCreator/combineCooccurrenceMatrices.pl
utils/datasetCreator/makeOrderNotMatter.pl
utils/datasetCreator/removeCUIPair.pl
utils/datasetCreator/removeExplicit.pl
utils/datasetCreator/testMatrixEquality.pl
utils/datasetCreator/dataStats/getCUICooccurrences.pl
utils/datasetCreator/dataStats/getMatrixStats.pl
utils/datasetCreator/dataStats/metaAnalysis.pl
utils/datasetCreator/fromMySQL/dbToTab.pl
},
"name" : "ALBD",
"no_index" : {
"directory" : [
"t",
"inc"
]
},
"prereqs" : {
"build" : {
"requires" : {
"ExtUtils::MakeMaker" : "0"
}
},
"configure" : {
"requires" : {
"ExtUtils::MakeMaker" : "0"
}
},
"runtime" : {
"requires" : {
"UMLS::Association" : "0",
"UMLS::Interface" : "0"
}
}
},
"release_status" : "stable",
"version" : 0.05
}
---
abstract: 'a perl implementation of Literature Based Discovery'
author:
- 'Sam Henry <henryst@vcu.edu>'
build_requires:
ExtUtils::MakeMaker: 0
configure_requires:
ExtUtils::MakeMaker: 0
dynamic_config: 1
generated_by: 'ExtUtils::MakeMaker version 6.66, CPAN::Meta::Converter version 2.120921'
license: unknown
meta-spec:
url: http://module-build.sourceforge.net/META-spec-v1.4.html
version: 1.4
name: ALBD
no_index:
directory:
- t
- inc
requires:
UMLS::Association: 0
UMLS::Interface: 0
version: 0.05
This package consists of Perl modules along with supporting Perl
programs that perform Literature Based Discovery (LBD). The core
data from which LBD is performed are co-occurrences matrices
generated from UMLS::Association. ALBD is based on the ABC
co-occurrence model. Many options can be specified, and many
ranking methods are available. The novel ranking methods that use
association measure are available as well as frequency based
ranking methods. See samples/lbd for more info. Can perform open and
closed LBD as well as time slicing evaluation.
ALBD requires UMLS::Association both to compute the co-occurrence
database that the co-occurrence matrix is derived from, but also for
ranking the generated C terms.
UMLS::Association requires the UMLS::Interface module to access
the Unified Medical Language System (UMLS) for semantic type filtering
and to determine if CUIs are valid.
The following sections describe the organization of this software
package and how to use it. A few typical examples are given to help
clearly understand the usage of the modules and the supporting
utilities.
INSTALL
To install the module, run the following magic commands:
during the 'perl Makefile.PL' stage as:
perl Makefile.PL PREFIX=/home/programs
It is possible to modify other parameters during installation. The
details of these can be found in the ExtUtils::MakeMaker documentation.
However, it is highly recommended not messing around with other
parameters, unless you know what you're doing.
CO-OCCURRENCE MATRIX SETUP
ALBD requires that a co-occurrence matrix of CUIs has been created. This
matrix is stored as a flat file, in a sparse matrix format such that
each line contains three tab seperated values, cui_1, cui_2, n_11 = the
count of their co-occurrences. Any matrix with that format is
acceptable, however the intended method of matrix generation is to
convert a UMLS::Association database into a flat matrix file. These
databases are created using the CUICollector tool of UMLS::Association,
and are run over the MetaMapped Medline baseline. With that file, run
utils/datasetCreator/fromMySQL/dbToTab.pl to convert the desired
database into a matrix file. Notice that code in dbToTab.pl is just a
sample mysql command. If the input database is created in another
method, a different command may be needed. As long as the resulting
co-occurrence matrix is in the correct format LBD may be run on it. This
allows flexibility in where co-occurrence information comes from.
Note: utils/datasetCreator/fromMySQL/removeQuotes.pl may need to be run
on the resulting tab seperated file, if quotes are inlcuded in the
resulting co-ocurrence matrix file.
Set Up Dummy UMLS::Association Database
UMLS::Association requires that a database can be connected to that is
in the correct format. Although this database is not required for ALBD
(since co-occurrence data is loaded from a co-occurrence matrix), it is
required to run UMLS:Association. If you ran UMLS::Association to
generate a co-occurrence matrix, you should be fine. Otherwise you will
need to create a dummy database that it can connect to. This can be done
in a few steps:
1) open mysql type mysql at the terminal
2) create the default database in the correct format, type: CREATE
my %options = ();
$options{'assocConfig'} = '/home/share/ALBD/config/association';
$options{'interfaceConfig'} = '/home/shar/ALBD/config/interface';
$options{'lbdConfig'} = 'configFile'
my $lbd = LiteratureBasedDiscovery->new(\%options);
$lbd->performLBD();
CONTENTS
All the modules that will be installed in the Perl system directory are
present in the '/lib' directory tree of the package.
The package contains a utils/ directory that contain Perl utility
programs. These utilities use the modules or provide some supporting
functionality.
runDiscovery.pl -- runs LBD using the parameters specified in the input
file, and outputs to an output file.
The package contains a large selection of functions to manipulate CUI
Co-occurrence matrices in the utils/datasetCreator/ directory. These are
short scripts and generally require modifying the code at the top with
user input paramaters specific for each run. These scripts include:
applyMaxThreshold.pl -- applies a maximum co-occurrence threshold to the
co-occurrence matrix
applyMinThreshold.pl -- applies a minimum co-occurrence threshold to the
co-occurrence matrix
applySemanticFilter.pl -- applies a semantic type and/or group filter to
the co-occurrence matrix.
combineCooccurrenceMatrices.pl -- combines the co-occurrence counts of
multiple co-occurrence matrices
makeOrderNotMatter.pl -- makes the order of CUI co-occurrences not
matter by updating the co-occurrence matrix file. (UMLS::Association
dataset
getMatrixStats.pl -- determines the number of rows, columns, and entries
of a co-occurrence matrix
metaAnalysis.pl -- determines the number of rows, columns, vocabulary
size, and total number of co-occurrences of a co-occurrence file, or set
of co-occurrence files
There is another folder containing scripts to square co-occurrence
matrices. Squaring an explicit (A to B) co-occurrence matrix results in
a co-occurrence matrix containing all implicit (A to C) connections.
This is useful for time slicing and other analysis. Removal of the
original explicit matrix is an additional step that is required if you
wish to create a predictions matrix file for every CUI. This can be done
with the removeExplicit.pl script. Squaring a co-occurrence matrix can
be very computationally expensive, both in terms of ram and cpu. For
this reason MATLAB scripts are preferred over perl scripts. Even using
MATLAB ram can become an issue, and squaring sections of a matrix and
combining them into a single output matrix may be necassary, but takes
much longer. Scripts in the squaring folder include:
convertForSquaring_MATLAB.pl -- functions to convert to and from ALBD
and MATLAB sparse matrix formats
squareMatrix.m -- MATLAB script to square a matrix while holding
everything in ram. Faster, but requires more ram.
squareMatrix_partial.m -- MATLAB script to square a matrix in chunks.
Only loads parts of the matrix into ram at a time which makes squaring
any size matrix possible, but potentially take impracticle amounts of
time.
squareMatrix_perl.pl -- squares a matrix in perl, but requires the most
ram of any squaring method. The easiest method to use, but only
practical for small datasets.
The fromMySQL folder contains scripts that convery UMLS::Association
databases to ALBD co-occurrence matrices. The files contained are:
dbToTab.pl -- converts a UMLS::Association co-occurrence database to a
sparse format co-occurrence matrix used for ALBD
removeQuotes.pl -- removes quotes from lines in the co-occurrence matrix
lib/ALBD.pm view on Meta::CPAN
######################################################################
# Description
######################################################################
#
# This is a description heared more towards understanding or modifying
# the code, rather than using the program.
#
# LiteratureBasedDiscovery.pm - provides functionality to perform LBD
#
# Matrix Representation:
# LBD is performed using Matrix and Vector operations. The major components
# are an explicit knowledge matrix, which is squared to find the implicit
# knowledge matrix.
#
# The explicit knowledge is read from UMLS::Association N11 matrix. This
# matrix contains the co-occurrence counts for all CUI pairs. The
# UMLS::Association database is completely independent from
# implementation, so any dataset, window size, or anything else may be used.
# Data is read in as a sparse matrix using the Discovery::tableToSparseMatrix
# function. This returns the primary data structures and variables used
# throughtout LBD.
#
# Matrix representation:
# This module uses a matrix representation for LBD. All operations are
# performed either as matrix or vector operations. The core data structure
# are the co-occurrence matrices explicitMatrix and implicitMatrix. These
# matrices have dimensions vocabulary size by vocabulary size. Each row
# corresponds to the all co-occurrences for a single CUI. Each column of that
# row corresponding to a co-occurrence with a single CUI. Since the matrices
# tend to be sparse, they are stored as hashes of hashes, where the the first
# key is for a row, and the second key is for a column. The keys of each hash
# are the indeces within the matrix. The hash values are the number of
# co-ocurrences for that CUI pair (e.g. ${${$explicit{C0000000}}{C1111111} = 10
# means that CUI C0000000 and C1111111 co-occurred 10 times).
#
# Now with an understanding of the data strucutres, below is a breif
# description of each:
#
# startingMatrix <- A matrix containing the explicit matrix rows for all of the
# start terms. This makes it easy to have multiple start terms
# and using this matrix as opposed to the entire explicit
# matrix drastically improves performance.
# explicitMatrix <- A matrix containing explicit connections (known connections)
# for every CUI in the dataset.
# implicitMatrix <- A matrix containing implicit connections (discovered
# connections) for every CUI in the datast
lib/ALBD.pm view on Meta::CPAN
#### UPDATE VERSION HERE #######
use vars qw($VERSION);
$VERSION = 0.05;
#global variables
my $DEBUG = 0;
my $N11_TABLE = 'N_11';
my %lbdOptions = ();
#rankingProcedure <-- the procedure to use for ranking
#rankingMeasure <-- the association measure to use for ranking
#implicitOutputFile <--- the output file of results
#explicitInputFile <-- file to load explicit matrix from
#implicitInputFile <-- load implicit from file rather than calculating
#references to other packages
my $umls_interface;
my $umls_association;
#####################################################
####################################################
# performs LBD
# input: none
# ouptut: none, but a results file is written to disk
sub performLBD {
my $self = shift;
my $start; #used to record run times
#implicit matrix ranking requires a different set of procedures
if ($lbdOptions{'rankingProcedure'} eq 'implicitMatrix') {
$self->performLBD_implicitMatrixRanking();
return;
}
if (exists $lbdOptions{'targetCuis'}) {
$self->performLBD_closedDiscovery();
return;
}
if (exists $lbdOptions{'precisionAndRecall_explicit'}) {
$self->timeSlicing_generatePrecisionAndRecall_explicit();
lib/ALBD.pm view on Meta::CPAN
}
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'explicitInputFile'});
print "Got Explicit Matrix in ".(time() - $start)."\n";
#Get the Starting Matrix
$start = time();
my $startingMatrixRef =
Discovery::getRows($startCuisRef, $explicitMatrixRef);
print "Got Starting Matrix in ".(time() - $start)."\n";
#if using average minimum weight, grab the a->b scores
my %abPairsWithScores = ();
if ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight'
|| $lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
#apply semantic type filter to columns only
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
}
#initialize the abPairs to frequency of co-occurrence
foreach my $row (keys %{$startingMatrixRef}) {
foreach my $col (keys %{${$startingMatrixRef}{$row}}) {
$abPairsWithScores{"$row,$col"} = ${${$startingMatrixRef}{$row}}{$col};
}
}
Rank::getBatchAssociationScores(\%abPairsWithScores, $explicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
}
#Apply Semantic Type Filter to the explicit matrix
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
$start = time();
Filters::semanticTypeFilter_rowsAndColumns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
print "Semantic Type Filter in ".(time() - $start)."\n";
}
lib/ALBD.pm view on Meta::CPAN
#Apply Semantic Type Filter
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
$start = time();
Filters::semanticTypeFilter_columns(
$implicitMatrixRef, $targetAcceptTypesRef, $umls_interface);
print "Semantic Type Filter in ".(time() - $start)."\n";
}
#Score Implicit Connections
$start = time();
my $scoresRef;
if ($lbdOptions{'rankingProcedure'} eq 'allPairs') {
$scoresRef = Rank::scoreImplicit_fromAllPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
} elsif ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight') {
$scoresRef = Rank::scoreImplicit_averageMinimumWeight($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores);
} elsif ($lbdOptions{'rankingProcedure'} eq 'linkingTermCount') {
$scoresRef = Rank::scoreImplicit_linkingTermCount($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
} elsif ($lbdOptions{'rankingProcedure'} eq 'frequency') {
$scoresRef = Rank::scoreImplicit_frequency($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltcAssociation') {
$scoresRef = Rank::scoreImplicit_ltcAssociation($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
$scoresRef = Rank::scoreImplicit_LTC_AMW($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores);
} else {
die ("Error: Invalid Ranking Procedure\n");
}
print "Scored in: ".(time()-$start)."\n";
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
close OUT;
#Done
print "DONE!\n\n";
}
#----------------------------------------------------------------------------
# performs LBD, closed discovery
# input: none
# ouptut: none, but a results file is written to disk
sub performLBD_closedDiscovery {
my $self = shift;
my $start; #used to record run times
print "Closed Discovery\n";
print $self->_parametersToString();
#Get inputs
my $startCuisRef = $self->_getStartCuis();
my $targetCuisRef = $self->_getTargetCuis();
lib/ALBD.pm view on Meta::CPAN
my %inCommon = ();
foreach my $startLink (keys %startLinks) {
if (exists $targetLinks{$startLink}) {
$inCommon{$startLink} = $startLinks{$startLink} + $targetLinks{$startLink};
}
}
print " num in common = ".(scalar keys %inCommon)."\n";
#Score and Rank
#Score the linking terms in common
my $scoresRef = \%inCommon;
#TODO score is just summed frequency right now
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
print OUT "\n\n---------------------------------------\n\n";
print OUT "starting linking terms:\n";
print OUT join("\n", keys %startLinks);
print OUT "\n\n---------------------------------------\n\n";
print OUT "target linking terms:\n";
print OUT join("\n", keys %targetLinks, );
close OUT;
#Done
print "DONE!\n\n";
}
#NOTE, this is experimental code for using the implicit matrix as input
# to association measures and then rank. This provides a nice method of
# association for implicit terms, but there are implementation problems
# primarily memory constraints or time constraints now, because this
# requires the entire implicit matrix be computed. This can be done, but
# access to it is then slow. Would require a major redo of the code
#
=comment
# performs LBD, but using implicit matrix ranking schemes.
# Since the order of operations for those methods are slighly different
# a new method has been created.
# input: none
# output: none, but a results file is written to disk
sub performLBD_implicitMatrixRanking {
my $self = shift;
my $start; #used to record run times
print $self->_parametersToString();
print "In Implicit Ranking\n";
#Get inputs
my $startCuisRef = $self->_getStartCuis();
my $linkingAcceptTypesRef = $self->_getAcceptTypes('linking');
my $targetAcceptTypesRef = $self->_getAcceptTypes('target');
print "startCuis = ".(join(',', @{$startCuisRef}))."\n";
print "linkingAcceptTypes = ".(join(',', keys %{$linkingAcceptTypesRef}))."\n";
print "targetAcceptTypes = ".(join(',', keys %{$targetAcceptTypesRef}))."\n";
#Score Implicit Connections
$start = time();
my $scoresRef;
$scoresRef = Rank::scoreImplicit_fromImplicitMatrix($startCuisRef, $lbdOptions{'implicitInputFile'}, $lbdOptions{rankingMeasue}, $umls_association);
print "Scored in: ".(time()-$start)."\n";
#Rank Implicit Connections
$start = time();
my $ranksRef = Rank::rankDescending($scoresRef);
print "Ranked in: ".(time()-$start)."\n";
#Output The Results
open OUT, ">$lbdOptions{implicitOutputFile}"
or die "unable to open implicit ouput file: "
."$lbdOptions{implicitOutputFile}\n";
my $outputString = $self->_rankedTermsToString($scoresRef, $ranksRef);
my $paramsString = $self->_parametersToString();
print OUT $paramsString;
print OUT $outputString;
close OUT;
#Done
print "DONE!\n\n";
}
=cut
##################################################
################ Time Slicing ####################
##################################################
#NOTE: This function isn't really tested, and is really slow right now
# Generates precision and recall values by varying the threshold
# of the A->B ranking measure.
# input: none
# output: none, but precision and recall values are printed to STDOUT
sub timeSlicing_generatePrecisionAndRecall_explicit {
my $NUM_SAMPLES = 100; #TODO, read fomr file number of samples to average over for timeslicing
my $self = shift;
print "In timeSlicing_generatePrecisionAndRecall\n";
my $numIntervals = 10;
lib/ALBD.pm view on Meta::CPAN
die ("ERROR: explicitInputFile must be defined in LBD config file\n");
}
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'explicitInputFile'});
#------------------------------------------
#create the starting matrix
my $startingMatrixRef
= TimeSlicing::generateStartingMatrix($explicitMatrixRef, \%lbdOptions, $startAcceptTypesRef, $NUM_SAMPLES, $umls_interface);
#get association scores for the starting matrix
my $assocScoresRef = TimeSlicing::getAssociationScores(
$startingMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
my ($min, $max) = TimeSlicing::getMinMax($assocScoresRef);
my $range = $max-$min;
#load the post cutoff matrix for the necassary rows
my $postCutoffMatrixRef
= TimeSlicing::loadPostCutOffMatrix($startingMatrixRef, $explicitMatrixRef, $lbdOptions{'postCutoffFileName'});
#apply a semantic type filter to the post cutoff matrix
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$postCutoffMatrixRef, $targetAcceptTypesRef, $umls_interface);
}
#apply a threshold at $numIntervals% intervals to generate an 11 point
# interpolated precision/recall curve for linking term ranking/thresholding
#stats for collecting info about predicted vs. true
my $predictedAverage = 0;
my $trueAverage = 0;
my $trueMin = 99999;
my $trueMax = -999999;
my $predictedMin = 999999;
my $predictedMax = 999999;
my $predictedTotal = 0;
my $trueTotal = 0;
my $allPairsCount = scalar keys %{$assocScoresRef};
for (my $i = $numIntervals; $i >= 0; $i--) {
#determine the number of samples to threshold
my $numSamples = $i*($allPairsCount/$numIntervals);
print "i, numSamples/allPairsCount = $i, $numSamples/$allPairsCount\n";
#grab samples at just 10 to estimate the final point (this is what
# makes it an 11 point curve)
if ($numSamples == 0) {
$numSamples = 10;
}
#apply a threshold (number of samples)
my $thresholdedStartingMatrixRef = TimeSlicing::grabKHighestRankedSamples($numSamples, $assocScoresRef, $startingMatrixRef);
#generate implicit knowledge
my $implicitMatrixRef = Discovery::findImplicit($explicitMatrixRef, $thresholdedStartingMatrixRef);
#Remove Known Connections
$implicitMatrixRef
= Discovery::removeExplicit($startingMatrixRef, $implicitMatrixRef);
#apply a semantic type filter to the implicit matrix
if ((scalar keys %{$targetAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$implicitMatrixRef, $targetAcceptTypesRef, $umls_interface);
}
lib/ALBD.pm view on Meta::CPAN
$trueAverage /= (scalar keys %{$implicitMatrixRef});
}
}
#output stats
print "predicted - total, min, max, average = $predictedTotal, $predictedMin, $predictedMax, $predictedAverage\n";
print "true - total, min, max, average = $trueTotal, $trueMin, $trueMax, $trueAverage\n";
}
# generates precision and recall values by varying the threshold
# of the A->C ranking measure. Also generates precision at k, and
# mean average precision
# input: none
# output: none, but precision, recall, precision at k, and map values
# output to STDOUT
sub timeSlicing_generatePrecisionAndRecall_implicit {
my $NUM_SAMPLES = 200; #TODO, read fomr file number of samples to average over for timeslicing
my $self = shift;
my $start; #used to record run times
print "In timeSlicing_generatePrecisionAndRecall_implicit\n";
lib/ALBD.pm view on Meta::CPAN
if (exists $lbdOptions{'goldOutputFile'}) {
print "outputting gold\n";
Discovery::outputMatrixToFile($lbdOptions{'goldOutputFile'}, $goldMatrixRef);
}
}
#-------
#-------
# AB Scoring (if needed)
#-------
#if using average minimum weight, grab the a->b scores, #TODO this is sloppy here, but it has to be here...how to make it fit better?
my %abPairsWithScores = ();
if ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight'
|| $lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
print "getting AB scores\n";
#apply semantic type filter to columns only
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_columns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
}
#intitialize the abPairs to the frequency of co-ocurrence
foreach my $row (keys %{$startingMatrixRef}) {
foreach my $col (keys %{${$startingMatrixRef}{$row}}) {
$abPairsWithScores{"$row,$col"} = ${${$startingMatrixRef}{$row}}{$col};
}
}
Rank::getBatchAssociationScores(
\%abPairsWithScores, $explicitMatrixRef, $lbdOptions{'rankingMeasure'}, $umls_association);
}
#--------
#------------
# Matrix Filtering/Thresholding
#------------
#load or threshold the matrix
if (exists $lbdOptions{'thresholdedMatrix'}) {
print "loading thresholded matrix\n";
$explicitMatrixRef = (); #clear (for memory)
$explicitMatrixRef = Discovery::fileToSparseMatrix($lbdOptions{'thresholdedMatrix'});
}
#else {#TODO apply a threshold}
#NOTE, we must threshold the entire matrix because that is how we are calculating association scores
#Apply Semantic Type Filter to the explicit matrix
print "applying semantic filter to explicit matrix\n";
if ((scalar keys %{$linkingAcceptTypesRef}) > 0) {
Filters::semanticTypeFilter_rowsAndColumns(
$explicitMatrixRef, $linkingAcceptTypesRef, $umls_interface);
}
#------------
# Prediction Generation
lib/ALBD.pm view on Meta::CPAN
#save the implicit knowledge matrix to file
if (exists ($lbdOptions{'predictionsOutFile'})) {
print "outputting predictions\n";
Discovery::outputMatrixToFile($lbdOptions{'predictionsOutFile'}, $predictionsMatrixRef);
}
}
#-------------------------------------------
#At this point, the explicitMatrixRef has been filtered and thresholded
#The predictions matrix Ref has been generated from the filtered and
# thresholded explicitMatrixRef, only rows of starting terms remain, filtered, and
# had explicit removed
#Association scores are generated using the explicitMatrixRef
#--------------
# Get the ranks of all predictions
#--------------
#get the scores and ranks seperately for each row
# thereby generating scores and ranks for each starting
# term individually
my %rowRanks = ();
my ($n1pRef, $np1Ref, $npp);
print "getting row ranks\n";
foreach my $rowKey (keys %{$predictionsMatrixRef}) {
#grab rows from start and implicit matrices
my %startingRow = ();
$startingRow{$rowKey} = ${$startingMatrixRef}{$rowKey};
my %implicitRow = ();
$implicitRow{$rowKey} = ${$predictionsMatrixRef}{$rowKey};
#Score Implicit Connections
my $scoresRef;
if ($lbdOptions{'rankingProcedure'} eq 'allPairs') {
#get stats just a single time
if (!defined $n1pRef || !defined $np1Ref || !defined $npp) {
($n1pRef, $np1Ref, $npp) = Rank::getAllStats($explicitMatrixRef);
}
$scoresRef = Rank::scoreImplicit_fromAllPairs(\%startingRow, $explicitMatrixRef, \%implicitRow, $lbdOptions{'rankingMeasure'}, $umls_association, $n1pRef, $np1Ref, $npp);
} elsif ($lbdOptions{'rankingProcedure'} eq 'averageMinimumWeight') {
#get stats just a single time
if (!defined $n1pRef || !defined $np1Ref || !defined $npp) {
($n1pRef, $np1Ref, $npp) = Rank::getAllStats($explicitMatrixRef);
}
$scoresRef = Rank::scoreImplicit_averageMinimumWeight(\%startingRow, $explicitMatrixRef, \%implicitRow, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores, $n1pRef, $np1Ref, $npp);
} elsif ($lbdOptions{'rankingProcedure'} eq 'linkingTermCount') {
$scoresRef = Rank::scoreImplicit_linkingTermCount(\%startingRow, $explicitMatrixRef, \%implicitRow);
} elsif ($lbdOptions{'rankingProcedure'} eq 'frequency') {
$scoresRef = Rank::scoreImplicit_frequency(\%startingRow, $explicitMatrixRef, \%implicitRow);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltcAssociation') {
$scoresRef = Rank::scoreImplicit_ltcAssociation(\%startingRow, $explicitMatrixRef, \%implicitRow, $lbdOptions{'rankingMeasure'}, $umls_association);
} elsif ($lbdOptions{'rankingProcedure'} eq 'ltc_amw') {
#get stats just a single time
if (!defined $n1pRef || !defined $np1Ref || !defined $npp) {
($n1pRef, $np1Ref, $npp) = Rank::getAllStats($explicitMatrixRef);
}
$scoresRef = Rank::scoreImplicit_LTC_AMW(\%startingRow, $explicitMatrixRef, \%implicitRow, $lbdOptions{'rankingMeasure'}, $umls_association, \%abPairsWithScores, $n1pRef, $np1Ref, $npp);
} else {
die ("Error: Invalid Ranking Procedure\n");
}
#Rank Implicit Connections
my $ranksRef = Rank::rankDescending($scoresRef);
#save the row ranks
$rowRanks{$rowKey} = $ranksRef;
}
#output the results at 10 intervals
TimeSlicing::outputTimeSlicingResults($goldMatrixRef, \%rowRanks, 10);
}
##############################################################################
# functions to grab parameters and inialize all input
##############################################################################
# method to create a new LiteratureBasedDiscovery object
# input: $optionsHashRef <- a reference to an LBD options hash
lib/ALBD.pm view on Meta::CPAN
return \%acceptTypes;
}
##############################################################################
# function to produce output
##############################################################################
# outputs the implicit terms to string
# input: $scoresRef <- a reference to a hash of scores (hash{CUI}=score)
# $ranksRef <- a reference to an array of CUIs ranked by their score
# $printTo <- optional, outputs the $printTo top ranked terms. If not
# specified, all terms are output
# output: a line seperated string containing ranked terms, scores, and thier
# preferred terms
sub _rankedTermsToString {
my $self = shift;
my $scoresRef = shift;
my $ranksRef = shift;
my $printTo = shift;
#set printTo
if (!$printTo) {
$printTo = scalar @{$ranksRef};
}
#construct the output string
my $string = '';
my $index;
for (my $i = 0; $i < $printTo; $i++) {
#add the rank
$index = $i+1;
$string .= "$index\t";
#add the score
$string .= sprintf "%.5f\t", "${$scoresRef}{${$ranksRef}[$i]}\t";
#add the CUI
$string .= "${$ranksRef}[$i]\t";
#add the name
my $name = $umls_interface->getPreferredTerm(${$ranksRef}[$i]);
#if no preferred name, get anything
if (!defined $name || $name eq '') {
my $termListRef = $umls_interface->getTermList('C0440102');
if (scalar @{$termListRef} > 0) {
$name = '.**'.${$termListRef}[0];
}
lib/ALBD.pm view on Meta::CPAN
my $npp = Rank::getNPP($explicitMatrixRef);
my $n1p = Rank::getN1P('C0', $explicitMatrixRef);
my $np1 = Rank::getNP1('C2', $explicitMatrixRef);
print "Contingency Table Values from Explicit Matrix\n";
print "n11 = $n11\n";
print "npp = $npp\n";
print "n1p = $n1p\n";
print "np1 = $np1\n";
#Test other rank methods
my $scoresRef = Rank::scoreImplicit_fromAllPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $lbdOptions{rankingMethod}, $umls_association);
my $ranksRef = Rank::rankDescending($scoresRef);
print "Scores: \n";
foreach my $cui (keys %{$scoresRef}) {
print " scores{$cui} = ${$scoresRef}{$cui}\n";
}
print "Ranks = ".join(',', @{$ranksRef})."\n";
}
sub _printMatrix {
my $matrixRef = shift;
my $matrixSize = shift;
my $indexToCuiRef = shift;
for (my $i = 0; $i < $matrixSize; $i++) {
lib/LiteratureBasedDiscovery/Discovery.pm view on Meta::CPAN
# 3) apply filtering to explicit knowledge
# 4) square explicit knowledge to generate implicit knowledge
# 5) remove explicit knowledge from implicit knowledge
# 6) filter impicit knowledge
#
# which has code as:
# TODO insert sample code
#NOTE: CUI merging/term expansion can also be easily done by adding
# two or more explicit vectors, then generating explicit knowledge from
# them. BUT also interesting is that term expansion, etc... is
# unnecassary if we just rank against every term. We may however need
# to modify the ranking metrics to account for synonyms, etc.. (max value
# of a set of synonyms or something)
######################################################################
# Functions to perform Literature Based Discovery
######################################################################
lib/LiteratureBasedDiscovery/Evaluation.pm view on Meta::CPAN
# ALBD::Evaluation.pm
#
# Provides functionality to evaluate LBD systems
# Key components are:
# Results Matrix <- all new knowledge generated by an LBD system (e.g.
# all proposed discoveries of a system from pre-cutoff
# data).
# Gold Standard Matrix <- the gold standard knowledge matrix (e.g. all
# knowledge present in the post-cutoff dataset
# that is not present in the pre-cutoff dataset).
#
# Copyright (c) 2017
#
# Sam Henry
# henryst at vcu.edu
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
lib/LiteratureBasedDiscovery/Evaluation.pm view on Meta::CPAN
#
# The Free Software Foundation, Inc.,
# 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA.
package Evaluation;
use strict;
use warnings;
# Timeslicing evaluation that calculates the precision of LBD
# (O(k), where k is the number of keys in results)
# input: $resultsMatrixRef <- ref a matrix of LBD results
# $goldMatrixRef <- ref to a gold standard matrix
# output: the precision of results
sub calculatePrecision {
my $resultsMatrixRef = shift;
my $goldMatrixRef = shift;
# calculate the precision which is the percentage of results that are
# are in the gold standard
# (percent of generated that is gold)
my $count = 0;
foreach my $key(keys %{$resultsMatrixRef}) {
if (exists ${$goldMatrixRef}{$key}) {
$count++;
}
}
return $count/(scalar keys %{$resultsMatrixRef});
}
# Timeslicing evaluation that calculate the recall of LBD
# (O(k), where k (is the number of keys in gold)
# input: $resultsMatrixRef <- ref a matrix of LBD results
# $goldMatrixRef <- ref to a gold standard matrix
# output: the recall of results
sub calculateRecall {
my $resultsMatrixRef = shift;
my $goldMatrixRef = shift;
# calculate the recall which is the percentage of knowledge in the gold
# standard that was generated by the LBD system
# (percent of gold that is generated)
my $count = 0;
foreach my $key(keys %{$goldMatrixRef}) {
if (exists ${$resultsMatrixRef}{$key}) {
$count++;
}
}
return $count/(scalar keys %{$goldMatrixRef});
}
1;
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
# along with this program; if not, write to
#
# The Free Software Foundation, Inc.,
# 59 Temple Place - Suite 330,
# Boston, MA 02111-1307, USA.
package Rank;
use strict;
use warnings;
# scores each implicit CUI using an assocation measure, but the input to
# the association measure is based on linking term counts, rather than
# co-occurrence counts.
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# $measure <- the string of the umls association measure to use
# $association <- an instance of umls association
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_ltcAssociation {
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
my $measure = shift;
my $association = shift;
#bTerms to calculate n1p (number of unique co-occurring terms)
my %bTerms = ();
my $rowRef;
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
my $npp = 0;
my %uniqueKeys = ();
foreach my $key1 (keys %{$explicitMatrixRef}) {
$rowRef = ${$explicitMatrixRef}{$key1};
foreach my $key2 (keys %{$rowRef}) {
$uniqueKeys{$key2} = 1;
}
}
$npp = scalar keys %uniqueKeys;
#get scores for each cTerm
my %score = ();
foreach my $cTerm (keys %cTerms) {
#assume calculation cannot be made
$score{$cTerm} = -1;
#only calculate if np1 > 0
if ($np1{$cTerm} > 0) {
#get score
$score{$cTerm} = $association->_calculateAssociation_fromObservedCounts($n11{$cTerm}, $n1p, $np1{$cTerm}, $npp, $measure);
}
}
return \%score;
}
# scores each implicit CUI using an assocation measure. Score is the average
# of the minimum between association score between start and linking, and
# linking and target.
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# $measure <- the string of the umls association measure to use
# $association <- an instance of umls association
# $abScoresRef <- hashRef of the a to b scores used in AMW
# key is the a,b cui pair (e.g. hash{'C00,C11'})
# values are their score
#
# Optional Input for passing in precalculated stats
# so that they don't have to get recalcualted each time
# such as in timeslicing
# $n1pRef <- hashRef where key is a cui, value is n1p
# $np1Ref <- hashRef where key is a cui, value is np1
# $npp <- scalar = value of npp
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_averageMinimumWeight {
#grab input
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
my $measure = shift;
my $association = shift;
my $abScoresRef = shift;
#optionally pass in stats so they don't get recalculated for
# multiple terms (such as with time slicing)
my $n1pRef = shift;
my $np1Ref = shift;
my $npp = shift;
#get all BC pairs (call it bcScores because it will hold the scores)
my $bcScoresRef = &_getBCPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
#get cui pair scores
&getBatchAssociationScores(
$bcScoresRef, $explicitMatrixRef, $measure, $association,
$n1pRef, $np1Ref, $npp);
#find the max a->b score (since there can be multiple a terms)
my %maxABScores = ();
my ($key1, $key2, $score);
foreach my $pairKey (keys %{$abScoresRef}) {
#second value is b term
($key1, $key2) = split(/,/,$pairKey);
$score = ${$abScoresRef}{$pairKey};
if ($score != -1) { #only compute for associations that exist
if (exists $maxABScores{$key2}) {
if ($score > $maxABScores{$key2}) {
$maxABScores{$key2} = $score;
}
} else {
$maxABScores{$key2} = $score;
}
}
}
# Find the average minimum weight (cScores) for each c term
# average of minimum a->b score and b->c score
my %cScores = ();
my %counts = ();
my ($value, $count, $min, $bTerm, $cTerm);
#sum min scores
foreach my $pairKey (keys %{$bcScoresRef}) {
#only compute for scores that exist
if (${$bcScoresRef}{$pairKey} != -1) {
#first is bTerm, second is cTerm
($bTerm, $cTerm) = split(/,/,$pairKey);
#check there is an AB value
if ($maxABScores{$bTerm} != -1) {
#get the minimum between a->b and b->c
$min = ${$bcScoresRef}{$pairKey};
if ($maxABScores{$bTerm} < $min) {
$min = $maxABScores{$bTerm};
}
#increase the sum (automatically initialize to 0)
$cScores{$cTerm} += $min;
$counts{$cTerm}++;
}
}
}
#normalize by counts
foreach my $key (keys %cScores) {
$cScores{$key} /= $counts{$key}
}
return \%cScores;
}
# scores each implicit CUI using linking term count, and AMW as a tie breaker
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# $measure <- the string of the umls association measure to use
# $association <- an instance of umls association
# $abScoresRef <- hashRef of the a to b scores used in AMW
# key is the a,b cui pair (e.g. hash{'C00,C11'})
# values are their score
# Optional Input for passing in precalculated stats
# so that they don't have to get recalcualted each time
# such as in timeslicing
# $n1pRef <- hashRef where key is a cui, value is n1p
# $np1Ref <- hashRef where key is a cui, value is np1
# $npp <- scalar = value of npp
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_LTC_AMW {
#grab the input
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
my $measure = shift;
my $association = shift;
my $abScoresRef = shift;
#optionally pass in stats so they don't get recalculated for
# multiple terms (such as with time slicing)
my $n1pRef = shift;
my $np1Ref = shift;
my $nppRef = shift;
#get linking term count scores
my $ltcAssociationsRef = &scoreImplicit_linkingTermCount($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
#get average minimum weight scores
my $amwScoresRef = &scoreImplicit_averageMinimumWeight($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef, $measure, $association, $abScoresRef, $n1pRef, $np1Ref, $nppRef);
#create a hash of cui pairs for which the key is the ltc, and the value is an array of cui pairs that have that LTC
my %ltcHash = ();
foreach my $pairKey (keys %{$ltcAssociationsRef}) {
#get the LTC we will be tie breaking
my $currentLTC = ${$ltcAssociationsRef}{$pairKey};
if (!exists $ltcHash{$currentLTC}) {
my @newArray = ();
$ltcHash{$currentLTC} = \@newArray;
}
push @{$ltcHash{$currentLTC}}, $pairKey;
}
#generate the LTC-AMW scores by assigning a rank value
# first by LTC, and then my AMW
my %ltcAMWScores = ();
my $topRank = scalar keys %{$ltcAssociationsRef};
my $currentRank = $topRank;
#iterate first over ltc in descending order
foreach my $ltc (sort {$b <=> $a} keys %ltcHash) {
#check each cuiPair with this ltc
my %tiedAMWScores = ();
foreach my $cuiPair (@{$ltcHash{$ltc}}) {
$tiedAMWScores{$cuiPair} = ${$amwScoresRef}{$cuiPair};
}
#add the cui pairs by descending amw score
foreach my $cuiPair (sort {$tiedAMWScores{$b} <=> $tiedAMWScores{$a}} keys %tiedAMWScores) {
$ltcAMWScores{$cuiPair} = $currentRank;
$currentRank--;
}
}
#return the scores
return \%ltcAMWScores;
}
#TODO this is an untested method
# gets the max cosine distance score between all a terms and each cTerm
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub score_cosineDistance {
#LBD Info
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
#get all the A->C pairs
my $acPairsRef = &_getACPairs($startingMatrixRef, $implicitMatrixRef);
my %scores = ();
foreach my $pairKey (keys %{$acPairsRef}) {
#get the A and C keys
my ($aKey, $cKey) = split(/,/,$pairKey);
#grab the A and C explicit vectors
my $aVectorRef = ${$explicitMatrixRef}{$aKey};
my $cVectorRef = ${$explicitMatrixRef}{$cKey};
#find the numerator which is the sum of A[i]*C[i] values
my $numerator = 0;
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
}
#find the denominator, which is the product of A and C lengths
my $denom = sqrt($aSum)*sqrt($cSum);
#set the score (maximum score seen for that C term)
my $score = -1;
if ($denom != 0) {
$score = $numerator/$denom;
}
if (exists $scores{$cKey}) {
if ($score > $scores{$cKey}) {
$scores{$cKey} = $score;
}
}
else {
$scores{$cKey} = $score;
}
}
return \%scores;
}
# gets a list of A->C pairs, and sets the value as the implicit matrix value
# input: $startingMatrixRef <- ref to the starting matrix
# $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref where keys are comma seperated cui pairs hash{'C000,C111'}
# and values are set to the value at that index in the implicit matrix
sub _getACPairs {
my $startingMatrixRef = shift;
my $implicitMatrixRef = shift;
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
foreach my $keyC (%{${$implicitMatrixRef}{$keyA}}) {
$acPairs{$keyA,$keyC} = ${${$implicitMatrixRef}{$keyA}}{$keyC};
}
}
return \%acPairs;
}
# scores each implicit CUI based on the number of linking terms between
# it and all starting terms.
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_linkingTermCount {
#LBD Info
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
#get all bc pairs
my $bcPairsRef = &_getBCPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
# Find the linking term count for each cTerm
my %scores = ();
my ($key1, $key2);
foreach my $pairKey (keys %{$bcPairsRef}) {
#cTerm is the second value ($key2)
($key1, $key2) = split(/,/,$pairKey);
#automatically initializes to 0
$scores{$key2}++;
}
return \%scores;
}
# scores each implicit CUI based on the summed frequency of co-occurrence
# between it and all B terms (A->B frequencies are NOT considered)
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_frequency {
#LBD Info
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
#get all bc pairs
my $bcPairsRef = &_getBCPairs($startingMatrixRef, $explicitMatrixRef, $implicitMatrixRef);
# Find the frequency count for each cTerm
my %scores = ();
my ($key1, $key2);
foreach my $pairKey (keys %{$bcPairsRef}) {
#cTerm is the second value ($key2)
($key1, $key2) = split(/,/,$pairKey);
#automatically initializes to 0 (with +=)
$scores{$key2} += ${$bcPairsRef}{$pairKey};
}
return \%scores;
}
# scores each implicit CUI using an assocation measure. Score is the maximum
# association between a column in the implicit matrix, and one of the start
# matrix terms (so max between any A and that C term).
# Score is calculated using the implicit matrix
# input: $startCuisRef <- ref to an array of start cuis (A terms)
# $implicitMatrixFileName <- fileName of the implicit matrix
# $measure <- the string of the umls association measure to use
# $association <- an instance of umls association
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_fromImplicitMatrix {
#LBD Info
my $startCuisRef = shift;
my $implicitMatrixFileName = shift;
my $measure = shift;
my $association = shift;
######################################
#Get hashes for A and C terms
#####################################
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
}
}
######################################
#Get Co-occurrence values, N11, N1P, NP1, NPP
######################################
#NPP is the number of Co-occurreces total
#@NP1 is the number of co-occurrences of a C term with any term ... so sum of XXX\tCTerm\tVal for each cTerm
#@N1P is the number of co-occurrences of any A term ... so sum of anyATerm\tXXX\t
#N11{Cterm} is the sum of anyATerm\tCTerm\tVal
seek IN, 0,0; #reset to the beginning of the implicit file
#iterate over the lines of interest, and grab values
my %np1 = ();
my %n11 = ();
my $n1p = 0;
my $npp = 0;
my $matchedCuiB = 0;
my ($cuiA, $cuiB, $val);
while (my $line = <IN>) {
#grab data from the line
($cuiA, $cuiB, $val) = split(/\t/,$line);
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
$matchedCuiB = 0;
}
}
}
}
######################################
# Calculate Association for each c term
######################################
my %associationScores = ();
foreach my $cTerm(keys %cTerms) {
$associationScores{$cTerm} =
$association->_calculateAssociation_fromObservedCounts($n11{$cTerm}, $n1p, $np1{$cTerm}, $npp, $measure);
}
return \%associationScores;
}
# scores each implicit CUI using an assocation measure. Score is the maximum
# association between any of the linking terms.
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# $measure <- the string of the umls association measure to use
# $association <- an instance of umls association
# output: a hash ref of scores for each implicit key. (hash{cui} = score)
sub scoreImplicit_fromAllPairs {
#LBD Info
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $implicitMatrixRef = shift;
my $measure = shift;
my $association = shift;
#optionally pass in stats so they don't get recalculated for
# multiple terms (such as with time slicing)
my $n1pRef = shift;
my $np1Ref = shift;
my $npp = shift;
#get all bc pairs
my $bcPairsRef = &_getBCPairs($startingMatrixRef,
$explicitMatrixRef, $implicitMatrixRef);
#get bc pairs scores
&getBatchAssociationScores(
$bcPairsRef, $explicitMatrixRef, $measure, $association,
$n1pRef, $np1Ref, $npp);
# Find the max explicitCUI,implicitCUI association for each implicit CUI.
# The association score is the maximum value between a C term and all
# B terms
my %scores = ();
my $max;
my $value;
my $implicitCui;
my ($key1,$key2);
foreach my $pairKey (keys %{$bcPairsRef}) {
#only compare association scores that are valid
if (${$bcPairsRef}{$pairKey} != -1) {
($key1,$key2) = split(/,/,$pairKey);
#only use key2, since that is the implicit cui (c term)
#update max for this implicit cui or create if needed
if (!exists $scores{$key2}) {
$scores{$key2} = ${$bcPairsRef}{$pairKey};
}
elsif (${$bcPairsRef}{$pairKey} > $scores{$key2}) {
$scores{$key2} = ${$bcPairsRef}{$pairKey}
}
}
}
return \%scores;
}
sub scoreImplicit_minimumWeightAssociation {
}
#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# Builds a list of B->C term pairs that also co-occurr with A terms
# Only adds B->C term pairs for C terms that are also present in the
# implicitMatrix.
# The value of the bcPairs Hash is the value in the explicit matrix
# for that pair.
# input: $startingMatrixRef <- ref to the starting matrix
# $explicitMatrixRef <- ref to the explicit matrix
# $implicitMatrixRef <- ref to the implicit matrix
# output: a hash ref of BC term pairs. Each key is "$bTerm,$cTerm",
# value is by default the frequency of BC co-occurrences in the
# matrix
sub _getBCPairs {
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
#add because this a->b->c term (%cTerms) is also a b->c term
$bcPairs{"$bTerm,$cTerm"} = ${$rowRef}{$cTerm};
}
}
}
}
return \%bcPairs;
}
# ranks the scores in descending order
# input: $scoresRef <- a hash ref to a hash of cuis and scores (hash{cui} = score)
# output: an array ref of the ranked cuis in descending order
sub rankDescending {
#grab the input
my $scoresRef = shift;
#order in descending order, and use the CUI string as a tiebreaker
my @rankedCuis = ();
my @tiedCuis = ();
my $currentScore = -1;
foreach my $cui (
#sort function to sort by value
sort {${$scoresRef}{$b} <=> ${$scoresRef}{$a}}
keys %{$scoresRef}) {
#see if this cui is tied with previuos
if (${$scoresRef}{$cui} != $currentScore) {
#this cui is not tied with previuos,
# so save all previuos ones to the ranked array
# Here, we sort by key name, so the tie breaker
# is the cui name itself. This is arbitrary but
# allows for results to be precisely replicated.
# UPDATE: Almost precisely replicated. There is
# a numerical stability problem so that the sort
# by value will chunk out differently depending
# on the run. So one run something with a values of
# 0.66666666666667 will be sorted above another item
# with that same value, the next run sorted with it.
# this is essentially unavoidable without implementing
# a tolerance threshold which seems like overkill
foreach my $tiedCui (sort @tiedCuis) {
push @rankedCuis, $tiedCui;
}
#clear the list of tied CUIs
@tiedCuis = ();
}
#add current CUI to the tied CUI list and update the
# current score
$currentScore = ${$scoresRef}{$cui};
push @tiedCuis, $cui;
}
#add any remaining tied cuis to the final list
foreach my $cui (sort @tiedCuis) {
push @rankedCuis, $cui;
}
#return the ranked cuis
return \@rankedCuis;
}
#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
#XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
# gets association scores for a set of cui pairs
# input: $cuiPairsRef <- reference to a hash of pairs of matrix indeces (key = '1,2')
# $matrixRef <- a reference to a sparse matrix of n11 values
# $measure <- the association measure to perform
# $association <- an instance of UMLS::Association
# output: none, bu the cuiPairs ref has values updated to reflect the
# computed assocation score
sub getBatchAssociationScores {
my $cuiPairsRef = shift;
my $matrixRef = shift;
my $measure = shift;
my $association = shift;
#optionally pass in $n1pRef, $np1Ref, and $npp
# do this if they get calculated multiple times
# (such as with time slicing)
my $n1pRef = shift;
my $np1Ref = shift;
lib/LiteratureBasedDiscovery/Rank.pm view on Meta::CPAN
# the cuiPairs ref which already holds CUI frequencies
if ($measure eq 'freq') {
return $cuiPairsRef;
}
#calculate stats if needed
if (!defined $n1pRef || !defined $np1Ref || !defined $npp) {
($n1pRef, $np1Ref, $npp) = &getAllStats($matrixRef);
}
#get association scores for each CUI pair
my ($n11, $cui1, $cui2);
foreach my $key (keys %{$cuiPairsRef}) {
#get the cui indeces
($cui1, $cui2) = split(/,/,$key);
#assume calculation cannot be made
${$cuiPairsRef}{$key} = -1;
#get n11
$n11 = ${${$matrixRef}{$cui1}}{$cui2};
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
print "calculating precision and recall\n";
#bounds check, the predictions matrix must contain keys
if ((scalar keys %{$predictionsMatrixRef}) < 1) {
return (0,0); #precision and recall are both zero
}
#calculate precision and recall averaged over each cui
my $precision = 0;
my $recall = 0;
#each row key corresponds to a term for which we calculate
# precision and recall. From each term's precision and recall
# we calculate an average over all terms
foreach my $rowKey (keys %{$trueMatrixRef}) {
#calculate precision for this term
my $truePositive = 0;
my $falsePositive = 0;
foreach my $colKey (keys %{${$predictionsMatrixRef}{$rowKey}}) {
if (exists ${${$trueMatrixRef}{$rowKey}}{$colKey}) {
$truePositive++;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#calculate the averages (divide by the number of rows
# = the number of terms in the post cutoff matrix)
$precision /= scalar keys %{$trueMatrixRef};
$recall /= scalar keys %{$trueMatrixRef};
#return the average precision and recall
return ($precision, $recall);
}
# loads the post cutoff matrix from file. Only loads rows corresponding
# to rows in the starting matrix ref to save memory, and because those are
# the only rows that are needed.
# input: $startingMatrixRef <- a ref to the starting sparse matrix
# $explicitMatrix Ref <- a ref to the explicit sparse matrix
# $postCutoffFileName <- the filename to the postCutoffMatrix
# output: \%postCutoffMatrix <- a ref to the postCutoff sparse matrix
sub loadPostCutOffMatrix {
my $startingMatrixRef = shift;
my $explicitMatrixRef = shift;
my $postCutoffFileName = shift;
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
$explicitMatrixRef, $startTermAcceptTypesRef, $umls_interface);
((scalar keys %{$rowsToKeepRef}) >= $numRows) or die("ERROR: number of acceptable rows starting terms is less than $numRows\n");
#randomly select 100 rows (to generate the 'starting matrix')
#generate random numbers from 0 to number of rows in the explicit matrix
my %rowNumbers = ();
while ((scalar keys %rowNumbers) < $numRows) {
$rowNumbers{int(rand(scalar keys %{$rowsToKeepRef}))} = 1;
}
#fill starting matrix with keys corresponding to the random numbers
my $i = 0;
foreach my $key (keys %{$rowsToKeepRef}) {
if (exists $rowNumbers{$i}) {
$startingMatrix{$key} = ${$explicitMatrixRef}{$key}
}
$i++;
}
#output the cui list if needed
if (exists ${$lbdOptionsRef}{'cuiListOutputFile'}) {
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
$rowsToKeep{$cui1} = 1;
last;
}
}
}
#return the rowsToKeep
return \%rowsToKeep
}
# generates a hash of all association scores from the matrix
# the hash keys are $rowKey,$colKey. Hash values are the association scores
# between the $rowKey and $colKey. All co-occurring cui pairs from the matrix
# are calculated
# input: $matrixRef <- a reference to a sparse matrix
# $rankingMeasue <- a string specifying the ranking measure to use
# $umls_association <- an instance of UMLS::Association
# output: \%cuiPairs <- a ref to a hash of CUI pairs and their assocaition
# each key of the hash is a comma seperated string
# containing cui1, and cui2 of the pair
# (e.g. 'cui1,cui2'), and each value is their association
# score using the specified assocition measure
sub getAssociationScores {
my $matrixRef = shift;
my $rankingMeasure = shift;
my $umls_association = shift;
print " getting Association Scores, rankingMeasure = $rankingMeasure\n";
#generate a list of cui pairs in the matrix
my %cuiPairs = ();
print " generating association scores:\n";
foreach my $rowKey (keys %{$matrixRef}) {
foreach my $colKey (keys %{${$matrixRef}{$rowKey}}) {
$cuiPairs{"$rowKey,$colKey"} = ${${$matrixRef}{$rowKey}}{$colKey};
}
}
#get ranks for all the cui pairs in the matrix
#return a hash of cui pairs and their frequency
if ($rankingMeasure eq 'frequency') {
return \%cuiPairs;
} else {
#updates values in cuiPairs hash with their association scores and returns
Rank::getBatchAssociationScores(\%cuiPairs, $matrixRef, $rankingMeasure, $umls_association);
return \%cuiPairs;
}
}
# gets the min and max value of a hash
# returns a two element array, where the first value is the min, and
# the second values is the max
# input: $hashref <- a reference to a hash with numbers as values
# output: ($min, $max) <- the minimum and maximum values in the hash
sub getMinMax {
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
if ($val < $min) {
$min = $val;
}
if ($val > $max) {
$max = $val;
}
}
return ($min,$max);
}
# Applies a threshold to a matrix using a corresponding association scores
# hash. Any keys less than the threshold are not copied to the new matrix
# input: $threshold <- a scalar threshold
# $assocScoresRef <- a reference to a cui pair hash of association
# scores. Each key is a comma seperated cui pair
# (e.g. 'cui1,cui2'), values are their association
# scores.
# $matrixRef <- a reference to a co-occurrence sparse matrix that
# corresponds to the assocScoresRef
# output: \%thresholdedMatrix < a ref to a new matrix, built from the
# $matrixRef after applying the $threshold
sub applyThreshold {
my $threshold = shift;
my $assocScoresRef = shift;
my $matrixRef = shift;
#apply the threshold
my $preKeyCount = scalar keys %{$assocScoresRef};
my $postKeyCount = 0;
my %thresholdedMatrix = ();
my ($cui1, $cui2);
foreach my $key (keys %{$assocScoresRef}) {
#add key if val >= threshold
if (${$assocScoresRef}{$key} >= $threshold) {
($cui1,$cui2) = split(/,/, $key);
#create new hash at rowkey location
if (!(exists $thresholdedMatrix{$cui1})) {
my %newHash = ();
$thresholdedMatrix{$cui1} = \%newHash;
}
#set key value
${$thresholdedMatrix{$cui1}}{$cui2} = ${${$matrixRef}{$cui1}}{$cui2};
$postKeyCount++;
}
}
#return the thresholded matrix
return \%thresholdedMatrix;
}
# Grabs the K highest ranked samples. This is for thresholding based the number
# of samples. Used in explicit timeslicing
# input: $k <- the number of samples to get
# $assocScoresRef <- a reference to a cui pair hash of association
# scores. Each key is a comma seperated cui pair
# (e.g. 'cui1,cui2'), values are their association
# scores.
# $matrixRef <- a reference to a co-occurrence sparse matrix that
# corresponds to the assocScoresRef
# output: \%thresholdedMatrix <- a ref to a sparse matrix containing only the
# $k ranked samples (cui pairs)
sub grabKHighestRankedSamples {
my $k = shift;
my $assocScoresRef = shift;
my $matrixRef = shift;
print "getting $k highest ranked samples\n";
#apply the threshold
my $preKeyCount = scalar keys %{$assocScoresRef};
my $postKeyCount = 0;
my %thresholdedMatrix = ();
#get the keys sorted by value in descending order
my @sortedKeys = sort { $assocScoresRef->{$b} <=> $assocScoresRef->{$a} } keys(%$assocScoresRef);
my $threshold = ${$assocScoresRef}{$sortedKeys[$k-1]};
print " threshold = $threshold\n";
#add the first k keys to the thresholded matrix
my ($cui1, $cui2);
foreach my $key (@sortedKeys) {
($cui1, $cui2) = split(/,/, $key);
#create new hash at rowkey location (if needed)
if (!(exists $thresholdedMatrix{$cui1})) {
my %newHash = ();
$thresholdedMatrix{$cui1} = \%newHash;
}
#set key value for the key pair
${$thresholdedMatrix{$cui1}}{$cui2} = ${${$matrixRef}{$cui1}}{$cui2};
$postKeyCount++;
#stop adding keys when below the threshold
if (${$assocScoresRef}{$key} < $threshold) {
last;
}
}
#return the thresholded matrix
return \%thresholdedMatrix;
}
# calculates precision and recall at $numIntervals (e.g. 10 for 10%) recall
# intervals using an implicit ranking threshold
# input: $trueMatrixRef <- a ref to a hash of true discoveries
# $rowRanksRef <- a ref to a hash of arrays of ranked predictions.
# Each hash key is a cui, each hash element is an
# array of ranked predictions for that cui. The ranked
# predictions are cuis are ordered in descending order
# based on association. (from Rank::RankDescending)
# $numIntervals <- the number of recall intervals to generate
# output: (\%precision, \%recall) <- refs to hashes of precision and recall.
# Each hash key is the interval number, and
# the value is the precision and recall
# respectively
sub calculatePrecisionAndRecall_implicit {
my $trueMatrixRef = shift; #a ref to the true matrix
my $rowRanksRef = shift; #a ref to ranked predictions, each hash element are the predictions for a single cui, at each element is an array of cuis ordered by their rank
my $numIntervals = shift; #the recall intervals to test at
#find precision and recall curves for each cui that is being predicted
# take the sum of precisions, then average after the loop
my %precision = ();
my %recall = ();
foreach my $rowKey (keys %{$trueMatrixRef}) {
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#skip if there are NO new discoveries for this start term
if ($numTrue == 0) {
next;
}
#skip if there are NO predictions for this start term
if ($numPredictions == 0) {
next;
}
#determine precision and recall at 10% intervals of the number of
#predicted true vaules. This is done by simulating a threshold being
#applied, so the top $numToTest ranked terms are tested at 10% intervals
my $interval = $numPredictions/$numIntervals;
for (my $i = 0; $i <= 1; $i+=(1/$numIntervals)) {
#determine the number true to grab
my $numTrueForInterval = 1; #at $i = 0, grab just the first term that is true
if ($i > 0) {
$numTrueForInterval = $numTrue*$i;
}
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
#average the mean precision over all terms
foreach my $rowKey (keys %{$trueMatrixRef}) {
my $rankedPredictionsRef = ${$rowRanksRef}{$rowKey}; #an array ref of ranked predictions
#skip for rows that have no predictions
if (!defined $rankedPredictionsRef) {
next;
}
my $trueRef = ${$trueMatrixRef}{$rowKey}; #a list of true discoveries
#threshold the interval, so that it does not exceed
# the number of predictions
my $interval = $k;
if ($k > scalar @{$rankedPredictionsRef}) {
$interval = scalar @{$rankedPredictionsRef};
}
#find the number of true positives in the top $interval ranked terms
my $truePositiveCount = 0;
for (my $rank = 0; $rank < $interval; $rank++) {
my $cui = ${$rankedPredictionsRef}[$rank];
lib/LiteratureBasedDiscovery/TimeSlicing.pm view on Meta::CPAN
# or in time slicing
foreach my $rowKey (keys %{$trueMatrixRef}) {
my $rankedPredictionsRef = ${$rowRanksRef}{$rowKey}; #an array ref of ranked predictions
#skip for rows that have no predictions
if (!defined $rankedPredictionsRef) {
next;
}
my $trueRef = ${$trueMatrixRef}{$rowKey}; #a list of true discoveries
#threshold the interval, so that it does not exceed
# the number of predictions
my $interval = $k;
if ($k > scalar @{$rankedPredictionsRef}) {
$interval = scalar @{$rankedPredictionsRef};
}
#find the number of true co-occurrence for the top $interval
# ranked terms
my $cooccurrenceCount = 0;
for (my $rank = 0; $rank < $interval; $rank++) {
samples/configFileSamples/UMLSAssociationConfig view on Meta::CPAN
# the line "<database>bigrams" will pass the 'database' parameter with a
# value of 'bigrams' to the UMLS::Association options hash for its
# initialization.
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
#
#
# See UMLS::Association for more detailed
# Database of Association Scores. Not used, but required to initialize
# UMLS::Association
<database>CUI_Bigram
# If the UMLS::Association Database is not installed on the local machine
# The following parameters may be needed to connect to the server
<hostname>192.168.00.00
<username>username
<password>password
<socket>/var/run/mysqld.sock
samples/lbdConfig view on Meta::CPAN
# Literature Based Discovery
# Options keys are in <>'s, and values follow directly after with no space.
# As as example, the line "<rankingMethod>ll" will set the 'rankingMethod'
# parameter with a value of 'll' for literature based discovery
#
# For parameters where no value is needed, just write the name of the
# parameter in '<>' (e.g. '<debug>')
# lines started with a # are skipped and may be used for comments
# The ranking procedure to use for LBD
# valid ranking procedures are:
# allPairs (maxBC) - maximum B to C term value
# averageMinimumWeight (AMW) - average of minimum A to B and B to C values
# linkingTermCount* (LTC) - count of shared linking terms
# frequency* (freq) - sum of A to B co-occurrences of shared B terms
# ltcAssociation (LTA) - association measures that use linking terms as inputs
# ltc_AMW - linking term count with AMW as a tie-breaker
#
# *all procedures require a measure to be specified except LTC and freq
<rankingProcedure>averageMinimumWeight
# The association measure to use as a value in the ranking procedure.
# The string is passed directly to UMLS::Association, so as that gets
# updated, new associatio measures will work automatically.
# At the time of this writing, valid arguments are:
# freq - Frequency
# dice - Dice Coefficient
# left - Fishers exact test - left sided
# right - Fishers exact test - right sided
# twotailed - Fishers twotailed test
# jaccard - Jaccard Coefficient
# ll - Log-likelihood ratio
# tmi - Mutual Information
# odds - Odds Ratio
# pmi - Pointwise Mutual Information
# phi - Phi Coefficient
# chi - Pearson's Chi Squared Test
# ps - Poisson Stirling Measure
# tscore - T-score
<rankingMeasure>ll
# The output path of the results of lbd
<implicitOutputFile>sampleOutput
# a comma seperated list of linking (B) term accept semantic groups, which
# restricts the linking terms to the semantic groups specified. Group names
# come directly from the UMLS.
# See https://metamap.nlm.nih.gov/Docs/SemGroups_2013.txt for a list
#<linkingAcceptGroups>CHEM,DISO,GENE,PHYS,ANAT
# similar to linking accept groups, this restricts the acceptable linking (B)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# a comma seperated list of target (C) term accept semantic groups, which
# restricts the linking terms to the semantic groups specified. Group names
# come directly from the UMLS.
# See https://metamap.nlm.nih.gov/Docs/SemGroups_2013.txt for a list
#<targetAcceptGroups>CHEM,GENE
# similar to target termcept groups, this restricts the acceptable target (C)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix
# A comma seperated list of starting (A) cuis used in LBD
<startCuis>C0001554,C1961131
samples/runSample.pl view on Meta::CPAN
#Demo file, showing how to run open discovery using the sample data, and how
# to perform time slicing evaluation using the sample data
# run a sample lbd using the parameters in the lbd configuration file
print "\n OPEN DISCOVERY \n";
`perl ../utils/runDiscovery.pl lbdConfig`;
print "LBD Open discovery results output to sampleOutput\n\n";
# run a sample time slicing
# first remove the co-occurrences of the precutoff matrix (in this case it is
# the sampleExplicitMatrix from the post cutoff matrix. This generates a gold
# standard discovery matrix from which time slicing may be performed
# This requires modifying the removeExplicit.pl, which we have done for you.
# The variables for this example in removeExplicit.pl are:
# my $matrixFileName = 'sampleExplicitMatrix';
# my $squaredMatrixFileName = postCutoffMatrix;
# my $outputFileName = 'sampleGoldMatrix';
#`perl ../utils/datasetCreator/removeExplicit.pl`;
# next, run time slicing
print " TIME SLICING \n";
`perl ../utils/runDiscovery.pl timeSlicingConfig > sampleTimeSliceOutput`;
print "LBD Time Slicing results output to sampleTimeSliceOutput\n";
samples/timeSlicingConfig view on Meta::CPAN
# A list of starting accept types. This is used to randomly generate 100
# starting terms if a cuiListFileName is not specified. All starting terms
# will be of the types listed
<startAcceptTypes>dsyn
#--------------------------------------
# The ranking procedure to use for LBD
# valid ranking procedures are:
# allPairs (maxBC) - maximum B to C term value
# averageMinimumWeight (AMW) - average of minimum A to B and B to C values
# linkingTermCount* (LTC) - count of shared linking terms
# frequency* (freq) - sum of A to B co-occurrences of shared B terms
# ltcAssociation (LTA) - association measures that use linking terms as inputs
# ltc_AMW - linking term count with AMW as a tie-breaker
#
# *all procedures require a measure to be specified except LTC and freq
<rankingProcedure>averageMinimumWeight
# The association measure to use as a value in the ranking procedure.
# The string is passed directly to UMLS::Association, so as that gets
# updated, new associatio measures will work automatically.
# At the time of this writing, valid arguments are:
# freq - Frequency
# dice - Dice Coefficient
# left - Fishers exact test - left sided
# right - Fishers exact test - right sided
# twotailed - Fishers twotailed test
# jaccard - Jaccard Coefficient
# ll - Log-likelihood ratio
# tmi - Mutual Information
# odds - Odds Ratio
# pmi - Pointwise Mutual Information
# phi - Phi Coefficient
# chi - Pearson's Chi Squared Test
# ps - Poisson Stirling Measure
# tscore - T-score
<rankingMeasure>ll
# a comma seperated list of linking (B) term accept semantic groups, which
# restricts the linking terms to the semantic groups specified. Group names
# come directly from the UMLS.
# See https://metamap.nlm.nih.gov/Docs/SemGroups_2013.txt for a list
#<linkingAcceptGroups>CHEM,DISO,GENE,PHYS,ANAT
# similar to linking accept groups, this restricts the acceptable linking (B)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# a comma seperated list of target (C) term accept semantic groups, which
# restricts the linking terms to the semantic groups specified. Group names
# come directly from the UMLS.
# See https://metamap.nlm.nih.gov/Docs/SemGroups_2013.txt for a list
#<targetAcceptGroups>CHEM,GENE
# similar to target termcept groups, this restricts the acceptable target (C)
# terms to terms within the semantic types listed
# See http://metampa.nlm.gov/Docs/SemanticTypes_2013AA.txt for a list
#<linkingAcceptGroups>clnd,chem
# Input file path for the explicit co-occurrence matrix used in LBD
<explicitInputFile>sampleExplicitMatrix
# Input file path for the gold standard matrix (matrix of true predictions)
# See utils/datasetCreator on how to make this
<goldInputFile>sampleGoldMatrix
samples/timeSlicingConfig view on Meta::CPAN
# by running timeslicing first with the predictionsOutFile specified, then
# in subsequent runs using that as an input
# <predictionsInFile>predictionsMatrix
# Output file path of the pre-computed predictions file
# This is optional, but can speed up computation time, since computing the
# prediction matrix can be time consuming.
# The prediction matrix is all predicted discoveries
<predictionsOutFile>predictionsMatrix
# A thresholded matrix to use for computing association measure values. The
# user must specify a non-thresholded matrix in order to properly compute
# predictions. The thresholded matrix file is optional, and is used only
# for the values in ranking procedures
# <thresholdedMatrix>thresholdMatrix
t/goldSampleTimeSliceOutput view on Meta::CPAN
In timeSlicing_generatePrecisionAndRecall_implicit
loading explicit
generating starting
inputting gold
getting AB scores
applying semantic filter to explicit matrix
generating predictions
Squaring Matrix
Removing Known from Predictions
Applying Semantic Filter to Predictions
outputting predictions
getting row ranks
calculating precision and recall
----- average precision at 10% recall intervals (i recall precision) ---->
0 0.430555555555556 0.383928571428571
# `make test'. After `make install' it should work as `perl t/lch.t'
use strict;
use warnings;
use Test::Simple tests => 10;
#error tolerance for exact numerical matches due to precision issues
# and sort issues (again due to precision) there may be small
# differences between runs. The precision at K difference is
# larger due to small differences in ranking making big differences
# in scores when the K < 10. See Rank::rankDescending for more
# details as to why the ranking imprecision occurrs
my $precRecallErrorTol = 0.0001;
my $atKErrorTol = 1.0;
#######################################################
# test script to run the sample code and compare its
# output to the expected output. This tests both the
# open and closed discovery code portions
#########################################################
#Test that the demo file can run correctly
`(cd ./samples/; perl runSample.pl) &`;
#######################################################
#test that the demo output matches the expected demo output
#########################################################
print "Performing Open Discovery Tests:\n";
#read in the gold scores from the open discovery gold
my %goldScores = ();
open IN, './t/goldSampleOutput'
or die ("Error: Cannot open gold sample output\n");
while (my $line = <IN>) {
if ($line =~ /\d+\t(\d+\.\d+)\t(C\d+)/) {
$goldScores{$2} = $1;
}
}
close IN;
#read in the scores that were just generated
my %newScores = ();
open IN, './samples/sampleOutput'
or die ("Error: Cannot open sample output\n");
while (my $line = <IN>) {
if ($line =~ /\d+\t(\d+\.\d+)\t(C\d+)/) {
$newScores{$2} = $1;
}
}
close IN;
#check that the number of keys in the input and output files are the same
ok(scalar keys %goldScores == scalar keys %newScores, "Number of Output CUIs match");
#check that the gold and sample scores match
my $allMatch = 1;
my $allExist = 1;
foreach my $key(keys %goldScores) {
if (exists $newScores{$key}) {
if ($newScores{$key} != $goldScores{$key}) {
$allMatch = 0;
last;
}
}
else {
$allExist = 0;
$allMatch = 0;
last;
}
}
ok ($allExist == 1, "All CUIs exist in the output"); #all cuis exist in the new output file
ok ($allMatch == 1, "All Scores are the same in the output"); #all scores are the same in the new output file
print "Done with Open Discovery Tests\n\n";
#######################################################
#test that time slicing is computed correctly
#########################################################
print "Performing Time Slicing Tests\n";
#read in gold time slicing output
(my $goldAPScoresRef, my $goldMAP, my $goldPAtKScoresRef, my $goldFAtKScoresRef)
= &readTimeSlicingData('./t/goldSampleTimeSliceOutput');
#read in new time slicing output
(my $newAPScoresRef, my $newMAP, my $newPAtKScoresRef, my $newFAtKScoresRef)
= &readTimeSlicingData('./samples/sampleTimeSliceOutput');
#check that the correct number of values are read for all the
# time slicing metrics
ok (scalar @{$newAPScoresRef} == 11, "Correct Count of Average Precisions");
ok (scalar @{$newPAtKScoresRef} == 19, "Correct Count of Precision at K's");
ok (scalar @{$newFAtKScoresRef} == 19, "Correct Count of Freq at K's");
#check that each of the AP scores match the gold (within error tolerance)
my $apSame = 1;
for (my $i = 0; $i < scalar @{$goldAPScoresRef}; $i++) {
#check both comma seperated values (precision and recall)
my @goldScores = split(',',${$goldAPScoresRef}[$i]);
my @newScores = split(',',${$newAPScoresRef}[$i]);
if ((abs($goldScores[0]-$newScores[0]) > $precRecallErrorTol)
&& (abs($goldScores[1]-$newScores[1]) > $precRecallErrorTol)) {
$apSame = 0;
last;
}
}
ok($apSame == 1, "Average Precisions Match");
#check MAP is the same (within error tolerance)
ok (abs($goldMAP - $newMAP) > $precRecallErrorTol, "Mean Average Precision Matches");
#check that each of Precision at K scores match the gold
# (within error tolerance)
my $pAtKSame = 1;
for (my $i = 0; $i < scalar @{$goldPAtKScoresRef}; $i++) {
if (abs(${$goldPAtKScoresRef}[$i] - ${$newPAtKScoresRef}[$i]) > $atKErrorTol) {
$pAtKSame = 0;
last;
}
}
ok($pAtKSame == 1, "Precision at K Matches");
#check that each of the Freq at K scores match the gold
# (within error tolerance)
my $fAtKSame = 1;
for (my $i = 0; $i < scalar @{$goldFAtKScoresRef}; $i++) {
if (abs(${$goldFAtKScoresRef}[$i] - ${$newFAtKScoresRef}[$i]) > $atKErrorTol) {
$fAtKSame = 0;
last;
}
}
ok($fAtKSame == 1, "Frequency at K Matches");
print "Done with Time Slicing Tests\n";
############################################################
#function to read in time slicing data values
sub readTimeSlicingData {
my $fileName = shift;
#read in the gold time slicing values
my @APScores = ();
my $MAP;
my @PAtKScores = ();
my @FAtKScores = ();
open IN, "$fileName"
#open IN, './t/goldSampleTimeSliceOutput'
or die ("Error: Cannot open timeSliceOutput: $fileName\n");
while (my $line = <IN>) {
#read in the 11 values of average precision
if ($line =~ /average precision at 10% recall intervals/ ) {
while (my $line2 = <IN>) {
if ($line2 =~ /\d\s(\d\.?\d*)\s(\d\.\d*)/) {
push @APScores, "$1,$2";
}
else {
last;
}
}
}
#read in the MAP value
if ($line =~ /MAP = (\d+\.\d+)/ ) {
$MAP = $1;
}
#read in the 19 values of precision at k
if ($line =~ /mean precision at k interval/ ) {
while (my $line2 = <IN>) {
if ($line2 =~ /\d\s(\d\.\d*)/) {
push @PAtKScores, "$1";
}
else {
last;
}
}
}
#read in the 19 values of frequency at k
if ($line =~ /mean cooccurrences at k intervals/ ) {
while (my $line2 = <IN>) {
if ($line2 =~ /\d+\s(\d\.?\d*)/) {
push @FAtKScores, "$1";
}
else {
last;
}
}
}
}
close IN;
return (\@APScores, $MAP, \@PAtKScores, \@FAtKScores)
}
utils/datasetCreator/applyMaxThreshold.pl view on Meta::CPAN
use strict;
use warnings;
# applies a max threshold to a matrix. The max threshold is based on either
# the number of unique co-occurrences of a CUI, or the total number of
# co-occurrences of a CUI. Any CUI that occurs more than the $maxThreshold
# number of times (or with $maxThreshold number of CUIs) is eliminated from
# the matrix. This is done by copying values from the $inputFile to the
# $outputFile. $applyToUnique is used to toggle on or off unique number of
# CUIs threshold vs. total number of co-occurrences.
my $inputFile = '/home/henryst/lbdData/groupedData/reg/1975_1999_window8_noOrder';
my $outputFile = '/home/henryst/lbdData/groupedData/1975_1999_window8_noOrder_threshold5000u';
my $maxThreshold = 5000;
my $applyToUnique = 1;
my $countRef = &getStats($inputFile, $applyToUnique);
&applyMaxThreshold($inputFile, $outputFile, $maxThreshold, $countRef);
# gets co-occurrence stats, returns a hash of (unique) co-occurrence counts
# for each CUI. (count is unique or not depending on $applyToUnique)
sub getStats {
my $inputFile = shift;
my $applyToUnique = shift;
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
utils/datasetCreator/applyMaxThreshold.pl view on Meta::CPAN
#NOTE: do not update counts for $2, because in the case where order
#does not matter, the matrix will have been pre-processed to ensure
#the second cui will appear first in the key. In the case where order
#does matter we just shouldnt be counting it anyway
}
close IN;
return \%count;
}
#applies a maxThreshold, $countRef is the output of getStats
sub applyMaxThreshold {
my $inputFile = shift;
my $outputFile = shift;
my $maxThreshold = shift;
my $countRef = shift;
#open the input and output
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
open OUT, ">$outputFile"
or die ("ERROR: unable to open outputFile: $outputFile\n");
print "ApplyingThreshold\n";
#threshold each line of the file
my ($cui1, $cui2, $val);
while (my $line = <IN>) {
#grab values
($cui1, $cui2, $val) = split(/\t/,$line);
#skip if either $cui1 or $cui2 are greater than the threshold
# the counts in %count have been set already according to
# whether $applyToUnique or not
if (${$countRef}{$cui1} > $maxThreshold
|| ${$countRef}{$cui2} > $maxThreshold) {
next;
}
else {
print OUT $line;
}
}
close IN;
close OUT;
utils/datasetCreator/applyMinThreshold.pl view on Meta::CPAN
#Applies a minimum number of co-occurrences threshold to a file by
#copying the $inputFile to $outputFile, but ommitting lines that have less than
#$minThreshold number of co-occurrences
my $minThreshold = 5;
my $inputFile = '/home/henryst/1975_2015_window8_noOrder_preThresh';
my $outputFile = '/home/henryst/1975_2015_window8_noOrder_threshold'.$minThreshold;
&applyMinThreshold($minThreshold, $inputFile, $outputFile);
############
sub applyMinThreshold {
#grab the input
my $minThreshold = shift;
my $inputFile = shift;
my $outputFile = shift;
#open files
open IN, $inputFile or die("ERROR: unable to open inputFile\n");
open OUT, ">$outputFile"
or die ("ERROR: unable to open outputFile: $outputFile\n");
print "Reading File\n";
#threshold each line of the file
my ($key, $cui1, $cui2, $val);
while (my $line = <IN>) {
#grab values
($cui1, $cui2, $val) = split(/\t/,$line);
#check minThreshold
if ($val > $minThreshold) {
print OUT $line;
}
}
close IN;
print "Done!\n";
}
utils/datasetCreator/applySemanticFilter.pl view on Meta::CPAN
use LiteratureBasedDiscovery::Discovery;
use LiteratureBasedDiscovery::Evaluation;
use LiteratureBasedDiscovery::Rank;
use LiteratureBasedDiscovery::Filters;
use LiteratureBasedDiscovery;
use UMLS::Association;
use UMLS::Interface;
####### User input
my $matrixFileName = '/home/henryst/lbdData/groupedData/1975_1999_window8_noOrder_threshold5';
my $outputFileName = $matrixFileName.'_filtered';
my $acceptTypesString = ''; #leave blank if none are applied
my $acceptGroupsString = 'CHEM,DISO,GENE,PHYS,ANAT'; #for the explicit matrix
my $interfaceConfig = '/home/share/packages/ALBD/config/interface';
#apply the filter to rows and columns or columns only
# apply to just columns generally for the implicit matrix
# ...if the rows are just the starting terms
# apply to rows and columns generally for the explicit matrix
my $columnsOnly = 0; #apply to columns only, or rows and columns
utils/datasetCreator/combineCooccurrenceMatrices.pl view on Meta::CPAN
# combines the co-occurrences counts for the year range specified (inclusive
# e.g. 1983-1985 will combine counts from files of 1983, 1984, and 1985
# co-occurrences). This file is intended to run on co-occurrence matrices
# created seperately for each year, and stored in a single folder. Creating
# co-occurrence matrices in this manner is useful because it makes running
# the CUICollector faster, and because files can be easily combined for
# different time slicing or discovery replication results. We ran CUI Collector
# seperately for each year of the MetaMapped MEDLINES baseline and stored each
# co-occurrence matrix in a single folder "hadoopByYear/output/". That folder
# contained file named the year and window size used (e.g. 1975_window8).
# The code may need to be modified slightly for other purposes.
use strict;
use warnings;
my $startYear;
my $endYear;
my $windowSize;
my $dataFolder;
utils/datasetCreator/fromMySQL/removeQuotes.pl view on Meta::CPAN
#renoves quotes from a db to tab file
my $inFile = '1980_1984_window1_retest_data.txt';
my $outFile = '1980_1984_window1_restest_DELETEME';
open IN, $inFile or die ("unable to open inFile: $inFile\n");
open OUT, '>'.$outFile or die ("unable to open outFile: $outFile\n");
while (my $line = <IN>) {
$line =~ s/"//g;
#print $line;
print OUT $line;
}
utils/datasetCreator/makeOrderNotMatter.pl view on Meta::CPAN
#makes the order of a CUIs not matter in a co-occurrence matrix and writes
# the result to file
use strict;
use warnings;
use Getopt::Long;
my $DEBUG = 0;
my $HELP = '';
my %options = ();
GetOptions( 'debug' => \$DEBUG,
utils/datasetCreator/squaring/convertForSquaring_MATLAB.pl view on Meta::CPAN
# functions to convert to and from assocLBD and MATLAB sparse matrix formats
use strict;
use warnings;
#convert to MATLAB sparse format
my $fileName = "1975_1999_window8_noOrder_threshold5_filtered";
&convertTo("/home/henryst/lbdData/groupedData/$fileName",
"/home/henryst/lbdData/groupedData/forSquaring/$fileName".'_converted',
"/home/henryst/lbdData/groupedData/forSquaring/$fileName".'_keys');
#convert from MATLAB sparse format
$fileName = "1980_1984_window1_ordered_filtered";
&convertFrom("/home/henryst/lbdData/groupedData/squared/$fileName".'_squared', "/home/henryst/lbdData/groupedData/squared/$fileName".'_squared_convertedBack',"/home/henryst/lbdData/groupedData/forSquaring/".$fileName.'_keys');
########################################
utils/datasetCreator/squaring/squareMatrix_perl.pl view on Meta::CPAN
#squares a matrix from file and writes the result to file
#use strict;
#use warnings;
#use Getopt::Long;
my $DEBUG = 0;
my $HELP = '';
my %options = ();
#GetOptions( 'debug' => \$DEBUG,
utils/datasetCreator/squaring/squareMatrix_perl.pl view on Meta::CPAN
close OUT;
#read in the matrix
my $matrixRef = fileToSparseMatrix($options{'inputFile'});
#loop over the rows of the B matrix
my %product = ();
my $count = 1;
my $total = scalar keys %{$matrixRef};
my $dumpThreshold = 20000; #dump to file every 20,000 keys
my $keyCount = 0;
foreach my $key0 (keys %{$matrixRef}) {
#loop over row
foreach my $key1 (keys %{$matrixRef}) {
#loop over column
foreach my $key2 (keys %{${$matrixRef}{$key1}}) {
#update values
if (exists ${${$matrixRef}{$key0}}{$key1}) {
#update
utils/datasetCreator/squaring/squareMatrix_perl.pl view on Meta::CPAN
$keyCount++;
}
${$product{$key0}}{$key2} +=
${${$matrixRef}{$key0}}{$key1} *
${${$matrixRef}{$key1}}{$key2};
}
}
#output if needed
if ($keyCount > $dumpThreshold) {
&outputMatrix(\%product, $options{'outputFile'});
$keyCount = 0;
}
}
print STDERR "done with row: $count/$total\n";
$count++;
}
utils/runDiscovery.pl view on Meta::CPAN
#!/usr/bin/perl
=head1 NAME
runDiscovery.pl This program runs literature based discovery with the
parameters specified in the input file. Please see samples/lbd or
samples/thresholding for sample input files and descriptions of parameters
and full details on what can be in an LBD input file.
=head1 SYNOPSIS
This utility takes an lbd configuration file and outputs the results
of lbd
=head1 USAGE
Usage: umls-assocation.pl [OPTIONS] LBD_CONFIG_FILE
=head1 INPUT
=head2 LBD_CONFIG_FILE
utils/runDiscovery.pl view on Meta::CPAN
=head3 --debug
enter debug mode
=head3 --version
display the version number
=head1 OUTPUT
A file containing the results of LBD
=head1 SYSTEM REQUIREMENTS
=over
=item * Perl (version 5.16.5 or better) - http://www.perl.org
=item * UMLS::Interface - http://search.cpan.org/dist/UMLS-Interface
=item * UMLS::Association - http://search.cpan.org/dist/UMLS-Association
utils/runDiscovery.pl view on Meta::CPAN
my $lbd = ALBD->new(\%options);
$lbd->performLBD();
############################################################################
# function to output help messages for this program
############################################################################
sub showHelp() {
print "This utility takes an lbd configuration file and outputs\n";
print "the results of lbd to file. The parameters for LBD are\n";
print "specified in the input file. Please see samples/lbd or\n";
print "samples/thresholding for sample input files and descriptions\n";
print "of parameters and full details on what can be in an LBD input\n";
print "file.\n";
print "\n";
print "Usage: runDiscovery.pl LBD_CONFIG_FILE [OPTIONS]\n";
print "\n";
print "General Options:\n\n";
print "--help displays help, a quick summary of program\n";
print " options\n";