1
|
Henry S, Wijesinghe DS, Myers A, McInnes BT. Using Literature Based Discovery to Gain Insights Into the Metabolomic Processes of Cardiac Arrest. Front Res Metr Anal 2021; 6:644728. [PMID: 34250435 PMCID: PMC8267364 DOI: 10.3389/frma.2021.644728] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/07/2021] [Indexed: 12/19/2022] Open
Abstract
In this paper, we describe how we applied LBD techniques to discover lecithin cholesterol acyltransferase (LCAT) as a druggable target for cardiac arrest. We fully describe our process which includes the use of high-throughput metabolomic analysis to identify metabolites significantly related to cardiac arrest, and how we used LBD to gain insights into how these metabolites relate to cardiac arrest. These insights lead to our proposal (for the first time) of LCAT as a druggable target; the effects of which are supported by in vivo studies which were brought forth by this work. Metabolites are the end product of many biochemical pathways within the human body. Observed changes in metabolite levels are indicative of changes in these pathways, and provide valuable insights toward the cause, progression, and treatment of diseases. Following cardiac arrest, we observed changes in metabolite levels pre- and post-resuscitation. We used LBD to help discover diseases implicitly linked via these metabolites of interest. Results of LBD indicated a strong link between Fish Eye disease and cardiac arrest. Since fish eye disease is characterized by an LCAT deficiency, it began an investigation into the effects of LCAT and cardiac arrest survival. In the investigation, we found that decreased LCAT activity may increase cardiac arrest survival rates by increasing ω-3 polyunsaturated fatty acid availability in circulation. We verified the effects of ω-3 polyunsaturated fatty acids on increasing survival rate following cardiac arrest via in vivo with rat models.
Collapse
Affiliation(s)
- Sam Henry
- Department of Physics, Computer Science and Engineering, Christopher Newport University, Newport News, VA, United States
| | - D. Shanaka Wijesinghe
- Department of Pharmacotherapy and Outcomes Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Aidan Myers
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
2
|
|
3
|
Crichton G, Baker S, Guo Y, Korhonen A. Neural networks for open and closed Literature-based Discovery. PLoS One 2020; 15:e0232891. [PMID: 32413059 PMCID: PMC7228051 DOI: 10.1371/journal.pone.0232891] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 04/23/2020] [Indexed: 12/18/2022] Open
Abstract
Literature-based Discovery (LBD) aims to discover new knowledge automatically from large collections of literature. Scientific literature is growing at an exponential rate, making it difficult for researchers to stay current in their discipline and easy to miss knowledge necessary to advance their research. LBD can facilitate hypothesis testing and generation and thus accelerate scientific progress. Neural networks have demonstrated improved performance on LBD-related tasks but are yet to be applied to it. We propose four graph-based, neural network methods to perform open and closed LBD. We compared our methods with those used by the state-of-the-art LION LBD system on the same evaluations to replicate recently published findings in cancer biology. We also applied them to a time-sliced dataset of human-curated peer-reviewed biological interactions. These evaluations and the metrics they employ represent performance on real-world knowledge advances and are thus robust indicators of approach efficacy. In the first experiments, our best methods performed 2-4 times better than the baselines in closed discovery and 2-3 times better in open discovery. In the second, our best methods performed almost 2 times better than the baselines in open discovery. These results are strong indications that neural LBD is potentially a very effective approach for generating new scientific discoveries from existing literature. The code for our models and other information can be found at: https://github.com/cambridgeltl/nn_for_LBD.
Collapse
Affiliation(s)
- Gamal Crichton
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Simon Baker
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Yufan Guo
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| | - Anna Korhonen
- Language Technology Laboratory, TAL, University of Cambridge, Cambridge, United Kingdom
| |
Collapse
|
4
|
|
5
|
Visualizing a field of research: A methodology of systematic scientometric reviews. PLoS One 2019; 14:e0223994. [PMID: 31671124 PMCID: PMC6822756 DOI: 10.1371/journal.pone.0223994] [Citation(s) in RCA: 342] [Impact Index Per Article: 68.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2019] [Accepted: 10/02/2019] [Indexed: 12/14/2022] Open
Abstract
Systematic scientometric reviews, empowered by computational and visual analytic approaches, offer opportunities to improve the timeliness, accessibility, and reproducibility of studies of the literature of a field of research. On the other hand, effectively and adequately identifying the most representative body of scholarly publications as the basis of subsequent analyses remains a common bottleneck in the current practice. What can we do to reduce the risk of missing something potentially significant? How can we compare different search strategies in terms of the relevance and specificity of topical areas covered? In this study, we introduce a flexible and generic methodology based on a significant extension of the general conceptual framework of citation indexing for delineating the literature of a research field. The method, through cascading citation expansion, provides a practical connection between studies of science from local and global perspectives. We demonstrate an application of the methodology to the research of literature-based discovery (LBD) and compare five datasets constructed based on three use scenarios and corresponding retrieval strategies, namely a query-based lexical search (one dataset), forward expansions starting from a groundbreaking article of LBD (two datasets), and backward expansions starting from a recently published review article by a prominent expert in LBD (two datasets). We particularly discuss the relevance of areas captured by expansion processes with reference to the query-based scientometric visualization. The method used in this study for comparing bibliometric datasets is applicable to comparative studies of search strategies.
Collapse
|
6
|
Henry S, McInnes BT. Indirect association and ranking hypotheses for literature based discovery. BMC Bioinformatics 2019; 20:425. [PMID: 31416434 PMCID: PMC6694578 DOI: 10.1186/s12859-019-2989-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2018] [Accepted: 07/09/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Literature Based Discovery (LBD) produces more potential hypotheses than can be manually reviewed, making automatically ranking these hypotheses critical. In this paper, we introduce the indirect association measures of Linking Term Association (LTA), Minimum Weight Association (MWA), and Shared B to C Set Association (SBC), and compare them to Linking Set Association (LSA), concept embeddings vector cosine, Linking Term Count (LTC), and direct co-occurrence vector cosine. Our proposed indirect association measures extend traditional association measures to quantify indirect rather than direct associations while preserving valuable statistical properties. RESULTS We perform a comparison between several different hypothesis ranking methods for LBD, and compare them against our proposed indirect association measures. We intrinsically evaluate each method's performance using its ability to estimate semantic relatedness on standard evaluation datasets. We extrinsically evaluate each method's ability to rank hypotheses in LBD using a time-slicing dataset based on co-occurrence information, and another time-slicing dataset based on SemRep extracted-relationships. Precision and recall curves are generated by ranking term pairs and applying a threshold at each rank. CONCLUSIONS Results differ depending on the evaluation methods and datasets, but it is unclear if this is a result of biases in the evaluation datasets or if one method is truly better than another. We conclude that LTC and SBC are the best suited methods for hypothesis ranking in LBD, but there is value in having a variety of methods to choose from.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| | - Bridget T. McInnes
- Department of Computer Science, Virginia Commonwealth University, 601 W. Main St. Rm 435, Richmond, 23284 USA
| |
Collapse
|
7
|
Preiss J. Is automatic detection of hidden knowledge an anomaly? BMC Bioinformatics 2019; 20:251. [PMID: 31138105 PMCID: PMC6538538 DOI: 10.1186/s12859-019-2815-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The quantity of documents being published requires researchers to specialize to a narrower field, meaning that inferable connections between publications (particularly from different domains) can be missed. This has given rise to automatic literature based discovery (LBD). However, unless heavily filtered, LBD generates more potential new knowledge than can be manually verified and another form of selection is required before the results can be passed onto a user. Since a large proportion of the automatically generated hidden knowledge is valid but generally known, we investigate the hypothesis that non trivial, interesting, hidden knowledge can be treated as an anomaly and identified using anomaly detection approaches. RESULTS Two experiments are conducted: (1) to avoid errors arising from incorrect extraction of relations, the hypothesis is validated using manually annotated relations appearing in a thesaurus, and (2) automatically extracted relations are used to investigate the hypothesis on publication abstracts. These allow an investigation of a potential upper bound and the detection of limitations yielded by automatic relation extraction. CONCLUSION We apply one-class SVM and isolation forest anomaly detection algorithms to a set of hidden connections to rank connections by identifying outlying (interesting) ones and show that the approach increases the F1 measure by a factor of 10 while greatly reducing the quantity of hidden knowledge to manually verify. We also demonstrate the statistical significance of this result.
Collapse
Affiliation(s)
- Judita Preiss
- University of Salford, The School of Computing, Science & Engineering, Newton Building, Salford, Greater Manchester, M5 4WT, UK.
| |
Collapse
|
8
|
Gopalakrishnan V, Jha K, Jin W, Zhang A. A survey on literature based discovery approaches in biomedical domain. J Biomed Inform 2019; 93:103141. [PMID: 30857950 DOI: 10.1016/j.jbi.2019.103141] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 02/06/2023]
Abstract
Literature Based Discovery (LBD) refers to the problem of inferring new and interesting knowledge by logically connecting independent fragments of information units through explicit or implicit means. This area of research, which incorporates techniques from Natural Language Processing (NLP), Information Retrieval and Artificial Intelligence, has significant potential to reduce discovery time in biomedical research fields. Formally introduced in 1986, LBD has grown to be a significant and a core task for text mining practitioners in the biomedical domain. Together with its inter-disciplinary nature, this has led researchers across domains to contribute in advancing this field of study. This survey attempts to consolidate and present the evolution of techniques in this area. We cover a variety of techniques and provide a detailed description of the problem setting, the intuition, the advantages and limitations of various influential papers. We also list the current bottlenecks in this field and provide a general direction of research activities for the future. In an effort to be comprehensive and for ease of reference for off-the-shelf users, we also list many publicly available tools for LBD. We hope this survey will act as a guide to both academic and industry (bio)-informaticians, introduce the various methodologies currently employed and also the challenges yet to be tackled.
Collapse
Affiliation(s)
| | | | - Wei Jin
- University of North Texas at Denton, TX, United States.
| | | |
Collapse
|
9
|
Thilakaratne M, Falkner K, Atapattu T. A systematic review on literature-based discovery workflow. PeerJ Comput Sci 2019; 5:e235. [PMID: 33816888 PMCID: PMC7924697 DOI: 10.7717/peerj-cs.235] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Accepted: 10/17/2019] [Indexed: 05/02/2023]
Abstract
As scientific publication rates increase, knowledge acquisition and the research development process have become more complex and time-consuming. Literature-Based Discovery (LBD), supporting automated knowledge discovery, helps facilitate this process by eliciting novel knowledge by analysing existing scientific literature. This systematic review provides a comprehensive overview of the LBD workflow by answering nine research questions related to the major components of the LBD workflow (i.e., input, process, output, and evaluation). With regards to the input component, we discuss the data types and data sources used in the literature. The process component presents filtering techniques, ranking/thresholding techniques, domains, generalisability levels, and resources. Subsequently, the output component focuses on the visualisation techniques used in LBD discipline. As for the evaluation component, we outline the evaluation techniques, their generalisability, and the quantitative measures used to validate results. To conclude, we summarise the findings of the review for each component by highlighting the possible future research directions.
Collapse
Affiliation(s)
- Menasha Thilakaratne
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Katrina Falkner
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| | - Thushari Atapattu
- Faculty of Engineering, Computer and Mathematical Sciences, The University of Adelaide, Adelaide, South Australia, Australia
| |
Collapse
|
10
|
Henry S, McInnes BT. Literature Based Discovery: Models, methods, and trends. J Biomed Inform 2017; 74:20-32. [PMID: 28838802 DOI: 10.1016/j.jbi.2017.08.011] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2017] [Revised: 07/21/2017] [Accepted: 08/20/2017] [Indexed: 01/25/2023]
Abstract
OBJECTIVES This paper provides an introduction and overview of literature based discovery (LBD) in the biomedical domain. It introduces the reader to modern and historical LBD models, key system components, evaluation methodologies, and current trends. After completion, the reader will be familiar with the challenges and methodologies of LBD. The reader will be capable of distinguishing between recent LBD systems and publications, and be capable of designing an LBD system for a specific application. TARGET AUDIENCE From biomedical researchers curious about LBD, to someone looking to design an LBD system, to an LBD expert trying to catch up on trends in the field. The reader need not be familiar with LBD, but knowledge of biomedical text processing tools is helpful. SCOPE This paper describes a unifying framework for LBD systems. Within this framework, different models and methods are presented to both distinguish and show overlap between systems. Topics include term and document representation, system components, and an overview of models including co-occurrence models, semantic models, and distributional models. Other topics include uninformative term filtering, term ranking, results display, system evaluation, an overview of the application areas of drug development, drug repurposing, and adverse drug event prediction, and challenges and future directions. A timeline showing contributions to LBD, and a table summarizing the works of several authors is provided. Topics are presented from a high level perspective. References are given if more detailed analysis is required.
Collapse
Affiliation(s)
- Sam Henry
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, 401 S. Main St., Rm E4222, Richmond, VA 23284, USA
| |
Collapse
|
11
|
Abstract
Background Literature based discovery (LBD) automatically infers missed connections between concepts in literature. It is often assumed that LBD generates more information than can be reasonably examined. Methods We present a detailed analysis of the quantity of hidden knowledge produced by an LBD system and the effect of various filtering approaches upon this. The investigation of filtering combined with single or multi-step linking term chains is carried out on all articles in PubMed. Results The evaluation is carried out using both replication of existing discoveries, which provides justification for multi-step linking chain knowledge in specific cases, and using timeslicing, which gives a large scale measure of performance. Conclusions While the quantity of hidden knowledge generated by LBD can be vast, we demonstrate that (a) intelligent filtering can greatly reduce the number of hidden knowledge pairs generated, (b) for a specific term, the number of single step connections can be manageable, and (c) in the absence of single step hidden links, considering multiple steps can provide valid links.
Collapse
Affiliation(s)
- Judita Preiss
- Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, UK.
| | - Mark Stevenson
- Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, UK
| |
Collapse
|
12
|
Abstract
AbstractLiterature-based discovery systems aim at discovering valuable latent connections between previously disparate research areas. This is achieved by analyzing the contents of their respective literatures with the help of various intelligent computational techniques. In this paper, we review the progress of literature-based discovery research, focusing on understanding their technical features and evaluating their performance. The present literature-based discovery techniques can be divided into two general approaches: the traditional approach and the emerging approach. The traditional approach, which dominate the current research landscape, comprises mainly of techniques that rely on utilizing lexical statistics, knowledge-based and visualization methods in order to address literature-based discovery problems. On the other hand, we have also observed the births of new trends and unprecedented paradigm shifts among the recently emerging literature-based discovery approach. These trends are likely to shape the future trajectory of the next generation literature-based discovery systems.
Collapse
|
13
|
Kastrin A, Rindflesch TC, Hristovski D. Link Prediction on a Network of Co-occurring MeSH Terms: Towards Literature-based Discovery. Methods Inf Med 2016; 55:340-6. [PMID: 27435341 DOI: 10.3414/me15-01-0108] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Accepted: 05/19/2016] [Indexed: 12/24/2022]
Abstract
OBJECTIVES Literature-based discovery (LBD) is a text mining methodology for automatically generating research hypotheses from existing knowledge. We mimic the process of LBD as a classification problem on a graph of MeSH terms. We employ unsupervised and supervised link prediction methods for predicting previously unknown connections between biomedical concepts. METHODS We evaluate the effectiveness of link prediction through a series of experiments using a MeSH network that contains the history of link formation between biomedical concepts. We performed link prediction using proximity measures, such as common neighbor (CN), Jaccard coefficient (JC), Adamic / Adar index (AA) and preferential attachment (PA). Our approach relies on the assumption that similar nodes are more likely to establish a link in the future. RESULTS Applying an unsupervised approach, the AA measure achieved the best performance in terms of area under the ROC curve (AUC = 0.76), followed by CN, JC, and PA. In a supervised approach, we evaluate whether proximity measures can be combined to define a model of link formation across all four predictors. We applied various classifiers, including decision trees, k-nearest neighbors, logistic regression, multilayer perceptron, naïve Bayes, and random forests. Random forest classifier accomplishes the best performance (AUC = 0.87). CONCLUSIONS The link prediction approach proved to be effective for LBD processing. Supervised statistical learning approaches clearly outperform an unsupervised approach to link prediction.
Collapse
Affiliation(s)
- Andrej Kastrin
- Andrej Kastrin, PhD, Faculty of Information Studies, Ljubljanska cesta 31A, SI-8000 Novo Mesto, Slovenia, E-mail:
| | | | | |
Collapse
|
14
|
Kajikawa Y, Abe K, Noda S. Filling the gap between researchers studying different materials and different methods: a proposal for structured keywords. J Inf Sci 2016. [DOI: 10.1177/0165551506067125] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Scientific publications written in natural language still play a central role as our knowledge source. However, due to the flood of publications, obtaining a comprehensive view even on a topic of limited scope, from a stack of publications is becoming an arduous task. Examples are presented from our recent experiences in the materials science field, where information is not shared among researchers studying different materials and different methods. To overcome the limitation, we propose a structured keywords method to reinforce the functionality of a future e-library.
Collapse
Affiliation(s)
- Yuya Kajikawa
- Institute of Engineering Innovation, School of Engineering, University of Tokyo, Japan
| | - Koji Abe
- Intelligent Modelling Laboratory, University of Tokyo, Japan
| | - Suguru Noda
- Department of Chemical System Engineering, Graduate School of Engineering, University of Tokyo, Japan
| |
Collapse
|
15
|
Abstract
The applicability, validity and usefulness of Karl Popper’s ‘Three Worlds’ epistemology is examined, with specific reference to health-care information and knowledge. It is concluded that Popper’s ideas provide a valuable way of understanding this domain and that, in particular, Popper’s controversial ‘World 3’ of objective knowledge is a valid concept.
Collapse
|
16
|
Preiss J, Stevenson M, Gaizauskas R. Exploring relation types for literature-based discovery. J Am Med Inform Assoc 2015; 22:987-92. [PMID: 25971437 PMCID: PMC4986660 DOI: 10.1093/jamia/ocv002] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2014] [Accepted: 12/26/2014] [Indexed: 11/25/2022] Open
Abstract
Objective Literature-based discovery (LBD) aims to identify “hidden knowledge” in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LBD approaches use simple techniques to identify semantically weak relations between concepts, for example, document co-occurrence. These generate huge numbers of hypotheses, difficult for humans to assess. More complex techniques rely on linguistic analysis, for example, shallow parsing, to identify semantically stronger relations. Such approaches generate fewer hypotheses, but may miss hidden knowledge. The authors investigate this trade-off in detail, comparing techniques for identifying related concepts to discover which are most suitable for LBD. Materials and methods A generic LBD system that can utilize a range of relation types was developed. Experiments were carried out comparing a number of techniques for identifying relations. Two approaches were used for evaluation: replication of existing discoveries and the “time slicing” approach.1 Results Previous LBD discoveries could be replicated using relations based either on document co-occurrence or linguistic analysis. Using relations based on linguistic analysis generated many fewer hypotheses, but a significantly greater proportion of them were candidates for hidden knowledge. Discussion and Conclusion The use of linguistic analysis-based relations improves accuracy of LBD without overly damaging coverage. LBD systems often generate huge numbers of hypotheses, which are infeasible to manually review. Improving their accuracy has the potential to make these systems significantly more usable.
Collapse
Affiliation(s)
- Judita Preiss
- Department of Computer Science, The University of Sheffield 211 Portobello, Sheffield S1 4DP, UK
| | - Mark Stevenson
- Department of Computer Science, The University of Sheffield 211 Portobello, Sheffield S1 4DP, UK
| | - Robert Gaizauskas
- Department of Computer Science, The University of Sheffield 211 Portobello, Sheffield S1 4DP, UK
| |
Collapse
|
17
|
Cameron D, Kavuluru R, Rindflesch TC, Sheth AP, Thirunarayan K, Bodenreider O. Context-driven automatic subgraph creation for literature-based discovery. J Biomed Inform 2015; 54:141-57. [PMID: 25661592 PMCID: PMC4888806 DOI: 10.1016/j.jbi.2015.01.014] [Citation(s) in RCA: 51] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Revised: 01/21/2015] [Accepted: 01/25/2015] [Indexed: 01/29/2023]
Abstract
BACKGROUND Literature-based discovery (LBD) is characterized by uncovering hidden associations in non-interacting scientific literature. Prior approaches to LBD include use of: (1) domain expertise and structured background knowledge to manually filter and explore the literature, (2) distributional statistics and graph-theoretic measures to rank interesting connections, and (3) heuristics to help eliminate spurious connections. However, manual approaches to LBD are not scalable and purely distributional approaches may not be sufficient to obtain insights into the meaning of poorly understood associations. While several graph-based approaches have the potential to elucidate associations, their effectiveness has not been fully demonstrated. A considerable degree of a priori knowledge, heuristics, and manual filtering is still required. OBJECTIVES In this paper we implement and evaluate a context-driven, automatic subgraph creation method that captures multifaceted complex associations between biomedical concepts to facilitate LBD. Given a pair of concepts, our method automatically generates a ranked list of subgraphs, which provide informative and potentially unknown associations between such concepts. METHODS To generate subgraphs, the set of all MEDLINE articles that contain either of the two specified concepts (A, C) are first collected. Then binary relationships or assertions, which are automatically extracted from the MEDLINE articles, called semantic predications, are used to create a labeled directed predications graph. In this predications graph, a path is represented as a sequence of semantic predications. The hierarchical agglomerative clustering (HAC) algorithm is then applied to cluster paths that are bounded by the two concepts (A, C). HAC relies on implicit semantics captured through Medical Subject Heading (MeSH) descriptors, and explicit semantics from the MeSH hierarchy, for clustering. Paths that exceed a threshold of semantic relatedness are clustered into subgraphs based on their shared context. Finally, the automatically generated clusters are provided as a ranked list of subgraphs. RESULTS The subgraphs generated using this approach facilitated the rediscovery of 8 out of 9 existing scientific discoveries. In particular, they directly (or indirectly) led to the recovery of several intermediates (or B-concepts) between A- and C-terms, while also providing insights into the meaning of the associations. Such meaning is derived from predicates between the concepts, as well as the provenance of the semantic predications in MEDLINE. Additionally, by generating subgraphs on different thematic dimensions (such as Cellular Activity, Pharmaceutical Treatment and Tissue Function), the approach may enable a broader understanding of the nature of complex associations between concepts. Finally, in a statistical evaluation to determine the interestingness of the subgraphs, it was observed that an arbitrary association is mentioned in only approximately 4 articles in MEDLINE on average. CONCLUSION These results suggest that leveraging the implicit and explicit semantics provided by manually assigned MeSH descriptors is an effective representation for capturing the underlying context of complex associations, along multiple thematic dimensions in LBD situations.
Collapse
Affiliation(s)
- Delroy Cameron
- Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis), Wright State University, Dayton, OH 45435, USA.
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, University of Kentucky, Lexington, KY 40506, USA
| | | | - Amit P Sheth
- Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis), Wright State University, Dayton, OH 45435, USA
| | - Krishnaprasad Thirunarayan
- Ohio Center of Excellence in Knowledge-Enabled Computing (Kno.e.sis), Wright State University, Dayton, OH 45435, USA
| | | |
Collapse
|
18
|
Cheng L, Lin H, Zhou F, Yang Z, Wang J. Enhancing the accuracy of knowledge discovery: a supervised learning method. BMC Bioinformatics 2014; 15 Suppl 12:S9. [PMID: 25474584 PMCID: PMC4243114 DOI: 10.1186/1471-2105-15-s12-s9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Background The amount of biomedical literature available is growing at an explosive speed, but a large amount of useful information remains undiscovered in it. Researchers can make informed biomedical hypotheses through mining this literature. Unfortunately, popular mining methods based on co-occurrence produce too many target concepts, leading to the declining relevance ranking of the potential target concepts. Methods This paper presents a new method for selecting linking concepts which exploits statistical and textual features to represent each linking concept, and then classifies them as relevant or irrelevant to the starting concepts. Relevant linking concepts are then used to discover target concepts. Results Through an evaluation it is observed textual features improve the results obtained with only statistical features. We successfully replicate Swanson's two classic discoveries and find the rankings of potentially relevant target concepts are relatively high. Conclusions The number of target concepts is greatly reduced and potentially relevant target concepts gain higher ranking by adopting only relevant linking concepts. Thus, the proposed method has the potential to help biomedical experts find the most useful and valuable target concepts effectively.
Collapse
|
19
|
Kazmer MM, Lustria MLA, Cortese J, Burnett G, Kim JH, Ma J, Frost J. Distributed knowledge in an online patient support community: Authority and discovery. J Assoc Inf Sci Technol 2014. [DOI: 10.1002/asi.23064] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Affiliation(s)
- Michelle M. Kazmer
- School of Library and Information Studies; Florida State University; 142 Collegiate Loop Tallahassee FL 32306-2100
| | - Mia Liza A. Lustria
- School of Library and Information Studies; Florida State University; 142 Collegiate Loop Tallahassee FL 32306-2100
| | - Juliann Cortese
- School of Communication; Florida State University; 296 Champions Way Tallahassee FL 32306-2664
| | - Gary Burnett
- School of Library and Information Studies; Florida State University; 142 Collegiate Loop Tallahassee FL 32306-2100
| | - Ji-Hyun Kim
- School of Library and Information Studies; Florida State University; 142 Collegiate Loop Tallahassee FL 32306-2100
| | - Jinxuan Ma
- School of Library and Information Studies; Florida State University; 142 Collegiate Loop Tallahassee FL 32306-2100
| | - Jeana Frost
- VU Amsterdam; De Boelelaan 1081 1081 HV Amsterdam
| |
Collapse
|
20
|
Workman TE, Fiszman M, Rindflesch TC, Nahl D. Framing serendipitous information-seeking behavior for facilitating literature-based discovery: A proposed model. J Assoc Inf Sci Technol 2014. [DOI: 10.1002/asi.22999] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Affiliation(s)
- T. Elizabeth Workman
- National Library of Medicine; Lister Hill Center; 8600 Rockville Pike, Building 38A, Room B1N30A4 Bethesda MD 20894
| | - Marcelo Fiszman
- National Library of Medicine; Lister Hill Center; 8600 Rockville Pike, Building 38A, Room B1N30A4 Bethesda MD 20894
| | - Thomas C. Rindflesch
- National Library of Medicine; Lister Hill Center; 8600 Rockville Pike, Building 38A, Room B1N30A4 Bethesda MD 20894
| | - Diane Nahl
- University of Hawai'i; Library and Information Science Program; Department of Information and Computer Sciences; Hamilton Library 3C, 2550 McCarthy Mall Honolulu HI 96822
| |
Collapse
|
21
|
Jenkin TA, Chan YE, Skillicorn DB, Rogers KW. Individual Exploration, Sensemaking, and Innovation: A Design for the Discovery of Novel Information. DECISION SCIENCES 2013. [DOI: 10.1111/deci.12042] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Tracy A. Jenkin
- School of Business; Queen's University; Kingston ON K7L 3N6 Canada
| | - Yolande E. Chan
- School of Business; Queen's University; Kingston ON K7L 3N6 Canada
| | | | - Keith W. Rogers
- School of Business; Queen's University; Kingston ON K7L 3N6 Canada
| |
Collapse
|
22
|
Abstract
This paper proposes entitymetrics to measure the impact of knowledge units. Entitymetrics highlight the importance of entities embedded in scientific literature for further knowledge discovery. In this paper, we use Metformin, a drug for diabetes, as an example to form an entity-entity citation network based on literature related to Metformin. We then calculate the network features and compare the centrality ranks of biological entities with results from Comparative Toxicogenomics Database (CTD). The comparison demonstrates the usefulness of entitymetrics to detect most of the outstanding interactions manually curated in CTD.
Collapse
|
23
|
|
24
|
Cameron D, Bodenreider O, Yalamanchili H, Danh T, Vallabhaneni S, Thirunarayan K, Sheth AP, Rindflesch TC. A graph-based recovery and decomposition of Swanson's hypothesis using semantic predications. J Biomed Inform 2012; 46:238-51. [PMID: 23026233 DOI: 10.1016/j.jbi.2012.09.004] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2012] [Revised: 09/05/2012] [Accepted: 09/08/2012] [Indexed: 11/29/2022]
Abstract
OBJECTIVES This paper presents a methodology for recovering and decomposing Swanson's Raynaud Syndrome-Fish Oil hypothesis semi-automatically. The methodology leverages the semantics of assertions extracted from biomedical literature (called semantic predications) along with structured background knowledge and graph-based algorithms to semi-automatically capture the informative associations originally discovered manually by Swanson. Demonstrating that Swanson's manually intensive techniques can be undertaken semi-automatically, paves the way for fully automatic semantics-based hypothesis generation from scientific literature. METHODS Semantic predications obtained from biomedical literature allow the construction of labeled directed graphs which contain various associations among concepts from the literature. By aggregating such associations into informative subgraphs, some of the relevant details originally articulated by Swanson have been uncovered. However, by leveraging background knowledge to bridge important knowledge gaps in the literature, a methodology for semi-automatically capturing the detailed associations originally explicated in natural language by Swanson, has been developed. RESULTS Our methodology not only recovered the three associations commonly recognized as Swanson's hypothesis, but also decomposed them into an additional 16 detailed associations, formulated as chains of semantic predications. Altogether, 14 out of the 19 associations that can be attributed to Swanson were retrieved using our approach. To the best of our knowledge, such an in-depth recovery and decomposition of Swanson's hypothesis has never been attempted. CONCLUSION In this work therefore, we presented a methodology to semi-automatically recover and decompose Swanson's RS-DFO hypothesis using semantic representations and graph algorithms. Our methodology provides new insights into potential prerequisites for semantics-driven Literature-Based Discovery (LBD). Based on our observations, three critical aspects of LBD include: (1) the need for more expressive representations beyond Swanson's ABC model; (2) an ability to accurately extract semantic information from text; and (3) the semantic integration of scientific literature and structured background knowledge.
Collapse
Affiliation(s)
- Delroy Cameron
- Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), Wright State University, Dayton, OH 45435, USA.
| | | | | | | | | | | | | | | |
Collapse
|
25
|
Lekka E, Deftereos SN, Persidis A, Persidis A, Andronis C. Literature analysis for systematic drug repurposing: a case study from Biovista. ACTA ACUST UNITED AC 2011. [DOI: 10.1016/j.ddstr.2011.06.005] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
26
|
Bisgin H, Liu Z, Fang H, Xu X, Tong W. Mining FDA drug labels using an unsupervised learning technique--topic modeling. BMC Bioinformatics 2011; 12 Suppl 10:S11. [PMID: 22166012 PMCID: PMC3236833 DOI: 10.1186/1471-2105-12-s10-s11] [Citation(s) in RCA: 60] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The Food and Drug Administration (FDA) approved drug labels contain a broad array of information, ranging from adverse drug reactions (ADRs) to drug efficacy, risk-benefit consideration, and more. However, the labeling language used to describe these information is free text often containing ambiguous semantic descriptions, which poses a great challenge in retrieving useful information from the labeling text in a consistent and accurate fashion for comparative analysis across drugs. Consequently, this task has largely relied on the manual reading of the full text by experts, which is time consuming and labor intensive. METHOD In this study, a novel text mining method with unsupervised learning in nature, called topic modeling, was applied to the drug labeling with a goal of discovering "topics" that group drugs with similar safety concerns and/or therapeutic uses together. A total of 794 FDA-approved drug labels were used in this study. First, the three labeling sections (i.e., Boxed Warning, Warnings and Precautions, Adverse Reactions) of each drug label were processed by the Medical Dictionary for Regulatory Activities (MedDRA) to convert the free text of each label to the standard ADR terms. Next, the topic modeling approach with latent Dirichlet allocation (LDA) was applied to generate 100 topics, each associated with a set of drugs grouped together based on the probability analysis. Lastly, the efficacy of the topic modeling was evaluated based on known information about the therapeutic uses and safety data of drugs. RESULTS The results demonstrate that drugs grouped by topics are associated with the same safety concerns and/or therapeutic uses with statistical significance (P<0.05). The identified topics have distinct context that can be directly linked to specific adverse events (e.g., liver injury or kidney injury) or therapeutic application (e.g., antiinfectives for systemic use). We were also able to identify potential adverse events that might arise from specific medications via topics. CONCLUSIONS The successful application of topic modeling on the FDA drug labeling demonstrates its potential utility as a hypothesis generation means to infer hidden relationships of concepts such as, in this study, drug safety and therapeutic use in the study of biomedical documents.
Collapse
Affiliation(s)
- Halil Bisgin
- Department of Information Science, University of Arkansas at Little Rock, 2801 S. University Ave., Little Rock, AR 72204-1099, USA
| | - Zhichao Liu
- Center for Bioinformatics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA
| | - Hong Fang
- ICF International at FDA's National Center for Toxicological Research, 3900 NCTR Rd, Jefferson, AR 72079, USA
| | - Xiaowei Xu
- Department of Information Science, University of Arkansas at Little Rock, 2801 S. University Ave., Little Rock, AR 72204-1099, USA
- Center for Bioinformatics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA
| | - Weida Tong
- Center for Bioinformatics, National Center for Toxicological Research, US Food and Drug Administration, 3900 NCTR Road, Jefferson, AR 72079, USA
| |
Collapse
|
27
|
Kostoff RN, Block JA, Solka JL, Briggs MB, Rushenberg RL, Stump JA, Johnson D, Lyons TJ, Wyatt JR. Literature-related discovery. ACTA ACUST UNITED AC 2011. [DOI: 10.1002/aris.2009.1440430112] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
28
|
Deftereos SN, Andronis C, Friedla EJ, Persidis A, Persidis A. Drug repurposing and adverse event prediction using high-throughput literature analysis. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2011; 3:323-34. [PMID: 21416632 DOI: 10.1002/wsbm.147] [Citation(s) in RCA: 80] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Drug repurposing is the process of using existing drugs in indications other than the ones they were originally designed for. It is an area of significant recent activity due to the mounting costs of traditional drug development and scarcity of new chemical entities brought to the market by bio-pharmaceutical companies. By selecting drugs that already satisfy basic toxicity, ADME and related criteria, drug repurposing promises to deliver significant value at reduced cost and in dramatically shorter time frames than is normally the case for the drug development process. The same process that results in drug repurposing can also be used for the prediction of adverse events of known or novel drugs. The analytics method is based on the description of the mechanism of action of a drug, which is then compared to the molecular mechanisms underlying all known adverse events. This review will focus on those approaches to drug repurposing and adverse event prediction that are based on the biomedical literature. Such approaches typically begin with an analysis of the literature and aim to reveal indirect relationships among seemingly unconnected biomedical entities such as genes, signaling pathways, physiological processes, and diseases. Networks of associations of these entities allow the uncovering of the molecular mechanisms underlying a disease, better understanding of the biological effects of a drug and the evaluation of its benefit/risk profile. In silico results can be tested in relevant cellular and animal models and, eventually, in clinical trials.
Collapse
|
29
|
Ijaz AZ, Song M, Lee D. MKEM: a Multi-level Knowledge Emergence Model for mining undiscovered public knowledge. BMC Bioinformatics 2010; 11 Suppl 2:S3. [PMID: 20406501 PMCID: PMC3165192 DOI: 10.1186/1471-2105-11-s2-s3] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND Since Swanson proposed the Undiscovered Public Knowledge (UPK) model, there have been many approaches to uncover UPK by mining the biomedical literature. These earlier works, however, required substantial manual intervention to reduce the number of possible connections and are mainly applied to disease-effect relation. With the advancement in biomedical science, it has become imperative to extract and combine information from multiple disjoint researches, studies and articles to infer new hypotheses and expand knowledge. METHODS We propose MKEM, a Multi-level Knowledge Emergence Model, to discover implicit relationships using Natural Language Processing techniques such as Link Grammar and Ontologies such as Unified Medical Language System (UMLS) MetaMap. The contribution of MKEM is as follows: First, we propose a flexible knowledge emergence model to extract implicit relationships across different levels such as molecular level for gene and protein and Phenomic level for disease and treatment. Second, we employ MetaMap for tagging biological concepts. Third, we provide an empirical and systematic approach to discover novel relationships. RESULTS We applied our system on 5000 abstracts downloaded from PubMed database. We performed the performance evaluation as a gold standard is not yet available. Our system performed with a good precision and recall and we generated 24 hypotheses. CONCLUSIONS Our experiments show that MKEM is a powerful tool to discover hidden relationships residing in extracted entities that were represented by our Substance-Effect-Process-Disease-Body Part (SEPDB) model.
Collapse
Affiliation(s)
- Ali Z Ijaz
- Department of Bio and Brain Engineering, KAIST, South Korea
| | - Min Song
- Information Systems, New Jersey Institute of Technology, USA
| | - Doheon Lee
- Department of Bio and Brain Engineering, KAIST, South Korea
| |
Collapse
|
30
|
Cohen T, Schvaneveldt R, Widdows D. Reflective Random Indexing and indirect inference: a scalable method for discovery of implicit connections. J Biomed Inform 2009; 43:240-56. [PMID: 19761870 DOI: 10.1016/j.jbi.2009.09.003] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2009] [Revised: 09/09/2009] [Accepted: 09/09/2009] [Indexed: 10/20/2022]
Abstract
The discovery of implicit connections between terms that do not occur together in any scientific document underlies the model of literature-based knowledge discovery first proposed by Swanson. Corpus-derived statistical models of semantic distance such as Latent Semantic Analysis (LSA) have been evaluated previously as methods for the discovery of such implicit connections. However, LSA in particular is dependent on a computationally demanding method of dimension reduction as a means to obtain meaningful indirect inference, limiting its ability to scale to large text corpora. In this paper, we evaluate the ability of Random Indexing (RI), a scalable distributional model of word associations, to draw meaningful implicit relationships between terms in general and biomedical language. Proponents of this method have achieved comparable performance to LSA on several cognitive tasks while using a simpler and less computationally demanding method of dimension reduction than LSA employs. In this paper, we demonstrate that the original implementation of RI is ineffective at inferring meaningful indirect connections, and evaluate Reflective Random Indexing (RRI), an iterative variant of the method that is better able to perform indirect inference. RRI is shown to lead to more clearly related indirect connections and to outperform existing RI implementations in the prediction of future direct co-occurrence in the MEDLINE corpus.
Collapse
Affiliation(s)
- Trevor Cohen
- Center for Cognitive Informatics and Decision Making, School of Health Information Sciences, University of Texas, Houston, TX, USA.
| | | | | |
Collapse
|
31
|
|
32
|
|
33
|
Hu X, Zhang X, Yoo I, Wang X, Feng J. Mining hidden connections among biomedical concepts from disjoint biomedical literature sets through semantic-based association rule. INT J INTELL SYST 2009. [DOI: 10.1002/int.20396] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
34
|
A new evaluation methodology for literature-based discovery systems. J Biomed Inform 2008; 42:633-43. [PMID: 19124086 DOI: 10.1016/j.jbi.2008.12.001] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2008] [Revised: 12/03/2008] [Accepted: 12/05/2008] [Indexed: 11/21/2022]
Abstract
While medical researchers formulate new hypotheses to test, they need to identify connections to their work from other parts of the medical literature. However, the current volume of information has become a great barrier for this task. Recently, many literature-based discovery (LBD) systems have been developed to help researchers identify new knowledge that bridges gaps across distinct sections of the medical literature. Each LBD system uses different methods for mining the connections from text and ranking the identified connections, but none of the currently available LBD evaluation approaches can be used to compare the effectiveness of these methods. In this paper, we present an evaluation methodology for LBD systems that allows comparisons across different systems. We demonstrate the abilities of our evaluation methodology by using it to compare the performance of different correlation-mining and ranking approaches used by existing LBD systems. This evaluation methodology should help other researchers compare approaches, make informed algorithm choices, and ultimately help to improve the performance of LBD systems overall.
Collapse
|
35
|
Literature-Based Knowledge Discovery using Natural Language Processing. ACTA ACUST UNITED AC 2008. [DOI: 10.1007/978-3-540-68690-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023]
|
36
|
|
37
|
Zhou X, Liu B, Wu Z, Feng Y. Integrative mining of traditional Chinese medicine literature and MEDLINE for functional gene networks. Artif Intell Med 2007; 41:87-104. [PMID: 17804209 DOI: 10.1016/j.artmed.2007.07.007] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2006] [Revised: 07/24/2007] [Accepted: 07/24/2007] [Indexed: 01/17/2023]
Abstract
OBJECTIVE The amount of biomedical data in different disciplines is growing at an exponential rate. Integrating these significant knowledge sources to generate novel hypotheses for systems biology research is difficult. Traditional Chinese medicine (TCM) is a completely different discipline, and is a complementary knowledge system to modern biomedical science. This paper uses a significant TCM bibliographic literature database in China, together with MEDLINE, to help discover novel gene functional knowledge. MATERIALS AND METHODS We present an integrative mining approach to uncover the functional gene relationships from MEDLINE and TCM bibliographic literature. This paper introduces TCM literature (about 50,000 records) as one knowledge source for constructing literature-based gene networks. We use the TCM diagnosis, TCM syndrome, to automatically congregate the related genes. The syndrome-gene relationships are discovered based on the syndrome-disease relationships extracted from TCM literature and the disease-gene relationships in MEDLINE. Based on the bubble-bootstrapping and relation weight computing methods, we have developed a prototype system called MeDisco/3S, which has name entity and relation extraction, and online analytical processing (OLAP) capabilities, to perform the integrative mining process. RESULTS We have got about 200,000 syndrome-gene relations, which could help generate syndrome-based gene networks, and help analyze the functional knowledge of genes from syndrome perspective. We take the gene network of Kidney-Yang Deficiency syndrome (KYD syndrome) and the functional analysis of some genes, such as CRH (corticotropin releasing hormone), PTH (parathyroid hormone), PRL (prolactin), BRCA1 (breast cancer 1, early onset) and BRCA2 (breast cancer 2, early onset), to demonstrate the preliminary results. The underlying hypothesis is that the related genes of the same syndrome will have some biological functional relationships, and will constitute a functional network. CONCLUSION This paper presents an approach to integrate TCM literature and modern biomedical data to discover novel gene networks and functional knowledge of genes. The preliminary results show that the novel gene functional knowledge and gene networks, which are worthy of further investigation, could be generated by integrating the two complementary biomedical data sources. It will be a promising research field through integrative mining of TCM and modern life science literature.
Collapse
Affiliation(s)
- Xuezhong Zhou
- China Academy of Chinese Medical Sciences, Beijing 100700, China.
| | | | | | | |
Collapse
|
38
|
Tulipano PK, Tao Y, Millar WS, Zanzonico P, Kolbert K, Xu H, Yu H, Chen L, Lussier YA, Friedman C. Natural language processing and visualization in the molecular imaging domain. J Biomed Inform 2006; 40:270-81. [PMID: 17084109 DOI: 10.1016/j.jbi.2006.08.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2005] [Revised: 08/25/2006] [Accepted: 08/29/2006] [Indexed: 11/16/2022]
Abstract
Molecular imaging is at the crossroads of genomic sciences and medical imaging. Information within the molecular imaging literature could be used to link to genomic and imaging information resources and to organize and index images in a way that is potentially useful to researchers. A number of natural language processing (NLP) systems are available to automatically extract information from genomic literature. One existing NLP system, known as BioMedLEE, automatically extracts biological information consisting of biomolecular substances and phenotypic data. This paper focuses on the adaptation, evaluation, and application of BioMedLEE to the molecular imaging domain. In order to adapt BioMedLEE for this domain, we extend an existing molecular imaging terminology and incorporate it into BioMedLEE. BioMedLEE's performance is assessed with a formal evaluation study. The system's performance, measured as recall and precision, is 0.74 (95% CI: [.70-.76]) and 0.70 (95% CI [.63-.76]), respectively. We adapt a JAVA viewer known as PGviewer for the simultaneous visualization of images with NLP extracted information.
Collapse
Affiliation(s)
- P Karina Tulipano
- Department of Biomedical Informatics, Columbia University, 622 West 168th Street, Vanderbilt Clinic Floor 5, NY 10032, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Bekhuis T. Conceptual biology, hypothesis discovery, and text mining: Swanson's legacy. BIOMEDICAL DIGITAL LIBRARIES 2006; 3:2. [PMID: 16584552 PMCID: PMC1459187 DOI: 10.1186/1742-5581-3-2] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/26/2005] [Accepted: 04/03/2006] [Indexed: 11/10/2022]
Abstract
Innovative biomedical librarians and information specialists who want to expand their roles as expert searchers need to know about profound changes in biology and parallel trends in text mining. In recent years, conceptual biology has emerged as a complement to empirical biology. This is partly in response to the availability of massive digital resources such as the network of databases for molecular biologists at the National Center for Biotechnology Information. Developments in text mining and hypothesis discovery systems based on the early work of Swanson, a mathematician and information scientist, are coincident with the emergence of conceptual biology. Very little has been written to introduce biomedical digital librarians to these new trends. In this paper, background for data and text mining, as well as for knowledge discovery in databases (KDD) and in text (KDT) is presented, then a brief review of Swanson's ideas, followed by a discussion of recent approaches to hypothesis discovery and testing. 'Testing' in the context of text mining involves partially automated methods for finding evidence in the literature to support hypothetical relationships. Concluding remarks follow regarding (a) the limits of current strategies for evaluation of hypothesis discovery systems and (b) the role of literature-based discovery in concert with empirical research. Report of an informatics-driven literature review for biomarkers of systemic lupus erythematosus is mentioned. Swanson's vision of the hidden value in the literature of science and, by extension, in biomedical digital databases, is still remarkably generative for information scientists, biologists, and physicians.
Collapse
Affiliation(s)
- Tanja Bekhuis
- Department of Library & Information Science, School of Information Sciences, University of Pittsburgh, 135 North Bellefield Avenue, Pittsburgh, PA 15260, USA.
| |
Collapse
|
40
|
Swanson DR. Atrial fibrillation in athletes: Implicit literature-based connections suggest that overtraining and subsequent inflammation may be a contributory mechanism. Med Hypotheses 2006; 66:1085-92. [PMID: 16504414 DOI: 10.1016/j.mehy.2006.01.006] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2006] [Accepted: 01/10/2006] [Indexed: 11/21/2022]
Abstract
Research on atrial fibrillation (AF), a common heart arrhythmia in the elderly, over many decades has resulted in a literature of more than 16,000 articles indexed in Medline. An exploratory Medline search was conducted in which the subheadings for epidemiology and etiology of AF were combined to form a small subset of the initial records. Further computer-assisted selection led to a few articles that reported an unexpectedly high prevalence of AF in groups of otherwise healthy middle-aged endurance runners and other athletes. Why athletes should be unusually susceptible to AF is mysterious and puzzling. Because relatively few articles are about both AF and endurance exercise, a computer was used first to create a list of important terms that these two separate literatures had in common. Several inflammation-related terms, including C-reactive protein (CRP) and interleukin-6, were on that list. Further searching and literature analysis revealed that excessive endurance exercise or overtraining can lead to chronic systemic inflammation and, separately, that there is a solid association between CRP and AF and that anti-inflammatory agents have been reported to lower CRP and ameliorate AF. No articles were found that brought together all three concepts - AF, inflammation, and exercise. The following hypothesis is plausible, readily testable, and apparently novel: Older athletes diagnosed with AF but otherwise healthy who have engaged in rigorous aerobic endurance exercise for more than a decade will have CRP levels that are higher than those of a similar population of athletes without AF. Corroboration of this hypothesis would then justify a prospective clinical trial of anti-inflammation therapy. It is of particular interest to extend recent studies of inflammation in AF to athletes; athletic behavior that can induce inflammation may contribute to understanding the origins of AF.
Collapse
Affiliation(s)
- Don R Swanson
- Division of the Humanities, The University of Chicago, IL 60637, USA.
| |
Collapse
|
41
|
|
42
|
Hristovski D, Peterlin B, Mitchell JA, Humphrey SM. Using literature-based discovery to identify disease candidate genes. Int J Med Inform 2005; 74:289-98. [PMID: 15694635 DOI: 10.1016/j.ijmedinf.2004.04.024] [Citation(s) in RCA: 169] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2003] [Revised: 04/06/2004] [Accepted: 04/20/2004] [Indexed: 11/27/2022]
Abstract
We present BITOLA, an interactive literature-based biomedical discovery support system. The goal of this system is to discover new, potentially meaningful relations between a given starting concept of interest and other concepts, by mining the bibliographic database MEDLINE. To make the system more suitable for disease candidate gene discovery and to decrease the number of candidate relations, we integrate background knowledge about the chromosomal location of the starting disease as well as the chromosomal location of the candidate genes from resources such as LocusLink and Human Genome Organization (HUGO). BITOLA can also be used as an alternative way of searching the MEDLINE database. The system is available at http://www.mf.uni-lj.si/bitola/.
Collapse
Affiliation(s)
- Dimitar Hristovski
- Institute of Biomedical Informatics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2/2 1104 Ljubljana, Slovenia.
| | | | | | | |
Collapse
|
43
|
MacMullen WJ, Denn SO. Information problems in molecular biology and bioinformatics. ACTA ACUST UNITED AC 2005. [DOI: 10.1002/asi.20134] [Citation(s) in RCA: 32] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
44
|
Kostoff RN, Block JA, Stump JA, Pfeil KM. Information content in Medline record fields. Int J Med Inform 2004; 73:515-27. [PMID: 15171980 DOI: 10.1016/j.ijmedinf.2004.02.008] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2003] [Revised: 11/07/2003] [Accepted: 02/27/2004] [Indexed: 11/30/2022]
Abstract
BACKGROUND The authors have been conducting text mining analyses (extraction of useful information from text) of Medline records, using Abstracts as the main data source. For literature-based discovery, and other text mining applications as well, all records in a discipline need to be evaluated for determining prior art. Many Medline records do not contain Abstracts, but typically contain Titles and Mesh terms. Substitution of these fields for Abstracts in the non-Abstract records would restore the missing literature to some degree. OBJECTIVES Determine how well the information content of Title and Mesh fields approximates that of Abstracts in Medline records. APPROACH Select historical Medline records related to Raynaud's Phenomenon that contain Abstracts. Determine the information content in the Abstract fields through text mining. Then, determine the information content in the Title fields, the Mesh fields, and the combined Title-Mesh fields, and compare with the information content in the Abstracts. RESULTS Four metrics were used to compare the information content related to Raynaud's Phenomenon in the different fields: total number of phrases; number of unique phrases; content of factors from factor analyses; content of clusters from multi-link clustering. The Abstract field contains almost an order of magnitude more phrases than the other fields, and slightly more than an order of magnitude more unique phrases than the other fields. Each field used a factor matrix with 14 factors, and the combination of all 56 factors for the four fields represented 27 separate, but not unique, themes. These themes could be placed in two major categories, with two sub-categories per major category: Auto-immunity (antibodies, inflammation) and circulation (peripheral vessel circulation, coronary vessel circulation). All four sub-categories included representation from each field. Thus, while the focus of the representation of each field in each sub-category was moderately different, the four sub-category structure could be identified by analyzing the total factors in each field. In the cluster comparison phase of the study, the phrases used to create the clusters were the most important phrases identified for each factor. Thus, the factor matrix served as a filter for words used for clustering. While clusters were generated for all four fields, the Title hierarchy tended to be fragmented due to sparsity of the co-occurrence matrix that underlies the clusters. Therefore, the Title clusters were examined at only the lower levels of aggregation. The Abstract, Mesh, and Mesh + Title fields had the same first level taxonomy categories, auto-immunity and circulation. At the second level, the Abstract, Mesh, and Mesh + Title fields had the autoimmune diseases and antibodies sub-category in common. The Abstract and Mesh fields shared fascia inflammation as the other auto-immunity sub-category, while the other Mesh + Title sub-category focuses on vinyl chloride poisoning from industrial contact, and consequences of antineoplastic agents. However, in both cases, even though the words may be different, inflammation may be the common theme. CONCLUSIONS For taxonomy generation, especially at the higher levels, each of the four fields has a similar thematic structure. At very detailed levels, the Mesh and Title fields run out of phrases relative to the Abstract field. Therefore, selection of field (s) to be employed for taxonomy generation depends on the objectives of the study, particularly the level of categorization required for the taxonomy. For information retrieval, or literature-based discovery, selection of the appropriate field again depends on the study objectives. If large queries, or large numbers of concepts or themes are desired, then the field with the largest number of technical phrases would be desirable. If queries or concepts represented by the more accepted popular terminology is adequate, then the smaller fields may be sufficient. Because of its established and controlled vocabulary, the Mesh field lags the Title or Abss the Title or Abstract fields in currency. Thus, the Title or Abstract fields would retrieve records with the most explicitly stated current concepts, but the Mesh field would capture a larger swath of fields that contained a concept of interest but perhaps had a wider range of specific terminology in the Abstract or Title text. In addition, this study provides the first validated estimate of the disparity in information retrieved through text mining limited to Titles and Mesh terms relative to entire Abstracts. As much of the older biomedical literature was entered into electronic databases without associated Abstracts, literature-based discovery exercises that search the older medical literature may miss a substantial proportion of relevant information. On the basis of this study, it may be estimated that up to a log order more information may be retrieved when complete Abstracts are searched.
Collapse
|
45
|
|
46
|
|
47
|
|
48
|
Viator JA, Pestorius FM. Investigating trends in acoustics research from 1970-1999. THE JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA 2001; 109:1779-1783. [PMID: 11386532 DOI: 10.1121/1.1366711] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/23/2023]
Abstract
Text data mining is a burgeoning field in which new information is extracted from existing text databases. Computational methods are used to compare relationships between database elements to yield new information about the existing data. Text data mining software was used to determine research trends in acoustics for the years 1970, 1980, 1990, and 1999. Trends were indicated by the number of published articles in the categories of acoustics using the Journal of the Acoustical Society of America (JASA) as the article source. Research was classified using a method based on the Physics and Astronomy Classification Scheme (PACS). Research was further subdivided into world regions, including North and South America, Eastern and Western Europe, Asia, Africa, Middle East, and Australia/New Zealand. In order to gauge the use of JASA as an indicator of international acoustics research, three subjects, underwater sound, nonlinear acoustics, and bioacoustics, were further tracked in 1999, using all journals in the INSPEC database. Research trends indicated a shift in emphasis of certain areas, notably underwater sound, audition, and speech. JASA also showed steady growth, with increasing participation by non-US authors, from about 20% in 1970 to nearly 50% in 1999.
Collapse
Affiliation(s)
- J A Viator
- Beckman Laser Institute and Medical Clinic, University of California, Irvine 92612, USA
| | | |
Collapse
|
49
|
|
50
|
Weeber M, Klein H, de Jong-van den Berg LT, Vos R. Using concepts in literature-based discovery: Simulating Swanson's Raynaud-fish oil and migraine-magnesium discoveries. ACTA ACUST UNITED AC 2001. [DOI: 10.1002/asi.1104] [Citation(s) in RCA: 143] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|