1
|
Biomedical Text Classification Using Augmented Word Representation Based on Distributional and Relational Contexts. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2023. [DOI: 10.1155/2023/2989791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
Due to the increasing use of information technologies by biomedical experts, researchers, public health agencies, and healthcare professionals, a large number of scientific literatures, clinical notes, and other structured and unstructured text resources are rapidly increasing and being stored in various data sources like PubMed. These massive text resources can be leveraged to extract valuable knowledge and insights using machine learning techniques. Recent advancement in neural network-based classification models has gained popularity which takes numeric vectors (aka word representation) of training data as the input to train classification models. Better the input vectors, more accurate would be the classification. Word representations are learned as the distribution of words in an embedding space, wherein each word has its vector and the semantically similar words based on the contexts appear nearby each other. However, such distributional word representations are incapable of encapsulating relational semantics between distant words. In the biomedical domain, relation mining is a well-studied problem which aims to extract relational words, which associates distant entities generally representing the subject and object of a sentence. Our goal is to capture the relational semantics information between distant words from a large corpus to learn enhanced word representation and employ the learned word representation for various natural language processing tasks such as text classification. In this article, we have proposed an application of biomedical relation triplets to learn word representation through incorporating relational semantic information within the distributional representation of words. In other words, the proposed approach aims to capture both distributional and relational contexts of the words to learn their numeric vectors from text corpus. We have also proposed an application of the learned word representations for text classification. The proposed approach is evaluated over multiple benchmark datasets, and the efficacy of the learned word representations is tested in terms of word similarity and concept categorization tasks. Our proposed approach provides better performance in comparison to the state-of-the-art GloVe model. Furthermore, we have applied the learned word representations to classify biomedical texts using four neural network-based classification models, and the classification accuracy further confirms the effectiveness of the learned word representations by our proposed approach.
Collapse
|
3
|
Sänger M, Leser U. Large-scale entity representation learning for biomedical relationship extraction. Bioinformatics 2021; 37:236-242. [PMID: 32726411 DOI: 10.1093/bioinformatics/btaa674] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 07/14/2020] [Accepted: 07/21/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The automatic extraction of published relationships between molecular entities has important applications in many biomedical fields, ranging from Systems Biology to Personalized Medicine. Existing works focused on extracting relationships described in single articles or in single sentences. However, a single record is rarely sufficient to judge upon the biological correctness of a relation, as experimental evidence might be weak or only valid in a certain context. Furthermore, statements may be more speculative than confirmative, and different articles often contradict each other. Experts therefore always take the complete literature into account to take a reliable decision upon a relationship. It is an open research question how to do this effectively in an automatic manner. RESULTS We propose two novel relation extraction approaches which use recent representation learning techniques to create comprehensive models of biomedical entities or entity-pairs, respectively. These representations are learned by considering all publications from PubMed mentioning an entity or a pair. They are used as input for a neural network for classifying relations globally, i.e. the derived predictions are corpus-based, not sentence- or article based as in prior art. Experiments on the extraction of mutation-disease, drug-disease and drug-drug relationships show that the learned embeddings indeed capture semantic information of the entities under study and outperform traditional methods by 4-29% regarding F1 score. AVAILABILITY AND IMPLEMENTATION Source codes are available at: https://github.com/mariosaenger/bio-re-with-entity-embeddings. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mario Sänger
- Computer Science Department, Knowledge Management in Bioinformatics, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Knowledge Management in Bioinformatics, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
4
|
Jiang S, Wu W, Tomita N, Ganoe C, Hassanpour S. Multi-Ontology Refined Embeddings (MORE): A hybrid multi-ontology and corpus-based semantic representation model for biomedical concepts. J Biomed Inform 2020; 111:103581. [PMID: 33010425 DOI: 10.1016/j.jbi.2020.103581] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 09/22/2020] [Accepted: 09/26/2020] [Indexed: 11/25/2022]
Abstract
OBJECTIVE Currently, a major limitation for natural language processing (NLP) analyses in clinical applications is that concepts are not effectively referenced in various forms across different texts. This paper introduces Multi-Ontology Refined Embeddings (MORE), a novel hybrid framework that incorporates domain knowledge from multiple ontologies into a distributional semantic model, learned from a corpus of clinical text. MATERIALS AND METHODS We use the RadCore and MIMIC-III free-text datasets for the corpus-based component of MORE. For the ontology-based part, we use the Medical Subject Headings (MeSH) ontology and three state-of-the-art ontology-based similarity measures. In our approach, we propose a new learning objective, modified from the sigmoid cross-entropy objective function. RESULTS AND DISCUSSION We used two established datasets of semantic similarities among biomedical concept pairs to evaluate the quality of the generated word embeddings. On the first dataset with 29 concept pairs, with similarity scores established by physicians and medical coders, MORE's similarity scores have the highest combined correlation (0.633), which is 5.0% higher than that of the baseline model, and 12.4% higher than that of the best ontology-based similarity measure. On the second dataset with 449 concept pairs, MORE's similarity scores have a correlation of 0.481, based on the average of four medical residents' similarity ratings, and that outperforms the skip-gram model by 8.1%, and the best ontology measure by 6.9%. Furthermore, MORE outperforms three pre-trained transformer-based word embedding models (i.e., BERT, ClinicalBERT, and BioBERT) on both datasets. CONCLUSION MORE incorporates knowledge from several biomedical ontologies into an existing corpus-based distributional semantics model, improving both the accuracy of the learned word embeddings and the extensibility of the model to a broader range of biomedical concepts. MORE allows for more accurate clustering of concepts across a wide range of applications, such as analyzing patient health records to identify subjects with similar pathologies, or integrating heterogeneous clinical data to improve interoperability between hospitals.
Collapse
Affiliation(s)
- Steven Jiang
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA
| | - Weiyi Wu
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Naofumi Tomita
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Craig Ganoe
- Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA
| | - Saeed Hassanpour
- Department of Computer Science, Dartmouth College, Hanover, NH 03755, USA; Department of Biomedical Data Science, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA.
| |
Collapse
|
5
|
Arguello-Casteleiro M, Stevens R, Des-Diz J, Wroe C, Fernandez-Prieto MJ, Maroto N, Maseda-Fernandez D, Demetriou G, Peters S, Noble PJM, Jones PH, Dukes-McEwan J, Radford AD, Keane J, Nenadic G. Exploring semantic deep learning for building reliable and reusable one health knowledge from PubMed systematic reviews and veterinary clinical notes. J Biomed Semantics 2019; 10:22. [PMID: 31711540 PMCID: PMC6849172 DOI: 10.1186/s13326-019-0212-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
Background Deep Learning opens up opportunities for routinely scanning large bodies of biomedical literature and clinical narratives to represent the meaning of biomedical and clinical terms. However, the validation and integration of this knowledge on a scale requires cross checking with ground truths (i.e. evidence-based resources) that are unavailable in an actionable or computable form. In this paper we explore how to turn information about diagnoses, prognoses, therapies and other clinical concepts into computable knowledge using free-text data about human and animal health. We used a Semantic Deep Learning approach that combines the Semantic Web technologies and Deep Learning to acquire and validate knowledge about 11 well-known medical conditions mined from two sets of unstructured free-text data: 300 K PubMed Systematic Review articles (the PMSB dataset) and 2.5 M veterinary clinical notes (the VetCN dataset). For each target condition we obtained 20 related clinical concepts using two deep learning methods applied separately on the two datasets, resulting in 880 term pairs (target term, candidate term). Each concept, represented by an n-gram, is mapped to UMLS using MetaMap; we also developed a bespoke method for mapping short forms (e.g. abbreviations and acronyms). Existing ontologies were used to formally represent associations. We also create ontological modules and illustrate how the extracted knowledge can be queried. The evaluation was performed using the content within BMJ Best Practice. Results MetaMap achieves an F measure of 88% (precision 85%, recall 91%) when applied directly to the total of 613 unique candidate terms for the 880 term pairs. When the processing of short forms is included, MetaMap achieves an F measure of 94% (precision 92%, recall 96%). Validation of the term pairs with BMJ Best Practice yields precision between 98 and 99%. Conclusions The Semantic Deep Learning approach can transform neural embeddings built from unstructured free-text data into reliable and reusable One Health knowledge using ontologies and content from BMJ Best Practice.
Collapse
Affiliation(s)
| | - Robert Stevens
- School of Computer Science, University of Manchester, Manchester, UK
| | - Julio Des-Diz
- Hospital do Salnés, Villagarcía de Arousa, Pontevedra, Spain
| | | | | | - Nava Maroto
- Departamento de Lingüística Aplicada a la Ciencia y a la Tecnología, Universidad Politécnica de Madrid, Madrid, Spain
| | - Diego Maseda-Fernandez
- Midcheshire Hospital Foundation Trust, NHS England, Crewe, UK.,School of Medical Sciences, University of Manchester, Manchester, UK
| | - George Demetriou
- School of Computer Science, University of Manchester, Manchester, UK
| | - Simon Peters
- School of Social Sciences, University of Manchester, Manchester, UK
| | - Peter-John M Noble
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Phil H Jones
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - Jo Dukes-McEwan
- Small Animal Teaching Hospital, University of Liverpool, Liverpool, UK
| | - Alan D Radford
- Small Animal Veterinary Surveillance Network, University of Liverpool, Liverpool, UK
| | - John Keane
- School of Computer Science, University of Manchester, Manchester, UK.,Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
| | - Goran Nenadic
- School of Computer Science, University of Manchester, Manchester, UK.,Manchester Institute of Biotechnology, University of Manchester, Manchester, UK.,Health eResearch Centre, University of Manchester, Manchester, UK
| |
Collapse
|
6
|
Grabar N, Grouin C. A Year of Papers Using Biomedical Texts: Findings from the Section on Natural Language Processing of the IMIA Yearbook. Yearb Med Inform 2019; 28:218-222. [PMID: 31419835 PMCID: PMC6697498 DOI: 10.1055/s-0039-1677937] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Objectives
: To analyze the content of publications within the medical Natural Language Processing (NLP) domain in 2018.
Methods
: Automatic and manual pre-selection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues.
Results
: Two best papers have been selected this year. One dedicated to the generation of multi- documents summaries and another dedicated to the generation of imaging reports. We also proposed an analysis of the content of main research trends of NLP publications in 2018.
Conclusions
: The year 2018 is very rich with regard to NLP issues and topics addressed. It shows the will of researchers to go towards robust and reproducible results. Researchers also prove to be creative for original issues and approaches.
Collapse
Affiliation(s)
- Natalia Grabar
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France.,STL, CNRS, Université de Lille, Villeneuve-d'Ascq, France
| | - Cyril Grouin
- LIMSI, CNRS, Université Paris-Saclay, Orsay, France
| | | |
Collapse
|