1
|
Nezamuldeen L, Jafri MS. Protein-Protein Interaction Network Extraction Using Text Mining Methods Adds Insight into Autism Spectrum Disorder. BIOLOGY 2023; 12:1344. [PMID: 37887054 PMCID: PMC10604135 DOI: 10.3390/biology12101344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 10/02/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023]
Abstract
Text mining methods are being developed to assimilate the volume of biomedical textual materials that are continually expanding. Understanding protein-protein interaction (PPI) deficits would assist in explaining the genesis of diseases. In this study, we designed an automated system to extract PPIs from the biomedical literature that uses a deep learning sentence classification model, a pretrained word embedding, and a BiLSTM recurrent neural network with additional layers, a conditional random field (CRF) named entity recognition (NER) model, and shortest-dependency path (SDP) model using the SpaCy library in Python. The automated system ensures that it targets sentences that contain PPIs and not just these proteins mentioned in the framework of disease discovery or other context. Our first model achieved 13% greater precision on the Aimed/BioInfr benchmark corpus than the previous state-of-the-art BiLSTM neural network models. The NER model presented in this study achieved 98% precision on the Aimed/BioInfr corpus over previous models. In order to facilitate the production of an accurate representation of the PPI network, the processes were developed to systematically map the protein interactions in the texts. Overall, evaluating our system through the use of 6027 abstracts pertaining to seven proteins associated with Autism Spectrum Disorder completed the manually curated PPI network for these proteins. When it comes to complicated diseases, these networks would assist in understanding how PPI deficits contribute to disease development while also emphasizing the influence of interactions on protein function and biological processes.
Collapse
Affiliation(s)
- Leena Nezamuldeen
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
- King Fahd Medical Research Centre, King Abdulaziz University, Jeddah 21589, Saudi Arabia;
| | - Mohsin Saleet Jafri
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA
- Center for Biomedical Engineering and Technology, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| |
Collapse
|
2
|
Cai L, Li J, Lv H, Liu W, Niu H, Wang Z. Integrating domain knowledge for biomedical text analysis into deep learning: A survey. J Biomed Inform 2023; 143:104418. [PMID: 37290540 DOI: 10.1016/j.jbi.2023.104418] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/24/2023] [Accepted: 05/31/2023] [Indexed: 06/10/2023]
Abstract
The past decade has witnessed an explosion of textual information in the biomedical field. Biomedical texts provide a basis for healthcare delivery, knowledge discovery, and decision-making. Over the same period, deep learning has achieved remarkable performance in biomedical natural language processing, however, its development has been limited by well-annotated datasets and interpretability. To solve this, researchers have considered combining domain knowledge (such as biomedical knowledge graph) with biomedical data, which has become a promising means of introducing more information into biomedical datasets and following evidence-based medicine. This paper comprehensively reviews more than 150 recent literature studies on incorporating domain knowledge into deep learning models to facilitate typical biomedical text analysis tasks, including information extraction, text classification, and text generation. We eventually discuss various challenges and future directions.
Collapse
Affiliation(s)
- Linkun Cai
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Jia Li
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Han Lv
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Wenjuan Liu
- Aerospace Center Hospital, 100049 Beijing, China
| | - Haijun Niu
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Zhenchang Wang
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China; Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China.
| |
Collapse
|
3
|
Molina M, Jiménez C, Montenegro C. Improving Drug-Drug Interaction Extraction with Gaussian Noise. Pharmaceutics 2023; 15:1823. [PMID: 37514010 PMCID: PMC10385013 DOI: 10.3390/pharmaceutics15071823] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 04/28/2023] [Accepted: 06/12/2023] [Indexed: 07/30/2023] Open
Abstract
Drug-Drug Interactions (DDIs) produce essential and valuable insights for healthcare professionals, since they provide data on the impact of concurrent administration of medications to patients during therapy. In that sense, some relevant works, related to the DDIExtraction2013 Challenge, are available in the current technical literature. This study aims to improve previous results, using two models, where a Gaussian noise layer is added to achieve better DDI relationship extraction. (1) A Piecewise Convolutional Neural Network (PW-CNN) model is used to capture relationships among pharmacological entities described in biomedical databases. Additionally, the model incorporates multichannel words to enrich a person's vocabulary and reduce unfamiliar words. (2) The model uses the pre-trained BERT language model to classify relationships, while also integrating data from the target entities. After identifying the target entities, the model transfers the relevant information through the pre-trained architecture and integrates the encoded data for both entities. The results of the experiment show an improved performance, with respect to previous models.
Collapse
Affiliation(s)
- Marco Molina
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| | - Cristina Jiménez
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| | - Carlos Montenegro
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| |
Collapse
|
4
|
Luo L, Lai PT, Wei CH, Lu Z. A sequence labeling framework for extracting drug-protein relations from biomedical literature. Database (Oxford) 2022; 2022:baac058. [PMID: 35856889 PMCID: PMC9297941 DOI: 10.1093/database/baac058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Revised: 05/24/2022] [Accepted: 07/14/2022] [Indexed: 06/15/2023]
Abstract
UNLABELLED Automatic extracting interactions between chemical compound/drug and gene/protein are significantly beneficial to drug discovery, drug repurposing, drug design and biomedical knowledge graph construction. To promote the development of the relation extraction between drug and protein, the BioCreative VII challenge organized the DrugProt track. This paper describes the approach we developed for this task. In addition to the conventional text classification framework that has been widely used in relation extraction tasks, we propose a sequence labeling framework to drug-protein relation extraction. We first comprehensively compared the cutting-edge biomedical pre-trained language models for both frameworks. Then, we explored several ensemble methods to further improve the final performance. In the evaluation of the challenge, our best submission (i.e. the ensemble of models in two frameworks via major voting) achieved the F1-score of 0.795 on the official test set. Further, we realized the sequence labeling framework is more efficient and achieves better performance than the text classification framework. Finally, our ensemble of the sequence labeling models with majority voting achieves the best F1-score of 0.800 on the test set. DATABASE URL https://github.com/lingluodlut/BioCreativeVII_DrugProt.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- *Corresponding author: Tel: 301 594 7089; Fax: 301 480 2288;
| |
Collapse
|
5
|
Raja K. Biomedical Literature Mining and Its Components. Methods Mol Biol 2022; 2496:1-16. [PMID: 35713856 DOI: 10.1007/978-1-0716-2305-3_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The published biomedical articles are the best source of knowledge to understand the importance of biomedical entities such as disease, drugs, and their role in different patient population groups. The number of biomedical literature available and being published is increasing at an exponential rate with the use of large scale experimental techniques. Manual extraction of such information is becoming extremely difficult because of the huge number of biomedical literature available. Alternatively, text mining approaches receive much interest within biomedicine by providing automatic extraction of such information in more structured format from the unstructured biomedical text. Here, a text mining protocol to extract the patient population information, to identify the disease and drug mentions in PubMed titles and abstracts, and a simple information retrieval approach to retrieve a list of relevant documents for a user query are presented. The text mining protocol presented in this chapter is useful for retrieving information on drugs for patients with a specific disease. The protocol covers three major text mining tasks, namely, information retrieval, information extraction, and knowledge discovery.
Collapse
Affiliation(s)
- Kalpana Raja
- Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
6
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
7
|
Zhang Y, Wang M, Saberi M, Chang E. Knowledge fusion through academic articles: a survey of definitions, techniques, applications and challenges. Scientometrics 2020. [DOI: 10.1007/s11192-020-03683-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
8
|
Li Z, Yang Z, Xiang Y, Luo L, Sun Y, Lin H. Exploiting sequence labeling framework to extract document-level relations from biomedical texts. BMC Bioinformatics 2020; 21:125. [PMID: 32216746 PMCID: PMC7099809 DOI: 10.1186/s12859-020-3457-2] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2019] [Accepted: 03/18/2020] [Indexed: 12/02/2022] Open
Abstract
Background Both intra- and inter-sentential semantic relations in biomedical texts provide valuable information for biomedical research. However, most existing methods either focus on extracting intra-sentential relations and ignore inter-sentential ones or fail to extract inter-sentential relations accurately and regard the instances containing entity relations as being independent, which neglects the interactions between relations. We propose a novel sequence labeling-based biomedical relation extraction method named Bio-Seq. In the method, sequence labeling framework is extended by multiple specified feature extractors so as to facilitate the feature extractions at different levels, especially at the inter-sentential level. Besides, the sequence labeling framework enables Bio-Seq to take advantage of the interactions between relations, and thus, further improves the precision of document-level relation extraction. Results Our proposed method obtained an F1-score of 63.5% on BioCreative V chemical disease relation corpus, and an F1-score of 54.4% on inter-sentential relations, which was 10.5% better than the document-level classification baseline. Also, our method achieved an F1-score of 85.1% on n2c2-ADE sub-dataset. Conclusion Sequence labeling method can be successfully used to extract document-level relations, especially for boosting the performance on inter-sentential relation extraction. Our work can facilitate the research on document-level biomedical text mining.
Collapse
Affiliation(s)
- Zhiheng Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China.
| | - Yang Xiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, 77030, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Yuanyuan Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| |
Collapse
|
9
|
Métais E, Meziane F, Horacek H, Cimiano P. Improving Named Entity Recognition for Biomedical and Patent Data Using Bi-LSTM Deep Neural Network Models. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS 2020. [PMCID: PMC7298184 DOI: 10.1007/978-3-030-51310-8_3] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Elisabeth Métais
- Laboratoire Cédric, Conservatoire National des Arts et Métiers, Paris, France
| | - Farid Meziane
- School of Science, Engineering and Environment, University of Salford, Salford, UK
| | - Helmut Horacek
- Language Technology, German Research Center for Artificial Intelligence, Saarbrücken, Germany
| | - Philipp Cimiano
- Semantic Computing Group, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
10
|
Tian B, Wu X, Chen C, Qiu W, Ma Q, Yu B. Predicting protein–protein interactions by fusing various Chou's pseudo components and using wavelet denoising approach. J Theor Biol 2019; 462:329-346. [DOI: 10.1016/j.jtbi.2018.11.011] [Citation(s) in RCA: 25] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2018] [Revised: 11/08/2018] [Accepted: 11/15/2018] [Indexed: 12/26/2022]
|
11
|
A research framework for pharmacovigilance in health social media: Identification and evaluation of patient adverse drug event reports. J Biomed Inform 2015; 58:268-279. [PMID: 26518315 DOI: 10.1016/j.jbi.2015.10.011] [Citation(s) in RCA: 42] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Revised: 10/20/2015] [Accepted: 10/21/2015] [Indexed: 11/23/2022]
Abstract
Social media offer insights of patients' medical problems such as drug side effects and treatment failures. Patient reports of adverse drug events from social media have great potential to improve current practice of pharmacovigilance. However, extracting patient adverse drug event reports from social media continues to be an important challenge for health informatics research. In this study, we develop a research framework with advanced natural language processing techniques for integrated and high-performance patient reported adverse drug event extraction. The framework consists of medical entity extraction for recognizing patient discussions of drug and events, adverse drug event extraction with shortest dependency path kernel based statistical learning method and semantic filtering with information from medical knowledge bases, and report source classification to tease out noise. To evaluate the proposed framework, a series of experiments were conducted on a test bed encompassing about postings from major diabetes and heart disease forums in the United States. The results reveal that each component of the framework significantly contributes to its overall effectiveness. Our framework significantly outperforms prior work.
Collapse
|
12
|
Chiang JH, Ju JH. Discovering novel protein–protein interactions by measuring the protein semantic similarity from the biomedical literature. J Bioinform Comput Biol 2015; 12:1442008. [DOI: 10.1142/s0219720014420086] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein–protein interactions (PPIs) are involved in the majority of biological processes. Identification of PPIs is therefore one of the key aims of biological research. Although there are many databases of PPIs, many other unidentified PPIs could be buried in the biomedical literature. Therefore, automated identification of PPIs from biomedical literature repositories could be used to discover otherwise hidden interactions. Search engines, such as Google, have been successfully applied to measure the relatedness among words. Inspired by such approaches, we propose a novel method to identify PPIs through semantic similarity measures among protein mentions. We define six semantic similarity measures as features based on the page counts retrieved from the MEDLINE database. A machine learning classifier, Random Forest, is trained using the above features. The proposed approach achieve an averaged micro-F of 71.28% and an averaged macro-F of 64.03% over five PPI corpora, an improvement over the results of using only the conventional co-occurrence feature (averaged micro-F of 68.79% and an averaged macro-F of 60.49%). A relation-word reinforcement further improves the averaged micro-F to 71.3% and averaged macro-F to 65.12%. Comparing the results of the current work with other studies on the AIMed corpus (ranging from 77.58% to 85.1% in micro-F, 62.18% to 76.27% in macro-F), we show that the proposed approach achieves micro-F of 81.88% and macro-F of 64.01% without the use of sophisticated feature extraction. Finally, we manually examine the newly discovered PPI pairs based on a literature review, and the results suggest that our approach could extract novel protein–protein interactions.
Collapse
Affiliation(s)
- Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, East District, Tainan City 701, Taiwan
| | - Jiun-Huang Ju
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, East District, Tainan City 701, Taiwan
| |
Collapse
|
13
|
Bagewadi S, Bobić T, Hofmann-Apitius M, Fluck J, Klinger R. Detecting miRNA Mentions and Relations in Biomedical Literature. F1000Res 2014; 3:205. [PMID: 26535109 PMCID: PMC4602280 DOI: 10.12688/f1000research.4591.3] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 09/24/2015] [Indexed: 12/16/2022] Open
Abstract
INTRODUCTION MicroRNAs (miRNAs) have demonstrated their potential as post-transcriptional gene expression regulators, participating in a wide spectrum of regulatory events such as apoptosis, differentiation, and stress response. Apart from the role of miRNAs in normal physiology, their dysregulation is implicated in a vast array of diseases. Dissection of miRNA-related associations are valuable for contemplating their mechanism in diseases, leading to the discovery of novel miRNAs for disease prognosis, diagnosis, and therapy. MOTIVATION Apart from databases and prediction tools, miRNA-related information is largely available as unstructured text. Manual retrieval of these associations can be labor-intensive due to steadily growing number of publications. Additionally, most of the published miRNA entity recognition methods are keyword based, further subjected to manual inspection for retrieval of relations. Despite the fact that several databases host miRNA-associations derived from text, lower sensitivity and lack of published details for miRNA entity recognition and associated relations identification has motivated the need for developing comprehensive methods that are freely available for the scientific community. Additionally, the lack of a standard corpus for miRNA-relations has caused difficulty in evaluating the available systems. We propose methods to automatically extract mentions of miRNAs, species, genes/proteins, disease, and relations from scientific literature. Our generated corpora, along with dictionaries, and miRNA regular expression are freely available for academic purposes. To our knowledge, these resources are the most comprehensive developed so far. RESULTS The identification of specific miRNA mentions reaches a recall of 0.94 and precision of 0.93. Extraction of miRNA-disease and miRNA-gene relations lead to an F 1 score of up to 0.76. A comparison of the information extracted by our approach to the databases miR2Disease and miRSel for the extraction of Alzheimer's disease related relations shows the capability of our proposed methods in identifying correct relations with improved sensitivity. The published resources and described methods can help the researchers for maximal retrieval of miRNA-relations and generation of miRNA-regulatory networks. AVAILABILITY The training and test corpora, annotation guidelines, developed dictionaries, and supplementary files are available at http://www.scai.fraunhofer.de/mirna-corpora.html.
Collapse
Affiliation(s)
- Shweta Bagewadi
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
- University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Tamara Bobić
- Hasso Plattner Institute Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Potsdam, Germany
| | - Martin Hofmann-Apitius
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
- University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Juliane Fluck
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
| | - Roman Klinger
- Semantic Computing Group, CIT-EC, Bielefeld University, 33615 Bielefeld, Germany
| |
Collapse
|
14
|
Bagewadi S, Bobić T, Hofmann-Apitius M, Fluck J, Klinger R. Detecting miRNA Mentions and Relations in Biomedical Literature. F1000Res 2014; 3:205. [PMID: 26535109 DOI: 10.12688/f1000research.4591.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/19/2014] [Indexed: 12/23/2022] Open
Abstract
INTRODUCTION MicroRNAs (miRNAs) have demonstrated their potential as post-transcriptional gene expression regulators, participating in a wide spectrum of regulatory events such as apoptosis, differentiation, and stress response. Apart from the role of miRNAs in normal physiology, their dysregulation is implicated in a vast array of diseases. Dissection of miRNA-related associations are valuable for contemplating their mechanism in diseases, leading to the discovery of novel miRNAs for disease prognosis, diagnosis, and therapy. MOTIVATION Apart from databases and prediction tools, miRNA-related information is largely available as unstructured text. Manual retrieval of these associations can be labor-intensive due to steadily growing number of publications. Despite the fact that several databases host miRNA-associations derived from text, lower sensitivity has motivated the need for an improvised framework. Additionally, the lack of a standard corpus for miRNA-relations has caused difficulty in evaluating the available systems. We propose methods to automatically extract mentions of miRNAs, species, genes/proteins, disease, and relations from scientific literature. Our generated corpora, along with dictionaries, and miRNA regular expression are freely available for academic purposes. To our knowledge, these resources are the most comprehensive developed so far. RESULTS The identification of specific miRNA mentions reaches a recall of 0.94 and precision of 0.93. Extraction of miRNA-disease and miRNA-gene relations lead to an F 1 score of up to 0.76. A comparison of the information extracted by our approach to the databases miR2Disease and miRSel for the extraction of Alzheimer's disease related relations shows the capability of our proposed methods in identifying correct relations with improved sensitivity. The published resources and described methods can help the researchers for maximal retrieval of miRNA-relations and generation of miRNA-regulatory networks. AVAILABILITY The training and test corpora, annotation guidelines, developed dictionaries, and supplementary files are available at http://www.scai.fraunhofer.de/mirna-corpora.html.
Collapse
Affiliation(s)
- Shweta Bagewadi
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany ; University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Tamara Bobić
- Hasso Plattner Institute Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Potsdam, Germany
| | - Martin Hofmann-Apitius
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany ; University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Juliane Fluck
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
| | - Roman Klinger
- Semantic Computing Group, CIT-EC, Bielefeld University, 33615 Bielefeld, Germany
| |
Collapse
|
15
|
Bagewadi S, Bobić T, Hofmann-Apitius M, Fluck J, Klinger R. Detecting miRNA Mentions and Relations in Biomedical Literature. F1000Res 2014; 3:205. [PMID: 26535109 DOI: 10.12688/f1000research.4591.2] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/15/2014] [Indexed: 12/30/2022] Open
Abstract
Introduction: MicroRNAs (miRNAs) have demonstrated their potential as post-transcriptional gene expression regulators, participating in a wide spectrum of regulatory events such as apoptosis, differentiation, and stress response. Apart from the role of miRNAs in normal physiology, their dysregulation is implicated in a vast array of diseases. Dissection of miRNA-related associations are valuable for contemplating their mechanism in diseases, leading to the discovery of novel miRNAs for disease prognosis, diagnosis, and therapy. Motivation: Apart from databases and prediction tools, miRNA-related information is largely available as unstructured text. Manual retrieval of these associations can be labor-intensive due to steadily growing number of publications. Additionally, most of the published miRNA entity recognition methods are keyword based, further subjected to manual inspection for retrieval of relations. Despite the fact that several databases host miRNA-associations derived from text, lower sensitivity and lack of published details for miRNA entity recognition and associated relations identification has motivated the need for developing comprehensive methods that are freely available for the scientific community. Additionally, the lack of a standard corpus for miRNA-relations has caused difficulty in evaluating the available systems. We propose methods to automatically extract mentions of miRNAs, species, genes/proteins, disease, and relations from scientific literature. Our generated corpora, along with dictionaries, and miRNA regular expression are freely available for academic purposes. To our knowledge, these resources are the most comprehensive developed so far. Results: The identification of specific miRNA mentions reaches a recall of 0.94 and precision of 0.93. Extraction of miRNA-disease and miRNA-gene relations lead to an F 1 score of up to 0.76. A comparison of the information extracted by our approach to the databases miR2Disease and miRSel for the extraction of Alzheimer's disease related relations shows the capability of our proposed methods in identifying correct relations with improved sensitivity. The published resources and described methods can help the researchers for maximal retrieval of miRNA-relations and generation of miRNA-regulatory networks. Availability: The training and test corpora, annotation guidelines, developed dictionaries, and supplementary files are available at http://www.scai.fraunhofer.de/mirna-corpora.html.
Collapse
Affiliation(s)
- Shweta Bagewadi
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany ; University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Tamara Bobić
- Hasso Plattner Institute Potsdam, Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Potsdam, Germany
| | - Martin Hofmann-Apitius
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany ; University of Bonn, B-IT, Dahlmannstr. 2, 53113 Bonn, Germany
| | - Juliane Fluck
- Fraunhofer SCAI, Bioinformatics, Schloss Birlinghoven, 53754, Sankt Augustin, Germany
| | - Roman Klinger
- Semantic Computing Group, CIT-EC, Bielefeld University, 33615 Bielefeld, Germany
| |
Collapse
|
16
|
Yang Z, Zhao Z, Li Y, Hu Y, Lin H. PPIExtractor: a protein interaction extraction and visualization system for biomedical literature. IEEE Trans Nanobioscience 2013; 12:173-81. [PMID: 23974658 DOI: 10.1109/tnb.2013.2263837] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Protein-protein interactions (PPIs) play a key role in various aspects of the structural and functional organization of the cell. Knowledge about them unveils the molecular mechanisms of biological processes. However, the amount of biomedical literature regarding protein interactions is increasing rapidly and it is difficult for interaction database curators to detect and curate protein interaction information manually. In this paper, we present a PPI extraction system, termed PPIExtractor, which automatically extracts PPIs from biomedical text and visualizes them. Given a Medline record dataset, PPIExtractor first applies Feature Coupling Generalization (FCG) to tag protein names in text, next uses the extended semantic similarity-based method to normalize them, then combines feature-based, convolution tree and graph kernels to extract PPIs, and finally visualizes the PPI network. Experimental evaluations show that PPIExtractor can achieve state-of-the-art performance on a DIP subset with respect to comparable evaluations.
Collapse
Affiliation(s)
- Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | | | | | | | | |
Collapse
|
17
|
Köster J, Zamir E, Rahmann S. Efficiently mining protein interaction dependencies from large text corpora. Integr Biol (Camb) 2012; 4:805-12. [PMID: 22706334 DOI: 10.1039/c2ib00126h] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
Biochemical research has yielded an extensive amount of information about dependencies between protein interactions, as generated by allosteric regulations, steric hindrance and other mechanisms. Collectively, this information is valuable for understanding large intracellular protein networks. However, this information is sparsely distributed among millions of publications and documented as freely styled text meant for manual reading. Here we develop a computational approach for extracting information about interaction dependencies from large numbers of publications. First, keyword-based tokenization reduces full papers to short strings, facilitating an efficient search for patterns that are likely to indicate descriptions of interaction dependencies. Sentences that match such patterns are extracted, thereby reducing the amount of text to be read by human curators. Application of this approach to the integrin adhesome network extracted from 59,933 papers 208 short statements, close to half of which indeed describe interaction dependencies. We visualize the obtained hypernetwork of dependencies and illustrate that these dependencies confine the feasible mechanisms of adhesion sites assembly and generate testable hypotheses about their switchability.
Collapse
Affiliation(s)
- Johannes Köster
- Genome Informatics, Institute of Human Genetics, Faculty of Medicine, University of Duisburg-Essen, Essen, Germany
| | | | | |
Collapse
|
18
|
Song M, Yu H, Han WS. Combining active learning and semi-supervised learning techniques to extract protein interaction sentences. BMC Bioinformatics 2011; 12 Suppl 12:S4. [PMID: 22168401 PMCID: PMC3247085 DOI: 10.1186/1471-2105-12-s12-s4] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
Background Protein-protein interaction (PPI) extraction has been a focal point of many biomedical research and database curation tools. Both Active Learning and Semi-supervised SVMs have recently been applied to extract PPI automatically. In this paper, we explore combining the AL with the SSL to improve the performance of the PPI task. Methods We propose a novel PPI extraction technique called PPISpotter by combining Deterministic Annealing-based SSL and an AL technique to extract protein-protein interaction. In addition, we extract a comprehensive set of features from MEDLINE records by Natural Language Processing (NLP) techniques, which further improve the SVM classifiers. In our feature selection technique, syntactic, semantic, and lexical properties of text are incorporated into feature selection that boosts the system performance significantly. Results By conducting experiments with three different PPI corpuses, we show that PPISpotter is superior to the other techniques incorporated into semi-supervised SVMs such as Random Sampling, Clustering, and Transductive SVMs by precision, recall, and F-measure. Conclusions Our system is a novel, state-of-the-art technique for efficiently extracting protein-protein interaction pairs.
Collapse
Affiliation(s)
- Min Song
- Information Systems Department, New Jersey Institute of Technology, University Heights, Newark, New Jersey, USA.
| | | | | |
Collapse
|
19
|
Segura-Bedmar I, Martínez P, de Pablo-Sánchez C. Using a shallow linguistic kernel for drug–drug interaction extraction. J Biomed Inform 2011; 44:789-804. [DOI: 10.1016/j.jbi.2011.04.005] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2010] [Revised: 04/14/2011] [Accepted: 04/19/2011] [Indexed: 11/26/2022]
|
20
|
Faro A, Giordano D, Spampinato C. Combining literature text mining with microarray data: advances for system biology modeling. Brief Bioinform 2011; 13:61-82. [PMID: 21677032 DOI: 10.1093/bib/bbr018] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A huge amount of important biomedical information is hidden in the bulk of research articles in biomedical fields. At the same time, the publication of databases of biological information and of experimental datasets generated by high-throughput methods is in great expansion, and a wealth of annotated gene databases, chemical, genomic (including microarray datasets), clinical and other types of data repositories are now available on the Web. Thus a current challenge of bioinformatics is to develop targeted methods and tools that integrate scientific literature, biological databases and experimental data for reducing the time of database curation and for accessing evidence, either in the literature or in the datasets, useful for the analysis at hand. Under this scenario, this article reviews the knowledge discovery systems that fuse information from the literature, gathered by text mining, with microarray data for enriching the lists of down and upregulated genes with elements for biological understanding and for generating and validating new biological hypothesis. Finally, an easy to use and freely accessible tool, GeneWizard, that exploits text mining and microarray data fusion for supporting researchers in discovering gene-disease relationships is described.
Collapse
Affiliation(s)
- Alberto Faro
- Department of Informatics and Telecommunication Engineering-University of Catania, Catania, Italy
| | | | | |
Collapse
|
21
|
Segura-Bedmar I, Martínez P, de Pablo-Sánchez C. A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents. BMC Bioinformatics 2011; 12 Suppl 2:S1. [PMID: 21489220 PMCID: PMC3073181 DOI: 10.1186/1471-2105-12-s2-s1] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background A drug-drug interaction (DDI) occurs when one drug influences the level or activity of another drug. The increasing volume of the scientific literature overwhelms health care professionals trying to be kept up-to-date with all published studies on DDI. Methods This paper describes a hybrid linguistic approach to DDI extraction that combines shallow parsing and syntactic simplification with pattern matching. Appositions and coordinate structures are interpreted based on shallow syntactic parsing provided by the UMLS MetaMap tool (MMTx). Subsequently, complex and compound sentences are broken down into clauses from which simple sentences are generated by a set of simplification rules. A pharmacist defined a set of domain-specific lexical patterns to capture the most common expressions of DDI in texts. These lexical patterns are matched with the generated sentences in order to extract DDIs. Results We have performed different experiments to analyze the performance of the different processes. The lexical patterns achieve a reasonable precision (67.30%), but very low recall (14.07%). The inclusion of appositions and coordinate structures helps to improve the recall (25.70%), however, precision is lower (48.69%). The detection of clauses does not improve the performance. Conclusions Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract DDI from texts. To the best of our knowledge, this work proposes the first integral solution for the automatic extraction of DDI from biomedical texts.
Collapse
Affiliation(s)
- Isabel Segura-Bedmar
- Computer Science Department, University Carlos III of Madrid, Leganés, 28911, Spain.
| | | | | |
Collapse
|
22
|
Yang Z, Tang N, Zhang X, Lin H, Li Y, Yang Z. Multiple kernel learning in protein-protein interaction extraction from biomedical literature. Artif Intell Med 2011; 51:163-73. [PMID: 21208788 DOI: 10.1016/j.artmed.2010.12.002] [Citation(s) in RCA: 36] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2009] [Revised: 11/25/2010] [Accepted: 12/07/2010] [Indexed: 11/19/2022]
Abstract
OBJECTIVE Knowledge about protein-protein interactions (PPIs) unveils the molecular mechanisms of biological processes. The volume and content of published biomedical literature on protein interactions is expanding rapidly, making it increasingly difficult for interaction database administrators, responsible for content input and maintenance to detect and manually update protein interaction information. The objective of this work is to develop an effective approach to automatic extraction of PPI information from biomedical literature. METHODS AND MATERIALS We present a weighted multiple kernel learning-based approach for automatic PPI extraction from biomedical literature. The approach combines the following kernels: feature-based, tree, graph and part-of-speech (POS) path. In particular, we extend the shortest path-enclosed tree (SPT) and dependency path tree to capture richer contextual information. RESULTS Our experimental results show that the combination of SPT and dependency path tree extensions contributes to the improvement of performance by almost 0.7 percentage units in F-score and 2 percentage units in area under the receiver operating characteristics curve (AUC). Combining two or more appropriately weighed individual will further improve the performance. Both on the individual corpus and cross-corpus evaluation our combined kernel can achieve state-of-the-art performance with respect to comparable evaluations, with 64.41% F-score and 88.46% AUC on the AImed corpus. CONCLUSIONS As different kernels calculate the similarity between two sentences from different aspects. Our combined kernel can reduce the risk of missing important features. More specifically, we use a weighted linear combination of individual kernels instead of assigning the same weight to each individual kernel, thus allowing the introduction of each kernel to incrementally contribute to the performance improvement. In addition, SPT and dependency path tree extensions can improve the performance by including richer context information.
Collapse
Affiliation(s)
- Zhihao Yang
- Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024, China.
| | | | | | | | | | | |
Collapse
|