1
|
Rehana H, Çam NB, Basmaci M, Zheng J, Jemiyo C, He Y, Özgür A, Hur J. Evaluation of GPT and BERT-based models on identifying proteinprotein interactions in biomedical text. ARXIV 2023:arXiv:2303.17728v2. [PMID: 38764593 PMCID: PMC11101131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/21/2024]
Abstract
Detecting protein-protein interactions (PPIs) is crucial for understanding genetic mechanisms, disease pathogenesis, and drug design. However, with the fast-paced growth of biomedical literature, there is a growing need for automated and accurate extraction of PPIs to facilitate scientific knowledge discovery. Pre-trained language models, such as generative pre-trained transformers (GPT) and bidirectional encoder representations from transformers (BERT), have shown promising results in natural language processing (NLP) tasks. We evaluated the performance of PPI identification of multiple GPT and BERT models using three manually curated gold-standard corpora: Learning Language in Logic (LLL) with 164 PPIs in 77 sentences, Human Protein Reference Database with 163 PPIs in 145 sentences, and Interaction Extraction Performance Assessment with 335 PPIs in 486 sentences. BERT-based models achieved the best overall performance, with BioBERT achieving the highest recall (91.95%) and F1-score (86.84%) and PubMedBERT achieving the highest precision (85.25%). Interestingly, despite not being explicitly trained for biomedical texts, GPT-4 achieved commendable performance, comparable to the top-performing BERT models. It achieved a precision of 88.37%, a recall of 85.14%, and an F1-score of 86.49% on the LLL dataset. These results suggest that GPT models can effectively detect PPIs from text data, offering promising avenues for application in biomedical literature mining. Further research could explore how these models might be fine-tuned for even more specialized tasks within the biomedical domain.
Collapse
Affiliation(s)
- Hasin Rehana
- Computer Science Graduate Program, University of North Dakota, Grand Forks, North Dakota, 58202, USA
| | - Nur Bengisu Çam
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Mert Basmaci
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Jie Zheng
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, 48109, USA
| | - Christianah Jemiyo
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota, 58202, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, 48109, USA
- Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, 48109, USA
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota, 58202, USA
| |
Collapse
|
2
|
He Y, Yu H, Huffman A, Lin AY, Natale DA, Beverley J, Zheng L, Perl Y, Wang Z, Liu Y, Ong E, Wang Y, Huang P, Tran L, Du J, Shah Z, Shah E, Desai R, Huang HH, Tian Y, Merrell E, Duncan WD, Arabandi S, Schriml LM, Zheng J, Masci AM, Wang L, Liu H, Smaili FZ, Hoehndorf R, Pendlington ZM, Roncaglia P, Ye X, Xie J, Tang YW, Yang X, Peng S, Zhang L, Chen L, Hur J, Omenn GS, Athey B, Smith B. A comprehensive update on CIDO: the community-based coronavirus infectious disease ontology. J Biomed Semantics 2022; 13:25. [PMID: 36271389 PMCID: PMC9585694 DOI: 10.1186/s13326-022-00279-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2022] [Accepted: 09/13/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND The current COVID-19 pandemic and the previous SARS/MERS outbreaks of 2003 and 2012 have resulted in a series of major global public health crises. We argue that in the interest of developing effective and safe vaccines and drugs and to better understand coronaviruses and associated disease mechenisms it is necessary to integrate the large and exponentially growing body of heterogeneous coronavirus data. Ontologies play an important role in standard-based knowledge and data representation, integration, sharing, and analysis. Accordingly, we initiated the development of the community-based Coronavirus Infectious Disease Ontology (CIDO) in early 2020. RESULTS As an Open Biomedical Ontology (OBO) library ontology, CIDO is open source and interoperable with other existing OBO ontologies. CIDO is aligned with the Basic Formal Ontology and Viral Infectious Disease Ontology. CIDO has imported terms from over 30 OBO ontologies. For example, CIDO imports all SARS-CoV-2 protein terms from the Protein Ontology, COVID-19-related phenotype terms from the Human Phenotype Ontology, and over 100 COVID-19 terms for vaccines (both authorized and in clinical trial) from the Vaccine Ontology. CIDO systematically represents variants of SARS-CoV-2 viruses and over 300 amino acid substitutions therein, along with over 300 diagnostic kits and methods. CIDO also describes hundreds of host-coronavirus protein-protein interactions (PPIs) and the drugs that target proteins in these PPIs. CIDO has been used to model COVID-19 related phenomena in areas such as epidemiology. The scope of CIDO was evaluated by visual analysis supported by a summarization network method. CIDO has been used in various applications such as term standardization, inference, natural language processing (NLP) and clinical data integration. We have applied the amino acid variant knowledge present in CIDO to analyze differences between SARS-CoV-2 Delta and Omicron variants. CIDO's integrative host-coronavirus PPIs and drug-target knowledge has also been used to support drug repurposing for COVID-19 treatment. CONCLUSION CIDO represents entities and relations in the domain of coronavirus diseases with a special focus on COVID-19. It supports shared knowledge representation, data and metadata standardization and integration, and has been used in a range of applications.
Collapse
Affiliation(s)
- Yongqun He
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Hong Yu
- People’s Hospital of Guizhou Province, Guiyang, Guizhou China
| | | | - Asiyah Yu Lin
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD USA
- National Center for Ontological Research, Buffalo, NY USA
| | | | - John Beverley
- National Center for Ontological Research, Buffalo, NY USA
- The Johns Hopkins University Applied Physics Laboratory, Laurel, MD USA
| | - Ling Zheng
- Computer Science and Software Engineering Department, Monmouth University, West Long Branch, NJ USA
| | - Yehoshua Perl
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ USA
| | - Zhigang Wang
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & School of Basic Medicine, Peking Union Medical College, Beijing, China
| | - Yingtong Liu
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Edison Ong
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Yang Wang
- University of Michigan Medical School, Ann Arbor, MI USA
- People’s Hospital of Guizhou Province, Guiyang, Guizhou China
| | - Philip Huang
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Long Tran
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Jinyang Du
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Zalan Shah
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Easheta Shah
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Roshan Desai
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Hsin-hui Huang
- University of Michigan Medical School, Ann Arbor, MI USA
- National Yang-Ming University, Taipei, Taiwan
| | - Yujia Tian
- Rutgers University, New Brunswick, NJ USA
| | | | | | | | - Lynn M. Schriml
- University of Maryland School of Medicine, Baltimore, MD USA
| | - Jie Zheng
- Department of Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA USA
| | - Anna Maria Masci
- Office of Data Science, National Institute of Environmental Health Sciences, Research Triangle Park, NC USA
| | | | | | | | - Robert Hoehndorf
- King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Zoë May Pendlington
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Paola Roncaglia
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, UK
| | - Xianwei Ye
- People’s Hospital of Guizhou Province, Guiyang, Guizhou China
| | - Jiangan Xie
- School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, China
| | - Yi-Wei Tang
- Cepheid, Danaher Diagnostic Platform, Shanghai, China
| | - Xiaolin Yang
- Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & School of Basic Medicine, Peking Union Medical College, Beijing, China
| | - Suyuan Peng
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Luxia Zhang
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Luonan Chen
- Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai, China
| | - Junguk Hur
- University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND USA
| | | | - Brian Athey
- University of Michigan Medical School, Ann Arbor, MI USA
| | - Barry Smith
- National Center for Ontological Research, Buffalo, NY USA
- University at Buffalo, Buffalo, NY 14260 USA
| |
Collapse
|
3
|
Huffman A, Ong E, Hur J, D’Mello A, Tettelin H, He Y. COVID-19 vaccine design using reverse and structural vaccinology, ontology-based literature mining and machine learning. Brief Bioinform 2022; 23:bbac190. [PMID: 35649389 PMCID: PMC9294427 DOI: 10.1093/bib/bbac190] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Revised: 04/13/2022] [Accepted: 04/26/2022] [Indexed: 12/11/2022] Open
Abstract
Rational vaccine design, especially vaccine antigen identification and optimization, is critical to successful and efficient vaccine development against various infectious diseases including coronavirus disease 2019 (COVID-19). In general, computational vaccine design includes three major stages: (i) identification and annotation of experimentally verified gold standard protective antigens through literature mining, (ii) rational vaccine design using reverse vaccinology (RV) and structural vaccinology (SV) and (iii) post-licensure vaccine success and adverse event surveillance and its usage for vaccine design. Protegen is a database of experimentally verified protective antigens, which can be used as gold standard data for rational vaccine design. RV predicts protective antigen targets primarily from genome sequence analysis. SV refines antigens through structural engineering. Recently, RV and SV approaches, with the support of various machine learning methods, have been applied to COVID-19 vaccine design. The analysis of post-licensure vaccine adverse event report data also provides valuable results in terms of vaccine safety and how vaccines should be used or paused. Ontology standardizes and incorporates heterogeneous data and knowledge in a human- and computer-interpretable manner, further supporting machine learning and vaccine design. Future directions on rational vaccine design are discussed.
Collapse
Affiliation(s)
- Anthony Huffman
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| | - Edison Ong
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| | - Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, North Dakota 58202, USA
| | - Adonis D’Mello
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Hervé Tettelin
- Department of Microbiology and Immunology, Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD 21201, USA
| | - Yongqun He
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
4
|
Wang Z, He Y. Precision omics data integration and analysis with interoperable ontologies and their application for COVID-19 research. Brief Funct Genomics 2021; 20:235-248. [PMID: 34159360 PMCID: PMC8287950 DOI: 10.1093/bfgp/elab029] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2021] [Revised: 05/10/2021] [Accepted: 05/24/2021] [Indexed: 12/12/2022] Open
Abstract
Omics technologies are widely used in biomedical research. Precision medicine focuses on individual-level disease treatment and prevention. Here, we propose the usage of the term 'precision omics' to represent the combinatorial strategy that applies omics to translate large-scale molecular omics data for precision disease understanding and accurate disease diagnosis, treatment and prevention. Given the complexity of both omics and precision medicine, precision omics requires standardized representation and integration of heterogeneous data types. Ontology has emerged as an important artificial intelligence component to become critical for standard data and metadata representation, standardization and integration. To support precision omics, we propose a precision omics ontology hypothesis, which hypothesizes that the effectiveness of precision omics is positively correlated with the interoperability of ontologies used for data and knowledge integration. Therefore, to make effective precision omics studies, interoperable ontologies are required to standardize and incorporate heterogeneous data and knowledge in a human- and computer-interpretable manner. Methods for efficient development and application of interoperable ontologies are proposed and illustrated. With the interoperable omics data and knowledge, omics tools such as OmicsViz can also be evolved to process, integrate, visualize and analyze various omics data, leading to the identification of new knowledge and hypotheses of molecular mechanisms underlying the outcomes of diseases such as COVID-19. Given extensive COVID-19 omics research, we propose the strategy of precision omics supported by interoperable ontologies, accompanied with ontology-based semantic reasoning and machine learning, leading to systematic disease mechanism understanding and rational design of precision treatment and prevention. SHORT ABSTRACT Precision medicine focuses on individual-level disease treatment and prevention. Precision omics is a new strategy that applies omics for precision medicine research, which requires standardized representation and integration of individual genetics and phenotypes, experimental conditions, and data analysis settings. Ontology has emerged as an important artificial intelligence component to become critical for standard data and metadata representation, standardization and integration. To support precision omics, interoperable ontologies are required in order to standardize and incorporate heterogeneous data and knowledge in a human- and computer-interpretable manner. With the interoperable omics data and knowledge, omics tools such as OmicsViz can also be evolved to process, integrate, visualize and analyze various omics data, leading to the identification of new knowledge and hypotheses of molecular mechanisms underlying disease outcomes. The precision COVID-19 omics study is provided as the primary use case to illustrate the rationale and implementation of the precision omics strategy.
Collapse
Affiliation(s)
| | - Yongqun He
- University of Michigan Medical School, Ann Arbor, MI, USA
| |
Collapse
|
5
|
Islamaj Dogan R, Kim S, Chatr-Aryamontri A, Wei CH, Comeau DC, Antunes R, Matos S, Chen Q, Elangovan A, Panyam NC, Verspoor K, Liu H, Wang Y, Liu Z, Altinel B, Hüsünbeyi ZM, Özgür A, Fergadis A, Wang CK, Dai HJ, Tran T, Kavuluru R, Luo L, Steppi A, Zhang J, Qu J, Lu Z. Overview of the BioCreative VI Precision Medicine Track: mining protein interactions and mutations for precision medicine. Database (Oxford) 2019; 2019:5303240. [PMID: 30689846 PMCID: PMC6348314 DOI: 10.1093/database/bay147] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2018] [Accepted: 12/19/2018] [Indexed: 12/16/2022]
Abstract
The Precision Medicine Initiative is a multicenter effort aiming at formulating personalized treatments leveraging on individual patient data (clinical, genome sequence and functional genomic data) together with the information in large knowledge bases (KBs) that integrate genome annotation, disease association studies, electronic health records and other data types. The biomedical literature provides a rich foundation for populating these KBs, reporting genetic and molecular interactions that provide the scaffold for the cellular regulatory systems and detailing the influence of genetic variants in these interactions. The goal of BioCreative VI Precision Medicine Track was to extract this particular type of information and was organized in two tasks: (i) document triage task, focused on identifying scientific literature containing experimentally verified protein-protein interactions (PPIs) affected by genetic mutations and (ii) relation extraction task, focused on extracting the affected interactions (protein pairs). To assist system developers and task participants, a large-scale corpus of PubMed documents was manually annotated for this task. Ten teams worldwide contributed 22 distinct text-mining models for the document triage task, and six teams worldwide contributed 14 different text-mining systems for the relation extraction task. When comparing the text-mining system predictions with human annotations, for the triage task, the best F-score was 69.06%, the best precision was 62.89%, the best recall was 98.0% and the best average precision was 72.5%. For the relation extraction task, when taking homologous genes into account, the best F-score was 37.73%, the best precision was 46.5% and the best recall was 54.1%. Submitted systems explored a wide range of methods, from traditional rule-based, statistical and machine learning systems to state-of-the-art deep learning methods. Given the level of participation and the individual team results we find the precision medicine track to be successful in engaging the text-mining research community. In the meantime, the track produced a manually annotated corpus of 5509 PubMed documents developed by BioGRID curators and relevant for precision medicine. The data set is freely available to the community, and the specific interactions have been integrated into the BioGRID data set. In addition, this challenge provided the first results of automatically identifying PubMed articles that describe PPI affected by mutations, as well as extracting the affected relations from those articles. Still, much progress is needed for computer-assisted precision medicine text mining to become mainstream. Future work should focus on addressing the remaining technical challenges and incorporating the practical benefits of text-mining tools into real-world precision medicine information-related curation.
Collapse
Affiliation(s)
- Rezarta Islamaj Dogan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | | | - Chih-Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI)/Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
| | - Hongfang Liu
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Yanshan Wang
- Department of Health Science Research, Mayo Clinic, Rochester, MN, USA
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Berna Altinel
- Department of Computer Engineering, Marmara University, Istanbul, Turkey
| | | | | | - Aris Fergadis
- School of Electrical and Computer Engineering, National Technical University of Athens, Zografou, Athens, Greece
| | - Chen-Kai Wang
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Hong-Jie Dai
- Department of Electrical Engineering, National Kaousiung University of Science and Technology, Kaohsiung, Taiwan
| | - Tung Tran
- Department of Computer Science, University of Kentucky, Lexington, KY, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY, USA
| | - Ling Luo
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Albert Steppi
- Department of Statistics, Florida State University, Florida, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Florida, USA
| | - Jinchan Qu
- Department of Statistics, Florida State University, Florida, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| |
Collapse
|
6
|
Sogancioglu G, Öztürk H, Özgür A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 2018; 33:i49-i58. [PMID: 28881973 PMCID: PMC5870675 DOI: 10.1093/bioinformatics/btx238] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. Results The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. Availability and implementation A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.
Collapse
Affiliation(s)
- Gizem Sogancioglu
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey.,R&D and Special Projects Department, Yapı Kredi Technology, Istanbul, Turkey
| | - Hakime Öztürk
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| |
Collapse
|
7
|
Chen Q, Panyam NC, Elangovan A, Verspoor K. BioCreative VI Precision Medicine Track system performance is constrained by entity recognition and variations in corpus characteristics. Database (Oxford) 2018; 2018:5255181. [PMID: 30576491 PMCID: PMC6301335 DOI: 10.1093/database/bay122] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 09/24/2018] [Accepted: 10/16/2018] [Indexed: 01/01/2023]
Abstract
Precision medicine aims to provide personalized treatments based on individual patient profiles. One critical step towards precision medicine is leveraging knowledge derived from biomedical publications-a tremendous literature resource presenting the latest scientific discoveries on genes, mutations and diseases. Biomedical natural language processing (BioNLP) plays a vital role in supporting automation of this process. BioCreative VI Track 4 brings community effort to the task of automatically identifying and extracting protein-protein interactions (PPi) affected by mutations (PPIm), important in the precision medicine context for capturing individual genotype variation related to disease.We present the READ-BioMed team's approach to identifying PPIm-related publications and to extracting specific PPIm information from those publications in the context of the BioCreative VI PPIm track. We observe that current BioNLP tools are insufficient to recognise entities for these two tasks; the best existing mutation recognition tool achieves only 55% recall in the document triage training set, while relation extraction performance is limited by the low recall performance of gene entity recognition. We develop the models accordingly: for document triage, we develop term lists capturing interactions and mutations to complement BioNLP tools, and select effective features via a feature contribution study, whereas an ensemble of BioNLP tools is employed for relation extraction.Our best document triage model achieves an F-score of 66.77% while our best model for relation extraction achieved an F-score of 35.09% over the final (updated post-task) test set. Impacting the document triage task, the characteristics of mutations are statistically different in the training and testing sets. While a vital new direction for biomedical text mining research, this early attempt to tackle the problem of identifying genetic variation of substantial biological significance highlights the importance of representative training data and the cascading impact of tool limitations in a modular system.
Collapse
Affiliation(s)
- Qingyu Chen
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Nagesh C Panyam
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Aparna Elangovan
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| | - Karin Verspoor
- School of Computing and Information Systems, The University of Melbourne, Parkville VIC Australia
| |
Collapse
|
8
|
Hur J, Özgür A, He Y. Ontology-based literature mining of E. coli vaccine-associated gene interaction networks. J Biomed Semantics 2017; 8:12. [PMID: 28288685 PMCID: PMC5348867 DOI: 10.1186/s13326-017-0122-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2016] [Accepted: 03/03/2017] [Indexed: 12/21/2022] Open
Abstract
Background Pathogenic Escherichia coli infections cause various diseases in humans and many animal species. However, with extensive E. coli vaccine research, we are still unable to fully protect ourselves against E. coli infections. To more rational development of effective and safe E. coli vaccine, it is important to better understand E. coli vaccine-associated gene interaction networks. Methods In this study, we first extended the Vaccine Ontology (VO) to semantically represent various E. coli vaccines and genes used in the vaccine development. We also normalized E. coli gene names compiled from the annotations of various E. coli strains using a pan-genome-based annotation strategy. The Interaction Network Ontology (INO) includes a hierarchy of various interaction-related keywords useful for literature mining. Using VO, INO, and normalized E. coli gene names, we applied an ontology-based SciMiner literature mining strategy to mine all PubMed abstracts and retrieve E. coli vaccine-associated E. coli gene interactions. Four centrality metrics (i.e., degree, eigenvector, closeness, and betweenness) were calculated for identifying highly ranked genes and interaction types. Results Using vaccine-related PubMed abstracts, our study identified 11,350 sentences that contain 88 unique INO interactions types and 1,781 unique E. coli genes. Each sentence contained at least one interaction type and two unique E. coli genes. An E. coli gene interaction network of genes and INO interaction types was created. From this big network, a sub-network consisting of 5 E. coli vaccine genes, including carA, carB, fimH, fepA, and vat, and 62 other E. coli genes, and 25 INO interaction types was identified. While many interaction types represent direct interactions between two indicated genes, our study has also shown that many of these retrieved interaction types are indirect in that the two genes participated in the specified interaction process in a required but indirect process. Our centrality analysis of these gene interaction networks identified top ranked E. coli genes and 6 INO interaction types (e.g., regulation and gene expression). Conclusions Vaccine-related E. coli gene-gene interaction network was constructed using ontology-based literature mining strategy, which identified important E. coli vaccine genes and their interactions with other genes through specific interaction types. Electronic supplementary material The online version of this article (doi:10.1186/s13326-017-0122-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND, 58202, USA.
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul, 34342, Turkey
| | - Yongqun He
- Department of Microbiology and Immunology, Unit for Laboratory Animal Medicine, University of Michigan Medical School, Ann Arbor, MI, 48109, USA. .,Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI, 48109, USA. .,Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA. .,Comprehensive Cancer Center, University of Michigan Medical School, Ann Arbor, MI, 48109, USA.
| |
Collapse
|