1
|
Yu B, Zhao Y, Jiang L, Zhou J, Xu H, Lei L, Xu L, Wang X, Bu S. Network pharmacology and experimental validation of Compound Kushen Powder for the treatment of diarrhea in vivo. Vet Anim Sci 2025; 28:100443. [PMID: 40206406 PMCID: PMC11979447 DOI: 10.1016/j.vas.2025.100443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2025] Open
Abstract
To explore the mechanism of sophora flavescens, cortex fraxini, and pomegranate peel complex powder (Compound Kushen Powder) in the treatment of animal diarrhea, a network pharmacology approach leveraging databases like TCMSP and SwissTarget was applied in this study. Molecular docking was executed between the primary constituents and pivotal targets, enabling an additional refinement of main targets and key medications. Subsequently, a rat diarrhea model induced by folium sennae leaves was established for in vivo validation. The rats were divided into four groups: negative control group, positive control group, positive drug treatment group, and Compound Kushen Powder treatment group. Key protein targets, such as Caspase-3, IL-1β, IL-10, MMP9, STAT3, TNF, TP53, and VEGFA, essential for mitigating diarrhea in response to the composite medication were found through network pharmacology. Additionally, the results of molecular docking analysis unveiled fundamental constituents of Compound Kushen Powder, namely beta-sitosterol, ursolic acid, formononetin, and matrine, which demonstrated significant binding affinities with those identified key protein targets. The results of mRNA and protein expression analyses of rat colonic tissue validated the in vivo alterations of core genes identified through network screening. Except for IL-10 and STAT3, the expression of all targets exhibited noteworthy reductions when compared to the positive control group (P < 0.05). These results demonstrated that Compound Kushen Powder can inhibit inflammation and regulate cell apoptosis by modulating signaling pathways such as IL-17, TNF-α, MAPK, and NF-κB. Collectively, this study sheds light on the traditional application of complex powder for the prevention and treatment of diarrhea.
Collapse
Affiliation(s)
- Bo Yu
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
- College of Veterinary Medicine, Yangzhou University, Yangzhou 225009, China
- Jiangsu Co-innovation Center for Prevention and Control of Important Animal Infectious Diseases and Zoonoses, Yangzhou 225009, China
| | - Yuanfeng Zhao
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Lingling Jiang
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Jingrui Zhou
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Haoxiang Xu
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Lu Lei
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Longxin Xu
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Xin Wang
- Institute of Animal Husbandry and Veterinary Medicine, Guizhou Academy of Agricultural Sciences, Guiyang 550005, China
| | - Shijin Bu
- College of Veterinary Medicine, Yangzhou University, Yangzhou 225009, China
- Jiangsu Co-innovation Center for Prevention and Control of Important Animal Infectious Diseases and Zoonoses, Yangzhou 225009, China
| |
Collapse
|
2
|
Saad E, Kishk S, Ali-Eldin A, Saleh AI. SB-AGT: A stochastic beam search-enhanced attention-based Gumbel tree framework for drug-drug interaction extraction from biomedical literature. Comput Biol Med 2025; 189:110011. [PMID: 40086288 DOI: 10.1016/j.compbiomed.2025.110011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 02/27/2025] [Accepted: 03/06/2025] [Indexed: 03/16/2025]
Abstract
Drug-Drug Interactions (DDI) and Chemical-Protein Interactions (CPI) detection are crucial for patient safety, as unidentified interactions may lead to severe Adverse Drug Reactions (ADRs). While extensive DDI and CPI information exists within biomedical literature, manual extraction by experts proves time-intensive and resource-demanding. This study presents SB-AGT: Stochastic Beam Search-enhanced Attention-based Gumbel Tree an innovative approach to automated DDI and CPI extraction through an attention-based architecture incorporating a modified Gumbel-Tree method with stochastic beam search optimization. SB-AGT leverages multi-head attention mechanisms and enhanced latent tree structures to effectively capture complex syntactic features and drug relationship patterns. Also, the methodology includes comprehensive preprocessing protocols and addresses dataset imbalance through hybrid sampling techniques. Experimental validation on the DDIExtraction 2013 and CHEMPROT datasets demonstrates the model's effectiveness, achieving precision, surpassing traditional approaches and showing competitive performance compared to contemporary pre-trained models. Through parametric analysis, we establish optimal beam search configurations that maximize the model's extraction accuracy.
Collapse
Affiliation(s)
- Eman Saad
- Communication and Information Technology Center, Mansoura University, Egypt.
| | - Sherif Kishk
- Electronics and Communication Engineering Dept., Mansoura University, Egypt.
| | - Amr Ali-Eldin
- Computer Engineering & Control Systems Dept., Mansoura University, Egypt.
| | - Ahmed I Saleh
- Computer Engineering & Control Systems Dept., Mansoura University, Egypt.
| |
Collapse
|
3
|
Zhang Y, Sui X, Pan F, Yu K, Li K, Tian S, Erdengasileng A, Han Q, Wang W, Wang J, Wang J, Sun D, Chung H, Zhou J, Zhou E, Lee B, Zhang P, Qiu X, Zhao T, Zhang J. A comprehensive large scale biomedical knowledge graph for AI powered data driven biomedical research. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2023.10.13.562216. [PMID: 38168218 PMCID: PMC10760044 DOI: 10.1101/2023.10.13.562216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2024]
Abstract
To address the rapid growth of scientific publications and data in biomedical research, knowledge graphs (KGs) have become a critical tool for integrating large volumes of heterogeneous data to enable efficient information retrieval and automated knowledge discovery (AKD). However, transforming unstructured scientific literature into KGs remains a significant challenge, with previous methods unable to achieve human-level accuracy. In this study, we utilized an information extraction pipeline that won first place in the LitCoin NLP Challenge (2022) to construct a large-scale KG named iKraph using all PubMed abstracts. The extracted information matches human expert annotations and significantly exceeds the content of manually curated public databases. To enhance the KG's comprehensiveness, we integrated relation data from 40 public databases and relation information inferred from high-throughput genomics data. This KG facilitates rigorous performance evaluation of AKD, which was infeasible in previous studies. We designed an interpretable, probabilistic-based inference method to identify indirect causal relations and applied it to real-time COVID-19 drug repurposing from March 2020 to May 2023. Our method identified 600-1400 candidate drugs per month, with one-third of those discovered in the first two months later supported by clinical trials or PubMed publications. These outcomes are very challenging to attain through alternative approaches that lack a thorough understanding of the existing literature. A cloud-based platform (https://biokde.insilicom.com) was developed for academic users to access this rich structured data and associated tools.
Collapse
Affiliation(s)
- Yuan Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| | - Xin Sui
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Feng Pan
- Insilicom LLC, Tallahassee, FL 32303
| | | | - Keqiao Li
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Shubo Tian
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Qing Han
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | - Wanjing Wang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
| | | | - Jian Wang
- 977 Wisteria Ter., Sunnyvale, CA 94086
| | | | | | - Jun Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Eric Zhou
- Insilicom LLC, Tallahassee, FL 32303
| | - Ben Lee
- Insilicom LLC, Tallahassee, FL 32303
| | - Peili Zhang
- Forward Informatics, Winchester, Massachusetts, 01890
| | - Xing Qiu
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY 14642
| | - Tingting Zhao
- Insilicom LLC, Tallahassee, FL 32303
- Department of Geography, Florida State University, Tallahassee, FL 32306
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL 32306
- Insilicom LLC, Tallahassee, FL 32303
| |
Collapse
|
4
|
He J, Li F, Li J, Hu X, Nian Y, Xiang Y, Wang J, Wei Q, Li Y, Xu H, Tao C. Prompt Tuning in Biomedical Relation Extraction. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:206-224. [PMID: 38681754 PMCID: PMC11052745 DOI: 10.1007/s41666-024-00162-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2022] [Revised: 02/09/2024] [Accepted: 02/19/2024] [Indexed: 05/01/2024]
Abstract
Biomedical relation extraction (RE) is critical in constructing high-quality knowledge graphs and databases as well as supporting many downstream text mining applications. This paper explores prompt tuning on biomedical RE and its few-shot scenarios, aiming to propose a simple yet effective model for this specific task. Prompt tuning reformulates natural language processing (NLP) downstream tasks into masked language problems by embedding specific text prompts into the original input, facilitating the adaption of pre-trained language models (PLMs) to better address these tasks. This study presents a customized prompt tuning model designed explicitly for biomedical RE, including its applicability in few-shot learning contexts. The model's performance was rigorously assessed using the chemical-protein relation (CHEMPROT) dataset from BioCreative VI and the drug-drug interaction (DDI) dataset from SemEval-2013, showcasing its superior performance over conventional fine-tuned PLMs across both datasets, encompassing few-shot scenarios. This observation underscores the effectiveness of prompt tuning in enhancing the capabilities of conventional PLMs, though the extent of enhancement may vary by specific model. Additionally, the model demonstrated a harmonious balance between simplicity and efficiency, matching state-of-the-art performance without needing external knowledge or extra computational resources. The pivotal contribution of our study is the development of a suitably designed prompt tuning model, highlighting prompt tuning's effectiveness in biomedical RE. It offers a robust, efficient approach to the field's challenges and represents a significant advancement in extracting complex relations from biomedical texts. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-024-00162-9.
Collapse
Affiliation(s)
- Jianping He
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Fang Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Jianfu Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Xinyue Hu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| | - Yi Nian
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yang Xiang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Jingqi Wang
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Qiang Wei
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Hua Xu
- Department of Bioinformatics and Data Science, Yale School of Medicine, New Haven, CT USA
| | - Cui Tao
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL USA
| |
Collapse
|
5
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
6
|
Whitton J, Hunter A. Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations. Artif Intell Med 2023; 144:102661. [PMID: 37783549 DOI: 10.1016/j.artmed.2023.102661] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 07/05/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]
Abstract
Evidence-based medicine, the practice in which healthcare professionals refer to the best available evidence when making decisions, forms the foundation of modern healthcare. However, it relies on labour-intensive systematic reviews, where domain specialists must aggregate and extract information from thousands of publications, primarily of randomised controlled trial (RCT) results, into evidence tables. This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks: named entity recognition, which identifies key entities within text, such as drug names, and relation extraction, which maps their relationships for separating them into ordered tuples. We focus on the automatic tabulation of sentences from published RCT abstracts that report the results of the study outcomes. Two deep neural net models were developed as part of a joint extraction pipeline, using the principles of transfer learning and transformer-based language representations. To train and test these models, a new gold-standard corpus was developed, comprising over 550 result sentences from six disease areas. This approach demonstrated significant advantages, with our system performing well across multiple natural language processing tasks and disease areas, as well as in generalising to disease domains unseen during training. Furthermore, we show these results were achievable through training our models on as few as 170 example sentences. The final system is a proof of concept that the generation of evidence tables can be semi-automated, representing a step towards fully automating systematic reviews.
Collapse
Affiliation(s)
- Jetsun Whitton
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Anthony Hunter
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
7
|
Ai X, Kavuluru R. End-to-End Models for Chemical-Protein Interaction Extraction: Better Tokenization and Span-Based Pipeline Strategies. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023; 2023:610-618. [PMID: 38274947 PMCID: PMC10809256 DOI: 10.1109/ichi57859.2023.00108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2024]
Abstract
End-to-end relation extraction (E2ERE) is an important task in information extraction, more so for biomedicine as scientific literature continues to grow exponentially. E2ERE typically involves identifying entities (or named entity recognition (NER)) and associated relations, while most RE tasks simply assume that the entities are provided upfront and end up performing relation classification. E2ERE is inherently more difficult than RE alone given the potential snowball effect of errors from NER leading to more errors in RE. A complex dataset in biomedical E2ERE is the ChemProt dataset (BioCreative VI, 2017) that identifies relations between chemical compounds and genes/proteins in scientific literature. ChemProt is included in all recent biomedical natural language processing benchmarks including BLUE, BLURB, and BigBio. However, its treatment in these benchmarks and in other separate efforts is typically not end-to-end, with few exceptions. In this effort, we employ a span-based pipeline approach to produce a new state-of-the-art E2ERE performance on the ChemProt dataset, resulting in > 4% improvement in F1-score over the prior best effort. Our results indicate that a straightforward fine-grained tokenization scheme helps span-based approaches excel in E2ERE, especially with regards to handling complex named entities. Our error analysis also identifies a few key failure modes in E2ERE for ChemProt.
Collapse
Affiliation(s)
- Xuguang Ai
- Department of Computer Science, University of Kentucky, Lexington, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Dept. of Internal Medicine, University of Kentucky, Lexington, USA
| |
Collapse
|
8
|
Bokharaeian B, Dehghani M, Diaz A. Automatic extraction of ranked SNP-phenotype associations from text using a BERT-LSTM-based method. BMC Bioinformatics 2023; 24:144. [PMID: 37046202 PMCID: PMC10099837 DOI: 10.1186/s12859-023-05236-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Accepted: 03/17/2023] [Indexed: 04/14/2023] Open
Abstract
Extraction of associations of singular nucleotide polymorphism (SNP) and phenotypes from biomedical literature is a vital task in BioNLP. Recently, some methods have been developed to extract mutation-diseases affiliations. However, no accessible method of extracting associations of SNP-phenotype from content considers their degree of certainty. In this paper, several machine learning methods were developed to extract ranked SNP-phenotype associations from biomedical abstracts and then were compared to each other. In addition, shallow machine learning methods, including random forest, logistic regression, and decision tree and two kernel-based methods like subtree and local context, a rule-based and a deep CNN-LSTM-based and two BERT-based methods were developed in this study to extract associations. Furthermore, the experiments indicated that although the used linguist features could be employed to implement a superior association extraction method outperforming the kernel-based counterparts, the used deep learning and BERT-based methods exhibited the best performance. However, the used PubMedBERT-LSTM outperformed the other developed methods among the used methods. Moreover, similar experiments were conducted to estimate the degree of certainty of the extracted association, which can be used to assess the strength of the reported association. The experiments revealed that our proposed PubMedBERT-CNN-LSTM method outperformed the sophisticated methods on the task.
Collapse
Affiliation(s)
| | - Mohammad Dehghani
- School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran
| | - Alberto Diaz
- Facultad Informatica, Complutense University of Madrid, Madrid, Spain
| |
Collapse
|
9
|
Kumar A, Sharaff A. ABEE: automated bio entity extraction from biomedical text documents. DATA TECHNOLOGIES AND APPLICATIONS 2022. [DOI: 10.1108/dta-04-2022-0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
PurposeThe purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts.Design/methodology/approachIn the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text.FindingsThe proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments.Research limitations/implicationsAs such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups.Practical implicationsAs far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery.Social implicationsDuring the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.Originality/valueIn this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score.
Collapse
|
10
|
Hou Y, Xia Y, Wu L, Xie S, Fan Y, Zhu J, Qin T, Liu TY. Discovering drug-target interaction knowledge from biomedical literature. Bioinformatics 2022; 38:5100-5107. [PMID: 36205562 DOI: 10.1093/bioinformatics/btac648] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Revised: 07/19/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The interaction between drugs and targets (DTI) in human body plays a crucial role in biomedical science and applications. As millions of papers come out every year in the biomedical domain, automatically discovering DTI knowledge from biomedical literature, which are usually triplets about drugs, targets and their interaction, becomes an urgent demand in the industry. Existing methods of discovering biological knowledge are mainly extractive approaches that often require detailed annotations (e.g. all mentions of biological entities, relations between every two entity mentions, etc.). However, it is difficult and costly to obtain sufficient annotations due to the requirement of expert knowledge from biomedical domains. RESULTS To overcome these difficulties, we explore an end-to-end solution for this task by using generative approaches. We regard the DTI triplets as a sequence and use a Transformer-based model to directly generate them without using the detailed annotations of entities and relations. Further, we propose a semi-supervised method, which leverages the aforementioned end-to-end model to filter unlabeled literature and label them. Experimental results show that our method significantly outperforms extractive baselines on DTI discovery. We also create a dataset, KD-DTI, to advance this task and release it to the community. AVAILABILITY AND IMPLEMENTATION Our code and data are available at https://github.com/bert-nmt/BERT-DTI. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yutai Hou
- Harbin Institute of Technology, Harbin 150001, China
| | - Yingce Xia
- Microsoft Research, Beijing 100080, China
| | - Lijun Wu
- Microsoft Research, Beijing 100080, China
| | | | - Yang Fan
- University of Science and Technology of China, Hefei 230027, China
| | - Jinhua Zhu
- University of Science and Technology of China, Hefei 230027, China
| | - Tao Qin
- Microsoft Research, Beijing 100080, China
| | | |
Collapse
|
11
|
Nicholson DN, Himmelstein DS, Greene CS. Expanding a database-derived biomedical knowledge graph via multi-relation extraction from biomedical abstracts. BioData Min 2022; 15:26. [PMID: 36258252 PMCID: PMC9578183 DOI: 10.1186/s13040-022-00311-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Accepted: 09/17/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND Knowledge graphs support biomedical research efforts by providing contextual information for biomedical entities, constructing networks, and supporting the interpretation of high-throughput analyses. These databases are populated via manual curation, which is challenging to scale with an exponentially rising publication rate. Data programming is a paradigm that circumvents this arduous manual process by combining databases with simple rules and heuristics written as label functions, which are programs designed to annotate textual data automatically. Unfortunately, writing a useful label function requires substantial error analysis and is a nontrivial task that takes multiple days per function. This bottleneck makes populating a knowledge graph with multiple nodes and edge types practically infeasible. Thus, we sought to accelerate the label function creation process by evaluating how label functions can be re-used across multiple edge types. RESULTS We obtained entity-tagged abstracts and subsetted these entities to only contain compounds, genes, and disease mentions. We extracted sentences containing co-mentions of certain biomedical entities contained in a previously described knowledge graph, Hetionet v1. We trained a baseline model that used database-only label functions and then used a sampling approach to measure how well adding edge-specific or edge-mismatch label function combinations improved over our baseline. Next, we trained a discriminator model to detect sentences that indicated a biomedical relationship and then estimated the number of edge types that could be recalled and added to Hetionet v1. We found that adding edge-mismatch label functions rarely improved relationship extraction, while control edge-specific label functions did. There were two exceptions to this trend, Compound-binds-Gene and Gene-interacts-Gene, which both indicated physical relationships and showed signs of transferability. Across the scenarios tested, discriminative model performance strongly depends on generated annotations. Using the best discriminative model for each edge type, we recalled close to 30% of established edges within Hetionet v1. CONCLUSIONS Our results show that this framework can incorporate novel edges into our source knowledge graph. However, results with label function transfer were mixed. Only label functions describing very similar edge types supported improved performance when transferred. We expect that the continued development of this strategy may provide essential building blocks to populating biomedical knowledge graphs with discoveries, ensuring that these resources include cutting-edge results.
Collapse
Affiliation(s)
- David N. Nicholson
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Daniel S. Himmelstein
- grid.25879.310000 0004 1936 8972Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, Philadelphia, PA USA
| | - Casey S. Greene
- grid.430503.10000 0001 0703 675XDepartment of Biomedical Informatics, University of Colorado School of Medicine and Center for Health Artificial Intellegence (CHAI), University of Colorado School of Medicine, Aurora, USA
| |
Collapse
|
12
|
Kim H, Sung M, Yoon W, Park S, Kang J. Full-text chemical identification with improved generalizability and tagging consistency. Database (Oxford) 2022; 2022:6726385. [PMID: 36170114 PMCID: PMC9518746 DOI: 10.1093/database/baac074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/11/2022] [Accepted: 08/22/2022] [Indexed: 11/14/2022]
Abstract
Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id
Collapse
Affiliation(s)
- Hyunjae Kim
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Mujeen Sung
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Sungjoon Park
- Department of Medicine, University of California , San Diego, CA, USA
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
- AIGEN Sciences , Seoul, South Korea
| |
Collapse
|
13
|
He J, Li F, Hu X, Li J, Nian Y, Wang J, Xiang Y, Wei Q, Xu H, Tao C. Chemical-Protein Relation Extraction with Pre-trained Prompt Tuning. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2022; 2022:608-609. [PMID: 37664001 PMCID: PMC10474649 DOI: 10.1109/ichi54592.2022.00120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Biomedical relation extraction plays a critical role in the construction of high-quality knowledge graphs and databases, which can further support many downstream applications. Pre-trained prompt tuning, as a new paradigm, has shown great potential in many natural language processing (NLP) tasks. Through inserting a piece of text into the original input, prompt converts NLP tasks into masked language problems, which could be better addressed by pre-trained language models (PLMs). In this study, we applied pre-trained prompt tuning to chemical-protein relation extraction using the BioCreative VI CHEMPROT dataset. The experiment results showed that the pre-trained prompt tuning outperformed the baseline approach in chemical-protein interaction classification. We conclude that the prompt tuning can improve the efficiency of the PLMs on chemical-protein relation extraction tasks.
Collapse
Affiliation(s)
- Jianping He
- School of Biomedical Informatics UTHealth Houston, USA
| | - Fang Li
- School of Biomedical Informatics UTHealth Houston, USA
| | - Xinyue Hu
- School of Biomedical Informatics UTHealth Houston, USA
| | - Jianfu Li
- School of Biomedical Informatics UTHealth Houston, USA
| | - Yi Nian
- School of Biomedical Informatics UTHealth Houston, USA
| | - Jingqi Wang
- School of Biomedical Informatics UTHealth Houston, USA
| | - Yang Xiang
- School of Biomedical Informatics UTHealth Houston, USA
| | - Qiang Wei
- School of Biomedical Informatics UTHealth Houston, USA
| | - Hua Xu
- School of Biomedical Informatics UTHealth Houston, USA
| | - Cui Tao
- School of Biomedical Informatics UTHealth Houston, USA
| |
Collapse
|
14
|
Brincat A, Hofmann M. Automated extraction of genes associated with antibiotic resistance from the biomedical literature. Database (Oxford) 2022; 2022:6520791. [PMID: 35134132 PMCID: PMC9263533 DOI: 10.1093/database/baab077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 09/21/2021] [Accepted: 11/22/2021] [Indexed: 11/15/2022]
Abstract
Abstract
The detection of bacterial antibiotic resistance phenotypes is important when carrying out clinical decisions for patient treatment. Conventional phenotypic testing involves culturing bacteria which requires a significant amount of time and work. Whole-genome sequencing is emerging as a fast alternative to resistance prediction, by considering the presence/absence of certain genes. A lot of research has focused on determining which bacterial genes cause antibiotic resistance and efforts are being made to consolidate these facts in knowledge bases (KBs). KBs are usually manually curated by domain experts to be of the highest quality. However, this limits the pace at which new facts are added. Automated relation extraction of gene-antibiotic resistance relations from the biomedical literature is one solution that can simplify the curation process. This paper reports on the development of a text mining pipeline that takes in English biomedical abstracts and outputs genes that are predicted to cause resistance to antibiotics. To test the generalisability of this pipeline it was then applied to predict genes associated with Helicobacter pylori antibiotic resistance, that are not present in common antibiotic resistance KBs or publications studying H. pylori. These genes would be candidates for further lab-based antibiotic research and inclusion in these KBs. For relation extraction, state-of-the-art deep learning models were used. These models were trained on a newly developed silver corpus which was generated by distant supervision of abstracts using the facts obtained from KBs. The top performing model was superior to a co-occurrence model, achieving a recall of 95%, a precision of 60% and F1-score of 74% on a manually annotated holdout dataset. To our knowledge, this project was the first attempt at developing a complete text mining pipeline that incorporates deep learning models to extract gene-antibiotic resistance relations from the literature. Additional related data can be found at https://github.com/AndreBrincat/Gene-Antibiotic-Resistance-Relation-Extraction
Collapse
Affiliation(s)
- Andre Brincat
- Department of Informatics, TU Dublin , Blanchardstown Campus, Dublin D15 YV78, Ireland
| | - Markus Hofmann
- Department of Informatics, TU Dublin , Blanchardstown Campus, Dublin D15 YV78, Ireland
| |
Collapse
|
15
|
Zanoli R, Lavelli A, Löffler T, Perez Gonzalez NA, Rinaldi F. An annotated dataset for extracting gene-melanoma relations from scientific literature. J Biomed Semantics 2022; 13:2. [PMID: 35045882 PMCID: PMC8772125 DOI: 10.1186/s13326-021-00251-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2020] [Accepted: 08/27/2021] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Melanoma is one of the least common but the deadliest of skin cancers. This cancer begins when the genes of a cell suffer damage or fail, and identifying the genes involved in melanoma is crucial for understanding the melanoma tumorigenesis. Thousands of publications about human melanoma appear every year. However, while biological curation of data is costly and time-consuming, to date the application of machine learning for gene-melanoma relation extraction from text has been severely limited by the lack of annotated resources.
Results
To overcome this lack of resources for melanoma, we have exploited the information of the Melanoma Gene Database (MGDB, a manually curated database of genes involved in human melanoma) to automatically build an annotated dataset of binary relations between gene and melanoma entities occurring in PubMed abstracts. The entities were automatically annotated by state-of-the-art text-mining tools. Their annotation includes both the mention text spans and normalized concept identifiers. The relations among the entities were annotated at concept- and mention-level. The concept-level annotation was produced using the information of the genes in MGDB to decide if a relation holds between a gene and melanoma concept in the whole abstract. The exploitability of this dataset was tested with both traditional machine learning, and neural network-based models like BERT. The models were then used to automatically extract gene-melanoma relations from the biomedical literature. Most of the current models use context-aware representations of the target entities to establish relations between them. To facilitate researchers in their experiments we generated a mention-level annotation in support to the concept-level annotation. The mention-level annotation was generated by automatically linking gene and melanoma mentions co-occurring within the sentences that in MGDB establish the association of the gene with melanoma.
Conclusions
This paper presents a corpus containing gene-melanoma annotated relations. Additionally, it discusses experiments which show the usefulness of such a corpus for training a system capable of mining gene-melanoma relationships from the literature. Researchers can use the corpus to develop and compare their own models, and produce results which might be integrated with existing structured knowledge databases, which in turn might facilitate medical research.
Collapse
|
16
|
Bhasuran B. BioBERT and Similar Approaches for Relation Extraction. Methods Mol Biol 2022; 2496:221-235. [PMID: 35713867 DOI: 10.1007/978-1-0716-2305-3_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In biomedicine, facts about relations between entities (disease, gene, drug, etc.) are hidden in the large trove of 30 million scientific publications. The curated information is proven to play an important role in various applications such as drug repurposing and precision medicine. Recently, due to the advancement in deep learning a transformer architecture named BERT (Bidirectional Encoder Representations from Transformers) has been proposed. This pretrained language model trained using the Books Corpus with 800M words and English Wikipedia with 2500M words reported state of the art results in various NLP (Natural Language Processing) tasks including relation extraction. It is a widely accepted notion that due to the word distribution shift, general domain models exhibit poor performance in information extraction tasks of the biomedical domain. Due to this, an architecture is later adapted to the biomedical domain by training the language models using 28 million scientific literatures from PubMed and PubMed central. This chapter presents a protocol for relation extraction using BERT by discussing state-of-the-art for BERT versions in the biomedical domain such as BioBERT. The protocol emphasis on general BERT architecture, pretraining and fine tuning, leveraging biomedical information, and finally a knowledge graph infusion to the BERT model layer.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India.
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA.
| |
Collapse
|
17
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
18
|
Warikoo N, Chang YC, Hsu WL. LBERT: Lexically aware Transformer-based Bidirectional Encoder Representation model for learning universal bio-entity relations. Bioinformatics 2021; 37:404-412. [PMID: 32810217 DOI: 10.1093/bioinformatics/btaa721] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Revised: 06/30/2020] [Accepted: 08/13/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Natural Language Processing techniques are constantly being advanced to accommodate the influx of data as well as to provide exhaustive and structured knowledge dissemination. Within the biomedical domain, relation detection between bio-entities known as the Bio-Entity Relation Extraction (BRE) task has a critical function in knowledge structuring. Although recent advances in deep learning-based biomedical domain embedding have improved BRE predictive analytics, these works are often task selective or use external knowledge-based pre-/post-processing. In addition, deep learning-based models do not account for local syntactic contexts, which have improved data representation in many kernel classifier-based models. In this study, we propose a universal BRE model, i.e. LBERT, which is a Lexically aware Transformer-based Bidirectional Encoder Representation model, and which explores both local and global contexts representations for sentence-level classification tasks. RESULTS This article presents one of the most exhaustive BRE studies ever conducted over five different bio-entity relation types. Our model outperforms state-of-the-art deep learning models in protein-protein interaction (PPI), drug-drug interaction and protein-bio-entity relation classification tasks by 0.02%, 11.2% and 41.4%, respectively. LBERT representations show a statistically significant improvement over BioBERT in detecting true bio-entity relation for large corpora like PPI. Our ablation studies clearly indicate the contribution of the lexical features and distance-adjusted attention in improving prediction performance by learning additional local semantic context along with bi-directionally learned global context. AVAILABILITY AND IMPLEMENTATION Github. https://github.com/warikoone/LBERT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Neha Warikoo
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei 112, Taiwan.,Bioinformatics Program, Taiwan International Graduate Program, Institute of Information Science, Academia Sinica, Taipei 115, Taiwan.,Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, College of Management, Taipei Medical University, Taipei 106, Taiwan.,Clinical Big Data Research Center, Taipei Medical University, Taipei 110, Taiwan.,Pervasive AI Research Labs, Ministry of Science and Technology, Hsinchu City 300, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei 11529, Taiwan.,Pervasive AI Research Labs, Ministry of Science and Technology, Hsinchu City 300, Taiwan
| |
Collapse
|
19
|
Shao Y, Li H, Gu J, Qian L, Zhou G. Extraction of causal relations based on SBEL and BERT model. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6133143. [PMID: 33570092 PMCID: PMC7904051 DOI: 10.1093/database/baab005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 01/19/2021] [Accepted: 01/26/2021] [Indexed: 11/15/2022]
Abstract
Extraction of causal relations between biomedical entities in the form of Biological Expression Language (BEL) poses a new challenge to the community of biomedical text mining due to the complexity of BEL statements. We propose a simplified form of BEL statements [Simplified Biological Expression Language (SBEL)] to facilitate BEL extraction and employ BERT (Bidirectional Encoder Representation from Transformers) to improve the performance of causal relation extraction (RE). On the one hand, BEL statement extraction is transformed into the extraction of an intermediate form—SBEL statement, which is then further decomposed into two subtasks: entity RE and entity function detection. On the other hand, we use a powerful pretrained BERT model to both extract entity relations and detect entity functions, aiming to improve the performance of two subtasks. Entity relations and functions are then combined into SBEL statements and finally merged into BEL statements. Experimental results on the BioCreative-V Track 4 corpus demonstrate that our method achieves the state-of-the-art performance in BEL statement extraction with F1 scores of 54.8% in Stage 2 evaluation and of 30.1% in Stage 1 evaluation, respectively. Database URL: https://github.com/grapeff/SBEL_datasets
Collapse
Affiliation(s)
- Yifan Shao
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Haoru Li
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Jinghang Gu
- Department of Chinese & Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China, 999077
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| |
Collapse
|
20
|
Abstract
Chemometrics play a critical role in biosensors-based detection, analysis, and diagnosis. Nowadays, as a branch of artificial intelligence (AI), machine learning (ML) have achieved impressive advances. However, novel advanced ML methods, especially deep learning, which is famous for image analysis, facial recognition, and speech recognition, has remained relatively elusive to the biosensor community. Herein, how ML can be beneficial to biosensors is systematically discussed. The advantages and drawbacks of most popular ML algorithms are summarized on the basis of sensing data analysis. Specially, deep learning methods such as convolutional neural network (CNN) and recurrent neural network (RNN) are emphasized. Diverse ML-assisted electrochemical biosensors, wearable electronics, SERS and other spectra-based biosensors, fluorescence biosensors and colorimetric biosensors are comprehensively discussed. Furthermore, biosensor networks and multibiosensor data fusion are introduced. This review will nicely bridge ML with biosensors, and greatly expand chemometrics for detection, analysis, and diagnosis.
Collapse
Affiliation(s)
- Feiyun Cui
- Department of Chemical Engineering, Worcester Polytechnic Institute, 100 Institute Road, Worcester, Massachusetts 01609, United States
| | - Yun Yue
- Department of Electrical & Computer Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Yi Zhang
- Department of Biomedical Engineering, University of Connecticut, Storrs, Connecticut 06269, United States
| | - Ziming Zhang
- Department of Electrical & Computer Engineering, Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - H. Susan Zhou
- Department of Chemical Engineering, Worcester Polytechnic Institute, 100 Institute Road, Worcester, Massachusetts 01609, United States
| |
Collapse
|
21
|
|
22
|
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36:1234-1240. [PMID: 31501885 PMCID: PMC7703786 DOI: 10.1093/bioinformatics/btz682] [Citation(s) in RCA: 1412] [Impact Index Per Article: 282.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. RESULTS We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. AVAILABILITY AND IMPLEMENTATION We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Collapse
Affiliation(s)
- Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sungdong Kim
- Clova AI Research, Naver Corp, Seong-Nam 13561, Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea.,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| |
Collapse
|
23
|
Liu X, Fan J, Dong S. Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study. JMIR Med Inform 2020; 8:e17644. [PMID: 32469325 PMCID: PMC7314385 DOI: 10.2196/17644] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 03/02/2020] [Accepted: 03/19/2020] [Indexed: 01/26/2023] Open
Abstract
Background The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations. Objective This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction. Methods This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics. Results Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance. Conclusions When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.
Collapse
Affiliation(s)
- Xiaofeng Liu
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Jianye Fan
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Shoubin Dong
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| |
Collapse
|
24
|
Döring K, Qaseem A, Becer M, Li J, Mishra P, Gao M, Kirchner P, Sauter F, Telukunta KK, Moumbock AFA, Thomas P, Günther S. Automated recognition of functional compound-protein relationships in literature. PLoS One 2020; 15:e0220925. [PMID: 32126064 PMCID: PMC7053725 DOI: 10.1371/journal.pone.0220925] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2019] [Accepted: 01/29/2020] [Indexed: 11/18/2022] Open
Abstract
MOTIVATION Much effort has been invested in the identification of protein-protein interactions using text mining and machine learning methods. The extraction of functional relationships between chemical compounds and proteins from literature has received much less attention, and no ready-to-use open-source software is so far available for this task. METHOD We created a new benchmark dataset of 2,613 sentences from abstracts containing annotations of proteins, small molecules, and their relationships. Two kernel methods were applied to classify these relationships as functional or non-functional, named shallow linguistic and all-paths graph kernel. Furthermore, the benefit of interaction verbs in sentences was evaluated. RESULTS The cross-validation of the all-paths graph kernel (AUC value: 84.6%, F1 score: 79.0%) shows slightly better results than the shallow linguistic kernel (AUC value: 82.5%, F1 score: 77.2%) on our benchmark dataset. Both models achieve state-of-the-art performance in the research area of relation extraction. Furthermore, the combination of shallow linguistic and all-paths graph kernel could further increase the overall performance slightly. We used each of the two kernels to identify functional relationships in all PubMed abstracts (29 million) and provide the results, including recorded processing time. AVAILABILITY The software for the tested kernels, the benchmark, the processed 29 million PubMed abstracts, all evaluation scripts, as well as the scripts for processing the complete PubMed database are freely available at https://github.com/KerstenDoering/CPI-Pipeline.
Collapse
Affiliation(s)
- Kersten Döring
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Ammar Qaseem
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Michael Becer
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Jianyu Li
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pankaj Mishra
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Mingjie Gao
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Pascal Kirchner
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Florian Sauter
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Kiran K. Telukunta
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | - Aurélien F. A. Moumbock
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
| | | | - Stefan Günther
- Institute of Pharmaceutical Sciences, Albert-Ludwigs-Universität Freiburg, Freiburg, Germany
- * E-mail:
| |
Collapse
|
25
|
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. BIOINFORMATICS (OXFORD, ENGLAND) 2020; 36:1234-1240. [PMID: 31501885 DOI: 10.48550/arxiv.1901.08746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 05/20/2023]
Abstract
MOTIVATION Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. RESULTS We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. AVAILABILITY AND IMPLEMENTATION We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Collapse
Affiliation(s)
- Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sungdong Kim
- Clova AI Research, Naver Corp, Seong-Nam 13561, Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| |
Collapse
|
26
|
Lung PY, He Z, Zhao T, Yu D, Zhang J. Extracting chemical-protein interactions from literature using sentence structure analysis and feature engineering. Database (Oxford) 2019; 2019:5280305. [PMID: 30624652 PMCID: PMC6323317 DOI: 10.1093/database/bay138] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2018] [Revised: 12/04/2018] [Accepted: 12/06/2018] [Indexed: 12/14/2022]
Abstract
Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.
Collapse
Affiliation(s)
- Pei-Yau Lung
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Zhe He
- School of Information, Florida State University, Tallahassee, FL, USA
| | - Tingting Zhao
- Department of Geography, Florida State University, Tallahassee, FL, USA
| | - Disa Yu
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| | - Jinfeng Zhang
- Department of Statistics, Florida State University, Tallahassee, FL, USA
| |
Collapse
|
27
|
Antunes R, Matos S. Extraction of chemical-protein interactions from the literature using neural networks and narrow instance representation. Database (Oxford) 2019; 2019:baz095. [PMID: 31622463 PMCID: PMC6796919 DOI: 10.1093/database/baz095] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Revised: 06/28/2019] [Accepted: 07/01/2019] [Indexed: 01/21/2023]
Abstract
The scientific literature contains large amounts of information on genes, proteins, chemicals and their interactions. Extraction and integration of this information in curated knowledge bases help researchers support their experimental results, leading to new hypotheses and discoveries. This is especially relevant for precision medicine, which aims to understand the individual variability across patient groups in order to select the most appropriate treatments. Methods for improved retrieval and automatic relation extraction from biomedical literature are therefore required for collecting structured information from the growing number of published works. In this paper, we follow a deep learning approach for extracting mentions of chemical-protein interactions from biomedical articles, based on various enhancements over our participation in the BioCreative VI CHEMPROT task. A significant aspect of our best method is the use of a simple deep learning model together with a very narrow representation of the relation instances, using only up to 10 words from the shortest dependency path and the respective dependency edges. Bidirectional long short-term memory recurrent networks or convolutional neural networks are used to build the deep learning models. We report the results of several experiments and show that our best model is competitive with more complex sentence representations or network structures, achieving an F1-score of 0.6306 on the test set. The source code of our work, along with detailed statistics, is publicly available.
Collapse
Affiliation(s)
- Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| | - Sérgio Matos
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Aveiro, Portugal
| |
Collapse
|