1
|
Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024; 23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open
Abstract
Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Hao Wu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
| | - Yuxin Lin
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Department of Urology, the First Affiliated Hospital of Soochow University, Suzhou 215000, China
| | - Fuliang Qian
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Medical Center of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| |
Collapse
|
2
|
Peng L, Luo G, Zhou S, Chen J, Xu Z, Sun J, Zhang R. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. NPJ Digit Med 2024; 7:127. [PMID: 38750290 PMCID: PMC11096157 DOI: 10.1038/s41746-024-01126-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 04/23/2024] [Indexed: 05/18/2024] Open
Abstract
Language models (LMs) such as BERT and GPT have revolutionized natural language processing (NLP). However, the medical field faces challenges in training LMs due to limited data access and privacy constraints imposed by regulations like the Health Insurance Portability and Accountability Act (HIPPA) and the General Data Protection Regulation (GDPR). Federated learning (FL) offers a decentralized solution that enables collaborative learning while ensuring data privacy. In this study, we evaluated FL on 2 biomedical NLP tasks encompassing 8 corpora using 6 LMs. Our results show that: (1) FL models consistently outperformed models trained on individual clients' data and sometimes performed comparably with models trained with polled data; (2) with the fixed number of total data, FL models training with more clients produced inferior performance but pre-trained transformer-based models exhibited great resilience. (3) FL models significantly outperformed pre-trained LLMs with few-shot prompting.
Collapse
Affiliation(s)
- Le Peng
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Gaoxiang Luo
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Jiandong Chen
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Ziyue Xu
- Nvidia Corporation, Santa Clara, CA, USA
| | - Ju Sun
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA.
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA.
| |
Collapse
|
3
|
Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024. [PMID: 38733346 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Collapse
Affiliation(s)
- Meiqi Wang
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
| | - Avish Vijayaraghavan
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
- UKRI Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K
| | - Tim Beck
- School of Medicine, University of Nottingham, Biodiscovery Institute, Nottingham NG7 2RD, U.K
- Health Data Research (HDR) U.K., London NW1 2BE, U.K
| | - Joram M Posma
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
- Health Data Research (HDR) U.K., London NW1 2BE, U.K
| |
Collapse
|
4
|
Di Maria A, Bellomo L, Billeci F, Cardillo A, Alaimo S, Ferragina P, Ferro A, Pulvirenti A. NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph. Bioinformatics 2024; 40:btae194. [PMID: 38597890 DOI: 10.1093/bioinformatics/btae194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024]
Abstract
MOTIVATION The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e. in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION https://netme.click/.
Collapse
Affiliation(s)
- Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | | | - Fabrizio Billeci
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Alfio Cardillo
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, 56126 , Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| |
Collapse
|
5
|
Tie X, Shin M, Lee C, Perlman SB, Huemann Z, Weisman AJ, Castellino SM, Kelly KM, McCarten KM, Alazraki AL, Hu J, Cho SY, Bradshaw TJ. Automatic Quantification of Serial PET/CT Images for Pediatric Hodgkin Lymphoma Patients Using a Longitudinally-Aware Segmentation Network. ArXiv 2024:arXiv:2404.08611v1. [PMID: 38659641 PMCID: PMC11042444] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Purpose Automatic quantification of longitudinal changes in PET scans for lymphoma patients has proven challenging, as residual disease in interim-therapy scans is often subtle and difficult to detect. Our goal was to develop a longitudinally-aware segmentation network (LAS-Net) that can quantify serial PET/CT images for pediatric Hodgkin lymphoma patients. Materials and Methods This retrospective study included baseline (PET1) and interim (PET2) PET/CT images from 297 patients enrolled in two Children's Oncology Group clinical trials (AHOD1331 and AHOD0831). LAS-Net incorporates longitudinal cross-attention, allowing relevant features from PET1 to inform the analysis of PET2. Model performance was evaluated using Dice coefficients for PET1 and detection F1 scores for PET2. Additionally, we extracted and compared quantitative PET metrics, including metabolic tumor volume (MTV) and total lesion glycolysis (TLG) in PET1, as well as qPET and ΔSUVmax in PET2, against physician measurements. We quantified their agreement using Spearman's ρ correlations and employed bootstrap resampling for statistical analysis. Results LAS-Net detected residual lymphoma in PET2 with an F1 score of 0.606 (precision/recall: 0.615/0.600), outperforming all comparator methods (P<0.01). For baseline segmentation, LAS-Net achieved a mean Dice score of 0.772. In PET quantification, LAS-Net's measurements of qPET, ΔSUVmax, MTV and TLG were strongly correlated with physician measurements, with Spearman's ρ of 0.78, 0.80, 0.93 and 0.96, respectively. The performance remained high, with a slight decrease, in an external testing cohort. Conclusion LAS-Net achieved high performance in quantifying PET metrics across serial scans, highlighting the value of longitudinal awareness in evaluating multi-time-point imaging datasets.
Collapse
Affiliation(s)
- Xin Tie
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- Department of Medical Physics, University of Wisconsin, Madison, WI, USA
| | - Muheon Shin
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Changhee Lee
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Scott B Perlman
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- University of Wisconsin Carbone Comprehensive Cancer Center, Madison, WI, USA
| | - Zachary Huemann
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Amy J Weisman
- Department of Medical Physics, University of Wisconsin, Madison, WI, USA
| | - Sharon M Castellino
- Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA
- Aflac Cancer and Blood Disorders Center, Children's Healthcare of Atlanta, Atlanta, GA, USA
| | - Kara M Kelly
- Department of Pediatric Oncology, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
- Department of Pediatrics, University at Buffalo Jacobs School of Medicine and Biomedical Sciences, Buffalo, NY, USA
| | - Kathleen M McCarten
- Pediatric Radiology, Imaging and Radiation Oncology Core Rhode Island, Lincoln, RI, USA
| | - Adina L Alazraki
- Department of Radiology, Emory University School of Medicine and Children's Healthcare of Atlanta, Atlanta, GA, USA
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
- Department of Computer Science, School of Computer, University of Wisconsin, Madison, WI, USA
| | - Steve Y Cho
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- University of Wisconsin Carbone Comprehensive Cancer Center, Madison, WI, USA
| | - Tyler J Bradshaw
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
6
|
Alamro H, Gojobori T, Essack M, Gao X. BioBBC: a multi-feature model that enhances the detection of biomedical entities. Sci Rep 2024; 14:7697. [PMID: 38565624 PMCID: PMC10987643 DOI: 10.1038/s41598-024-58334-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/27/2024] [Indexed: 04/04/2024] Open
Abstract
The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.
Collapse
Affiliation(s)
- Hind Alamro
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia
| | - Takashi Gojobori
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| |
Collapse
|
7
|
Tie X, Shin M, Pirasteh A, Ibrahim N, Huemann Z, Castellino SM, Kelly KM, Garrett J, Hu J, Cho SY, Bradshaw TJ. Personalized Impression Generation for PET Reports Using Large Language Models. J Imaging Inform Med 2024; 37:471-488. [PMID: 38308070 PMCID: PMC11031527 DOI: 10.1007/s10278-024-00985-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 01/17/2024] [Accepted: 01/18/2024] [Indexed: 02/04/2024]
Abstract
Large language models (LLMs) have shown promise in accelerating radiology reporting by summarizing clinical findings into impressions. However, automatic impression generation for whole-body PET reports presents unique challenges and has received little attention. Our study aimed to evaluate whether LLMs can create clinically useful impressions for PET reporting. To this end, we fine-tuned twelve open-source language models on a corpus of 37,370 retrospective PET reports collected from our institution. All models were trained using the teacher-forcing algorithm, with the report findings and patient information as input and the original clinical impressions as reference. An extra input token encoded the reading physician's identity, allowing models to learn physician-specific reporting styles. To compare the performances of different models, we computed various automatic evaluation metrics and benchmarked them against physician preferences, ultimately selecting PEGASUS as the top LLM. To evaluate its clinical utility, three nuclear medicine physicians assessed the PEGASUS-generated impressions and original clinical impressions across 6 quality dimensions (3-point scales) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. When physicians assessed LLM impressions generated in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08/5. On average, physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P = 0.41). In summary, our study demonstrated that personalized impressions generated by PEGASUS were clinically useful in most cases, highlighting its potential to expedite PET reporting by automatically drafting impressions.
Collapse
Affiliation(s)
- Xin Tie
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
- Department of Medical Physics, School of Medicine and Public Health, University of Wisconsin, Madison, WI, USA
| | - Muheon Shin
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
| | - Ali Pirasteh
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
- Department of Medical Physics, School of Medicine and Public Health, University of Wisconsin, Madison, WI, USA
| | - Nevein Ibrahim
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
| | - Zachary Huemann
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
| | - Sharon M Castellino
- Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA
- Aflac Cancer and Blood Disorders Center, Childrens Healthcare of Atlanta, Atlanta, GA, USA
| | - Kara M Kelly
- Department of Pediatric Oncology, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
- Department of Pediatrics, University at Buffalo Jacobs School of Medicine and Biomedical Sciences, Buffalo, NY, USA
| | - John Garrett
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
- Department of Medical Physics, School of Medicine and Public Health, University of Wisconsin, Madison, WI, USA
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, School of Medicine and Public Health, University of Wisconsin, Madison, WI, USA
- Department of Computer Science, School of Computer, Data and Information Sciences, University of Wisconsin, Madison, WI, USA
| | - Steve Y Cho
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA
- University of Wisconsin Carbone Comprehensive Cancer Center, Madison, WI, USA
| | - Tyler J Bradshaw
- Department of Radiology, School of Medicine and Public Health, University of Wissconsin, Madison, WI, USA.
| |
Collapse
|
8
|
Park YJ, Yang GJ, Sohn CB, Park SJ. GPDminer: a tool for extracting named entities and analyzing relations in biological literature. BMC Bioinformatics 2024; 25:101. [PMID: 38448845 PMCID: PMC10916184 DOI: 10.1186/s12859-024-05710-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 02/19/2024] [Indexed: 03/08/2024] Open
Abstract
PURPOSE The expansion of research across various disciplines has led to a substantial increase in published papers and journals, highlighting the necessity for reliable text mining platforms for database construction and knowledge acquisition. This abstract introduces GPDMiner(Gene, Protein, and Disease Miner), a platform designed for the biomedical domain, addressing the challenges posed by the growing volume of academic papers. METHODS GPDMiner is a text mining platform that utilizes advanced information retrieval techniques. It operates by searching PubMed for specific queries, extracting and analyzing information relevant to the biomedical field. This system is designed to discern and illustrate relationships between biomedical entities obtained from automated information extraction. RESULTS The implementation of GPDMiner demonstrates its efficacy in navigating the extensive corpus of biomedical literature. It efficiently retrieves, extracts, and analyzes information, highlighting significant connections between genes, proteins, and diseases. The platform also allows users to save their analytical outcomes in various formats, including Excel and images. CONCLUSION GPDMiner offers a notable additional functionality among the array of text mining tools available for the biomedical field. This tool presents an effective solution for researchers to navigate and extract relevant information from the vast unstructured texts found in biomedical literature, thereby providing distinctive capabilities that set it apart from existing methodologies. Its application is expected to greatly benefit researchers in this domain, enhancing their capacity for knowledge discovery and data management.
Collapse
Affiliation(s)
- Yeon-Ji Park
- Department of Electronics and Communications Engineering, Kwangwoon University, 20 Gwangun-ro, Seoul, 01897, Republic of Korea
| | - Geun-Je Yang
- Department of Electronics and Communications Engineering, Kwangwoon University, 20 Gwangun-ro, Seoul, 01897, Republic of Korea
| | - Chae-Bong Sohn
- Department of Electronics and Communications Engineering, Kwangwoon University, 20 Gwangun-ro, Seoul, 01897, Republic of Korea.
| | - Soo Jun Park
- Welfare & Medical ICT Research Department, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Daejeon, 34129, Republic of Korea.
| |
Collapse
|
9
|
Yao X, He Z, Liu Y, Wang Y, Ouyang S, Xia J. Cancer-Alterome: a literature-mined resource for regulatory events caused by genetic alterations in cancer. Sci Data 2024; 11:265. [PMID: 38431735 PMCID: PMC10908799 DOI: 10.1038/s41597-024-03083-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2023] [Accepted: 02/20/2024] [Indexed: 03/05/2024] Open
Abstract
It is vital to investigate the complex mechanisms underlying tumors to better understand cancer and develop effective treatments. Metabolic abnormalities and clinical phenotypes can serve as essential biomarkers for diagnosing this challenging disease. Additionally, genetic alterations provide profound insights into the fundamental aspects of cancer. This study introduces Cancer-Alterome, a literature-mined dataset that focuses on the regulatory events of an organism's biological processes or clinical phenotypes caused by genetic alterations. By proposing and leveraging a text-mining pipeline, we identify 16,681 thousand of regulatory events records encompassing 21K genes, 157K genetic alterations and 154K downstream bio-concepts, extracted from 4,354K pan-cancer literature. The resulting dataset empowers a multifaceted investigation of cancer pathology, enabling the meticulous tracking of relevant literature support. Its potential applications extend to evidence-based medicine and precision medicine, yielding valuable insights for further advancements in cancer research.
Collapse
Affiliation(s)
- Xinzhi Yao
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Zhihan He
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yawen Liu
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Yuxing Wang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, P.R. China
| | - Sizhuo Ouyang
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China
| | - Jingbo Xia
- Hubei Key Lab of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P.R. China.
| |
Collapse
|
10
|
Crema C, Buonocore TM, Fostinelli S, Parimbelli E, Verde F, Fundarò C, Manera M, Ramusino MC, Capelli M, Costa A, Binetti G, Bellazzi R, Redolfi A. Advancing Italian biomedical information extraction with transformers-based models: Methodological insights and multicenter practical application. J Biomed Inform 2023; 148:104557. [PMID: 38012982 DOI: 10.1016/j.jbi.2023.104557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 10/26/2023] [Accepted: 11/24/2023] [Indexed: 11/29/2023]
Abstract
The introduction of computerized medical records in hospitals has reduced burdensome activities like manual writing and information fetching. However, the data contained in medical records are still far underutilized, primarily because extracting data from unstructured textual medical records takes time and effort. Information Extraction, a subfield of Natural Language Processing, can help clinical practitioners overcome this limitation by using automated text-mining pipelines. In this work, we created the first Italian neuropsychiatric Named Entity Recognition dataset, PsyNIT, and used it to develop a Transformers-based model. Moreover, we collected and leveraged three external independent datasets to implement an effective multicenter model, with overall F1-score 84.77 %, Precision 83.16 %, Recall 86.44 %. The lessons learned are: (i) the crucial role of a consistent annotation process and (ii) a fine-tuning strategy that combines classical methods with a "low-resource" approach. This allowed us to establish methodological guidelines that pave the way for Natural Language Processing studies in less-resourced languages.
Collapse
Affiliation(s)
- Claudio Crema
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy.
| | - Tommaso Mario Buonocore
- Dept. of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
| | - Silvia Fostinelli
- Molecular Markers Laboratory, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy.
| | - Enea Parimbelli
- Dept. of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
| | - Federico Verde
- Department of Neurology and Laboratory of Neuroscience, IRCCS Istituto Auxologico Italiano, Milan, Italy; Department of Pathophysiology and Transplantation, Dino Ferrari Center, Università degli Studi di Milano, Milan, Italy.
| | - Cira Fundarò
- Neurophysiopatology Unit, IRCCS Istituti Clinici Scientifici Maugeri, Pavia, Italy.
| | - Marina Manera
- Psychology Unit, IRCCS Istituti Clinici Scientifici Maugeri, Pavia, Italy.
| | - Matteo Cotta Ramusino
- Unit of Behavioral Neurology, IRCCS Mondino Foundation Pavia, and Dept. of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.
| | - Marco Capelli
- Unit of Behavioral Neurology, IRCCS Mondino Foundation Pavia, and Dept. of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.
| | - Alfredo Costa
- Unit of Behavioral Neurology, IRCCS Mondino Foundation Pavia, and Dept. of Brain and Behavioral Sciences, University of Pavia, Pavia, Italy.
| | - Giuliano Binetti
- Molecular Markers Laboratory, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy.
| | - Riccardo Bellazzi
- Dept. of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy.
| | - Alberto Redolfi
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, Italy.
| |
Collapse
|
11
|
Miranda-Escalada A, Mehryary F, Luoma J, Estrada-Zavala D, Gasco L, Pyysalo S, Valencia A, Krallinger M. Overview of DrugProt task at BioCreative VII: data and methods for large-scale text mining and knowledge graph generation of heterogenous chemical-protein relations. Database (Oxford) 2023; 2023:baad080. [PMID: 38015956 PMCID: PMC10683943 DOI: 10.1093/database/baad080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Revised: 09/22/2023] [Accepted: 10/30/2023] [Indexed: 11/30/2023]
Abstract
It is getting increasingly challenging to efficiently exploit drug-related information described in the growing amount of scientific literature. Indeed, for drug-gene/protein interactions, the challenge is even bigger, considering the scattered information sources and types of interactions. However, their systematic, large-scale exploitation is key for developing tools, impacting knowledge fields as diverse as drug design or metabolic pathway research. Previous efforts in the extraction of drug-gene/protein interactions from the literature did not address these scalability and granularity issues. To tackle them, we have organized the DrugProt track at BioCreative VII. In the context of the track, we have released the DrugProt Gold Standard corpus, a collection of 5000 PubMed abstracts, manually annotated with granular drug-gene/protein interactions. We have proposed a novel large-scale track to evaluate the capacity of natural language processing systems to scale to the range of millions of documents, and generate with their predictions a silver standard knowledge graph of 53 993 602 nodes and 19 367 406 edges. Its use exceeds the shared task and points toward pharmacological and biological applications such as drug discovery or continuous database curation. Finally, we have created a persistent evaluation scenario on CodaLab to continuously evaluate new relation extraction systems that may arise. Thirty teams from four continents, which involved 110 people, sent 107 submission runs for the Main DrugProt track, and nine teams submitted 21 runs for the Large Scale DrugProt track. Most participants implemented deep learning approaches based on pretrained transformer-like language models (LMs) such as BERT or BioBERT, reaching precision and recall values as high as 0.9167 and 0.9542 for some relation types. Finally, some initial explorations of the applicability of the knowledge graph have shown its potential to explore the chemical-protein relations described in the literature, or chemical compound-enzyme interactions. Database URL: https://doi.org/10.5281/zenodo.4955410.
Collapse
Affiliation(s)
| | - Farrokh Mehryary
- TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland
| | - Jouni Luoma
- TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland
| | | | - Luis Gasco
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku 20014, Finland
| | - Alfonso Valencia
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| | - Martin Krallinger
- Life Sciences Department, Barcelona Supercomputing Center, Barcelona 08034, Spain
| |
Collapse
|
12
|
Tian S, Jin Q, Yeganova L, Lai PT, Zhu Q, Chen X, Yang Y, Chen Q, Kim W, Comeau DC, Islamaj R, Kapoor A, Gao X, Lu Z. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief Bioinform 2023; 25:bbad493. [PMID: 38168838 PMCID: PMC10762511 DOI: 10.1093/bib/bbad493] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 11/15/2023] [Accepted: 12/06/2023] [Indexed: 01/05/2024] Open
Abstract
ChatGPT has drawn considerable attention from both the general public and domain experts with its remarkable text generation capabilities. This has subsequently led to the emergence of diverse applications in the field of biomedicine and health. In this work, we examine the diverse applications of large language models (LLMs), such as ChatGPT, in biomedicine and health. Specifically, we explore the areas of biomedical information retrieval, question answering, medical text summarization, information extraction and medical education and investigate whether LLMs possess the transformative power to revolutionize these tasks or whether the distinct complexities of biomedical domain presents unique challenges. Following an extensive literature survey, we find that significant advances have been made in the field of text generation tasks, surpassing the previous state-of-the-art methods. For other applications, the advances have been modest. Overall, LLMs have not yet revolutionized biomedicine, but recent rapid progress indicates that such methods hold great potential to provide valuable means for accelerating discovery and improving health. We also find that the use of LLMs, like ChatGPT, in the fields of biomedicine and health entails various risks and challenges, including fabricated information in its generated responses, as well as legal and privacy concerns associated with sensitive patient data. We believe this survey can provide a comprehensive and timely overview to biomedical researchers and healthcare practitioners on the opportunities and challenges associated with using ChatGPT and other LLMs for transforming biomedicine and health.
Collapse
Affiliation(s)
- Shubo Tian
- National Library of Medicine, National Institutes of Health
| | - Qiao Jin
- National Library of Medicine, National Institutes of Health
| | - Lana Yeganova
- National Library of Medicine, National Institutes of Health
| | - Po-Ting Lai
- National Library of Medicine, National Institutes of Health
| | - Qingqing Zhu
- National Library of Medicine, National Institutes of Health
| | - Xiuying Chen
- King Abdullah University of Science and Technology
| | - Yifan Yang
- National Library of Medicine, National Institutes of Health
| | - Qingyu Chen
- National Library of Medicine, National Institutes of Health
| | - Won Kim
- National Library of Medicine, National Institutes of Health
| | | | | | - Aadit Kapoor
- National Library of Medicine, National Institutes of Health
| | - Xin Gao
- King Abdullah University of Science and Technology
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health
| |
Collapse
|
13
|
He F, Liu K, Yang Z, Chen Y, Hammer RD, Xu D, Popescu M. pathCLIP: Detection of Genes and Gene Relations from Biological Pathway Figures through Image-Text Contrastive Learning. bioRxiv 2023:2023.10.31.564859. [PMID: 37961680 PMCID: PMC10635012 DOI: 10.1101/2023.10.31.564859] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
In biomedical literature, biological pathways are commonly described through a combination of images and text. These pathways contain valuable information, including genes and their relationships, which provide insight into biological mechanisms and precision medicine. Curating pathway information across the literature enables the integration of this information to build a comprehensive knowledge base. While some studies have extracted pathway information from images and text independently, they often overlook the correspondence between the two modalities. In this paper, we present a pathway figure curation system named pathCLIP for identifying genes and gene relations from pathway figures. Our key innovation is the use of an image-text contrastive learning model to learn coordinated embeddings of image snippets and text descriptions of genes and gene relations, thereby improving curation. Our validation results, using pathway figures from PubMed, showed that our multimodal model outperforms models using only a single modality. Additionally, our system effectively curates genes and gene relations from multiple literature sources. A case study on extracting pathway information from non-small cell lung cancer literature further demonstrates the usefulness of our curated pathway information in enhancing related pathways in the KEGG database.
Collapse
Affiliation(s)
- Fei He
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China; Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Kai Liu
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Zhiyuan Yang
- School of Information Science and Technology, Northeast Normal University, Changchun 130000, China
| | - Yibo Chen
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Richard D Hammer
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Dong Xu
- Department of Electrical Engineer and Computer Science, Christopher S. Bond Life Sciences Center, University of Missouri, Columbia Missouri, MO 65211 USA
| | - Mihail Popescu
- School of Medicine, University of Missouri, Columbia Missouri, MO 65211 USA
| |
Collapse
|
14
|
Yang X, Saha S, Venkatesan A, Tirunagari S, Vartak V, McEntyre J. Europe PMC annotated full-text corpus for gene/proteins, diseases and organisms. Sci Data 2023; 10:722. [PMID: 37857688 PMCID: PMC10587067 DOI: 10.1038/s41597-023-02617-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 10/03/2023] [Indexed: 10/21/2023] Open
Abstract
Named entity recognition (NER) is a widely used text-mining and natural language processing (NLP) subtask. In recent years, deep learning methods have superseded traditional dictionary- and rule-based NER approaches. A high-quality dataset is essential to fully leverage recent deep learning advancements. While several gold-standard corpora for biomedical entities in abstracts exist, only a few are based on full-text research articles. The Europe PMC literature database routinely annotates Gene/Proteins, Diseases, and Organisms entities. To transition this pipeline from a dictionary-based to a machine learning-based approach, we have developed a human-annotated full-text corpus for these entities, comprising 300 full-text open-access research articles. Over 72,000 mentions of biomedical concepts have been identified within approximately 114,000 sentences. This article describes the corpus and details how to access and reuse this open community resource.
Collapse
Affiliation(s)
- Xiao Yang
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Shyamasree Saha
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | - Aravind Venkatesan
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Santosh Tirunagari
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK.
- Open Targets, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - Vid Vartak
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| | - Johanna McEntyre
- Literature Services, EMBL-EBI, Wellcome Trust Genome Campus, Cambridge, UK
| |
Collapse
|
15
|
Tie X, Shin M, Pirasteh A, Ibrahim N, Huemann Z, Castellino SM, Kelly KM, Garrett J, Hu J, Cho SY, Bradshaw TJ. Automatic Personalized Impression Generation for PET Reports Using Large Language Models. ArXiv 2023:arXiv:2309.10066v2. [PMID: 37904738 PMCID: PMC10614982] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Purpose To determine if fine-tuned large language models (LLMs) can generate accurate, personalized impressions for whole-body PET reports. Materials and Methods Twelve language models were trained on a corpus of PET reports using the teacher-forcing algorithm, with the report findings as input and the clinical impressions as reference. An extra input token encodes the reading physician's identity, allowing models to learn physician-specific reporting styles. Our corpus comprised 37,370 retrospective PET reports collected from our institution between 2010 and 2022. To identify the best LLM, 30 evaluation metrics were benchmarked against quality scores from two nuclear medicine (NM) physicians, with the most aligned metrics selecting the model for expert evaluation. In a subset of data, model-generated impressions and original clinical impressions were assessed by three NM physicians according to 6 quality dimensions (3-point scale) and an overall utility score (5-point scale). Each physician reviewed 12 of their own reports and 12 reports from other physicians. Bootstrap resampling was used for statistical analysis. Results Of all evaluation metrics, domain-adapted BARTScore and PEGASUSScore showed the highest Spearman's ρ correlations (ρ=0.568 and 0.563) with physician preferences. Based on these metrics, the fine-tuned PEGASUS model was selected as the top LLM. When physicians reviewed PEGASUS-generated impressions in their own style, 89% were considered clinically acceptable, with a mean utility score of 4.08 out of 5. Physicians rated these personalized impressions as comparable in overall utility to the impressions dictated by other physicians (4.03, P=0.41). Conclusion Personalized impressions generated by PEGASUS were clinically useful, highlighting its potential to expedite PET reporting.
Collapse
Affiliation(s)
- Xin Tie
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- Department of Medical Physics, University of Wisconsin, Madison, WI, USA
| | - Muheon Shin
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Ali Pirasteh
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- Department of Medical Physics, University of Wisconsin, Madison, WI, USA
| | - Nevein Ibrahim
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Zachary Huemann
- Department of Radiology, University of Wisconsin, Madison, WI, USA
| | - Sharon M. Castellino
- Department of Pediatrics, Emory University School of Medicine, Atlanta, GA, USA
- Aflac Cancer and Blood Disorders Center, Children’s Healthcare of Atlanta, Atlanta, GA, USA
| | - Kara M. Kelly
- Department of Pediatric Oncology, Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
- Department of Pediatrics, University at Buffalo Jacobs School of Medicine and Biomedical Sciences, Buffalo, NY, USA
| | - John Garrett
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- Department of Medical Physics, University of Wisconsin, Madison, WI, USA
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
- Department of Computer Science, University of Wisconsin, Madison, WI, USA
| | - Steve Y. Cho
- Department of Radiology, University of Wisconsin, Madison, WI, USA
- University of Wisconsin Carbone Comprehensive Cancer Center, Madison, WI, USA
| | | |
Collapse
|
16
|
Chen Q, Sun H, Liu H, Jiang Y, Ran T, Jin X, Xiao X, Lin Z, Chen H, Niu Z. An extensive benchmark study on biomedical text generation and mining with ChatGPT. Bioinformatics 2023; 39:btad557. [PMID: 37682111 PMCID: PMC10562950 DOI: 10.1093/bioinformatics/btad557] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 08/09/2023] [Accepted: 09/06/2023] [Indexed: 09/09/2023] Open
Abstract
MOTIVATION In recent years, the development of natural language process (NLP) technologies and deep learning hardware has led to significant improvement in large language models (LLMs). The ChatGPT, the state-of-the-art LLM built on GPT-3.5 and GPT-4, shows excellent capabilities in general language understanding and reasoning. Researchers also tested the GPTs on a variety of NLP-related tasks and benchmarks and got excellent results. With exciting performance on daily chat, researchers began to explore the capacity of ChatGPT on expertise that requires professional education for human and we are interested in the biomedical domain. RESULTS To evaluate the performance of ChatGPT on biomedical-related tasks, this article presents a comprehensive benchmark study on the use of ChatGPT for biomedical corpus, including article abstracts, clinical trials description, biomedical questions, and so on. Typical NLP tasks like named entity recognization, relation extraction, sentence similarity, question and answering, and document classification are included. Overall, ChatGPT got a BLURB score of 58.50 while the state-of-the-art model had a score of 84.30. Through a series of experiments, we demonstrated the effectiveness and versatility of ChatGPT in biomedical text understanding, reasoning and generation, and the limitation of ChatGPT build on GPT-3.5. AVAILABILITY AND IMPLEMENTATION All the datasets are available from BLURB benchmark https://microsoft.github.io/BLURB/index.html. The prompts are described in the article.
Collapse
Affiliation(s)
- Qijie Chen
- AIDD, Mindrank AI Ltd, Zhejiang 310000, China
| | - Haotong Sun
- AIDD, Mindrank AI Ltd, Zhejiang 310000, China
| | - Haoyang Liu
- College of Life Sciences, Nankai University, Tianjin 300071, China
- Guangzhou Laboratory, GuangDong 510005, China
| | | | - Ting Ran
- Guangzhou Laboratory, GuangDong 510005, China
| | - Xurui Jin
- AIDD, Mindrank AI Ltd, Zhejiang 310000, China
| | | | - Zhimin Lin
- AIDD, Mindrank AI Ltd, Zhejiang 310000, China
| | | | - Zhangmin Niu
- AIDD, Mindrank AI Ltd, Zhejiang 310000, China
- National Heart and Lung Institute, Imperial College London, London, United Kingdom
| |
Collapse
|
17
|
Buonocore TM, Crema C, Redolfi A, Bellazzi R, Parimbelli E. Localizing in-domain adaptation of transformer-based biomedical language models. J Biomed Inform 2023; 144:104431. [PMID: 37385327 DOI: 10.1016/j.jbi.2023.104431] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 06/09/2023] [Accepted: 06/17/2023] [Indexed: 07/01/2023]
Abstract
In the era of digital healthcare, the huge volumes of textual information generated every day in hospitals constitute an essential but underused asset that could be exploited with task-specific, fine-tuned biomedical language representation models, improving patient care and management. For such specialized domains, previous research has shown that fine-tuning models stemming from broad-coverage checkpoints can largely benefit additional training rounds over large-scale in-domain resources. However, these resources are often unreachable for less-resourced languages like Italian, preventing local medical institutions to employ in-domain adaptation. In order to reduce this gap, our work investigates two accessible approaches to derive biomedical language models in languages other than English, taking Italian as a concrete use-case: one based on neural machine translation of English resources, favoring quantity over quality; the other based on a high-grade, narrow-scoped corpus natively written in Italian, thus preferring quality over quantity. Our study shows that data quantity is a harder constraint than data quality for biomedical adaptation, but the concatenation of high-quality data can improve model performance even when dealing with relatively size-limited corpora. The models published from our investigations have the potential to unlock important research opportunities for Italian hospitals and academia. Finally, the set of lessons learned from the study constitutes valuable insights towards a solution to build biomedical language models that are generalizable to other less-resourced languages and different domain settings.
Collapse
Affiliation(s)
- Tommaso Mario Buonocore
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy.
| | - Claudio Crema
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy
| | - Alberto Redolfi
- Laboratory of Neuroinformatics, IRCCS Istituto Centro San Giovanni di Dio Fatebenefratelli, Brescia, 25125, Italy
| | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy
| | - Enea Parimbelli
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, 27100, Italy
| |
Collapse
|
18
|
Badenes-Olmedo C, Corcho O. Lessons learned to enable question answering on knowledge graphs extracted from scientific publications: A case study on the coronavirus literature. J Biomed Inform 2023; 142:104382. [PMID: 37156393 PMCID: PMC10163941 DOI: 10.1016/j.jbi.2023.104382] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 04/14/2023] [Accepted: 05/03/2023] [Indexed: 05/10/2023]
Abstract
The article presents a workflow to create a question-answering system whose knowledge base combines knowledge graphs and scientific publications on coronaviruses. It is based on the experience gained in modeling evidence from research articles to provide answers to questions in natural language. The work contains best practices for acquiring scientific publications, tuning language models to identify and normalize relevant entities, creating representational models based on probabilistic topics, and formalizing an ontology that describes the associations between domain concepts supported by the scientific literature. All the resources generated in the domain of coronavirus are available openly as part of the Drugs4COVID initiative, and can be (re)-used independently or as a whole. They can be exploited by scientific communities conducting research related to SARS-CoV-2/COVID-19 and also by therapeutic communities, laboratories, etc., wishing to find and understand relationships between symptoms, drugs, active ingredients and their documentary evidence.
Collapse
Affiliation(s)
| | - Oscar Corcho
- Artificial Intelligence Department, Campus de Montegancedo, s/n., Boadilla del Monte, 28660, Madrid, Spain
| |
Collapse
|
19
|
Tinn R, Cheng H, Gu Y, Usuyama N, Liu X, Naumann T, Gao J, Poon H. Fine-tuning large neural language models for biomedical natural language processing. Patterns (N Y) 2023; 4:100729. [PMID: 37123444 PMCID: PMC10140607 DOI: 10.1016/j.patter.2023.100729] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 12/12/2022] [Accepted: 03/17/2023] [Indexed: 05/02/2023]
Abstract
Large neural language models have transformed modern natural language processing (NLP) applications. However, fine-tuning such models for specific tasks remains challenging as model size increases, especially with small labeled datasets, which are common in biomedical NLP. We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that fine-tuning performance may be sensitive to pretraining settings and conduct an exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for low-resource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT- B A S E models, while layerwise decay is more effective for BERT- L A R G E and ELECTRA models. For low-resource text similarity tasks, such as BIOSSES, reinitializing the top layers is the optimal strategy. Overall, domain-specific vocabulary and pretraining facilitate robust models for fine-tuning. Based on these findings, we establish a new state of the art on a wide range of biomedical NLP applications.
Collapse
Affiliation(s)
| | - Hao Cheng
- Microsoft Research, Redmond, WA, USA
| | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | - Hoifung Poon
- Microsoft Research, Redmond, WA, USA
- Corresponding author
| |
Collapse
|
20
|
Abstract
Generative Pre-trained Transformers (GPT) are powerful language models that have great potential to transform biomedical research. However, they are known to suffer from artificial hallucinations and provide false answers that are seemingly correct in some situations. We developed GeneTuring, a comprehensive QA database with 600 questions in genomics, and manually scored 10,800 answers returned by six GPT models, including GPT-3, ChatGPT, and New Bing. New Bing has the best overall performance and significantly reduces the level of AI hallucination compared to other models, thanks to its ability to recognize its incapacity in answering questions. We argue that improving incapacity awareness is equally important as improving model accuracy to address AI hallucination.
Collapse
Affiliation(s)
- Wenpin Hou
- Department of Biostatistics, The Mailman School of Public Health, Columbia University, New York City, NY, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
21
|
Leaman R, Islamaj R, Adams V, Alliheedi MA, Almeida JR, Antunes R, Bevan R, Chang YC, Erdengasileng A, Hodgskiss M, Ida R, Kim H, Li K, Mercer RE, Mertová L, Mobasher G, Shin HC, Sung M, Tsujimura T, Yeh WC, Lu Z. Chemical identification and indexing in full-text articles: an overview of the NLM-Chem track at BioCreative VII. Database (Oxford) 2023; 2023:7071696. [PMID: 36882099 PMCID: PMC9991492 DOI: 10.1093/database/baad005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Revised: 01/06/2023] [Accepted: 02/15/2023] [Indexed: 03/09/2023]
Abstract
The BioCreative National Library of Medicine (NLM)-Chem track calls for a community effort to fine-tune automated recognition of chemical names in the biomedical literature. Chemicals are one of the most searched biomedical entities in PubMed, and-as highlighted during the coronavirus disease 2019 pandemic-their identification may significantly advance research in multiple biomedical subfields. While previous community challenges focused on identifying chemical names mentioned in titles and abstracts, the full text contains valuable additional detail. We, therefore, organized the BioCreative NLM-Chem track as a community effort to address automated chemical entity recognition in full-text articles. The track consisted of two tasks: (i) chemical identification and (ii) chemical indexing. The chemical identification task required predicting all chemicals mentioned in recently published full-text articles, both span [i.e. named entity recognition (NER)] and normalization (i.e. entity linking), using Medical Subject Headings (MeSH). The chemical indexing task required identifying which chemicals reflect topics for each article and should therefore appear in the listing of MeSH terms for the document in the MEDLINE article indexing. This manuscript summarizes the BioCreative NLM-Chem track and post-challenge experiments. We received a total of 85 submissions from 17 teams worldwide. The highest performance achieved for the chemical identification task was 0.8672 F-score (0.8759 precision and 0.8587 recall) for strict NER performance and 0.8136 F-score (0.8621 precision and 0.7702 recall) for strict normalization performance. The highest performance achieved for the chemical indexing task was 0.6073 F-score (0.7417 precision and 0.5141 recall). This community challenge demonstrated that (i) the current substantial achievements in deep learning technologies can be utilized to improve automated prediction accuracy further and (ii) the chemical indexing task is substantially more challenging. We look forward to further developing biomedical text-mining methods to respond to the rapid growth of biomedical literature. The NLM-Chem track dataset and other challenge materials are publicly available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/. Database URL https://ftp.ncbi.nlm.nih.gov/pub/lu/BC7-NLM-Chem-track/.
Collapse
Affiliation(s)
| | | | - Virginia Adams
- NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
| | - Mohammed A Alliheedi
- Department of Computer Science, Al Baha University, 4781 King Fahd Rd, Al Aqiq 65779, Saudi Arabia
| | - João Rafael Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
- Department of Information and Communications Technologies, University of A Coruña, Camiño do Lagar de Castro, A Coruña 15008, Spain
| | - Rui Antunes
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Robert Bevan
- Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Da’an District, Taipei City , Taipei 106, Taiwan
| | - Arslan Erdengasileng
- Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
| | - Matthew Hodgskiss
- Informatics Department, Medicines Discovery Catapult, Alderley Park, Block 35, Mereside, Macclesfield SK10 4ZF, UK
| | - Ryuki Ida
- Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
| | - Hyunjae Kim
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Keqiao Li
- Department of Statistics, Florida State University, 117 N. Woodward Ave, Tallahassee, FL 32306, USA
| | - Robert E Mercer
- Department of Computer Science, The University of Western Ontario, Room 355, Middlesex College, Ontario , London N6A 5B7, Canada
| | - Lukrécia Mertová
- Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany
| | - Ghadeer Mobasher
- Scientific Databases and Visualization Group, Heidelberg Institute for Theoretical Studies (HITS gGmbH), Schloss-Wolfsbrunnenweg 35, Heidelberg 69118, Germany
- Institute of Computer Science, Heidelberg University, Im Neuenheimer Feld 205, Heidelberg 69120, Germany
| | - Hoo-Chang Shin
- NVIDIA, 2788 San Tomas Expressway, Santa Clara, CA 95051, USA
| | - Mujeen Sung
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul 02841, South Korea
| | - Tomoki Tsujimura
- Computational Intelligence Laboratory, Toyota Technological Institute, 2-12-1 Hisakata, Tempaku-ku, Nagoya, Aichi 468-8511, Japan
| | - Wen-Chao Yeh
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Kuang-Fu Road, Hsinchu 30013, Taiwan
| | - Zhiyong Lu
- *Corresponding author: Tel: +1-301-594-7089; Fax: +1-301-480-2290;
| |
Collapse
|
22
|
Rohanian O, Nouriborji M, Kouchaki S, Clifton DA. On the effectiveness of compact biomedical transformers. Bioinformatics 2023; 39:btad103. [PMID: 36825820 PMCID: PMC10027428 DOI: 10.1093/bioinformatics/btad103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 12/23/2022] [Accepted: 02/23/2023] [Indexed: 02/25/2023] Open
Abstract
MOTIVATION Language models pre-trained on biomedical corpora, such as BioBERT, have recently shown promising results on downstream biomedical tasks. Many existing pre-trained models, on the other hand, are resource-intensive and computationally heavy owing to factors such as embedding size, hidden dimension and number of layers. The natural language processing community has developed numerous strategies to compress these models utilizing techniques such as pruning, quantization and knowledge distillation, resulting in models that are considerably faster, smaller and subsequently easier to use in practice. By the same token, in this article, we introduce six lightweight models, namely, BioDistilBERT, BioTinyBERT, BioMobileBERT, DistilBioBERT, TinyBioBERT and CompactBioBERT which are obtained either by knowledge distillation from a biomedical teacher or continual learning on the Pubmed dataset. We evaluate all of our models on three biomedical tasks and compare them with BioBERT-v1.1 to create the best efficient lightweight models that perform on par with their larger counterparts. RESULTS We trained six different models in total, with the largest model having 65 million in parameters and the smallest having 15 million; a far lower range of parameters compared with BioBERT's 110M. Based on our experiments on three different biomedical tasks, we found that models distilled from a biomedical teacher and models that have been additionally pre-trained on the PubMed dataset can retain up to 98.8% and 98.6% of the performance of the BioBERT-v1.1, respectively. Overall, our best model below 30 M parameters is BioMobileBERT, while our best models over 30 M parameters are DistilBioBERT and CompactBioBERT, which can keep up to 98.2% and 98.8% of the performance of the BioBERT-v1.1, respectively. AVAILABILITY AND IMPLEMENTATION Codes are available at: https://github.com/nlpie-research/Compact-Biomedical-Transformers. Trained models can be accessed at: https://huggingface.co/nlpie.
Collapse
Affiliation(s)
- Omid Rohanian
- Department of Engineering Science, University of Oxford, Oxford, UK
- NLPie Research, Oxford, UK
| | | | - Samaneh Kouchaki
- Department of Electrical and Electronic Engineering, University of Surrey, Guildford, UK
| | - David A Clifton
- Department of Engineering Science, University of Oxford, Oxford, UK
- Oxford-Suzhou Centre for Advanced Research, Suzhou, China
| |
Collapse
|
23
|
Saxena P, Rauniyar S, Thakur P, Singh RN, Bomgni A, Alaba MO, Tripathi AK, Gnimpieba EZ, Lushbough C, Sani RK. Integration of text mining and biological network analysis: Identification of essential genes in sulfate-reducing bacteria. Front Microbiol 2023; 14:1086021. [PMID: 37125195 PMCID: PMC10133479 DOI: 10.3389/fmicb.2023.1086021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 03/23/2023] [Indexed: 05/02/2023] Open
Abstract
The growth and survival of an organism in a particular environment is highly depends on the certain indispensable genes, termed as essential genes. Sulfate-reducing bacteria (SRB) are obligate anaerobes which thrives on sulfate reduction for its energy requirements. The present study used Oleidesulfovibrio alaskensis G20 (OA G20) as a model SRB to categorize the essential genes based on their key metabolic pathways. Herein, we reported a feedback loop framework for gene of interest discovery, from bio-problem to gene set of interest, leveraging expert annotation with computational prediction. Defined bio-problem was applied to retrieve the genes of SRB from literature databases (PubMed, and PubMed Central) and annotated them to the genome of OA G20. Retrieved gene list was further used to enrich protein-protein interaction and was corroborated to the pangenome analysis, to categorize the enriched gene sets and the respective pathways under essential and non-essential. Interestingly, the sat gene (dde_2265) from the sulfur metabolism was the bridging gene between all the enriched pathways. Gene clusters involved in essential pathways were linked with the genes from seleno-compound metabolism, amino acid metabolism, secondary metabolite synthesis, and cofactor biosynthesis. Furthermore, pangenome analysis demonstrated the gene distribution, where 69.83% of the 116 enriched genes were mapped under "persistent," inferring the essentiality of these genes. Likewise, 21.55% of the enriched genes, which involves specially the formate dehydrogenases and metallic hydrogenases, appeared under "shell." Our methodology suggested that semi-automated text mining and network analysis may play a crucial role in deciphering the previously unexplored genes and key mechanisms which can help to generate a baseline prior to perform any experimental studies.
Collapse
Affiliation(s)
- Priya Saxena
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- Data Driven Material Discovery Center for Bioengineering Innovation, South Dakota School of Mines and Technology, Rapid City, SD, United States
| | - Shailabh Rauniyar
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- 2-Dimensional Materials for Biofilm Engineering, Science and Technology, South Dakota School of Mines and Technology, Rapid City, SD, United States
| | - Payal Thakur
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- Data Driven Material Discovery Center for Bioengineering Innovation, South Dakota School of Mines and Technology, Rapid City, SD, United States
| | - Ram Nageena Singh
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- 2-Dimensional Materials for Biofilm Engineering, Science and Technology, South Dakota School of Mines and Technology, Rapid City, SD, United States
| | - Alain Bomgni
- Department of Biomedical Engineering, University of South Dakota, Sioux Falls, SD, United States
| | - Mathew O. Alaba
- Department of Biomedical Engineering, University of South Dakota, Sioux Falls, SD, United States
| | - Abhilash Kumar Tripathi
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- 2-Dimensional Materials for Biofilm Engineering, Science and Technology, South Dakota School of Mines and Technology, Rapid City, SD, United States
| | - Etienne Z. Gnimpieba
- Department of Biomedical Engineering, University of South Dakota, Sioux Falls, SD, United States
- *Correspondence: Etienne Z. Gnimpieba,
| | - Carol Lushbough
- Department of Biomedical Engineering, University of South Dakota, Sioux Falls, SD, United States
| | - Rajesh Kumar Sani
- Department of Chemical and Biological Engineering, South Dakota School of Mines and Technology, Rapid City, SD, United States
- Data Driven Material Discovery Center for Bioengineering Innovation, South Dakota School of Mines and Technology, Rapid City, SD, United States
- 2-Dimensional Materials for Biofilm Engineering, Science and Technology, South Dakota School of Mines and Technology, Rapid City, SD, United States
- BuG ReMeDEE Consortium, South Dakota School of Mines and Technology, Rapid City, SD, United States
- Rajesh Kumar Sani,
| |
Collapse
|
24
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y, Lian Q. Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition. IEEE/ACM Trans Comput Biol Bioinform 2023; 20:595-605. [PMID: 35259113 DOI: 10.1109/tcbb.2022.3157630] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.
Collapse
|
25
|
Kumar A, Sharaff A. ABEE: automated bio entity extraction from biomedical text documents. DTA 2022. [DOI: 10.1108/dta-04-2022-0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
PurposeThe purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts.Design/methodology/approachIn the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text.FindingsThe proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments.Research limitations/implicationsAs such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups.Practical implicationsAs far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery.Social implicationsDuring the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.Originality/valueIn this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score.
Collapse
|
26
|
Jeon SH, Cho S. Edge Weight Updating Neural Network for Named Entity Normalization. Neural Process Lett 2022; 55:1-22. [PMID: 36573130 PMCID: PMC9770557 DOI: 10.1007/s11063-022-11102-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 12/07/2022] [Indexed: 12/24/2022]
Abstract
Discriminating the matched named entity pairs or identifying the entities' canonical forms are critical in text mining tasks. More precise named entity normalization in text mining will benefit other subsequent text analytic applications. We built the named entity normalization model with a novel edge weight updating neural network. We, next, verify our model's performance on NCBI disease, BC5CDR disease, and BC5CDR chemical databases, which are widely used named entity normalization datasets in the bioinformatics field. We also tested our model with our own financial named entity normalization dataset to validate the efficacy for more general applications. Using the constructed dataset, we differentiate named entity pairs. Our model achieved the highest named entity normalization performances in terms of various evaluation metrics. Our proposed model when tested on four different datasets achieved state-of-the-art results.
Collapse
Affiliation(s)
- Sung Hwan Jeon
- Department of Industrial Engineering, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Sungzoon Cho
- Department of Industrial Engineering, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
- Institute for Industrial Systems Innovation, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| |
Collapse
|
27
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:v14122761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
- Correspondence:
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
28
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
29
|
Chatr-Aryamontri A, Hirschman L, Ross KE, Oughtred R, Krallinger M, Dolinski K, Tyers M, Korves T, Arighi CN. Overview of the COVID-19 text mining tool interactive demonstration track in BioCreative VII. Database (Oxford) 2022; 2022:6748864. [PMID: 36197453 PMCID: PMC9534061 DOI: 10.1093/database/baac084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2022] [Revised: 08/18/2022] [Accepted: 09/08/2022] [Indexed: 11/06/2022]
Abstract
The coronavirus disease 2019 (COVID-19) pandemic has compelled biomedical researchers to communicate data in real time to establish more effective medical treatments and public health policies. Nontraditional sources such as preprint publications, i.e. articles not yet validated by peer review, have become crucial hubs for the dissemination of scientific results. Natural language processing (NLP) systems have been recently developed to extract and organize COVID-19 data in reasoning systems. Given this scenario, the BioCreative COVID-19 text mining tool interactive demonstration track was created to assess the landscape of the available tools and to gauge user interest, thereby providing a two-way communication channel between NLP system developers and potential end users. The goal was to inform system designers about the performance and usability of their products and to suggest new additional features. Considering the exploratory nature of this track, the call for participation solicited teams to apply for the track, based on their system's ability to perform COVID-19-related tasks and interest in receiving user feedback. We also recruited volunteer users to test systems. Seven teams registered systems for the track, and >30 individuals volunteered as test users; these volunteer users covered a broad range of specialties, including bench scientists, bioinformaticians and biocurators. The users, who had the option to participate anonymously, were provided with written and video documentation to familiarize themselves with the NLP tools and completed a survey to record their evaluation. Additional feedback was also provided by NLP system developers. The track was well received as shown by the overall positive feedback from the participating teams and the users. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-vii/track-4/.
Collapse
Affiliation(s)
- Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer (IRIC), University of Montreal, Marcelle-Coutu Pavilion, 2950 Chem. de Polytechnique Montreal, Quebec H3T 1J4, Canada
| | - Lynette Hirschman
- MITRE Labs, The MITRE Corporation, 202 Burlington Rd., Bedford, MA 01730, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, 2115 Wisconsin Ave NW, DC 20007, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, South Drive, Princeton, NJ 08544, USA
| | - Martin Krallinger
- Barcelona Supercomputing Center (BSC), Plaça d'Eusebi Güell, 1-3, Barcelona 08034, Spain
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, South Drive, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer (IRIC), University of Montreal, Marcelle-Coutu Pavilion, 2950 Chem. de Polytechnique Montreal, Quebec H3T 1J4, Canada
| | - Tonia Korves
- MITRE Labs, The MITRE Corporation, 202 Burlington Rd., Bedford, MA 01730, USA
| | - Cecilia N Arighi
- Computer and Information Sciences Department, University of Delaware, Ammon-Pinizzotto Biopharmaceutical Innovation Building, 590 Avenue 1743, Newark, DE 19713, USA
| |
Collapse
|
30
|
Liu Z, He M, Jiang Z, Wu Z, Dai H, Zhang L, Luo S, Han T, Li X, Jiang X, Zhu D, Cai X, Ge B, Liu W, Liu J, Shen D, Liu T. Survey on natural language processing in medical image analysis. Zhong Nan Da Xue Xue Bao Yi Xue Ban 2022; 47:981-993. [PMID: 36097765 PMCID: PMC10950114 DOI: 10.11817/j.issn.1672-7347.2022.220376] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Indexed: 06/15/2023]
Abstract
Recent advancement in natural language processing (NLP) and medical imaging empowers the wide applicability of deep learning models. These developments have increased not only data understanding, but also knowledge of state-of-the-art architectures and their real-world potentials. Medical imaging researchers have recognized the limitations of only targeting images, as well as the importance of integrating multimodal inputs into medical image analysis. The lack of comprehensive surveys of the current literature, however, impedes the progress of this domain. Existing research perspectives, as well as the architectures, tasks, datasets, and performance measures examined in the present literature, are reviewed in this work, and we also provide a brief description of possible future directions in the field, aiming to provide researchers and healthcare professionals with a detailed summary of existing academic research and to provide rational insights to facilitate future research.
Collapse
Affiliation(s)
- Zhengliang Liu
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA.
| | - Mengshen He
- School of Physics & Information Technology, Shaanxi Normal University, Xi'an 710119, China
| | - Zuowei Jiang
- School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Zihao Wu
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Haixing Dai
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| | - Lian Zhang
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ 85054, USA
| | - Siyi Luo
- Faculty of Electronic and Information Engineering, Xi'an Jiaotong University, Xi'an 710049, China
| | - Tianle Han
- School of Physics & Information Technology, Shaanxi Normal University, Xi'an 710119, China
| | - Xiang Li
- Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Xi Jiang
- School of Life Science and Technology, University of Electronic Science and Technology, Chengdu 611731, China
| | - Dajiang Zhu
- Department of Computer Science and Engineering, University of Texas at Arlington, Arlington, TX 76019, USA
| | - Xiaoyan Cai
- School of Automation, Northwestern Polytechnical University, Xi'an 710072, China
| | - Bao Ge
- School of Physics & Information Technology, Shaanxi Normal University, Xi'an 710119, China
| | - Wei Liu
- Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ 85054, USA
| | - Jun Liu
- Department of Radiology, Second Xiangya Hospital, Central South University, Changsha 410011, China
| | - Dinggang Shen
- School of Biomedical Engineering, ShanghaiTech University, Shanghai 201210, China
| | - Tianming Liu
- Department of Computer Science, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
31
|
Tong Y, Zhuang F, Zhang H, Fang C, Zhao Y, Wang D, Zhu H, Ni B. Improving biomedical named entity recognition by dynamic caching inter-sentence information. Bioinformatics 2022; 38:3976-3983. [PMID: 35758612 DOI: 10.1093/bioinformatics/btac422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 06/03/2022] [Accepted: 06/24/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Biomedical Named Entity Recognition (BioNER) aims to identify biomedical domain-specific entities (e.g. gene, chemical and disease) from unstructured texts. Despite deep learning-based methods for BioNER achieving satisfactory results, there is still much room for improvement. Firstly, most existing methods use independent sentences as training units and ignore inter-sentence context, which usually leads to the labeling inconsistency problem. Secondly, previous document-level BioNER works have approved that the inter-sentence information is essential, but what information should be regarded as context remains ambiguous. Moreover, there are still few pre-training-based BioNER models that have introduced inter-sentence information. Hence, we propose a cache-based inter-sentence model called BioNER-Cache to alleviate the aforementioned problems. RESULTS We propose a simple but effective dynamic caching module to capture inter-sentence information for BioNER. Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context. Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train our model. We build a comprehensive benchmark on four biomedical datasets to evaluate the model performance fairly. Finally, extensive experiments clearly validate the superiority of our proposed BioNER-Cache compared with various state-of-the-art intra-sentence and inter-sentence baselines. AVAILABILITYAND IMPLEMENTATION Code will be available at https://github.com/zgzjdx/BioNER-Cache. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiqi Tong
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Fuzhen Zhuang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China.,SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | - Huajie Zhang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Chuyu Fang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Yu Zhao
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Deqing Wang
- SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | | | - Bin Ni
- Xiamen Data Intelligence Academy of ICT, CAS, Xiamen 361021, China
| |
Collapse
|
32
|
Lin SJ, Yeh WC, Chiu YW, Chang YC, Hsu MH, Chen YS, Hsu WL. A BERT-based ensemble learning approach for the BioCreative VII challenges: full-text chemical identification and multi-label classification in PubMed articles. Database (Oxford) 2022; 2022:6645124. [PMID: 35849027 PMCID: PMC9290865 DOI: 10.1093/database/baac056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 06/20/2022] [Accepted: 07/02/2022] [Indexed: 11/25/2022]
Abstract
In this research, we explored various state-of-the-art biomedical-specific pre-trained Bidirectional Encoder Representations from Transformers (BERT) models for the National Library of Medicine - Chemistry (NLM CHEM) and LitCovid tracks in the BioCreative VII Challenge, and propose a BERT-based ensemble learning approach to integrate the advantages of various models to improve the system’s performance. The experimental results of the NLM-CHEM track demonstrate that our method can achieve remarkable performance, with F1-scores of 85% and 91.8% in strict and approximate evaluations, respectively. Moreover, the proposed Medical Subject Headings identifier (MeSH ID) normalization algorithm is effective in entity normalization, which achieved a F1-score of about 80% in both strict and approximate evaluations. For the LitCovid track, the proposed method is also effective in detecting topics in the Coronavirus disease 2019 (COVID-19) literature, which outperformed the compared methods and achieve state-of-the-art performance in the LitCovid corpus. Database URL: https://www.ncbi.nlm.nih.gov/research/coronavirus/.
Collapse
Affiliation(s)
- Sheng-Jie Lin
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Wen-Chao Yeh
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District , Hsinchu City 300, Taiwan
| | - Yu-Wen Chiu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District , Hsinchu City 300, Taiwan
| | - Min-Huei Hsu
- Graduate Institute of Data Science, Taipei Medical University, No. 172-1, Section 2, Keelung Rd, Dáan District , Taipei City 106, Taiwan
| | - Yi-Shin Chen
- Institute of Information Systems and Applications, National Tsing Hua University, No. 101, Section 2, Guangfu Rd, East District , Hsinchu City 300, Taiwan
| | - Wen-Lian Hsu
- Pervasive AI Research Labs, Ministry of Science and Technology, No. 1001, Daxue Rd, East District , Hsinchu City 300, Taiwan
- Department of Computer Science and Information Engineering, Asia University, No. 500, Liufeng Rd, Wufeng District , Taichung City 413, Taiwan
| |
Collapse
|
33
|
|
34
|
Yan A, McAuley J, Lu X, Du J, Chang EY, Gentili A, Hsu CN. RadBERT: Adapting Transformer-based Language Models to Radiology. Radiol Artif Intell 2022; 4:e210258. [PMID: 35923376 PMCID: PMC9344353 DOI: 10.1148/ryai.210258] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Revised: 04/28/2022] [Accepted: 06/03/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE To investigate if tailoring a transformer-based language model to radiology is beneficial for radiology natural language processing (NLP) applications. MATERIALS AND METHODS This retrospective study presents a family of bidirectional encoder representations from transformers (BERT)-based language models adapted for radiology, named RadBERT. Transformers were pretrained with either 2.16 or 4.42 million radiology reports from U.S. Department of Veterans Affairs health care systems nationwide on top of four different initializations (BERT-base, Clinical-BERT, robustly optimized BERT pretraining approach [RoBERTa], and BioMed-RoBERTa) to create six variants of RadBERT. Each variant was fine-tuned for three representative NLP tasks in radiology: (a) abnormal sentence classification: models classified sentences in radiology reports as reporting abnormal or normal findings; (b) report coding: models assigned a diagnostic code to a given radiology report for five coding systems; and (c) report summarization: given the findings section of a radiology report, models selected key sentences that summarized the findings. Model performance was compared by bootstrap resampling with five intensively studied transformer language models as baselines: BERT-base, BioBERT, Clinical-BERT, BlueBERT, and BioMed-RoBERTa. RESULTS For abnormal sentence classification, all models performed well (accuracies above 97.5 and F1 scores above 95.0). RadBERT variants achieved significantly higher scores than corresponding baselines when given only 10% or less of 12 458 annotated training sentences. For report coding, all variants outperformed baselines significantly for all five coding systems. The variant RadBERT-BioMed-RoBERTa performed the best among all models for report summarization, achieving a Recall-Oriented Understudy for Gisting Evaluation-1 score of 16.18 compared with 15.27 by the corresponding baseline (BioMed-RoBERTa, P < .004). CONCLUSION Transformer-based language models tailored to radiology had improved performance of radiology NLP tasks compared with baseline transformer language models.Keywords: Translation, Unsupervised Learning, Transfer Learning, Neural Networks, Informatics Supplemental material is available for this article. © RSNA, 2022See also commentary by Wiggins and Tejani in this issue.
Collapse
|
35
|
Cho H, Kim B, Choi W, Lee D, Lee H. Plant phenotype relationship corpus for biomedical relationships between plants and phenotypes. Sci Data 2022; 9:235. [PMID: 35618736 PMCID: PMC9135735 DOI: 10.1038/s41597-022-01350-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/03/2022] [Indexed: 11/09/2022] Open
Abstract
Medicinal plants have demonstrated therapeutic potential for applicability for a wide range of observable characteristics in the human body, known as "phenotype," and have been considered favorably in clinical treatment. With an ever increasing interest in plants, many researchers have attempted to extract meaningful information by identifying relationships between plants and phenotypes from the existing literature. Although natural language processing (NLP) aims to extract useful information from unstructured textual data, there is no appropriate corpus available to train and evaluate the NLP model for plants and phenotypes. Therefore, in the present study, we have presented the plant-phenotype relationship (PPR) corpus, a high-quality resource that supports the development of various NLP fields; it includes information derived from 600 PubMed abstracts corresponding to 5,668 plant and 11,282 phenotype entities, and demonstrates a total of 9,709 relationships. We have also described benchmark results through named entity recognition and relation extraction systems to verify the quality of our data and to show the significant performance of NLP tasks in the PPR test set.
Collapse
Affiliation(s)
- Hyejin Cho
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Baeksoo Kim
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea
| | - Wonjun Choi
- Digital Curation Center, Korea Institute of Science and Technology Information, Daejeon, 34141, Republic of Korea
| | - Doheon Lee
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, 34141, Republic of Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, 61005, Republic of Korea.
| |
Collapse
|
36
|
Li Z, Bai H, Zhang R, Chen B, Wang J, Xue B, Ren X, Wang J, Jia Y, Zang W, Wang J, Chen X. Systematic analysis of critical genes and pathways identified a signature of neuropathic pain after spinal cord injury. Eur J Neurosci 2022; 56:3991-4008. [PMID: 35560852 DOI: 10.1111/ejn.15693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 03/21/2022] [Accepted: 03/26/2022] [Indexed: 11/28/2022]
Abstract
Spinal cord injury (SCI) damages sensory systems, producing chronic neuropathic pain that is resistant to medical treatment. The specific mechanisms underlying SCI-induced neuropathic pain (SCI-NP) remain unclear, and protein biomarkers have not yet been integrated into diagnostic screening. To better understand the host molecular pathways involved in SCI-NP, we used the bioinformatics method, the PubMed database, and bioinformatics methods to identify target genes and their associated pathways. We reviewed 2504 articles on the regulation of SCI-NP and used the text mining of PubMed database abstracts to determine associations among 12 pathways and networks. Based on this method, we identified two central genes in SCI-NP: interleukin-6 (IL-6) and tumor necrosis factor-α (TNF-α). Adult male Sprague-Dawley rats were used to build the SCI-NP models. The threshold for paw withdrawal was significantly reduced in the SCI group and TLR4 was activated in microglia after SCI. ELISA analysis of TNF-α and IL-6 levels was significantly higher in the SCI group than in the sham group. Western blot showed that expressions of the TLR4/MyD88/NF-κB inflammatory pathway protein increased dramatically in the SCI group. Using the TLR4 inhibitor TAK-242, the pain threshold and expressions of inflammatory factors and proteins of the proteins of the inflammatory signal pathway were reversed, TLR4 in microglia was suppressed, suggesting that SCI-NP was related to neuroinflammation mediated by the TLR4 signaling pathway. In conclusion, we found TNF-α and IL-6 were the neuroinflammation-related genes involved in SCI-NP that can be alleviated by inhibiting the inflammatory pathway upstream of the TLR4/MyD88/NF-κB inflammatory pathway.
Collapse
Affiliation(s)
- Zefu Li
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Huiying Bai
- Outpatient Surgery, Zhengzhou University Hospital, Zhengzhou, Henan Province, China
| | - Ruoyu Zhang
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Bohan Chen
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Junmin Wang
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Bohan Xue
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Xiuhua Ren
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Jiarui Wang
- The Johns Hopkins University, Baltimore, Maryland, USA
| | - Yanjie Jia
- Department of Neurology, the first affiliated Hospital Zhengzhou University, Zhengzhou, Henan Province, China
| | - Weidong Zang
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Jian Wang
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| | - Xuemei Chen
- Department of Basic Medical College of Human Anatomy of Zhengzhou University, Zhengzhou, Henan Province, China
| |
Collapse
|
37
|
Church K, Liu B. Acronyms and Opportunities for Improving Deep Nets. Front Artif Intell 2022; 4:732381. [PMID: 34988434 PMCID: PMC8721666 DOI: 10.3389/frai.2021.732381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Accepted: 10/21/2021] [Indexed: 11/13/2022] Open
Abstract
Recently, several studies have reported promising results with BERT-like methods on acronym tasks. In this study, we find an older rule-based program, Ab3P, not only performs better, but error analysis suggests why. There is a well-known spelling convention in acronyms where each letter in the short form (SF) refers to “salient” letters in the long form (LF). The error analysis uses decision trees and logistic regression to show that there is an opportunity for many pre-trained models (BERT, T5, BioBert, BART, ERNIE) to take advantage of this spelling convention.
Collapse
Affiliation(s)
| | - Boxiang Liu
- Baidu Research, Sunnyvale, CA, United States
| |
Collapse
|
38
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics 2022; 23:8. [PMID: 34983362 PMCID: PMC8729142 DOI: 10.1186/s12859-021-04551-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 12/22/2021] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Biomedical named entity recognition (BioNER) is a basic and important medical information extraction task to extract medical entities with special meaning from medical texts. In recent years, deep learning has become the main research direction of BioNER due to its excellent data-driven context coding ability. However, in BioNER task, deep learning has the problem of poor generalization and instability. RESULTS we propose the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features. We select 14 datasets containing 4 types of entities for training and evaluate the model. The experimental results showed that the F1-scores of the five gold standard datasets BC5CDR-chemical, BC5CDR-disease, BC2GM, BC4CHEMD, NCBI-disease and LINNAEUS were increased by 0.57, 0.90, 0.42, 0.77, 0.98 and - 2.16 compared to the single-task XLNet-CRF model. BC5CDR-chemical, BC5CDR-disease and BC4CHEMD achieved state-of-the-art results.The reasons why LINNAEUS's multi-task results are lower than single-task results are discussed at the dataset level. CONCLUSION Compared with using multi-task learning and fine-tuning alone, the model has more accurate recognition ability of medical entities, and has higher generalization and stability.
Collapse
Affiliation(s)
- Zhaoying Chai
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Han Jin
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Shenghui Shi
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China.
| | - Siyan Zhan
- School of Public Health, Peking University, Beijing, China.
| | - Lin Zhuo
- Research Center of Clinical Epidemiology, Peking University Third Hospital, Beijing, China
| | - Yu Yang
- National Institute of Health Data Science, Peking University, Beijing, China
| |
Collapse
|
39
|
Jha K, Zhang A. Continual knowledge infusion into pre-trained biomedical language models. Bioinformatics 2022; 38:494-502. [PMID: 34554186 DOI: 10.1093/bioinformatics/btab671] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 09/12/2021] [Accepted: 09/20/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Biomedical language models produce meaningful concept representations that are useful for a variety of biomedical natural language processing (bioNLP) applications such as named entity recognition, relationship extraction and question answering. Recent research trends have shown that the contextualized language models (e.g. BioBERT, BioELMo) possess tremendous representational power and are able to achieve impressive accuracy gains. However, these models are still unable to learn high-quality representations for concepts with low context information (i.e. rare words). Infusing the complementary information from knowledge-bases (KBs) is likely to be helpful when the corpus-specific information is insufficient to learn robust representations. Moreover, as the biomedical domain contains numerous KBs, it is imperative to develop approaches that can integrate the KBs in a continual fashion. RESULTS We propose a new representation learning approach that progressively fuses the semantic information from multiple KBs into the pretrained biomedical language models. Since most of the KBs in the biomedical domain are expressed as parent-child hierarchies, we choose to model the hierarchical KBs and propose a new knowledge modeling strategy that encodes their topological properties at a granular level. Moreover, the proposed continual learning technique efficiently updates the concepts representations to accommodate the new knowledge while preserving the memory efficiency of contextualized language models. Altogether, the proposed approach generates knowledge-powered embeddings with high fidelity and learning efficiency. Extensive experiments conducted on bioNLP tasks validate the efficacy of the proposed approach and demonstrates its capability in generating robust concept representations.
Collapse
Affiliation(s)
- Kishlay Jha
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| | - Aidong Zhang
- Department of Computer Science, University of Virginia, Charlottesville, VA 22903, USA
| |
Collapse
|
40
|
Xiong Y, Chen S, Tang B, Chen Q, Wang X, Yan J, Zhou Y. Improving deep learning method for biomedical named entity recognition by using entity definition information. BMC Bioinformatics 2021; 22:600. [PMID: 34920699 PMCID: PMC8680061 DOI: 10.1186/s12859-021-04236-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 06/04/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. MATERIAL AND METHOD We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. RESULTS Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. CONCLUSION Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph.
Collapse
Affiliation(s)
- Ying Xiong
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Shuai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China.
- Peng Cheng Laboratory, Shenzhen, China.
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Xiaolong Wang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Yi Zhou
- Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, 510080, China.
| |
Collapse
|
41
|
Alshammari N, Alanazi S. The impact of using different annotation schemes on named entity recognition. Egyptian Informatics Journal 2021. [DOI: 10.1016/j.eij.2020.10.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
42
|
Serrano Nájera G, Narganes Carlón D, Crowther DJ. TrendyGenes, a computational pipeline for the detection of literature trends in academia and drug discovery. Sci Rep 2021; 11:15747. [PMID: 34344904 PMCID: PMC8333311 DOI: 10.1038/s41598-021-94897-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2021] [Accepted: 07/08/2021] [Indexed: 02/07/2023] Open
Abstract
Target identification and prioritisation are prominent first steps in modern drug discovery. Traditionally, individual scientists have used their expertise to manually interpret scientific literature and prioritise opportunities. However, increasing publication rates and the wider routine coverage of human genes by omic-scale research make it difficult to maintain meaningful overviews from which to identify promising new trends. Here we propose an automated yet flexible pipeline that identifies trends in the scientific corpus which align with the specific interests of a researcher and facilitate an initial prioritisation of opportunities. Using a procedure based on co-citation networks and machine learning, genes and diseases are first parsed from PubMed articles using a novel named entity recognition system together with publication date and supporting information. Then recurrent neural networks are trained to predict the publication dynamics of all human genes. For a user-defined therapeutic focus, genes generating more publications or citations are identified as high-interest targets. We also used topic detection routines to help understand why a gene is trendy and implement a system to propose the most prominent review articles for a potential target. This TrendyGenes pipeline detects emerging targets and pathways and provides a new way to explore the literature for individual researchers, pharmaceutical companies and funding agencies.
Collapse
Affiliation(s)
- Guillermo Serrano Nájera
- Division of Cell and Developmental Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
| | - David Narganes Carlón
- Division of Cell and Developmental Biology, School of Life Sciences, University of Dundee, Dundee, DD1 5EH, UK
- Division of Population Health and Genomics, Ninewells Hospital, School of Medicine, University of Dundee, Dundee, DD1 9SY, UK
- Exscientia Ltd, Dundee One, River Court, 5 West Victoria Dock Road, Dundee, DD1 3JT, UK
| | - Daniel J Crowther
- Exscientia Ltd, Dundee One, River Court, 5 West Victoria Dock Road, Dundee, DD1 3JT, UK.
| |
Collapse
|
43
|
Tian Y, Shen W, Song Y, Xia F, He M, Li K. Improving biomedical named entity recognition with syntactic information. BMC Bioinformatics 2020; 21:539. [PMID: 33238875 PMCID: PMC7687711 DOI: 10.1186/s12859-020-03834-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2020] [Accepted: 10/23/2020] [Indexed: 11/29/2022] Open
Abstract
Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks (KVMN) to incorporate auto-processed syntactic information. We evaluate BioKMNER on six English biomedical datasets, where our method with KVMN outperforms the strong baseline method, namely, BioBERT, from the previous study on all datasets. Specifically, the F1 scores of our best performing model are 85.29% on BC2GM, 77.83% on JNLPBA, 94.22% on BC5CDR-chemical, 90.08% on NCBI-disease, 89.24% on LINNAEUS, and 76.33% on Species-800, where state-of-the-art performance is obtained on four of them (i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800). Conclusion The experimental results on six English benchmark datasets demonstrate that auto-processed syntactic information can be a useful resource for BioNER and our method with KVMN can appropriately leverage such information to improve model performance.
Collapse
Affiliation(s)
| | | | - Yan Song
- The Chinese University of Hong Kong, Shenzhen, China. .,Shenzhen Research Institute of Big Data, Shenzhen, China.
| | - Fei Xia
- University of Washington, Seattle, USA
| | - Min He
- Hunan University, Changsha, China
| | - Kenli Li
- Hunan University, Changsha, China
| |
Collapse
|
44
|
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36:1234-1240. [PMID: 31501885 PMCID: PMC7703786 DOI: 10.1093/bioinformatics/btz682] [Citation(s) in RCA: 930] [Impact Index Per Article: 232.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. RESULTS We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. AVAILABILITY AND IMPLEMENTATION We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Collapse
Affiliation(s)
- Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sungdong Kim
- Clova AI Research, Naver Corp, Seong-Nam 13561, Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea.,Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| |
Collapse
|
45
|
Wang CCN, Jin J, Chang JG, Hayakawa M, Kitazawa A, Tsai JJP, Sheu PCY. Identification of most influential co-occurring gene suites for gastrointestinal cancer using biomedical literature mining and graph-based influence maximization. BMC Med Inform Decis Mak 2020; 20:208. [PMID: 32883271 PMCID: PMC7469322 DOI: 10.1186/s12911-020-01227-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2020] [Accepted: 08/20/2020] [Indexed: 12/02/2022] Open
Abstract
Background Gastrointestinal (GI) cancer including colorectal cancer, gastric cancer, pancreatic cancer, etc., are among the most frequent malignancies diagnosed annually and represent a major public health problem worldwide. Methods This paper reports an aided curation pipeline to identify potential influential genes for gastrointestinal cancer. The curation pipeline integrates biomedical literature to identify named entities by Bi-LSTM-CNN-CRF methods. The entities and their associations can be used to construct a graph, and from which we can compute the sets of co-occurring genes that are the most influential based on an influence maximization algorithm. Results The sets of co-occurring genes that are the most influential that we discover include RARA - CRBP1, CASP3 - BCL2, BCL2 - CASP3 – CRBP1, RARA - CASP3 – CRBP1, FOXJ1 - RASSF3 - ESR1, FOXJ1 - RASSF1A - ESR1, FOXJ1 - RASSF1A - TNFAIP8 - ESR1. With TCGA and functional and pathway enrichment analysis, we prove the proposed approach works well in the context of gastrointestinal cancer. Conclusions Our pipeline that uses text mining to identify objects and relationships to construct a graph and uses graph-based influence maximization to discover the most influential co-occurring genes presents a viable direction to assist knowledge discovery for clinical applications.
Collapse
Affiliation(s)
- Charles C N Wang
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.,Center for Artificial Intelligence in Precision Medicine, UAsia University, Taichung, Taiwan
| | - Jennifer Jin
- Department of EECS and BME, University of California, Irvine, USA
| | - Jan-Gowth Chang
- Department of Laboratory Medicine, China Medical University Hospital, Taichung, Taiwan.,Center for Precision Medicine, China Medical University Hospital, Taichung, Taiwan.,Graduate Institute of Clinical Medical Science, School of Medicine, College of Medicine, China Medical University, Taichung, Taiwan
| | | | | | - Jeffrey J P Tsai
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
| | - Phillip C-Y Sheu
- Department of EECS and BME, University of California, Irvine, USA.
| |
Collapse
|
46
|
Patra R, Saha SK. Utilizing external corpora through kernel function: application in biomedical named entity recognition. Prog Artif Intell 2020; 9:209-219. [DOI: 10.1007/s13748-020-00208-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
47
|
Xu J, Kim S, Song M, Jeong M, Kim D, Kang J, Rousseau JF, Li X, Xu W, Torvik VI, Bu Y, Chen C, Ebeid IA, Li D, Ding Y. Building a PubMed knowledge graph. Sci Data 2020; 7:205. [PMID: 32591513 PMCID: PMC7320186 DOI: 10.1038/s41597-020-0543-2] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Accepted: 05/26/2020] [Indexed: 01/08/2023] Open
Abstract
PubMed® is an essential resource for the medical domain, but useful concepts are either difficult to extract or are ambiguous, which has significantly hindered knowledge discovery. To address this issue, we constructed a PubMed knowledge graph (PKG) by extracting bio-entities from 29 million PubMed abstracts, disambiguating author names, integrating funding data through the National Institutes of Health (NIH) ExPORTER, collecting affiliation history and educational background of authors from ORCID®, and identifying fine-grained affiliation data from MapAffil. Through the integration of these credible multi-source data, we could create connections among the bio-entities, authors, articles, affiliations, and funding. Data validation revealed that the BioBERT deep learning method of bio-entity extraction significantly outperformed the state-of-the-art models based on the F1 score (by 0.51%), with the author name disambiguation (AND) achieving an F1 score of 98.09%. PKG can trigger broader innovations, not only enabling us to measure scholarly impact, knowledge usage, and knowledge transfer, but also assisting us in profiling authors and organizations based on their connections with bio-entities. Measurement(s) | textual entity • author information textual entity • funding source declaration textual entity • abstract • Biologic Entity Classification | Technology Type(s) | machine learning • computational modeling technique |
Machine-accessible metadata file describing the reported data: 10.6084/m9.figshare.12452597
Collapse
Affiliation(s)
- Jian Xu
- School of Information Management, Sun Yat-sen University, Guangzhou, China
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, South Korea
| | - Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, South Korea
| | | | - Xin Li
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Weijia Xu
- Texas Advanced Computing Center, Austin, TX, USA
| | - Vetle I Torvik
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Yi Bu
- Department of Information Management, Peking University, Beijing, China
| | - Chongyan Chen
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Islam Akef Ebeid
- School of Information, University of Texas at Austin, Austin, TX, USA
| | - Daifeng Li
- School of Information Management, Sun Yat-sen University, Guangzhou, China.
| | - Ying Ding
- Dell Medical School, University of Texas at Austin, Austin, TX, USA. .,School of Information, University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
48
|
Savery ME, Rogers WJ, Pillai M, Mork JG, Demner-Fushman D. Chemical Entity Recognition for MEDLINE Indexing. AMIA Jt Summits Transl Sci Proc 2020; 2020:561-568. [PMID: 32477678 PMCID: PMC7233078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Chemical entity recognition is essential for indexing scientific literature in the MEDLINE database at the National Library of Medicine. However, the tool currently used to suggest terms for indexing, the Medical Text Indexer, was not originally conceived as a chemical recognition tool. It has instead been adapted to the task via its use of MetaMap and the addition of in-house patterns and rules. In order to develop a tool more suitable for chemical recognition, we have created a collection of 200 MEDLINE titles and abstracts annotated with genes, proteins, inorganic and organic chemicals, as well as other biological molecules. We use this collection to evaluate eleven chemical entity recognition systems, where we seek to identify a tool that effectively recognizes chemical entities for indexing and also performs well on chemical recognition beyond the indexing task. We observe the highest performance with a SciBERT ensemble.
Collapse
Affiliation(s)
- Max E Savery
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Willie J Rogers
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Malvika Pillai
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - James G Mork
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| | - Dina Demner-Fushman
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, National Institutes of Health, Bethesda, MD
| |
Collapse
|
49
|
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics 2020; 35:1745-1752. [PMID: 30307536 DOI: 10.1093/bioinformatics/bty869] [Citation(s) in RCA: 75] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2018] [Revised: 10/03/2018] [Accepted: 10/09/2018] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION State-of-the-art biomedical named entity recognition (BioNER) systems often require handcrafted features specific to each entity type, such as genes, chemicals and diseases. Although recent studies explored using neural network models for BioNER to free experts from manual feature engineering, the performance remains limited by the available training data for each entity type. RESULTS We propose a multi-task learning framework for BioNER to collectively use the training data of different types of entities and improve the performance on each of them. In experiments on 15 benchmark BioNER datasets, our multi-task model achieves substantially better performance compared with state-of-the-art BioNER systems and baseline neural sequence labeling models. Further analysis shows that the large performance gains come from sharing character- and word-level information among relevant biomedical entities across differently labeled corpora. AVAILABILITY AND IMPLEMENTATION Our source code is available at https://github.com/yuzhimanhua/lm-lstm-crf. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xuan Wang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Yu Zhang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Xiang Ren
- Department of Computer Science, University of Southern California, Los Angeles, CA, USA
| | - Yuhao Zhang
- Department of Radiology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Marinka Zitnik
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jingbo Shang
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Curtis Langlotz
- Department of Radiology, School of Medicine, Stanford University, Stanford, CA, USA
| | - Jiawei Han
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| |
Collapse
|
50
|
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36:1234-1240. [PMID: 31501885 DOI: 10.48550/arxiv.1901.08746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 07/29/2019] [Accepted: 09/05/2019] [Indexed: 05/20/2023]
Abstract
MOTIVATION Biomedical text mining is becoming increasingly important as the number of biomedical documents rapidly grows. With the progress in natural language processing (NLP), extracting valuable information from biomedical literature has gained popularity among researchers, and deep learning has boosted the development of effective biomedical text mining models. However, directly applying the advancements in NLP to biomedical text mining often yields unsatisfactory results due to a word distribution shift from general domain corpora to biomedical corpora. In this article, we investigate how the recently introduced pre-trained language model BERT can be adapted for biomedical corpora. RESULTS We introduce BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining), which is a domain-specific language representation model pre-trained on large-scale biomedical corpora. With almost the same architecture across tasks, BioBERT largely outperforms BERT and previous state-of-the-art models in a variety of biomedical text mining tasks when pre-trained on biomedical corpora. While BERT obtains performance comparable to that of previous state-of-the-art models, BioBERT significantly outperforms them on the following three representative biomedical text mining tasks: biomedical named entity recognition (0.62% F1 score improvement), biomedical relation extraction (2.80% F1 score improvement) and biomedical question answering (12.24% MRR improvement). Our analysis results show that pre-training BERT on biomedical corpora helps it to understand complex biomedical texts. AVAILABILITY AND IMPLEMENTATION We make the pre-trained weights of BioBERT freely available at https://github.com/naver/biobert-pretrained, and the source code for fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.
Collapse
Affiliation(s)
- Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sungdong Kim
- Clova AI Research, Naver Corp, Seong-Nam 13561, Korea
| | - Donghyeon Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
| | - Chan Ho So
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul 02841, Korea
| |
Collapse
|