1
|
Wang Y, Tong H, Zhu Z, Hou F, Li Y. Enhancing biomedical named entity recognition with parallel boundary detection and category classification. BMC Bioinformatics 2025; 26:63. [PMID: 40000968 PMCID: PMC11863403 DOI: 10.1186/s12859-025-06086-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 02/14/2025] [Indexed: 02/27/2025] Open
Abstract
BACKGROUND Named entity recognition is a fundamental task in natural language processing. Recognizing entities in biomedical text, known as the BioNER, is particularly crucial for cutting-edge applications. However, BioNER poses greater challenges compared to traditional NER due to (1) nested structures and (2) category correlations inherent in biomedical entities. Recently, various BioNER models have been developed based on region classification or large language models. Despite being successful, these models still struggle to balance handling nested structures and capturing category knowledge. RESULTS We present a novel parallel BioNER model, BEAN, designed to address the unique properties of biomedical entities while achieving a reasonable balance between handling nested structures and incorporating category correlations. Extensive experiments on five public NER datasets, including four biomedical datasets, demonstrate that BEAN achieves state-of-the-art performance. CONCLUSIONS The proposed BEAN is elaborately designed to achieve two key objectives of the BioNER task: clearly detecting entity boundaries and correctly classifying entity categories. It is the first BioNER model to handle nested structures and category correlations in parallel. We exploit head, tail, and contextualized features to efficiently detect entity boundaries via a triaffine model. To the best of our knowledge, we are the first to introduce a multi-label classification model for the BioNER task to extract entity category information without boundary guidance.
Collapse
Affiliation(s)
- Yu Wang
- School of Science, China Pharmaceutical University, Nanjing, China.
| | - Hanghang Tong
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA
| | - Ziye Zhu
- Jiangsu Key Laboratory of Big Data Security and Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Fengzhen Hou
- School of Science, China Pharmaceutical University, Nanjing, China
| | - Yun Li
- Jiangsu Key Laboratory of Big Data Security and Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, China
| |
Collapse
|
2
|
Wiegers TC, Davis AP, Wiegers J, Sciaky D, Barkalow F, Wyatt B, Strong M, McMorran R, Abrar S, Mattingly CJ. Integrating AI-powered text mining from PubTator into the manual curation workflow at the Comparative Toxicogenomics Database. Database (Oxford) 2025; 2025:baaf013. [PMID: 39982792 PMCID: PMC11844237 DOI: 10.1093/database/baaf013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 01/23/2025] [Accepted: 02/10/2025] [Indexed: 02/23/2025]
Abstract
The Comparative Toxicogenomics Database (CTD) is a manually curated knowledge- and discovery-base that seeks to advance understanding about the relationship between environmental exposures and human health. CTD's manual curation process extracts from the biomedical literature molecular relationships between chemicals/drugs, genes/proteins, phenotypes, diseases, anatomical terms, and species. These relationships are organized in a highly systematic way in order to make them not only informative but also scientifically computational, enabling inferential hypotheses to be formed to address gaps in understanding. Integral to CTD's functionality is the use of structured, hierarchical ontologies and controlled vocabularies to describe these molecular relationships. Normalizing text (i.e. translating raw text from the literature into these controlled vocabularies) can be a time-consuming process for biocurators. To facilitate the normalization process and improve the efficiency with which our scientists curate the literature, CTD evaluated and integrated into the curation process PubTator 3.0, a state-of-the-art, AI-powered resource which extracts and normalizes from the literature many of the key biomedical concepts CTD curates. Here, we describe CTD's long-standing history with Natural Language Processing (NLP), how this history helped form our objectives for NLP integration, the evaluation of PubTator against our objectives, and the integration of PubTator into CTD's curation workflow. Database URL: https://ctdbase.org.
Collapse
Affiliation(s)
- Thomas C Wiegers
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Allan Peter Davis
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Jolene Wiegers
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Daniela Sciaky
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Fern Barkalow
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Brent Wyatt
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Melissa Strong
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Roy McMorran
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Sakib Abrar
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| | - Carolyn J Mattingly
- Department of Biological Sciences, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
- Center for Human Health and the Environment, North Carolina State University, Toxicology Building, 850 Main Campus Drive, Raleigh, NC 27695, USA
| |
Collapse
|
3
|
Musella L, Afonso Castro A, Lai X, Widmann M, Vera J. ENQUIRE automatically reconstructs, expands, and drives enrichment analysis of gene and Mesh co-occurrence networks from context-specific biomedical literature. PLoS Comput Biol 2025; 21:e1012745. [PMID: 39932993 PMCID: PMC11844901 DOI: 10.1371/journal.pcbi.1012745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 02/21/2025] [Accepted: 12/20/2024] [Indexed: 02/13/2025] Open
Abstract
The accelerating growth of scientific literature overwhelms our capacity to manually distil complex phenomena like molecular networks linked to diseases. Moreover, biases in biomedical research and database annotation limit our interpretation of facts and generation of hypotheses. ENQUIRE (Expanding Networks by Querying Unexpectedly Inter-Related Entities) offers a time- and resource-efficient alternative to manual literature curation and database mining. ENQUIRE reconstructs and expands co-occurrence networks of genes and biomedical ontologies from user-selected input corpora and network-inferred PubMed queries. Its modest resource usage and the integration of text mining, automatic querying, and network-based statistics mitigating literature biases makes ENQUIRE unique in its broad-scope applications. For example, ENQUIRE can generate co-occurrence gene networks that reflect high-confidence, functional networks. When tested on case studies spanning cancer, cell differentiation, and immunity, ENQUIRE identified interlinked genes and enriched pathways unique to each topic, thereby preserving their underlying context specificity. ENQUIRE supports biomedical researchers by easing literature annotation, boosting hypothesis formulation, and facilitating the identification of molecular targets for subsequent experimentation.
Collapse
Affiliation(s)
- Luca Musella
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| | - Alejandro Afonso Castro
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| | - Xin Lai
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
- Faculty of Medicine and Health Technology, Systems and Network Medicine Lab, Biomedicine Unit, Tampere University, Tampere, Finland
| | - Max Widmann
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
- University of Konstanz, Konstanz, Germany
| | - Julio Vera
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| |
Collapse
|
4
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
5
|
Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024; 2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]
Abstract
Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.
Collapse
Affiliation(s)
- Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Ben Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| |
Collapse
|
6
|
Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, Leser U. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools. Bioinformatics 2024; 40:btae564. [PMID: 39302686 PMCID: PMC11453098 DOI: 10.1093/bioinformatics/btae564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 08/23/2024] [Accepted: 09/17/2024] [Indexed: 09/22/2024] Open
Abstract
MOTIVATION With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications. RESULTS Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools. AVAILABILITY AND IMPLEMENTATION All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.
Collapse
Affiliation(s)
- Mario Sänger
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Samuele Garda
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Xing David Wang
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany
| | - Pia Droop
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Benedikt Fuchs
- Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria
| | - Alan Akbik
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
7
|
Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024; 11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open
Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Collapse
Grants
- U24 HG007822 NHGRI NIH HHS
- U41 HG007822 NHGRI NIH HHS
- NIH Intramural Research Program, National Library of Medicine
- Expert curation and evaluation of EnzChemRED at Swiss-Prot were supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and the National Human Genome Research Institute (NHGRI), Office of Director [OD/DPCPSI/ODSS], National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health [U24HG007822], and by the European Union's Horizon Europe Framework Programme (grant number 101080997), supported in Switzerland through the State Secretariat for Education, Research and Innovation (SERI).
- Fundamental Research Funds for the Central Universities [DUT23RC(3)014 to L.L.]
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Kristian Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lionel Breuza
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Marc Feuermann
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucille Pourcel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anastasia Sveshnikova
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
| |
Collapse
|
8
|
Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024; 2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]
Abstract
Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.
Collapse
Affiliation(s)
- M Janina Sarol
- Informatics Programs, University of Illinois Urbana-Champaign, 614 E Daniel Street, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Evan Guerra
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| |
Collapse
|
9
|
Islamaj R, Lai PT, Wei CH, Luo L, Almeida T, Jonker RAA, Conceição SIR, Sousa DF, Phan CP, Chiang JH, Li J, Pan D, Meesawad W, Tsai RTH, Sarol MJ, Hong G, Valiev A, Tutubalina E, Lee SM, Hsu YY, Li M, Verspoor K, Lu Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford) 2024; 2024:baae069. [PMID: 39114977 PMCID: PMC11306928 DOI: 10.1093/database/baae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/27/2024] [Accepted: 07/09/2024] [Indexed: 08/11/2024]
Abstract
The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Tiago Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A. A Jonker
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sofia I. R Conceição
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Diana F Sousa
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Wilailack Meesawad
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
- Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd., Nangang District, Taoyuan City 115201, Taiwan, Republic of China
| | - M. Janina Sarol
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Airat Valiev
- Higher School of Economics University, 20 Myasnitskaya St, Moscow 101000, Russia
| | - Elena Tutubalina
- Artificial Intelligence Research Institute (AIRI), 32 Kutuzovskiy St, Moscow 121170, Russia
- Kazan Federal University, 18 Kremlevskaya St, Kazan 420008, Russia
| | - Shao-Man Lee
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Yi-Yu Hsu
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Mingjie Li
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| |
Collapse
|
10
|
Garda S, Leser U. BELHD: improving biomedical entity linking with homonym disambiguation. Bioinformatics 2024; 40:btae474. [PMID: 39067036 PMCID: PMC11310454 DOI: 10.1093/bioinformatics/btae474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/14/2024] [Accepted: 07/25/2024] [Indexed: 07/30/2024] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| | - Ulf Leser
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| |
Collapse
|
11
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
12
|
Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024; 2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]
Abstract
Biomedical relation extraction is an ongoing challenge within the natural language processing community. Its application is important for understanding scientific biomedical literature, with many use cases, such as drug discovery, precision medicine, disease diagnosis, treatment optimization and biomedical knowledge graph construction. Therefore, the development of a tool capable of effectively addressing this task holds the potential to improve knowledge discovery by automating the extraction of relations from research manuscripts. The first track in the BioCreative VIII competition extended the scope of this challenge by introducing the detection of novel relations within the literature. This paper describes that our participation system initially focused on jointly extracting and classifying novel relations between biomedical entities. We then describe our subsequent advancement to an end-to-end model. Specifically, we enhanced our initial system by incorporating it into a cascading pipeline that includes a tagger and linker module. This integration enables the comprehensive extraction of relations and classification of their novelty directly from raw text. Our experiments yielded promising results, and our tagger module managed to attain state-of-the-art named entity recognition performance, with a micro F1-score of 90.24, while our end-to-end system achieved a competitive novelty F1-score of 24.59. The code to run our system is publicly available at https://github.com/ieeta-pt/BioNExt. Database URL: https://github.com/ieeta-pt/BioNExt.
Collapse
Affiliation(s)
- Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
13
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024; 52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
14
|
Keloth VK, Hu Y, Xie Q, Peng X, Wang Y, Zheng A, Selek M, Raja K, Wei CH, Jin Q, Lu Z, Chen Q, Xu H. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 2024; 40:btae163. [PMID: 38514400 PMCID: PMC11001490 DOI: 10.1093/bioinformatics/btae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 02/18/2024] [Accepted: 03/19/2024] [Indexed: 03/23/2024] Open
Abstract
MOTIVATION Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks. RESULTS We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications. AVAILABILITY AND IMPLEMENTATION Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.
Collapse
Affiliation(s)
- Vipina K Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX-77030, United States
| | - Qianqian Xie
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Xueqing Peng
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Yan Wang
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Andrew Zheng
- William P. Clements High School, Sugar Land, TX-77479, United States
| | - Melih Selek
- Stephen F. Austin High School, Sugar Land, TX-77498, United States
| | - Kalpana Raja
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Qiao Jin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Qingyu Chen
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| |
Collapse
|