1
|
Lu Q, Li R, Wen A, Wang J, Wang L, Liu H. Large Language Models Struggle in Token-Level Clinical Named Entity Recognition. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:748-757. [PMID: 40417588 PMCID: PMC12099373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
Large Language Models (LLMs) have revolutionized various sectors, including healthcare where they are employed in diverse applications. Their utility is particularly significant in the context of rare diseases, where data scarcity, complexity, and specificity pose considerable challenges. In the clinical domain, Named Entity Recognition (NER) stands out as an essential task and it plays a crucial role in extracting relevant information from clinical texts. Despite the promise of LLMs, current research mostly concentrates on document-level NER, identifying entities in a more general context across entire documents, without extracting their precise location. Additionally, efforts have been directed towards adapting ChatGPTfor token-level NER. However, there is a significant research gap when it comes to employing token-level NER for clinical texts, especially with the use of local open-source LLMs. This study aims to bridge this gap by investigating the effectiveness of both proprietary and local LLMs in token-level clinical NER. Essentially, we delve into the capabilities of these models through a series of experiments involving zero-shot prompting, few-shot prompting, retrieval-augmented generation (RAG), and instruction-fine-tuning. Our exploration reveals the inherent challenges LLMs face in token-level NER, particularly in the context of rare diseases, and suggests possible improvements for their application in healthcare. This research contributes to narrowing a significant gap in healthcare informatics and offers insights that could lead to a more refined application of LLMs in the healthcare sector.
Collapse
Affiliation(s)
- Qiuhao Lu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Rui Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Andrew Wen
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Jinlian Wang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Liwei Wang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | - Hongfang Liu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| |
Collapse
|
2
|
Wang Q, Huo X, Cui J, Li J, Bai X, Wu C, Zhang Q, Sun D, Zhao J. Systematic evaluation of patient-reported outcomes in clinical trials of digital health in cardiovascular diseases. NPJ Digit Med 2025; 8:238. [PMID: 40316762 PMCID: PMC12048614 DOI: 10.1038/s41746-025-01637-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Accepted: 04/13/2025] [Indexed: 05/04/2025] Open
Abstract
The integration of patient-reported outcomes (PROs) in digital health (DH) trials for cardiovascular diseases (CVDs) remains unknown. We searched ClinicalTrials.gov for randomized trials that tested DH interventions in CVDs from 2004 to 2024. The search identified 8037 trials, with 673 eligible trials included in the analysis. Among these, 321 trials (48%) incorporated PROs. The number of DH trials and the use of PROs have shown a significant upward trend. Phone-based interventions predominated (38%), mostly targeting hypertension (38%) and heart failure (27%). Behavioral interventions showed higher prevalence of PROs' usage (1.24 [1.04-1.48]), while trials for diagnostic or screening purpose (0.39 [0.20-0.77]) utilized PROs less frequently. Only 15% of trials reported results on ClinicalTrials.gov, while 58% were published in PubMed after completion. Despite DH trial expansion, PRO integration remains insufficient, especially in trials where clinical and patient perspectives are important in informing treatment decisions. Timely results dissemination is critical to improving transparency.
Collapse
Affiliation(s)
- Qianying Wang
- National Science Library, Chinese Academy of Sciences, Beijing, PR China
| | - Xiqian Huo
- National Clinical Research Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China.
| | - Jianlan Cui
- National Clinical Research Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China
| | - Jie Li
- National Science Library, Chinese Academy of Sciences, Beijing, PR China
| | - Xueke Bai
- National Clinical Research Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China
| | - Chaoqun Wu
- National Clinical Research Center for Cardiovascular Diseases, State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China
| | - Qian Zhang
- Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China
| | - Dianjianyi Sun
- Department of Epidemiology & Biostatistics, School of Public Health, Peking University, Beijing, PR China.
- Peking University Center for Public Health and Epidemic Preparedness & Response, Beijing, PR China.
- Key Laboratory of Epidemiology of Major Diseases (Peking University), Ministry of Education, Beijing, PR China.
| | - Jie Zhao
- Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, PR China.
| |
Collapse
|
3
|
Remaki A, Ung J, Pages P, Wajsburt P, Liu E, Faure G, Petit-Jean T, Tannier X, Gérardin C. Improving Phenotyping of Patients With Immune-Mediated Inflammatory Diseases Through Automated Processing of Discharge Summaries: Multicenter Cohort Study. JMIR Med Inform 2025; 13:e68704. [PMID: 40203304 PMCID: PMC12018854 DOI: 10.2196/68704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Revised: 01/21/2025] [Accepted: 01/25/2025] [Indexed: 04/11/2025] Open
Abstract
BACKGROUND Valuable insights gathered by clinicians during their inquiries and documented in textual reports are often unavailable in the structured data recorded in electronic health records (EHRs). OBJECTIVE This study aimed to highlight that mining unstructured textual data with natural language processing techniques complements the available structured data and enables more comprehensive patient phenotyping. A proof-of-concept for patients diagnosed with specific autoimmune diseases is presented, in which the extraction of information on laboratory tests and drug treatments is performed. METHODS We collected EHRs available in the clinical data warehouse of the Greater Paris University Hospitals from 2012 to 2021 for patients hospitalized and diagnosed with 1 of 4 immune-mediated inflammatory diseases: systemic lupus erythematosus, systemic sclerosis, antiphospholipid syndrome, and Takayasu arteritis. Then, we built, trained, and validated natural language processing algorithms on 103 discharge summaries selected from the cohort and annotated by a clinician. Finally, all discharge summaries in the cohort were processed with the algorithms, and the extracted data on laboratory tests and drug treatments were compared with the structured data. RESULTS Named entity recognition followed by normalization yielded F1-scores of 71.1 (95% CI 63.6-77.8) for the laboratory tests and 89.3 (95% CI 85.9-91.6) for the drugs. Application of the algorithms to 18,604 EHRs increased the detection of antibody results and drug treatments. For instance, among patients in the systemic lupus erythematosus cohort with positive antinuclear antibodies, the rate increased from 18.34% (752/4102) to 71.87% (2949/4102), making the results more consistent with the literature. CONCLUSIONS While challenges remain in standardizing laboratory tests, particularly with abbreviations, this work, based on secondary use of clinical data, demonstrates that automated processing of discharge summaries enriched the information available in structured data and facilitated more comprehensive patient profiling.
Collapse
Affiliation(s)
- Adam Remaki
- Limics, Université Sorbonne Paris-Nord, Inserm, Sorbonne Université, Paris, France
| | - Jacques Ung
- Pôle Innovation et Données, Direction des Services Numériques, Assistance Publique - Hôpitaux de Paris, Paris, France
| | - Pierre Pages
- Pôle Innovation et Données, Direction des Services Numériques, Assistance Publique - Hôpitaux de Paris, Paris, France
| | - Perceval Wajsburt
- Pôle Innovation et Données, Direction des Services Numériques, Assistance Publique - Hôpitaux de Paris, Paris, France
| | - Elise Liu
- Centre de Pharmacoépidémiologie, Hôpital Pitié Salpêtrière, Assistance Publique - Hôpitaux de Paris, Paris, France
| | - Guillaume Faure
- Limics, Université Sorbonne Paris-Nord, Inserm, Sorbonne Université, Paris, France
| | - Thomas Petit-Jean
- Pôle Innovation et Données, Direction des Services Numériques, Assistance Publique - Hôpitaux de Paris, Paris, France
| | - Xavier Tannier
- Limics, Université Sorbonne Paris-Nord, Inserm, Sorbonne Université, Paris, France
| | - Christel Gérardin
- Limics, Université Sorbonne Paris-Nord, Inserm, Sorbonne Université, Paris, France
- Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris, Paris, France
| |
Collapse
|
4
|
Liu D, Ames C, Khader S, Rapaport F. SciLinker: a large-scale text mining framework for mapping associations among biological entities. Front Artif Intell 2025; 8:1528562. [PMID: 40212086 PMCID: PMC11983328 DOI: 10.3389/frai.2025.1528562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Accepted: 02/27/2025] [Indexed: 04/13/2025] Open
Abstract
Introduction The biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora. Methods We developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction. Results We identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker's ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature. Conclusion SciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.
Collapse
Affiliation(s)
| | | | | | - Franck Rapaport
- Target, Disease and Systems Biology, Cambridge, MA, United States
| |
Collapse
|
5
|
Hier DB, Do TS, Obafemi-Ajayi T. A simplified retriever to improve accuracy of phenotype normalizations by large language models. Front Digit Health 2025; 7:1495040. [PMID: 40103736 PMCID: PMC11913805 DOI: 10.3389/fdgth.2025.1495040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Accepted: 02/12/2025] [Indexed: 03/20/2025] Open
Abstract
Large language models have shown improved accuracy in phenotype term normalization tasks when augmented with retrievers that suggest candidate normalizations based on term definitions. In this work, we introduce a simplified retriever that enhances large language model accuracy by searching the Human Phenotype Ontology (HPO) for candidate matches using contextual word embeddings from BioBERT without the need for explicit term definitions. Testing this method on terms derived from the clinical synopses of Online Mendelian Inheritance in Man (OMIM®), we demonstrate that the normalization accuracy of GPT-4o increases from a baseline of 62% without augmentation to 85% with retriever augmentation. This approach is potentially generalizable to other biomedical term normalization tasks and offers an efficient alternative to more complex retrieval methods.
Collapse
Affiliation(s)
- Daniel B Hier
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, IL, United States
| | - Thanh Son Do
- Department of Computer Science, Missouri State University, Springfield, MO, United States
| | - Tayo Obafemi-Ajayi
- Engineering Program, Missouri State University, Springfield, MO, United States
| |
Collapse
|
6
|
Alhassan A, Schlegel V, Aloud M, Batista-Navarro R, Nenadic G. Discontinuous named entities in clinical text: A systematic literature review. J Biomed Inform 2025; 162:104783. [PMID: 39863246 DOI: 10.1016/j.jbi.2025.104783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 01/03/2025] [Accepted: 01/19/2025] [Indexed: 01/27/2025]
Abstract
OBJECTIVE Extracting named entities from clinical free-text presents unique challenges, particularly when dealing with discontinuous entities-mentions that are separated by unrelated words. Traditional NER methods often struggle to accurately identify these entities, prompting the development of specialised computational solutions. This paper systematically reviews and presents the methodologies developed for Discontinuous Named Entity Recognition in clinical texts, highlighting their effectiveness and the challenges they face. METHOD We conducted a systematic literature review focused on discontinuous named entities, using structured searches across four Computer Science-related and one medical-related electronic database. A combination of search terms, grouped into three synonym categories-problem, entity/approach, and task-yielded 2,442 articles. Guided by our research objectives, we identified five key dimensions to systematically annotate and normalise the data for comprehensive analysis. RESULT The review included 44 studies which were coded across several key dimensions: the chronological development of approaches, the corpora used, the downstream tasks affected by discontinuous named entities, the methodological approaches proposed to address the issue, and the reported performance outcomes. The discussion section examines the challenges encountered in this area and suggests potential directions for future research. CONCLUSION Significant progress has been made in discontinuous named entity recognition; however, there remains a need for more adaptable, generalisable solutions that are independent of custom annotation schemes. Exploring various configurations of generative language models presents a promising avenue for advancing this area. Additionally, future research should investigate the impact of precise versus imprecise recognition of discontinuous entities on clinical downstream tasks to better understand its practical implications in healthcare applications.
Collapse
Affiliation(s)
- Areej Alhassan
- University of Manchester, United Kingdom; King Saud University, Saudi Arabia.
| | - Viktor Schlegel
- University of Manchester, United Kingdom; Imperial Global, Singapore
| | | | | | | |
Collapse
|
7
|
Musella L, Afonso Castro A, Lai X, Widmann M, Vera J. ENQUIRE automatically reconstructs, expands, and drives enrichment analysis of gene and Mesh co-occurrence networks from context-specific biomedical literature. PLoS Comput Biol 2025; 21:e1012745. [PMID: 39932993 PMCID: PMC11844901 DOI: 10.1371/journal.pcbi.1012745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2024] [Revised: 02/21/2025] [Accepted: 12/20/2024] [Indexed: 02/13/2025] Open
Abstract
The accelerating growth of scientific literature overwhelms our capacity to manually distil complex phenomena like molecular networks linked to diseases. Moreover, biases in biomedical research and database annotation limit our interpretation of facts and generation of hypotheses. ENQUIRE (Expanding Networks by Querying Unexpectedly Inter-Related Entities) offers a time- and resource-efficient alternative to manual literature curation and database mining. ENQUIRE reconstructs and expands co-occurrence networks of genes and biomedical ontologies from user-selected input corpora and network-inferred PubMed queries. Its modest resource usage and the integration of text mining, automatic querying, and network-based statistics mitigating literature biases makes ENQUIRE unique in its broad-scope applications. For example, ENQUIRE can generate co-occurrence gene networks that reflect high-confidence, functional networks. When tested on case studies spanning cancer, cell differentiation, and immunity, ENQUIRE identified interlinked genes and enriched pathways unique to each topic, thereby preserving their underlying context specificity. ENQUIRE supports biomedical researchers by easing literature annotation, boosting hypothesis formulation, and facilitating the identification of molecular targets for subsequent experimentation.
Collapse
Affiliation(s)
- Luca Musella
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| | - Alejandro Afonso Castro
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| | - Xin Lai
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
- Faculty of Medicine and Health Technology, Systems and Network Medicine Lab, Biomedicine Unit, Tampere University, Tampere, Finland
| | - Max Widmann
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
- University of Konstanz, Konstanz, Germany
| | - Julio Vera
- Laboratory of Systems Tumor Immunology, Friedrich-Alexander-Universität Erlangen-Nürnberg, Deutsches Zentrum Immuntherapie, BZKF, and Uniklinikum Erlangen, Erlangen, Germany
| |
Collapse
|
8
|
Bhushan RC, Donthi RK, Chilukuri Y, Srinivasarao U, Swetha P. Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model. BMC Bioinformatics 2025; 26:34. [PMID: 39885428 PMCID: PMC11780922 DOI: 10.1186/s12859-024-06008-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 12/05/2024] [Indexed: 02/01/2025] Open
Abstract
BACKGROUND Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets. RESULTS This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM). CONCLUSION To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.
Collapse
Affiliation(s)
| | - Rakesh Kumar Donthi
- Department of CSE GITAM (Deemed to be) UNIVERSITY Hyderabad, Rudraram, India
| | - Yojitha Chilukuri
- St. Jude Childrens Cancer Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA
| | | | - Polisetty Swetha
- Department of Information Technology, Vardhaman College of Engineering, Shamshabad, Hyderabad, India
| |
Collapse
|
9
|
Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024; 23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open
Abstract
Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Hao Wu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
| | - Yuxin Lin
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Department of Urology, the First Affiliated Hospital of Soochow University, Suzhou 215000, China
| | - Fuliang Qian
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Medical Center of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| |
Collapse
|
10
|
Ferrara B, Bourgoin-Voillard S, Habert D, Vallée B, Nicolas-Boluda A, Simanic I, Seve M, Vingert B, Gazeau F, Castellano F, Cohen J, Courty J, Cascone I. Matrix stiffness regulates the protein profile of extracellular vesicles of pancreatic cancer cell lines. Proteomics 2024; 24:e2400058. [PMID: 39279557 DOI: 10.1002/pmic.202400058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Revised: 07/31/2024] [Accepted: 08/19/2024] [Indexed: 09/18/2024]
Abstract
The fibrotic stroma characterizing pancreatic ductal adenocarcinoma (PDAC) derives from a progressive tissue rigidification, which induces epithelial mesenchymal transition and metastatic dissemination. The aim of this study was to investigate the influence of matrix stiffness on PDAC progression by analyzing the proteome of PDAC-derived extracellular vesicles (EVs). PDAC cell lines (mPDAC and KPC) were grown on synthetic supports with a stiffness close to non-tumor (NT) or tumor tissue (T), and the protein expression levels in cell-derived EVs were analyzed by a quantitative MSE label-free mass spectrometry approach. Our analysis figured out 15 differentially expressed proteins (DEPs) in mPDAC-EVs and 20 DEPs in KPC-EVs in response to matrix rigidification. Up-regulated proteins participate to the processes of metabolism, matrix remodeling, and immune response, altogether hallmarks of PDAC progression. A multimodal network analysis revealed that the majority of DEPs are strongly related to pancreatic cancer. Interestingly, among DEPs, 11 related genes (ACTB/ANXA7/C3/IGSF8/LAMC1/LGALS3/PCD6IP/SFN/TPM3/VARS/YWHAZ) for mPDAC-EVs and 9 (ACTB/ALDH2/GAPDH/HNRNPA2B/ITGA2/NEXN/PKM/RPN1/S100A6) for KPC-EVs were significantly overexpressed in tumor tissues according to gene expression profiling interaction analysis (GEPIA). Concerning the potential clinical relevance of these data, the cluster of ACTB, ITGA2, GAPDH and PKM genes displayed an adverse effect (p < 0.05) on the overall survival of PDAC patients.
Collapse
Affiliation(s)
- Benedetta Ferrara
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
- Diabetes Research Institute, IRCCS San Raffaele Scientific Institute, Milan, Italy
| | - Sandrine Bourgoin-Voillard
- Université Grenoble Alpes, CNRS UMR 5525, Grenoble INP, TIMC, EPSP, Grenoble, France
- Université Grenoble Alpes, CNRS UMR 5525, Grenoble INP, CHU Grenoble Alpes, TIMC, EPSP, Grenoble, France
- Université Grenoble Alpes, LBFA et BEeSy, Inserm, U1055, CHU Grenoble Alpes, PROMETHEE Proteomic Platform, Grenoble, France
| | - Damien Habert
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
| | - Benoit Vallée
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
| | - Alba Nicolas-Boluda
- Matière et Systèmes Complexes MSC, CNRS, Université Paris Cité, Paris, France
| | - Isidora Simanic
- Modèles de cellules souches malignes et therapeutiques, INSERM UMR-S 935, Université Paris-Saclay, Villejuif, France
| | - Michel Seve
- Université Grenoble Alpes, CNRS UMR 5525, Grenoble INP, TIMC, EPSP, Grenoble, France
- Université Grenoble Alpes, CNRS UMR 5525, Grenoble INP, CHU Grenoble Alpes, TIMC, EPSP, Grenoble, France
- Université Grenoble Alpes, LBFA et BEeSy, Inserm, U1055, CHU Grenoble Alpes, PROMETHEE Proteomic Platform, Grenoble, France
| | - Benoit Vingert
- Etablissement Français du Sang, Créteil, France
- Inserm, U955, Equipe 2, Créteil, France
| | - Florence Gazeau
- Matière et Systèmes Complexes MSC, CNRS, Université Paris Cité, Paris, France
| | - Flavia Castellano
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
| | - José Cohen
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
- AP-HP, Groupe hospitalo-universitaire Chenevier Mondor, Centre d'investigation clinique Biotherapie, Créteil, France
| | - José Courty
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
- AP-HP, Groupe hospitalo-universitaire Chenevier Mondor, Centre d'investigation clinique Biotherapie, Créteil, France
| | - Ilaria Cascone
- Immunorégulation et Biothérapie, INSERM U955, Hôpital Henri Mondor, Université Paris-Est, Créteil, France
- AP-HP, Groupe hospitalo-universitaire Chenevier Mondor, Centre d'investigation clinique Biotherapie, Créteil, France
| |
Collapse
|
11
|
Li T, Li M, Wu Y, Li Y. Visualization Methods for DNA Sequences: A Review and Prospects. Biomolecules 2024; 14:1447. [PMID: 39595624 PMCID: PMC11592258 DOI: 10.3390/biom14111447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 11/08/2024] [Accepted: 11/12/2024] [Indexed: 11/28/2024] Open
Abstract
The efficient analysis and interpretation of biological sequence data remain major challenges in bioinformatics. Graphical representation, as an emerging and effective visualization technique, offers a more intuitive method for analyzing DNA sequences. However, many visualization approaches are dispersed across research databases, requiring urgent organization, integration, and analysis. Additionally, no single visualization method excels in all aspects. To advance these methods, knowledge graphs and advanced machine learning techniques have become key areas of exploration. This paper reviews the current 2D and 3D DNA sequence visualization methods and proposes a new research direction focused on constructing knowledge graphs for biological sequence visualization, explaining the relevant theories, techniques, and models involved. Additionally, we summarize machine learning techniques applicable to sequence visualization, such as graph embedding methods and the use of convolutional neural networks (CNNs) for processing graphical representations. These machine learning techniques and knowledge graphs aim to provide valuable insights into computational biology, bioinformatics, genomic computing, and evolutionary analysis. The study serves as an important reference for improving intelligent search systems, enriching knowledge bases, and enhancing query systems related to biological sequence visualization, offering a comprehensive framework for future research.
Collapse
Affiliation(s)
- Tan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Mengshan Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| | - Yan Wu
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China;
| | - Yelin Li
- School of Physics and Electronic Information, Gannan Normal University, Ganzhou 341000, China; (T.L.); (Y.L.)
| |
Collapse
|
12
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
13
|
Ghosh S, Abushukair HM, Ganesan A, Pan C, Naqash AR, Lu K. Harnessing explainable artificial intelligence for patient-to-clinical-trial matching: A proof-of-concept pilot study using phase I oncology trials. PLoS One 2024; 19:e0311510. [PMID: 39446771 PMCID: PMC11500892 DOI: 10.1371/journal.pone.0311510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Accepted: 09/19/2024] [Indexed: 10/26/2024] Open
Abstract
This study aims to develop explainable AI methods for matching patients with phase 1 oncology clinical trials using Natural Language Processing (NLP) techniques to address challenges in patient recruitment for improved efficiency in drug development. A prototype system based on modern NLP techniques has been developed to match patient records with phase 1 oncology clinical trial protocols. Four criteria are considered for the matching: cancer type, performance status, genetic mutation, and measurable disease. The system outputs a summary matching score along with explanations of the evidence. The outputs of the AI system were evaluated against the ground truth matching results provided by the domain expert on a dataset of twelve synthesized dummy patient records and six clinical trial protocols. The system achieved a precision of 73.68%, sensitivity/recall of 56%, accuracy of 77.78%, and specificity of 89.36%. Further investigation into the misclassified cases indicated that ambiguity of abbreviation and misunderstanding of context are significant contributors to errors. The system found evidence of no matching for all false positive cases. To the best of our knowledge, no system in the public domain currently deploys an explainable AI-based approach to identify optimal patients for phase 1 oncology trials. This initial attempt to develop an AI system for patients and clinical trial matching in the context of phase 1 oncology trials showed promising results that are set to increase efficiency without sacrificing quality in patient-trial matching.
Collapse
Affiliation(s)
- Satanu Ghosh
- Department of Computer Science, University of New Hampshire, Durham, New Hampshire, United States of America
| | | | - Arjun Ganesan
- School of Computer Science, University of Oklahoma Norman Campus, Norman, Oklahoma, United States of America
| | - Chongle Pan
- School of Computer Science, University of Oklahoma Norman Campus, Norman, Oklahoma, United States of America
| | - Abdul Rafeh Naqash
- Medical Oncology/ TSET Phase 1 Program, Stephenson Cancer Center, The University of Oklahoma Health Science Campus, Oklahoma City, Oklahoma, United States of America
| | - Kun Lu
- School of Library and Information Studies, University of Oklahoma Norman Campus, Norman, Oklahoma, United States of America
| |
Collapse
|
14
|
Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, Leser U. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools. Bioinformatics 2024; 40:btae564. [PMID: 39302686 PMCID: PMC11453098 DOI: 10.1093/bioinformatics/btae564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 08/23/2024] [Accepted: 09/17/2024] [Indexed: 09/22/2024] Open
Abstract
MOTIVATION With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications. RESULTS Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools. AVAILABILITY AND IMPLEMENTATION All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.
Collapse
Affiliation(s)
- Mario Sänger
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Samuele Garda
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Xing David Wang
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany
| | - Pia Droop
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Benedikt Fuchs
- Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria
| | - Alan Akbik
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
15
|
Mante J, Sents Z, Britt D, Mo W, Liao C, Greer R, Myers CJ. SeqImprove: Machine-Learning-Assisted Curation of Genetic Circuit Sequence Information. ACS Synth Biol 2024; 13:3051-3055. [PMID: 39230953 DOI: 10.1021/acssynbio.4c00392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
The progress and utility of synthetic biology is currently hindered by the lengthy process of studying literature and replicating poorly documented work. Reconstruction of crucial design information through post hoc curation is highly noisy and error-prone. To combat this, author participation during the curation process is crucial. To encourage author participation without overburdening them, an ML-assisted curation tool called SeqImprove has been developed. Using named entity recognition, called entity normalization, and sequence matching, SeqImprove creates machine-accessible sequence data and metadata annotations, which authors can then review and edit before submitting a final sequence file. SeqImprove makes it easier for authors to submit sequence data that is FAIR (findable, accessible, interoperable, and reusable).
Collapse
Affiliation(s)
- Jeanet Mante
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Zach Sents
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Duncan Britt
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - William Mo
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Chunxiao Liao
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Ryan Greer
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Chris J Myers
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| |
Collapse
|
16
|
Sarol MJ, Hong G, Guerra E, Kilicoglu H. Integrating deep learning architectures for enhanced biomedical relation extraction: a pipeline approach. Database (Oxford) 2024; 2024:baae079. [PMID: 39197056 PMCID: PMC11352595 DOI: 10.1093/database/baae079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 06/21/2024] [Accepted: 08/14/2024] [Indexed: 08/30/2024]
Abstract
Biomedical relation extraction from scientific publications is a key task in biomedical natural language processing (NLP) and can facilitate the creation of large knowledge bases, enable more efficient knowledge discovery, and accelerate evidence synthesis. In this paper, building upon our previous effort in the BioCreative VIII BioRED Track, we propose an enhanced end-to-end pipeline approach for biomedical relation extraction (RE) and novelty detection (ND) that effectively leverages existing datasets and integrates state-of-the-art deep learning methods. Our pipeline consists of four tasks performed sequentially: named entity recognition (NER), entity linking (EL), RE, and ND. We trained models using the BioRED benchmark corpus that was the basis of the shared task. We explored several methods for each task and combinations thereof: for NER, we compared a BERT-based sequence labeling model that uses the BIO scheme with a span classification model. For EL, we trained a convolutional neural network model for diseases and chemicals and used an existing tool, PubTator 3.0, for mapping other entity types. For RE and ND, we adapted the BERT-based, sentence-bound PURE model to bidirectional and document-level extraction. We also performed extensive hyperparameter tuning to improve model performance. We obtained our best performance using BERT-based models for NER, RE, and ND, and the hybrid approach for EL. Our enhanced and optimized pipeline showed substantial improvement compared to our shared task submission, NER: 93.53 (+3.09), EL: 83.87 (+9.73), RE: 46.18 (+15.67), and ND: 38.86 (+14.9). While the performances of the NER and EL models are reasonably high, RE and ND tasks remain challenging at the document level. Further enhancements to the dataset could enable more accurate and useful models for practical use. We provide our models and code at https://github.com/janinaj/e2eBioMedRE/. Database URL: https://github.com/janinaj/e2eBioMedRE/.
Collapse
Affiliation(s)
- M Janina Sarol
- Informatics Programs, University of Illinois Urbana-Champaign, 614 E Daniel Street, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Evan Guerra
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, 501 E Daniel Street, Champaign, IL 61820, United States
| |
Collapse
|
17
|
Islamaj R, Lai PT, Wei CH, Luo L, Almeida T, Jonker RAA, Conceição SIR, Sousa DF, Phan CP, Chiang JH, Li J, Pan D, Meesawad W, Tsai RTH, Sarol MJ, Hong G, Valiev A, Tutubalina E, Lee SM, Hsu YY, Li M, Verspoor K, Lu Z. The overview of the BioRED (Biomedical Relation Extraction Dataset) track at BioCreative VIII. Database (Oxford) 2024; 2024:baae069. [PMID: 39114977 PMCID: PMC11306928 DOI: 10.1093/database/baae069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/27/2024] [Accepted: 07/09/2024] [Indexed: 08/11/2024]
Abstract
The BioRED track at BioCreative VIII calls for a community effort to identify, semantically categorize, and highlight the novelty factor of the relationships between biomedical entities in unstructured text. Relation extraction is crucial for many biomedical natural language processing (NLP) applications, from drug discovery to custom medical solutions. The BioRED track simulates a real-world application of biomedical relationship extraction, and as such, considers multiple biomedical entity types, normalized to their specific corresponding database identifiers, as well as defines relationships between them in the documents. The challenge consisted of two subtasks: (i) in Subtask 1, participants were given the article text and human expert annotated entities, and were asked to extract the relation pairs, identify their semantic type and the novelty factor, and (ii) in Subtask 2, participants were given only the article text, and were asked to build an end-to-end system that could identify and categorize the relationships and their novelty. We received a total of 94 submissions from 14 teams worldwide. The highest F-score performances achieved for the Subtask 1 were: 77.17% for relation pair identification, 58.95% for relation type identification, 59.22% for novelty identification, and 44.55% when evaluating all of the above aspects of the comprehensive relation extraction. The highest F-score performances achieved for the Subtask 2 were: 55.84% for relation pair, 43.03% for relation type, 42.74% for novelty, and 32.75% for comprehensive relation extraction. The entire BioRED track dataset and other challenge materials are available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/ and https://codalab.lisn.upsaclay.fr/competitions/13377 and https://codalab.lisn.upsaclay.fr/competitions/13378. Database URL: https://ftp.ncbi.nlm.nih.gov/pub/lu/BC8-BioRED-track/https://codalab.lisn.upsaclay.fr/competitions/13377https://codalab.lisn.upsaclay.fr/competitions/13378.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Tiago Almeida
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A. A Jonker
- Department of Electronics, Telecommunications and Informatics (DETI), Institute of Electronics and Informatics Engineering of Aveiro (IEETA), University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sofia I. R Conceição
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Diana F Sousa
- Departamento de Informática, Faculdade de Ciências da Universidade de Lisboa, Edifício C6 Campo Grande, Lisbon 1749-016, Portugal
| | - Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No.1, University Road, Tainan City 701, Taiwan, Republic of China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, No. 2 Linggong Road, Ganjingzi District, Dalian 116024, China
| | - Wilailack Meesawad
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
| | - Richard Tzong-Han Tsai
- Department of Computer Science and Information Engineering, National Central University, No. 300, Zhongda Rd., Zhongli District, Taoyuan City 32001, Taiwan, Republic of China
- Research Center for Humanities and Social Sciences, Academia Sinica, No. 128, Section 2, Academia Rd., Nangang District, Taoyuan City 115201, Taiwan, Republic of China
| | - M. Janina Sarol
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Gibong Hong
- School of Information Sciences, University of Illinois at Urbana-Champaign, 614 E. Daniel St, Champaign, IL 61820, United States
| | - Airat Valiev
- Higher School of Economics University, 20 Myasnitskaya St, Moscow 101000, Russia
| | - Elena Tutubalina
- Artificial Intelligence Research Institute (AIRI), 32 Kutuzovskiy St, Moscow 121170, Russia
- Kazan Federal University, 18 Kremlevskaya St, Kazan 420008, Russia
| | - Shao-Man Lee
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Yi-Yu Hsu
- Miin Wu School of Computing, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan, Republic of China
| | - Mingjie Li
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Karin Verspoor
- School of Computing Technologies, RMIT University, 124 La Trobe Street, Melbourne, Victoria 3000, Australia
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| |
Collapse
|
18
|
Garda S, Leser U. BELHD: improving biomedical entity linking with homonym disambiguation. Bioinformatics 2024; 40:btae474. [PMID: 39067036 PMCID: PMC11310454 DOI: 10.1093/bioinformatics/btae474] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/14/2024] [Accepted: 07/25/2024] [Indexed: 07/30/2024] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a given knowledge base (KB). Recently, neural name-based methods, system identifying the most appropriate name in the KB for a given mention using neural network (either via dense retrieval or autoregressive modeling), achieved remarkable results for the task, without requiring manual tuning or definition of domain/entity-specific rules. However, as name-based methods directly return KB names, they cannot cope with homonyms, i.e. different KB entities sharing the exact same name. This significantly affects their performance for KBs where homonyms account for a large amount of entity mentions (e.g. UMLS and NCBI Gene). RESULTS We present BELHD (Biomedical Entity Linking with Homonym Disambiguation), a new name-based method that copes with this challenge. BELHD builds upon the BioSyn model with two crucial extensions. First, it performs pre-processing of the KB, during which it expands homonyms with a specifically constructed disambiguating string, thus enforcing unique linking decisions. Second, it introduces candidate sharing, a novel strategy that strengthens the overall training signal by including similar mentions from the same document as positive or negative examples, according to their corresponding KB identifier. Experiments with 10 corpora and 5 entity types show that BELHD improves upon current neural state-of-the-art approaches, achieving the best results in 6 out of 10 corpora with an average improvement of 4.55pp recall@1. Furthermore, the KB preprocessing is orthogonal to the prediction model and thus can also improve other neural methods, which we exemplify for GenBioEL, a generative name-based BEL approach. AVAILABILITY AND IMPLEMENTATION The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belhd.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| | - Ulf Leser
- Computer Science, Humboldt-Universität zu Berlin, Berlin 12489, Germany
| |
Collapse
|
19
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
20
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
21
|
Almeida T, Jonker RAA, Antunes R, Almeida JR, Matos S. Towards discovery: an end-to-end system for uncovering novel biomedical relations. Database (Oxford) 2024; 2024:baae057. [PMID: 38994795 PMCID: PMC11240158 DOI: 10.1093/database/baae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 05/20/2024] [Accepted: 06/19/2024] [Indexed: 07/13/2024]
Abstract
Biomedical relation extraction is an ongoing challenge within the natural language processing community. Its application is important for understanding scientific biomedical literature, with many use cases, such as drug discovery, precision medicine, disease diagnosis, treatment optimization and biomedical knowledge graph construction. Therefore, the development of a tool capable of effectively addressing this task holds the potential to improve knowledge discovery by automating the extraction of relations from research manuscripts. The first track in the BioCreative VIII competition extended the scope of this challenge by introducing the detection of novel relations within the literature. This paper describes that our participation system initially focused on jointly extracting and classifying novel relations between biomedical entities. We then describe our subsequent advancement to an end-to-end model. Specifically, we enhanced our initial system by incorporating it into a cascading pipeline that includes a tagger and linker module. This integration enables the comprehensive extraction of relations and classification of their novelty directly from raw text. Our experiments yielded promising results, and our tagger module managed to attain state-of-the-art named entity recognition performance, with a micro F1-score of 90.24, while our end-to-end system achieved a competitive novelty F1-score of 24.59. The code to run our system is publicly available at https://github.com/ieeta-pt/BioNExt. Database URL: https://github.com/ieeta-pt/BioNExt.
Collapse
Affiliation(s)
- Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
22
|
Nédellec C, Sauvion C, Bossy R, Borovikova M, Deléger L. TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature. PLoS One 2024; 19:e0305475. [PMID: 38870159 PMCID: PMC11175518 DOI: 10.1371/journal.pone.0305475] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2023] [Accepted: 05/31/2024] [Indexed: 06/15/2024] Open
Abstract
Wheat varieties show a large diversity of traits and phenotypes. Linking them to genetic variability is essential for shorter and more efficient wheat breeding programs. A growing number of plant molecular information networks provide interlinked interoperable data to support the discovery of gene-phenotype interactions. A large body of scientific literature and observational data obtained in-field and under controlled conditions document wheat breeding experiments. The cross-referencing of this complementary information is essential. Text from databases and scientific publications has been identified early on as a relevant source of information. However, the wide variety of terms used to refer to traits and phenotype values makes it difficult to find and cross-reference the textual information, e.g. simple dictionary lookup methods miss relevant terms. Corpora with manually annotated examples are thus needed to evaluate and train textual information extraction methods. While several corpora contain annotations of human and animal phenotypes, no corpus is available for plant traits. This hinders the evaluation of text mining-based crop knowledge graphs (e.g. AgroLD, KnetMiner, WheatIS-FAIDARE) and limits the ability to train machine learning methods and improve the quality of information. The Triticum aestivum trait Corpus is a new gold standard for traits and phenotypes of wheat. It consists of 528 PubMed references that are fully annotated by trait, phenotype, and species. We address the interoperability challenge of crossing sparse assay data and publications by using the Wheat Trait and Phenotype Ontology to normalize trait mentions and the species taxonomy of the National Center for Biotechnology Information to normalize species. The paper describes the construction of the corpus. A study of the performance of state-of-the-art language models for both named entity recognition and linking tasks trained on the corpus shows that it is suitable for training and evaluation. This corpus is currently the most comprehensive manually annotated corpus for natural language processing studies on crop phenotype information from the literature.
Collapse
Affiliation(s)
- Claire Nédellec
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Clara Sauvion
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Robert Bossy
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| | - Mariya Borovikova
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
- TETIS, Univ. Montpellier, AgroParisTech, CIRAD, CNRS, INRAE, Montpellier, France
| | - Louise Deléger
- Université Paris-Saclay, INRAE, MaIAGE, Jouy-en-Josas, France
| |
Collapse
|
23
|
Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024; 23:1915-1925. [PMID: 38733346 PMCID: PMC11165580 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 01/30/2024] [Accepted: 04/29/2024] [Indexed: 05/13/2024]
Abstract
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Collapse
Affiliation(s)
- Meiqi Wang
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
| | - Avish Vijayaraghavan
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
- UKRI
Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K.
| | - Tim Beck
- School
of Medicine, University of Nottingham, Biodiscovery
Institute, Nottingham NG7 2RD, U.K.
- Health
Data Research (HDR) U.K., London NW1 2BE, U.K.
| | - Joram M. Posma
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
- Health
Data Research (HDR) U.K., London NW1 2BE, U.K.
| |
Collapse
|
24
|
Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024; 16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]
Abstract
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
Collapse
Affiliation(s)
- Dao-Ling Huang
- BGI Research, Shenzhen, 518083, China.
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Quanlei Zeng
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yun Xiong
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Shuixia Liu
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Chaoqun Pang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Menglei Xia
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Ting Fang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yanli Ma
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Cuicui Qiang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yi Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yu Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Hong Li
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yuying Yuan
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
25
|
Di Maria A, Bellomo L, Billeci F, Cardillo A, Alaimo S, Ferragina P, Ferro A, Pulvirenti A. NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph. Bioinformatics 2024; 40:btae194. [PMID: 38597890 PMCID: PMC11074003 DOI: 10.1093/bioinformatics/btae194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024] Open
Abstract
MOTIVATION The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e. in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION https://netme.click/.
Collapse
Affiliation(s)
- Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | | | - Fabrizio Billeci
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Alfio Cardillo
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, 56126 , Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| |
Collapse
|
26
|
Valero-Rello A, Baeza-Delgado C, Andreu-Moreno I, Sanjuán R. Cellular receptors for mammalian viruses. PLoS Pathog 2024; 20:e1012021. [PMID: 38377111 PMCID: PMC10906839 DOI: 10.1371/journal.ppat.1012021] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2023] [Revised: 03/01/2024] [Accepted: 02/02/2024] [Indexed: 02/22/2024] Open
Abstract
The interaction of viral surface components with cellular receptors and other entry factors determines key features of viral infection such as host range, tropism and virulence. Despite intensive research, our understanding of these interactions remains limited. Here, we report a systematic analysis of published work on mammalian virus receptors and attachment factors. We build a dataset twice the size of those available to date and specify the role of each factor in virus entry. We identify cellular proteins that are preferentially used as virus receptors, which tend to be plasma membrane proteins with a high propensity to interact with other proteins. Using machine learning, we assign cell surface proteins a score that predicts their ability to function as virus receptors. Our results also reveal common patterns of receptor usage among viruses and suggest that enveloped viruses tend to use a broader repertoire of alternative receptors than non-enveloped viruses, a feature that might confer them with higher interspecies transmissibility.
Collapse
Affiliation(s)
- Ana Valero-Rello
- Institute for Integrative Systems Biology (I2SysBio), Consejo Superior de Investigaciones Científicas-Universitat de València, Paterna, València, Spain
| | - Carlos Baeza-Delgado
- Institute for Integrative Systems Biology (I2SysBio), Consejo Superior de Investigaciones Científicas-Universitat de València, Paterna, València, Spain
| | - Iván Andreu-Moreno
- Institute for Integrative Systems Biology (I2SysBio), Consejo Superior de Investigaciones Científicas-Universitat de València, Paterna, València, Spain
| | - Rafael Sanjuán
- Institute for Integrative Systems Biology (I2SysBio), Consejo Superior de Investigaciones Científicas-Universitat de València, Paterna, València, Spain
| |
Collapse
|
27
|
Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. PATTERNS (NEW YORK, N.Y.) 2024; 5:100887. [PMID: 38264716 PMCID: PMC10801236 DOI: 10.1016/j.patter.2023.100887] [Citation(s) in RCA: 21] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 01/25/2024]
Abstract
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models-PhenoBCBERT and PhenoGPT-for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models to automate the detection of phenotype terms, including those not in the current HPO. We compare these models with PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also show strong performance in case studies on biomedical literature. We evaluate the strengths and weaknesses of BERT- and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Collapse
Affiliation(s)
- Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Wendy Deng
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Yunyun Zhou
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
28
|
Skoufos G, Kakoulidis P, Tastsoglou S, Zacharopoulou E, Kotsira V, Miliotis M, Mavromati G, Grigoriadis D, Zioga M, Velli A, Koutou I, Karagkouni D, Stavropoulos S, Kardaras F, Lifousi A, Vavalou E, Ovsepian A, Skoulakis A, Tasoulis S, Georgakopoulos S, Plagianakos V, Hatzigeorgiou A. TarBase-v9.0 extends experimentally supported miRNA-gene interactions to cell-types and virally encoded miRNAs. Nucleic Acids Res 2024; 52:D304-D310. [PMID: 37986224 PMCID: PMC10767993 DOI: 10.1093/nar/gkad1071] [Citation(s) in RCA: 46] [Impact Index Per Article: 46.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 10/18/2023] [Accepted: 11/02/2023] [Indexed: 11/22/2023] Open
Abstract
TarBase is a reference database dedicated to produce, curate and deliver high quality experimentally-supported microRNA (miRNA) targets on protein-coding transcripts. In its latest version (v9.0, https://dianalab.e-ce.uth.gr/tarbasev9), it pushes the envelope by introducing virally-encoded miRNAs, interactions leading to target-directed miRNA degradation (TDMD) events and the largest collection of miRNA-gene interactions to date in a plethora of experimental settings, tissues and cell-types. It catalogues ∼6 million entries, comprising ∼2 million unique miRNA-gene pairs, supported by 37 experimental (high- and low-yield) protocols in 172 tissues and cell-types. Interactions are annotated with rich metadata including information on genes/transcripts, miRNAs, samples, experimental contexts and publications, while millions of miRNA-binding locations are also provided at cell-type resolution. A completely re-designed interface with state-of-the-art web technologies, incorporates more features, and allows flexible and ingenious use. The new interface provides the capability to design sophisticated queries with numerous filtering criteria including cell lines, experimental conditions, cell types, experimental methods, species and/or tissues of interest. Additionally, a plethora of fine-tuning capacities have been integrated to the platform, offering the refinement of the returned interactions based on miRNA confidence and expression levels, while boundless local retrieval of the offered interactions and metadata is enabled.
Collapse
Affiliation(s)
- Giorgos Skoufos
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Panos Kakoulidis
- Dept. of Informatics and Telecommunications, National and Kapodistrian Univ. of Athens, Athens, Greece
- Biomedical Research Foundation of the Academy of Athens, 11527Athens, Greece
| | - Spyros Tastsoglou
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Elissavet Zacharopoulou
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Vasiliki Kotsira
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Marios Miliotis
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Galatea Mavromati
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Dimitris Grigoriadis
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Maria Zioga
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Angeliki Velli
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Ioanna Koutou
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Dimitra Karagkouni
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Steve Stavropoulos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Filippos S Kardaras
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Anna Lifousi
- Technical University of Denmark – Department of Health Technology, Copenhagen, Denmark
| | - Eustathia Vavalou
- Department of Biology, National and Kapodistrian University of Athens, 15784Athens, Greece
| | - Armen Ovsepian
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Anargyros Skoulakis
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| | - Sotiris K Tasoulis
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | | | - Vassilis P Plagianakos
- Department of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
| | - Artemis G Hatzigeorgiou
- DIANA-Lab, Dept. of Computer Science and Biomedical Informatics, University of Thessaly, Lamia, Greece
- Hellenic Pasteur Institute, Athens11521, Greece
| |
Collapse
|
29
|
Cui C, Zhong B, Fan R, Cui Q. HMDD v4.0: a database for experimentally supported human microRNA-disease associations. Nucleic Acids Res 2024; 52:D1327-D1332. [PMID: 37650649 PMCID: PMC10767894 DOI: 10.1093/nar/gkad717] [Citation(s) in RCA: 50] [Impact Index Per Article: 50.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Revised: 08/07/2023] [Accepted: 08/19/2023] [Indexed: 09/01/2023] Open
Abstract
MicroRNAs (miRNAs) are a class of important small non-coding RNAs with critical molecular functions in almost all biological processes, and thus, they play important roles in disease diagnosis and therapy. Human MicroRNA Disease Database (HMDD) represents an important and comprehensive resource for biomedical researchers in miRNA-related medicine. Here, we introduce HMDD v4.0, which curates 53530 miRNA-disease association entries from literatures. In comparison to HMDD v3.0 released five years ago, HMDD v4.0 contains 1.5 times more entries. In addition, some new categories have been curated, including exosomal miRNAs implicated in diseases, virus-encoded miRNAs involved in human diseases, and entries containing miRNA-circRNA interactions. We also curated sex-biased miRNAs in diseases. Furthermore, in a case study, disease similarity analysis successfully revealed that sex-biased miRNAs related to developmental anomalies are associated with a number of human diseases with sex bias. HMDD can be freely visited at http://www.cuilab.cn/hmdd.
Collapse
Affiliation(s)
- Chunmei Cui
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, State Key Laboratory of Vascular Homeostasis and Remodeling, School of Basic Medical Sciences, Peking University, 38 Xueyuan Rd, Beijing 100191, China
| | - Bitao Zhong
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, State Key Laboratory of Vascular Homeostasis and Remodeling, School of Basic Medical Sciences, Peking University, 38 Xueyuan Rd, Beijing 100191, China
| | - Rui Fan
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, State Key Laboratory of Vascular Homeostasis and Remodeling, School of Basic Medical Sciences, Peking University, 38 Xueyuan Rd, Beijing 100191, China
| | - Qinghua Cui
- Department of Biomedical Informatics, Center for Noncoding RNA Medicine, State Key Laboratory of Vascular Homeostasis and Remodeling, School of Basic Medical Sciences, Peking University, 38 Xueyuan Rd, Beijing 100191, China
- School of Sports Medicine, Wuhan Institute of Physical Education, No. 461 Luoyu Rd. Wuchang District, Wuhan 430079, Hubei Province, China
| |
Collapse
|
30
|
Collins C, Baker S, Brown J, Zheng H, Chan A, Stenius U, Narita M, Korhonen A. Text mining for contexts and relationships in cancer genomics literature. Bioinformatics 2024; 40:btae021. [PMID: 38258418 PMCID: PMC10822582 DOI: 10.1093/bioinformatics/btae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 09/27/2023] [Accepted: 01/15/2024] [Indexed: 01/24/2024] Open
Abstract
MOTIVATION Scientific advances build on the findings of existing research. The 2001 publication of the human genome has led to the production of huge volumes of literature exploring the context-specific functions and interactions of genes. Technology is needed to perform large-scale text mining of research papers to extract the reported actions of genes in specific experimental contexts and cell states, such as cancer, thereby facilitating the design of new therapeutic strategies. RESULTS We present a new corpus and Text Mining methodology that can accurately identify and extract the most important details of cancer genomics experiments from biomedical texts. We build a Named Entity Recognition model that accurately extracts relevant experiment details from PubMed abstract text, and a second model that identifies the relationships between them. This system outperforms earlier models and enables the analysis of gene function in diverse and dynamically evolving experimental contexts. AVAILABILITY AND IMPLEMENTATION Code and data are available here: https://github.com/cambridgeltl/functional-genomics-ie.
Collapse
Affiliation(s)
- Charlotte Collins
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Simon Baker
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Jason Brown
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| | - Huiyuan Zheng
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Adelyne Chan
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom
| | - Ulla Stenius
- Institute of Environmental Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
| | - Masashi Narita
- Cancer Research UK Cambridge Institute, Li Ka Shing Centre, University of Cambridge, Cambridge CB2 0RE, United Kingdom
| | - Anna Korhonen
- Language Technology Laboratory, Theoretical and Applied Linguistics, Faculty of Modern and Medieval Languages and Linguistics, University of Cambridge, Cambridge CB3 9DA, United Kingdom
| |
Collapse
|
31
|
Abdulnazar A, Roller R, Schulz S, Kreuzthaler M. Unsupervised SapBERT-based bi-encoders for medical concept annotation of clinical narratives with SNOMED CT. Digit Health 2024; 10:20552076241288681. [PMID: 39493636 PMCID: PMC11531008 DOI: 10.1177/20552076241288681] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 09/03/2024] [Indexed: 11/05/2024] Open
Abstract
Objective Clinical narratives provide comprehensive patient information. Achieving interoperability involves mapping relevant details to standardized medical vocabularies. Typically, natural language processing divides this task into named entity recognition (NER) and medical concept normalization (MCN). State-of-the-art results require supervised setups with abundant training data. However, the limited availability of annotated data due to sensitivity and time constraints poses challenges. This study addressed the need for unsupervised medical concept annotation (MCA) to overcome these limitations and support the creation of annotated datasets. Method We use an unsupervised SapBERT-based bi-encoder model to analyze n-grams from narrative text and measure their similarity to SNOMED CT concepts. At the end, we apply a syntactical re-ranker. For evaluation, we use the semantic tags of SNOMED CT candidates to assess the NER phase and their concept IDs to assess the MCN phase. The approach is evaluated with both English and German narratives. Result Without training data, our unsupervised approach achieves an F1 score of 0.765 in English and 0.557 in German for MCN. Evaluation at the semantic tag level reveals that "disorder" has the highest F1 scores, 0.871 and 0.648 on English and German datasets. Furthermore, the MCA approach on the semantic tag "disorder" shows F1 scores of 0.839 and 0.696 in English and 0.685 and 0.437 in German for NER and MCN, respectively. Conclusion This unsupervised approach demonstrates potential for initial annotation (pre-labeling) in manual annotation tasks. While promising for certain semantic tags, challenges remain, including false positives, contextual errors, and variability of clinical language, requiring further fine-tuning.
Collapse
Affiliation(s)
- Akhila Abdulnazar
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
- CBmed GmbH – Center for Biomarker Research in Medicine, Graz, Austria
| | - Roland Roller
- German Research Center for Artificial Intelligence (DFKI), Berlin, Germany
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| | - Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Austria
| |
Collapse
|
32
|
Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT. ARXIV 2023:arXiv:2308.06294v2. [PMID: 37986722 PMCID: PMC10659449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models - PhenoBCBERT and PhenoGPT - for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes, due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models (LLMs) to automate the detection of phenotype terms, including those not in the current HPO. We compared these models to PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also showed strong performance in case studies on biomedical literature. We evaluated the strengths and weaknesses of BERT-based and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Collapse
Affiliation(s)
- Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Wendy Deng
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Yunyun Zhou
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Biostatistics and Bioinformatics facility, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
33
|
Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023; 39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, München 80539, Germany
| | - Robert Martin
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
34
|
Wei CH, Luo L, Islamaj R, Lai PT, Lu Z. GNorm2: an improved gene name recognition and normalization system. Bioinformatics 2023; 39:btad599. [PMID: 37878810 PMCID: PMC10612401 DOI: 10.1093/bioinformatics/btad599] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Revised: 09/06/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Gene name normalization is an important yet highly complex task in biomedical text mining research, as gene names can be highly ambiguous and may refer to different genes in different species or share similar names with other bioconcepts. This poses a challenge for accurately identifying and linking gene mentions to their corresponding entries in databases such as NCBI Gene or UniProt. While there has been a body of literature on the gene normalization task, few have addressed all of these challenges or make their solutions publicly available to the scientific community. RESULTS Building on the success of GNormPlus, we have created GNorm2: a more advanced tool with optimized functions and improved performance. GNorm2 integrates a range of advanced deep learning-based methods, resulting in the highest levels of accuracy and efficiency for gene recognition and normalization to date. Our tool is freely available for download. AVAILABILITY AND IMPLEMENTATION https://github.com/ncbi/GNorm2.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| |
Collapse
|
35
|
Domingo-Fernández D, Gadiya Y, Mubeen S, Bollerman TJ, Healy MD, Chanana S, Sadovsky RG, Healey D, Colluru V. Modern drug discovery using ethnobotany: A large-scale cross-cultural analysis of traditional medicine reveals common therapeutic uses. iScience 2023; 26:107729. [PMID: 37701812 PMCID: PMC10494464 DOI: 10.1016/j.isci.2023.107729] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Revised: 08/08/2023] [Accepted: 08/22/2023] [Indexed: 09/14/2023] Open
Abstract
For millennia, numerous cultures and civilizations have relied on traditional remedies derived from plants to treat a wide range of conditions and ailments. Here, we systematically analyzed ethnobotanical patterns across taxonomically related plants, demonstrating that congeneric medicinal plants are more likely to be used for treating similar indications. Next, we reconstructed the phytochemical space covered by medicinal plants to reveal that (i) taxonomically related medicinal plants cover a similar phytochemical space, and (ii) chemical similarity correlates with similar therapeutic usage. Lastly, we present several case scenarios illustrating how mining this information can be used for drug discovery applications, including: (i) investigating taxonomic hotspots around particular indications, (ii) exploring shared patterns of congeneric plants located in different geographic areas, but which have been used to treat the same indications, and (iii) showing the concordance between ethnobotanical patterns among non-taxonomically related plants and the presence of shared bioactive phytochemicals.
Collapse
|
36
|
Neves M, Klippert A, Knöspel F, Rudeck J, Stolz A, Ban Z, Becker M, Diederich K, Grune B, Kahnau P, Ohnesorge N, Pucher J, Schönfelder G, Bert B, Butzke D. Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. J Biomed Semantics 2023; 14:13. [PMID: 37658458 PMCID: PMC10472567 DOI: 10.1186/s13326-023-00292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 07/29/2023] [Indexed: 09/03/2023] Open
Abstract
Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).
Collapse
Affiliation(s)
- Mariana Neves
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
| | - Antonina Klippert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Current affiliation: Nuvisan ICB GmbH, Müllerstraße 178, 13353, Berlin, Germany
| | - Fanny Knöspel
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Juliane Rudeck
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Ailine Stolz
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Zsofia Ban
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Markus Becker
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Kai Diederich
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Barbara Grune
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Pia Kahnau
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Nils Ohnesorge
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Johannes Pucher
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Gilbert Schönfelder
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Institute of Clinical Pharmacology and Toxicology, Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Bettina Bert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Daniel Butzke
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
37
|
Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023; 39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open
Abstract
SUMMARY Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%-8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. AVAILABILITY AND IMPLEMENTATION Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Collapse
Affiliation(s)
- Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul 04778, Republic of Korea
| |
Collapse
|
38
|
Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2023; 2023:558-564. [PMID: 38283164 PMCID: PMC10815931 DOI: 10.1109/ichi57859.2023.00100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.
Collapse
Affiliation(s)
- Zenan Sun
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| |
Collapse
|
39
|
Luo L, Wei CH, Lai PT, Leaman R, Chen Q, Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 2023; 39:btad310. [PMID: 37171899 PMCID: PMC10212279 DOI: 10.1093/bioinformatics/btad310] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/12/2023] [Accepted: 05/11/2023] [Indexed: 05/14/2023] Open
Abstract
MOTIVATION Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| |
Collapse
|
40
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
41
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-236011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
42
|
Hoyt CT, Balk M, Callahan TJ, Domingo-Fernández D, Haendel MA, Hegde HB, Himmelstein DS, Karis K, Kunze J, Lubiana T, Matentzoglu N, McMurry J, Moxon S, Mungall CJ, Rutz A, Unni DR, Willighagen E, Winston D, Gyori BM. Unifying the identification of biomedical entities with the Bioregistry. Sci Data 2022; 9:714. [PMID: 36402838 PMCID: PMC9675740 DOI: 10.1038/s41597-022-01807-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 10/26/2022] [Indexed: 11/21/2022] Open
Abstract
The standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry .
Collapse
Affiliation(s)
| | | | | | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer SCAI, Sankt Augustin, Germany
- Enveda Biosciences, Boulder, USA
| | | | | | | | - Klas Karis
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA
| | - John Kunze
- California Digital Library, University of California, Berkeley, USA
| | - Tiago Lubiana
- School of Pharmaceutical Sciences, University of São Paulo, São Paulo, Brazil
| | | | - Julie McMurry
- University of Colorado Anschutz Medical Campus, Aurora, USA
| | - Sierra Moxon
- Lawrence Berkeley National Laboratory, Berkeley, USA
| | | | - Adriano Rutz
- School of Pharmaceutical Sciences, University of Geneva, Geneva, Switzerland
- Institute of Pharmaceutical Sciences of Western Switzerland, University of Geneva, Geneva, Switzerland
| | - Deepak R Unni
- Lawrence Berkeley National Laboratory, Berkeley, USA
- European Molecular Biology Laboratory, Heidelberg, Germany
| | - Egon Willighagen
- Department of Bioinformatics - BiGCaT, NUTRIM, Maastricht University, Maastricht, Netherlands
| | | | - Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA.
| |
Collapse
|
43
|
Kim H, Sung M, Yoon W, Park S, Kang J. Full-text chemical identification with improved generalizability and tagging consistency. Database (Oxford) 2022; 2022:6726385. [PMID: 36170114 PMCID: PMC9518746 DOI: 10.1093/database/baac074] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/11/2022] [Accepted: 08/22/2022] [Indexed: 11/14/2022]
Abstract
Chemical identification involves finding chemical entities in text (i.e. named entity recognition) and assigning unique identifiers to the entities (i.e. named entity normalization). While current models are developed and evaluated based on article titles and abstracts, their effectiveness has not been thoroughly verified in full text. In this paper, we identify two limitations of models in tagging full-text articles: (1) low generalizability to unseen mentions and (2) tagging inconsistency. We use simple training and post-processing methods to address the limitations such as transfer learning and mention-wise majority voting. We also present a hybrid model for the normalization task that utilizes the high recall of a neural model while maintaining the high precision of a dictionary model. In the BioCreative VII NLM-Chem track challenge, our best model achieves 86.72 and 78.31 F1 scores in named entity recognition and normalization, significantly outperforming the median (83.73 and 77.49 F1 scores) and taking first place in named entity recognition. In a post-challenge evaluation, we re-implement our model and obtain 84.70 F1 score in the normalization task, outperforming the best score in the challenge by 3.34 F1 score. Database URL: https://github.com/dmis-lab/bc7-chem-id
Collapse
Affiliation(s)
- Hyunjae Kim
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Mujeen Sung
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Wonjin Yoon
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
| | - Sungjoon Park
- Department of Medicine, University of California , San Diego, CA, USA
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University , Seoul, South Korea
- AIGEN Sciences , Seoul, South Korea
| |
Collapse
|