1
|
Sänger M, Garda S, Wang XD, Weber-Genzel L, Droop P, Fuchs B, Akbik A, Leser U. HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools. Bioinformatics 2024; 40:btae564. [PMID: 39302686 PMCID: PMC11453098 DOI: 10.1093/bioinformatics/btae564] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 08/23/2024] [Accepted: 09/17/2024] [Indexed: 09/22/2024] Open
Abstract
MOTIVATION With the exponential growth of the life sciences literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. The identification of entities in texts, such as diseases or genes, and their normalization, i.e. grounding them in knowledge base, are crucial steps in any BTM pipeline to enable information aggregation from multiple documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied "in the wild," i.e. on application-dependent text collections from moderately to extremely different from those used for training, varying, e.g. in focus, genre or text type. This raises the question whether the reported performance, usually obtained by training and evaluating on different partitions of the same corpus, can be trusted for downstream applications. RESULTS Here, we report on the results of a carefully designed cross-corpus benchmark for entity recognition and normalization, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five, based on predefined criteria like feature richness and availability, for an in-depth analysis on three publicly available corpora covering four entity types. Our results present a mixed picture and show that cross-corpus performance is significantly lower than the in-corpus performance. HunFlair2, the redesigned and extended successor of the HunFlair tool, showed the best performance on average, being closely followed by PubTator Central. Our results indicate that users of BTM tools should expect a lower performance than the original published one when applying tools in "the wild" and show that further research is necessary for more robust BTM tools. AVAILABILITY AND IMPLEMENTATION All our models are integrated into the Natural Language Processing (NLP) framework flair: https://github.com/flairNLP/flair. Code to reproduce our results is available at: https://github.com/hu-ner/hunflair2-experiments.
Collapse
Affiliation(s)
- Mario Sänger
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Samuele Garda
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Xing David Wang
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing (CIS), Ludwig Maximilian University Munich, München 80539, Germany
| | - Pia Droop
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Benedikt Fuchs
- Research Industrial Systems Engineering (RISE) Forschungs-, Entwicklungs- und Großprojektberatung GmbH, Schwechat 2320, Austria
| | - Alan Akbik
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Department of Computer Science, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
2
|
Weber L, Barth F, Lorenz L, Konrath F, Huska K, Wolf J, Leser U. PEDL+: protein-centered relation extraction from PubMed at your fingertip. Bioinformatics 2023; 39:btad603. [PMID: 37950510 PMCID: PMC10660277 DOI: 10.1093/bioinformatics/btad603] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/29/2023] [Accepted: 10/31/2023] [Indexed: 11/12/2023] Open
Abstract
SUMMARY Relation extraction (RE) from large text collections is an important tool for database curation, pathway reconstruction, or functional omics data analysis. In practice, RE often is part of a complex data analysis pipeline requiring specific adaptations like restricting the types of relations or the set of proteins to be considered. However, current systems are either non-programmable web sites or research code with fixed functionality. We present PEDL+, a user-friendly tool for extracting protein-protein and protein-chemical associations from PubMed articles. PEDL+ combines state-of-the-art NLP technology with adaptable ranking and filtering options and can easily be integrated into analysis pipelines. We evaluated PEDL+ in two pathway curation projects and found that 59% to 80% of its extractions were helpful. AVAILABILITY AND IMPLEMENTATION PEDL+ is freely available at https://github.com/leonweber/pedl.
Collapse
Affiliation(s)
- Leon Weber
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, Geschwister-Scholl-Platz 1, München 80539, Germany
| | - Fabio Barth
- Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Leonie Lorenz
- Pathogen Informatics and Modelling, EMBL-EBI, Hinxton, Cambridgeshire CB10 1SD, United Kingdom
| | - Fabian Konrath
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
| | - Kirsten Huska
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
| | - Jana Wolf
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine, Robert-Rössle-Str. 10, Berlin 13125, Germany
- Department of Mathematics and Computer Science, Free University Berlin, Berlin, 14195, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| |
Collapse
|
3
|
Zirkle J, Han X, Racz R, Samieegohar M, Chaturbedi A, Mann J, Chakravartula S, Li Z. Deep learning-enabled natural language processing to identify directional pharmacokinetic drug-drug interactions. BMC Bioinformatics 2023; 24:413. [PMID: 37914988 PMCID: PMC10619324 DOI: 10.1186/s12859-023-05520-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Accepted: 10/04/2023] [Indexed: 11/03/2023] Open
Abstract
BACKGROUND During drug development, it is essential to gather information about the change of clinical exposure of a drug (object) due to the pharmacokinetic (PK) drug-drug interactions (DDIs) with another drug (precipitant). While many natural language processing (NLP) methods for DDI have been published, most were designed to evaluate if (and what kind of) DDI relationships exist in the text, without identifying the direction of DDI (object vs. precipitant drug). Here we present a method for the automatic identification of the directionality of a PK DDI from literature or drug labels. METHODS We reannotated the Text Analysis Conference (TAC) DDI track 2019 corpus for identifying the direction of a PK DDI and evaluated the performance of a fine-tuned BioBERT model on this task by following the training and validation steps prespecified by TAC. RESULTS This initial attempt showed the model achieved an F-score of 0.82 in identifying sentences as containing PK DDI and an F-score of 0.97 in identifying object versus precipitant drugs in those sentences. DISCUSSION AND CONCLUSION Despite a growing list of NLP methods for DDI extraction, most of them use a common set of corpora to perform general purpose tasks (e.g., classifying a sentence into one of several fixed DDI categories). There is a lack of coordination between the drug development and biomedical informatics method development community to develop corpora and methods to perform specific tasks (e.g., extract clinical exposure changes due to PK DDI). We hope that our effort can encourage such a coordination so that more "fit for purpose" NLP methods could be developed and used to facilitate the drug development process.
Collapse
Affiliation(s)
- Joel Zirkle
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Xiaomei Han
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Rebecca Racz
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Mohammadreza Samieegohar
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Anik Chaturbedi
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - John Mann
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Shilpa Chakravartula
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA
| | - Zhihua Li
- Division of Applied Regulatory Science, Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, Food and Drug Administration, WO Bldg 64 Rm 2078, 10903 New Hampshire Ave, Silver Spring, MD, 20993, USA.
| |
Collapse
|
4
|
Dhrangadhariya A, Müller H. Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation. JAMIA Open 2023; 6:ooac107. [PMID: 36632329 PMCID: PMC9828146 DOI: 10.1093/jamiaopen/ooac107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Revised: 12/01/2022] [Accepted: 12/21/2022] [Indexed: 01/11/2023] Open
Abstract
Objective The aim of this study was to test the feasibility of PICO (participants, interventions, comparators, outcomes) entity extraction using weak supervision and natural language processing. Methodology We re-purpose more than 127 medical and nonmedical ontologies and expert-generated rules to obtain multiple noisy labels for PICO entities in the evidence-based medicine (EBM)-PICO corpus. These noisy labels are aggregated using simple majority voting and generative modeling to get consensus labels. The resulting probabilistic labels are used as weak signals to train a weakly supervised (WS) discriminative model and observe performance changes. We explore mistakes in the EBM-PICO that could have led to inaccurate evaluation of previous automation methods. Results In total, 4081 randomized clinical trials were weakly labeled to train the WS models and compared against full supervision. The models were separately trained for PICO entities and evaluated on the EBM-PICO test set. A WS approach combining ontologies and expert-generated rules outperformed full supervision for the participant entity by 1.71% macro-F1. Error analysis on the EBM-PICO subset revealed 18-23% erroneous token classifications. Discussion Automatic PICO entity extraction accelerates the writing of clinical systematic reviews that commonly use PICO information to filter health evidence. However, PICO extends to more entities-PICOS (S-study type and design), PICOC (C-context), and PICOT (T-timeframe) for which labelled datasets are unavailable. In such cases, the ability to use weak supervision overcomes the expensive annotation bottleneck. Conclusions We show the feasibility of WS PICO entity extraction using freely available ontologies and heuristics without manually annotated data. Weak supervision has encouraging performance compared to full supervision but requires careful design to outperform it.
Collapse
Affiliation(s)
- Anjani Dhrangadhariya
- Corresponding Author: Anjani Dhrangadhariya, MSc, Institute of Informatics, University of Applied Sciences Western Switzerland (HES-SO), Rue de Technopôle 3, 3960 Sierre, Switzerland;
| | - Henning Müller
- Institute of Informatics, University of Applied Sciences Western Switzerland (HES-SO), Sierre, Switzerland,University of Geneva (UNIGE), Geneva, Switzerland
| |
Collapse
|
5
|
Weber L, Sänger M, Garda S, Barth F, Alt C, Leser U. Chemical-protein relation extraction with ensembles of carefully tuned pretrained language models. Database (Oxford) 2022; 2022:6833204. [PMID: 36399413 PMCID: PMC9674024 DOI: 10.1093/database/baac098] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 10/18/2022] [Accepted: 10/21/2022] [Indexed: 11/19/2022]
Abstract
The identification of chemical-protein interactions described in the literature is an important task with applications in drug design, precision medicine and biotechnology. Manual extraction of such relationships from the biomedical literature is costly and often prohibitively time-consuming. The BioCreative VII DrugProt shared task provides a benchmark for methods for the automated extraction of chemical-protein relations from scientific text. Here we describe our contribution to the shared task and report on the achieved results. We define the task as a relation classification problem, which we approach with pretrained transformer language models. Upon this basic architecture, we experiment with utilizing textual and embedded side information from knowledge bases as well as additional training data to improve extraction performance. We perform a comprehensive evaluation of the proposed model and the individual extensions including an extensive hyperparameter search leading to 2647 different runs. We find that ensembling and choosing the right pretrained language model are crucial for optimal performance, whereas adding additional data and embedded side information did not improve results. Our best model is based on an ensemble of 10 pretrained transformers and additional textual descriptions of chemicals taken from the Comparative Toxicogenomics Database. The model reaches an F1 score of 79.73% on the hidden DrugProt test set and achieves the first rank out of 107 submitted runs in the official evaluation. Database URL: https://github.com/leonweber/drugprot.
Collapse
Affiliation(s)
- Leon Weber
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| | - Mario Sänger
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Samuele Garda
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Fabio Barth
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany
| | - Christoph Alt
- Computer Science, Humboldt-Universität zu Berlin, Unter den Linden 6, Berlin 10099, Germany,Research Cluster of Excellence, Science of Intelligence, Marchstr. 23, Berlin 10587, Germany
| | - Ulf Leser
- *Corresponding authors: Tel: +49 30 209341293; Emails: and
| |
Collapse
|