1
|
Ding L, Colavizza G, Zhang Z. Partial Annotation Learning for Biomedical Entity Recognition. IEEE J Biomed Health Inform 2025; 29:1409-1418. [PMID: 39312441 DOI: 10.1109/jbhi.2024.3466294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. To conquer this issue, we undertake a systematic exploration of the efficacy of partial annotation learning methods for BioNER, which encompasses a comprehensive evaluation conducted across a spectrum of distinct simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We standardize a compilation of 16 BioNER corpora, encompassing a range of five distinct entity types, to establish a gold standard. And we compare against the state-of-the-art partial annotation model EER-PubMedBERT, the widely acknowledged partial annotation model BiLSTM-Partial-CRF model, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. Moreover, the recall of entity mentions in our model demonstrates a competitive alignment with the upper threshold observed on the fully annotated dataset.
Collapse
|
2
|
Majdik ZP, Graham SS, Shiva Edward JC, Rodriguez SN, Karnes MS, Jensen JT, Barbour JB, Rousseau JF. Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study. JMIR AI 2024; 3:e52095. [PMID: 38875593 PMCID: PMC11140272 DOI: 10.2196/52095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 12/13/2023] [Accepted: 03/30/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. OBJECTIVE This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. METHODS A random sample of 200 disclosure statements was prepared for annotation. All "PERSON" and "ORG" entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. RESULTS Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. CONCLUSIONS Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture's intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size.
Collapse
Affiliation(s)
- Zoltan P Majdik
- Department of Communication, North Dakota State University, Fargo, ND, United States
| | - S Scott Graham
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Jade C Shiva Edward
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Sabrina N Rodriguez
- Department of Neurology, The Dell Medical School, The University of Texas at Austin, Austin, TX, United States
| | - Martha S Karnes
- Department of Rhetoric & Writing, University of Arkansas Little Rock, Little Rock, AR, United States
| | - Jared T Jensen
- Department of Rhetoric & Writing, The University of Texas at Austin, Austin, TX, United States
| | - Joshua B Barbour
- Department of Communication, The University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Justin F Rousseau
- Statistical Planning and Analysis Section, Department of Neurology, The University of Texas Southwestern Medical Center, Dallas, TX, United States
- Peter O'Donnell Jr. Brain Institute, The University of Texas Southwestern Medical Center, Dallas, TX, United States
| |
Collapse
|
3
|
Cao L, Wu C, Luo G, Guo C, Zheng A. Online biomedical named entities recognition by data and knowledge-driven model. Artif Intell Med 2024; 150:102813. [PMID: 38553155 DOI: 10.1016/j.artmed.2024.102813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 12/15/2023] [Accepted: 02/12/2024] [Indexed: 04/02/2024]
Abstract
Named entity recognition (NER) is an important task for the natural language processing of biomedical text. Currently, most NER studies standardized biomedical text, but NER for unstandardized biomedical text draws less attention from researchers. Named entities in online biomedical text exist with errors and polymorphisms, which negatively impact NER models' performance and impede support from knowledge representation methods. In this paper, we propose a neural network method that can effectively recognize entities in unstandardized online medical/health text. We introduce a new pre-training scheme that uses large-scale online question-answering pairs to enhance transformers' model capacity on online biomedical text. Moreover, we supply models with knowledge representations from a knowledge base called multi-channel knowledge labels, and this method overcomes the restriction from languages, like Chinese, that require word segmentation tools to represent knowledge. Our model outperforms other baseline methods significantly in experiments on a dataset for Chinese online medical entity recognition and achieves state-of-the-art results.
Collapse
Affiliation(s)
- Lulu Cao
- Department of Rheumatology and Immunology, Peking University People's Hospital, 100044, China
| | - Chaochen Wu
- Renmin University of China, Beijing, 100872, China.
| | - Guan Luo
- State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, China.
| | - Chao Guo
- Department of Cardiology, Fuwai Hospital CAMS and PUMC, Beijing, 100037, China
| | - Anni Zheng
- State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, China
| |
Collapse
|
4
|
Msosa YJ, Grauslys A, Zhou Y, Wang T, Buchan I, Langan P, Foster S, Walker M, Pearson M, Folarin A, Roberts A, Maskell S, Dobson R, Kullu C, Kehoe D. Trustworthy Data and AI Environments for Clinical Prediction: Application to Crisis-Risk in People With Depression. IEEE J Biomed Health Inform 2023; 27:5588-5598. [PMID: 37669205 DOI: 10.1109/jbhi.2023.3312011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/07/2023]
Abstract
Depression is a common mental health condition that often occurs in association with other chronic illnesses, and varies considerably in severity. Electronic Health Records (EHRs) contain rich information about a patient's medical history and can be used to train, test and maintain predictive models to support and improve patient care. This work evaluated the feasibility of implementing an environment for predicting mental health crisis among people living with depression based on both structured and unstructured EHRs. A large EHR from a mental health provider, Mersey Care, was pseudonymised and ingested into the Natural Language Processing (NLP) platform CogStack, allowing text content in binary clinical notes to be extracted. All unstructured clinical notes and summaries were semantically annotated by MedCAT and BioYODIE NLP services. Cases of crisis in patients with depression were then identified. Random forest models, gradient boosting trees, and Long Short-Term Memory (LSTM) networks, with varying feature arrangement, were trained to predict the occurrence of crisis. The results showed that all the prediction models can use a combination of structured and unstructured EHR information to predict crisis in patients with depression with good and useful accuracy. The LSTM network that was trained on a modified dataset with only 1000 most-important features from the random forest model with temporality showed the best performance with a mean AUC of 0.901 and a standard deviation of 0.006 using a training dataset and a mean AUC of 0.810 and 0.01 using a hold-out test dataset. Comparing the results from the technical evaluation with the views of psychiatrists shows that there are now opportunities to refine and integrate such prediction models into pragmatic point-of-care clinical decision support tools for supporting mental healthcare delivery.
Collapse
|
5
|
Ahmad PN, Shah AM, Lee K. A Review on Electronic Health Record Text-Mining for Biomedical Name Entity Recognition in Healthcare Domain. Healthcare (Basel) 2023; 11:1268. [PMID: 37174810 PMCID: PMC10178605 DOI: 10.3390/healthcare11091268] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/24/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023] Open
Abstract
Biomedical-named entity recognition (bNER) is critical in biomedical informatics. It identifies biomedical entities with special meanings, such as people, places, and organizations, as predefined semantic types in electronic health records (EHR). bNER is essential for discovering novel knowledge using computational methods and Information Technology. Early bNER systems were configured manually to include domain-specific features and rules. However, these systems were limited in handling the complexity of the biomedical text. Recent advances in deep learning (DL) have led to the development of more powerful bNER systems. DL-based bNER systems can learn the patterns of biomedical text automatically, making them more robust and efficient than traditional rule-based systems. This paper reviews the healthcare domain of bNER, using DL techniques and artificial intelligence in clinical records, for mining treatment prediction. bNER-based tools are categorized systematically and represent the distribution of input, context, and tag (encoder/decoder). Furthermore, to create a labeled dataset for our machine learning sentiment analyzer to analyze the sentiment of a set of tweets, we used a manual coding approach and the multi-task learning method to bias the training signals with domain knowledge inductively. To conclude, we discuss the challenges facing bNER systems and future directions in the healthcare field.
Collapse
Affiliation(s)
- Pir Noman Ahmad
- School of Computer Science, Harbin Institute of Technology, Harbin 150001, China
| | - Adnan Muhammad Shah
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| | - KangYoon Lee
- Department of Computer Engineering, Gachon University, Seongnam 13120, Republic of Korea
| |
Collapse
|
6
|
Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022; 23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open
Abstract
The body of scientific literature continues to grow annually. Over 1.5 million abstracts of biomedical publications were added to the PubMed database in 2021. Therefore, developing cognitive systems that provide a specialized search for information in scientific publications based on subject area ontology and modern artificial intelligence methods is urgently needed. We previously developed a web-based information retrieval system, ANDDigest, designed to search and analyze information in the PubMed database using a customized domain ontology. This paper presents an improved ANDDigest version that uses fine-tuned PubMedBERT classifiers to enhance the quality of short name recognition for molecular-genetics entities in PubMed abstracts on eight biological object types: cell components, diseases, side effects, genes, proteins, pathways, drugs, and metabolites. This approach increased average short name recognition accuracy by 13%.
Collapse
Affiliation(s)
- Timofey V. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Correspondence:
| | - Pavel S. Demenkov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Nikolay A. Kolchanov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| | - Vladimir A. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| |
Collapse
|
7
|
Kühnel L, Fluck J. We are not ready yet: limitations of state-of-the-art disease named entity recognizers. J Biomed Semantics 2022; 13:26. [PMID: 36303237 DOI: 10.1186/s13326-022-00280-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Accepted: 10/12/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Intense research has been done in the area of biomedical natural language processing. Since the breakthrough of transfer learning-based methods, BERT models are used in a variety of biomedical and clinical applications. For the available data sets, these models show excellent results - partly exceeding the inter-annotator agreements. However, biomedical named entity recognition applied on COVID-19 preprints shows a performance drop compared to the results on test data. The question arises how well trained models are able to predict on completely new data, i.e. to generalize. RESULTS Based on the example of disease named entity recognition, we investigate the robustness of different machine learning-based methods - thereof transfer learning - and show that current state-of-the-art methods work well for a given training and the corresponding test set but experience a significant lack of generalization when applying to new data. CONCLUSIONS We argue that there is a need for larger annotated data sets for training and testing. Therefore, we foresee the curation of further data sets and, moreover, the investigation of continual learning processes for machine learning-based models.
Collapse
Affiliation(s)
- Lisa Kühnel
- ZB MED - Information Centre for Life Sciences, Gleueler Str. 60, Cologne, Germany. .,Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, 33501, Bielefeld, Germany.
| | - Juliane Fluck
- ZB MED - Information Centre for Life Sciences, Gleueler Str. 60, Cologne, Germany.,Institute of Geodesy and Geoinformation, Agricultural Faculty, University of Bonn, Nussallee 1, 53115, Bonn, Germany
| |
Collapse
|
8
|
Luo L, Wei CH, Lai PT, Chen Q, Islamaj R, Lu Z. Assigning species information to corresponding genes by a sequence labeling framework. Database (Oxford) 2022; 2022:6760187. [PMID: 36227127 PMCID: PMC9558450 DOI: 10.1093/database/baac090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2022] [Revised: 08/26/2022] [Accepted: 10/11/2022] [Indexed: 01/24/2023]
Abstract
The automatic assignment of species information to the corresponding genes in a research article is a critically important step in the gene normalization task, whereby a gene mention is normalized and linked to a database record or an identifier by a text-mining algorithm. Existing methods typically rely on heuristic rules based on gene and species co-occurrence in the article, but their accuracy is suboptimal. We therefore developed a high-performance method, using a novel deep learning-based framework, to identify whether there is a relation between a gene and a species. Instead of the traditional binary classification framework in which all possible pairs of genes and species in the same article are evaluated, we treat the problem as a sequence labeling task such that only a fraction of the pairs needs to be considered. Our benchmarking results show that our approach obtains significantly higher performance compared to that of the rule-based baseline method for the species assignment task (from 65.8-81.3% in accuracy). The source code and data for species assignment are freely available. Database URL https://github.com/ncbi/SpeciesAssignment.
Collapse
Affiliation(s)
| | | | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Rezarta Islamaj
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - Zhiyong Lu
- *Corresponding author: Tel: +301 594 7089; Fax: +301 480 2288;
| |
Collapse
|
9
|
McInnes BT, Downie JS, Hao Y, Jett J, Keating K, Nakum G, Ranjan S, Rodriguez NE, Tang J, Xiang D, Young EM, Nguyen MH. Discovering Content through Text Mining for a Synthetic Biology Knowledge System. ACS Synth Biol 2022; 11:2043-2054. [PMID: 35671034 DOI: 10.1021/acssynbio.1c00611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Scientific articles contain a wealth of information about experimental methods and results describing biological designs. Due to its unstructured nature and multiple sources of ambiguity and variability, extracting this information from text is a difficult task. In this paper, we describe the development of the synthetic biology knowledge system (SBKS) text processing pipeline. The pipeline uses natural language processing techniques to extract and correlate information from the literature for synthetic biology researchers. Specifically, we apply named entity recognition, relation extraction, concept grounding, and topic modeling to extract information from published literature to link articles to elements within our knowledge system. Our results show the efficacy of each of the components on synthetic biology literature and provide future directions for further advancement of the pipeline.
Collapse
Affiliation(s)
- Bridget T McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - J Stephen Downie
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | - Sudhanshu Ranjan
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Du Xiang
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric M Young
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Mai H Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
10
|
Lange L, Adel H, Strötgen J, Klakow D. CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain. Bioinformatics 2022; 38:3267-3274. [PMID: 35485748 DOI: 10.1093/bioinformatics/btac297] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2021] [Revised: 03/03/2022] [Accepted: 04/26/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION The field of natural language processing (NLP) has recently seen a large change towards using pre-trained language models for solving almost any task. Despite showing great improvements in benchmark datasets for various tasks, these models often perform sub-optimal in non-standard domains like the clinical domain where a large gap between pre-training documents and target documents is observed. In this paper, we aim at closing this gap with domain-specific training of the language model and we investigate its effect on a diverse set of downstream tasks and settings. RESULTS We introduce the pre-trained CLIN-X (Clinical XLM-R) language models and show how CLIN-X outperforms other pre-trained transformer models by a large margin for ten clinical concept extraction tasks from two languages. In addition, we demonstrate how the transformer model can be further improved with our proposed task- and language-agnostic model architecture based on ensembles over random splits and cross-sentence context. Our studies in low-resource and transfer settings reveal stable model performance despite a lack of annotated data with improvements of up to 47 F1 points when only 250 labeled sentences are available. Our results highlight the importance of specialized language models, such as CLIN-X, for concept extraction in non-standard domains, but also show that our task-agnostic model architecture is robust across the tested tasks and languages so that domain- or task-specific adaptations are not required. AVAILABILITY The CLIN-X language models and source code for fine-tuning and transferring the model are publicly available at https://github.com/boschresearch/clin_x/ and the huggingface model hub.
Collapse
Affiliation(s)
- Lukas Lange
- Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, 71272, Germany.,Spoken Language Systems, Saarland University, Saarbrücken, 66111, Germany
| | - Heike Adel
- Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, 71272, Germany
| | - Jannik Strötgen
- Bosch Center for Artificial Intelligence, Robert Bosch GmbH, Renningen, 71272, Germany
| | - Dietrich Klakow
- Spoken Language Systems, Saarland University, Saarbrücken, 66111, Germany
| |
Collapse
|
11
|
Rodriguez NE, Nguyen M, McInnes BT. Effects of Data and Entity Ablation on Multitask Learning Models for Biomedical Entity Recognition. J Biomed Inform 2022; 130:104062. [DOI: 10.1016/j.jbi.2022.104062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 02/11/2022] [Accepted: 03/27/2022] [Indexed: 11/24/2022]
|
12
|
Furrer L, Cornelius J, Rinaldi F. Parallel sequence tagging for concept recognition. BMC Bioinformatics 2022; 22:623. [PMID: 35331131 PMCID: PMC8943923 DOI: 10.1186/s12859-021-04511-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/01/2021] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Named Entity Recognition (NER) and Normalisation (NEN) are core components of any text-mining system for biomedical texts. In a traditional concept-recognition pipeline, these tasks are combined in a serial way, which is inherently prone to error propagation from NER to NEN. We propose a parallel architecture, where both NER and NEN are modeled as a sequence-labeling task, operating directly on the source text. We examine different harmonisation strategies for merging the predictions of the two classifiers into a single output sequence. RESULTS We test our approach on the recent Version 4 of the CRAFT corpus. In all 20 annotation sets of the concept-annotation task, our system outperforms the pipeline system reported as a baseline in the CRAFT shared task, a competition of the BioNLP Open Shared Tasks 2019. We further refine the systems from the shared task by optimising the harmonisation strategy separately for each annotation set. CONCLUSIONS Our analysis shows that the strengths of the two classifiers can be combined in a fruitful way. However, prediction harmonisation requires individual calibration on a development set for each annotation set. This allows achieving a good trade-off between established knowledge (training set) and novel information (unseen concepts).
Collapse
Affiliation(s)
- Lenz Furrer
- Department of Computational Linguistics, University of Zurich, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Joseph Cornelius
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Fabio Rinaldi
- Dalle Molle Institute for Artificial Intelligence Research (IDSIA USI/SUPSI), Lugano, Switzerland.
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland.
- Swiss Institute of Bioinformatics, Zurich, Switzerland.
- Fondazione Bruno Kessler, Trento, Italy.
| |
Collapse
|
13
|
Comparison of Text Mining Models for Food and Dietary Constituent Named-Entity Recognition. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2022. [DOI: 10.3390/make4010012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Biomedical Named-Entity Recognition (BioNER) has become an essential part of text mining due to the continuously increasing digital archives of biological and medical articles. While there are many well-performing BioNER tools for entities such as genes, proteins, diseases or species, there is very little research into food and dietary constituent named-entity recognition. For this reason, in this paper, we study seven BioNER models for food and dietary constituents recognition. Specifically, we study a dictionary-based model, a conditional random fields (CRF) model and a new hybrid model, called FooDCoNER (Food and Dietary Constituents Named-Entity Recognition), which we introduce combining the former two models. In addition, we study deep language models including BERT, BioBERT, RoBERTa and ELECTRA. As a result, we find that FooDCoNER does not only lead to the overall best results, comparable with the deep language models, but FooDCoNER is also much more efficient with respect to run time and sample size requirements of the training data. The latter has been identified via the study of learning curves. Overall, our results not only provide a new tool for food and dietary constituent NER but also shed light on the difference between classical machine learning models and recent deep language models.
Collapse
|
14
|
The Construction Model of the TCM Clinical Knowledge Coding Database Based on Knowledge Organization. BIOMED RESEARCH INTERNATIONAL 2022; 2022:2503779. [PMID: 35083328 PMCID: PMC8786531 DOI: 10.1155/2022/2503779] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Accepted: 12/16/2021] [Indexed: 11/17/2022]
Abstract
Based on the knowledge organization method, this paper explores the construction method of the traditional Chinese medicine (TCM) clinical knowledge coding model by taking TCM clinical electronic medical record data as the research object. Firstly, extracting technology is used to obtain the required data in the electronic medical record. Then, by constructing the clinical knowledge coding model, the tacit knowledge is made explicit, establishing the clinical knowledge base and exploring the connotation of TCM clinical knowledge. It provides necessary data resources for deepening the expression level of TCM clinical knowledge, constructing accurate TCM clinical diagnosis, intervention, and evaluation models, and promoting the inheritance, innovation, and development of TCM. In this paper, we extracted the data of 318 cases of distention and established the TCM clinical database from the basic information of patients, clinical diagnosis information, clinical diagnosis and treatment information, and clinical evaluation information. Based on the knowledge coding model and the connotation of knowledge attributes, the established TCM clinical knowledge base was to explore the law of TCM clinical precision diagnosis and treatment.
Collapse
|
15
|
Rivera-Zavala RM, Martínez P. Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization. BMC Bioinformatics 2021; 22:601. [PMID: 34920703 PMCID: PMC8680060 DOI: 10.1186/s12859-021-04247-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 06/09/2021] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. METHODS In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. RESULTS Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. CONCLUSION These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.
Collapse
Affiliation(s)
- Renzo M Rivera-Zavala
- Computer Science Department, University Carlos III of Madrid, Leganes, Madrid, Spain.
| | - Paloma Martínez
- Computer Science Department, University Carlos III of Madrid, Leganes, Madrid, Spain
| |
Collapse
|
16
|
Larmande P, Liu Y, Yao X, Xia J. OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition. Genomics Inform 2021; 19:e27. [PMID: 34638174 PMCID: PMC8510865 DOI: 10.5808/gi.21015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 07/27/2021] [Indexed: 12/02/2022] Open
Abstract
Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.
Collapse
Affiliation(s)
- Pierre Larmande
- DIADE, Univ. Montpellier, IRD, CIRAD, 34394 Montpellier, France.,French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier F-34398, France
| | - Yusha Liu
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| | - Xinzhi Yao
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| | - Jingbo Xia
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| |
Collapse
|
17
|
Mante J, Hao Y, Jett J, Joshi U, Keating K, Lu X, Nakum G, Rodriguez NE, Tang J, Terry L, Wu X, Yu E, Downie JS, McInnes BT, Nguyen MH, Sepulvado B, Young EM, Myers CJ. Synthetic Biology Knowledge System. ACS Synth Biol 2021; 10:2276-2285. [PMID: 34387462 DOI: 10.1021/acssynbio.1c00188] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The Synthetic Biology Knowledge System (SBKS) is an instance of the SynBioHub repository that includes text and data information that has been mined from papers published in ACS Synthetic Biology. This paper describes the SBKS curation framework that is being developed to construct the knowledge stored in this repository. The text mining pipeline performs automatic annotation of the articles using natural language processing techniques to identify salient content such as key terms, relationships between terms, and main topics. The data mining pipeline performs automatic annotation of the sequences extracted from the supplemental documents with the genetic parts used in them. Together these two pipelines link genetic parts to papers describing the context in which they are used. Ultimately, SBKS will reduce the time necessary for synthetic biologists to find the information necessary to complete their designs.
Collapse
Affiliation(s)
- Jeanet Mante
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Udayan Joshi
- University of California San Diego, La Jolla, California 92093, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Xiang Lu
- University of California San Diego, La Jolla, California 92093, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Logan Terry
- University of Utah, Salt Lake City, Utah 84112, United States
| | - Xuanyu Wu
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric Yu
- University of Utah, Salt Lake City, Utah 84112, United States
| | - J. Stephen Downie
- University of Illinois at Urbana−Champaign, Urbana, Illinois 61801, United States
| | - Bridget T. McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - Mai H. Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| | - Brandon Sepulvado
- NORC at the University of Chicago Bethesda, Chicago, Illinois 60637, United States
| | - Eric M. Young
- Worcester Polytechnic Institute, Worcester, Massachusettes 01609, United States
| | - Chris J. Myers
- University of Colorado Boulder, Boulder, Colorado 80309, United States
| |
Collapse
|
18
|
Weber L, Sänger M, Münchmeyer J, Habibi M, Leser U, Akbik A. HunFlair: an easy-to-use tool for state-of-the-art biomedical named entity recognition. Bioinformatics 2021; 37:2792-2794. [PMID: 33508086 PMCID: PMC8428609 DOI: 10.1093/bioinformatics/btab042] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 01/13/2021] [Accepted: 01/20/2021] [Indexed: 01/31/2023] Open
Abstract
SUMMARY Named entity recognition (NER) is an important step in biomedical information extraction pipelines. Tools for NER should be easy to use, cover multiple entity types, be highly accurate and be robust toward variations in text genre and style. We present HunFlair, a NER tagger fulfilling these requirements. HunFlair is integrated into the widely used NLP framework Flair, recognizes five biomedical entity types, reaches or overcomes state-of-the-art performance on a wide set of evaluation corpora, and is trained in a cross-corpus setting to avoid corpus-specific bias. Technically, it uses a character-level language model pretrained on roughly 24 million biomedical abstracts and three million full texts. It outperforms other off-the-shelf biomedical NER tools with an average gain of 7.26 pp over the next best tool in a cross-corpus setting and achieves on-par results with state-of-the-art research prototypes in in-corpus experiments. HunFlair can be installed with a single command and is applied with only four lines of code. Furthermore, it is accompanied by harmonized versions of 23 biomedical NER corpora. AVAILABILITY AND IMPLEMENTATION HunFlair ist freely available through the Flair NLP framework (https://github.com/flairNLP/flair) under an MIT license and is compatible with all major operating systems. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
- Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany
| | - Mario Sänger
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Jannes Münchmeyer
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
- Section Seismology, GFZ German Research Centre for Geosciences, Potsdam 14473, Germany
| | - Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Alan Akbik
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
19
|
Parolo S, Tomasoni D, Bora P, Ramponi A, Kaddi C, Azer K, Domenici E, Neves-Zaph S, Lombardo R. Reconstruction of the Cytokine Signaling in Lysosomal Storage Diseases by Literature Mining and Network Analysis. Front Cell Dev Biol 2021; 9:703489. [PMID: 34490253 PMCID: PMC8417786 DOI: 10.3389/fcell.2021.703489] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2021] [Accepted: 07/30/2021] [Indexed: 11/13/2022] Open
Abstract
Lysosomal storage diseases (LSDs) are characterized by the abnormal accumulation of substrates in tissues due to the deficiency of lysosomal proteins. Among the numerous clinical manifestations, chronic inflammation has been consistently reported for several LSDs. However, the molecular mechanisms involved in the inflammatory response are still not completely understood. In this study, we performed text-mining and systems biology analyses to investigate the inflammatory signals in three LSDs characterized by sphingolipid accumulation: Gaucher disease, Acid Sphingomyelinase Deficiency (ASMD), and Fabry Disease. We first identified the cytokines linked to the LSDs, and then built on the extracted knowledge to investigate the inflammatory signals. We found numerous transcription factors that are putative regulators of cytokine expression in a cell-specific context, such as the signaling axes controlled by STAT2, JUN, and NR4A2 as candidate regulators of the monocyte Gaucher disease cytokine network. Overall, our results suggest the presence of a complex inflammatory signaling in LSDs involving many cellular and molecular players that could be further investigated as putative targets of anti-inflammatory therapies.
Collapse
Affiliation(s)
- Silvia Parolo
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Danilo Tomasoni
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Pranami Bora
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Alan Ramponi
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| | - Chanchala Kaddi
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Karim Azer
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Enrico Domenici
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy.,Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento, Italy
| | - Susana Neves-Zaph
- Data and Data Science - Translational Disease Modeling, Sanofi, Bridgewater, NJ, United States
| | - Rosario Lombardo
- Fondazione the Microsoft Research-University of Trento Centre for Computational and Systems Biology, Rovereto, Italy
| |
Collapse
|
20
|
Song B, Li F, Liu Y, Zeng X. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Brief Bioinform 2021; 22:6326536. [PMID: 34308472 DOI: 10.1093/bib/bbab282] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2021] [Revised: 06/07/2021] [Accepted: 07/02/2021] [Indexed: 11/13/2022] Open
Abstract
The biomedical literature is growing rapidly, and the extraction of meaningful information from the large amount of literature is increasingly important. Biomedical named entity (BioNE) identification is one of the critical and fundamental tasks in biomedical text mining. Accurate identification of entities in the literature facilitates the performance of other tasks. Given that an end-to-end neural network can automatically extract features, several deep learning-based methods have been proposed for BioNE recognition (BioNER), yielding state-of-the-art performance. In this review, we comprehensively summarize deep learning-based methods for BioNER and datasets used in training and testing. The deep learning methods are classified into four categories: single neural network-based, multitask learning-based, transfer learning-based and hybrid model-based methods. They can be applied to BioNER in multiple domains, and the results are determined by the dataset size and type. Lastly, we discuss the future development and opportunities of BioNER methods.
Collapse
Affiliation(s)
- Bosheng Song
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Fen Li
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Yuansheng Liu
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| | - Xiangxiang Zeng
- College of Information Science and Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
21
|
Gao S, Kotevska O, Sorokine A, Christian JB. A pre-training and self-training approach for biomedical named entity recognition. PLoS One 2021; 16:e0246310. [PMID: 33561139 PMCID: PMC7872256 DOI: 10.1371/journal.pone.0246310] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2020] [Accepted: 01/18/2021] [Indexed: 11/18/2022] Open
Abstract
Named entity recognition (NER) is a key component of many scientific literature mining tasks, such as information retrieval, information extraction, and question answering; however, many modern approaches require large amounts of labeled training data in order to be effective. This severely limits the effectiveness of NER models in applications where expert annotations are difficult and expensive to obtain. In this work, we explore the effectiveness of transfer learning and semi-supervised self-training to improve the performance of NER models in biomedical settings with very limited labeled data (250-2000 labeled samples). We first pre-train a BiLSTM-CRF and a BERT model on a very large general biomedical NER corpus such as MedMentions or Semantic Medline, and then we fine-tune the model on a more specific target NER task that has very limited training data; finally, we apply semi-supervised self-training using unlabeled data to further boost model performance. We show that in NER tasks that focus on common biomedical entity types such as those in the Unified Medical Language System (UMLS), combining transfer learning with self-training enables a NER model such as a BiLSTM-CRF or BERT to obtain similar performance with the same model trained on 3x-8x the amount of labeled data. We further show that our approach can also boost performance in a low-resource application where entities types are more rare and not specifically covered in UMLS.
Collapse
Affiliation(s)
- Shang Gao
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Olivera Kotevska
- Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - Alexandre Sorokine
- Geospatial Science and Human Security Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| | - J. Blair Christian
- Computational Sciences and Engineering Division, Oak Ridge National Laboratory, Oak Ridge, TN, United States of America
| |
Collapse
|
22
|
Casaní-Galdón S, Pereira C, Conesa A. Padhoc: a computational pipeline for pathway reconstruction on the fly. Bioinformatics 2020; 36:i795-i803. [PMID: 33381819 DOI: 10.1093/bioinformatics/btaa811] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
MOTIVATION Molecular pathway databases represent cellular processes in a structured and standardized way. These databases support the community-wide utilization of pathway information in biological research and the computational analysis of high-throughput biochemical data. Although pathway databases are critical in genomics research, the fast progress of biomedical sciences prevents databases from staying up-to-date. Moreover, the compartmentalization of cellular reactions into defined pathways reflects arbitrary choices that might not always be aligned with the needs of the researcher. Today, no tool exists that allow the easy creation of user-defined pathway representations. RESULTS Here we present Padhoc, a pipeline for pathway ad hoc reconstruction. Based on a set of user-provided keywords, Padhoc combines natural language processing, database knowledge extraction, orthology search and powerful graph algorithms to create navigable pathways tailored to the user's needs. We validate Padhoc with a set of well-established Escherichia coli pathways and demonstrate usability to create not-yet-available pathways in model (human) and non-model (sweet orange) organisms. AVAILABILITY AND IMPLEMENTATION Padhoc is freely available at https://github.com/ConesaLab/padhoc. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Cecile Pereira
- Institute for Food and Agricultural Sciences, Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, 32603, USA.,EURA NOVA, Marseille 13382, France
| | - Ana Conesa
- Institute for Food and Agricultural Sciences, Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, 32603, USA
| |
Collapse
|
23
|
Ivanisenko TV, Saik OV, Demenkov PS, Ivanisenko NV, Savostianov AN, Ivanisenko VA. ANDDigest: a new web-based module of ANDSystem for the search of knowledge in the scientific literature. BMC Bioinformatics 2020; 21:228. [PMID: 32921303 PMCID: PMC7488989 DOI: 10.1186/s12859-020-03557-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2020] [Accepted: 05/25/2020] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND The rapid growth of scientific literature has rendered the task of finding relevant information one of the critical problems in almost any research. Search engines, like Google Scholar, Web of Knowledge, PubMed, Scopus, and others, are highly effective in document search; however, they do not allow knowledge extraction. In contrast to the search engines, text-mining systems provide extraction of knowledge with representations in the form of semantic networks. Of particular interest are tools performing a full cycle of knowledge management and engineering, including automated retrieval, integration, and representation of knowledge in the form of semantic networks, their visualization, and analysis. STRING, Pathway Studio, MetaCore, and others are well-known examples of such products. Previously, we developed the Associative Network Discovery System (ANDSystem), which also implements such a cycle. However, the drawback of these systems is dependence on the employed ontologies describing the subject area, which limits their functionality in searching information based on user-specified queries. RESULTS The ANDDigest system is a new web-based module of the ANDSystem tool, permitting searching within PubMed by using dictionaries from the ANDSystem tool and sets of user-defined keywords. ANDDigest allows performing the search based on complex queries simultaneously, taking into account many types of objects from the ANDSystem's ontology. The system has a user-friendly interface, providing sorting, visualization, and filtering of the found information, including mapping of mentioned objects in text, linking to external databases, sorting of data by publication date, citations number, journal H-indices, etc. The system provides data on trends for identified entities based on dynamics of interest according to the frequency of their mentions in PubMed by years. CONCLUSIONS The main feature of ANDDigest is its functionality, serving as a specialized search for information about multiple associative relationships of objects from the ANDSystem's ontology vocabularies, taking into account user-specified keywords. The tool can be applied to the interpretation of experimental genetics data, the search for associations between molecular genetics objects, and the preparation of scientific and analytical reviews. It is presently available at https://anddigest.sysbio.ru/ .
Collapse
Affiliation(s)
- Timofey V Ivanisenko
- Laboratory of Computer-Assisted Proteomics, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia.
- Laboratory of Computer Genomics, Novosibirsk State University, st. Pirogova 1, Novosibirsk, 630090, Russia.
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia.
| | - Olga V Saik
- Laboratory of Computer-Assisted Proteomics, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
| | - Pavel S Demenkov
- Laboratory of Computer-Assisted Proteomics, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
- Novosibirsk State University, st. Pirogova 1, Novosibirsk, 630090, Russia
| | - Nikita V Ivanisenko
- Laboratory of Computer-Assisted Proteomics, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
| | | | - Vladimir A Ivanisenko
- Laboratory of Computer-Assisted Proteomics, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk, 630090, Russia
- Novosibirsk State University, st. Pirogova 1, Novosibirsk, 630090, Russia
| |
Collapse
|
24
|
Perera N, Dehmer M, Emmert-Streib F. Named Entity Recognition and Relation Detection for Biomedical Information Extraction. Front Cell Dev Biol 2020; 8:673. [PMID: 32984300 PMCID: PMC7485218 DOI: 10.3389/fcell.2020.00673] [Citation(s) in RCA: 54] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Accepted: 07/02/2020] [Indexed: 12/29/2022] Open
Abstract
The number of scientific publications in the literature is steadily growing, containing our knowledge in the biomedical, health, and clinical sciences. Since there is currently no automatic archiving of the obtained results, much of this information remains buried in textual details not readily available for further usage or analysis. For this reason, natural language processing (NLP) and text mining methods are used for information extraction from such publications. In this paper, we review practices for Named Entity Recognition (NER) and Relation Detection (RD), allowing, e.g., to identify interactions between proteins and drugs or genes and diseases. This information can be integrated into networks to summarize large-scale details on a particular biomedical or clinical problem, which is then amenable for easy data management and further analysis. Furthermore, we survey novel deep learning methods that have recently been introduced for such tasks.
Collapse
Affiliation(s)
- Nadeesha Perera
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
| | - Matthias Dehmer
- Department of Mechatronics and Biomedical Computer Science, University for Health Sciences, Medical Informatics and Technology (UMIT), Hall in Tirol, Austria
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Frank Emmert-Streib
- Predictive Society and Data Analytics Lab, Faculty of Information Technology and Communication Sciences, Tampere University, Tampere, Finland
- Faculty of Medicine and Health Technology, Institute of Biosciences and Medical Technology, Tampere University, Tampere, Finland
| |
Collapse
|
25
|
Weber L, Thobe K, Migueles Lozano OA, Wolf J, Leser U. PEDL: extracting protein-protein associations using deep language models and distant supervision. Bioinformatics 2020; 36:i490-i498. [PMID: 32657389 PMCID: PMC7355289 DOI: 10.1093/bioinformatics/btaa430] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Motivation A significant portion of molecular biology investigates signalling pathways and thus depends on an up-to-date and complete resource of functional protein–protein associations (PPAs) that constitute such pathways. Despite extensive curation efforts, major pathway databases are still notoriously incomplete. Relation extraction can help to gather such pathway information from biomedical publications. Current methods for extracting PPAs typically rely exclusively on rare manually labelled data which severely limits their performance. Results We propose PPA Extraction with Deep Language (PEDL), a method for predicting PPAs from text that combines deep language models and distant supervision. Due to the reliance on distant supervision, PEDL has access to an order of magnitude more training data than methods solely relying on manually labelled annotations. We introduce three different datasets for PPA prediction and evaluate PEDL for the two subtasks of predicting PPAs between two proteins, as well as identifying the text spans stating the PPA. We compared PEDL with a recently published state-of-the-art model and found that on average PEDL performs better in both tasks on all three datasets. An expert evaluation demonstrates that PEDL can be used to predict PPAs that are missing from major pathway databases and that it correctly identifies the text spans supporting the PPA. Availability and implementation PEDL is freely available at https://github.com/leonweber/pedl. The repository also includes scripts to generate the used datasets and to reproduce the experiments from this article. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany.,Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany
| | - Kirsten Thobe
- Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany
| | - Oscar Arturo Migueles Lozano
- Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany
| | - Jana Wolf
- Group Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin 13125, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
26
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F. On Biomedical Named Entity Recognition: Experiments in Interlingual Transfer for Clinical and Social Media Texts. ADVANCES IN INFORMATION RETRIEVAL 2020. [PMCID: PMC7148079 DOI: 10.1007/978-3-030-45442-5_35] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Although deep neural networks yield state-of-the-art performance in biomedical named entity recognition (bioNER), much research shares one limitation: models are usually trained and evaluated on English texts from a single domain. In this work, we present a fine-grained evaluation intended to understand the efficiency of multilingual BERT-based models for bioNER of drug and disease mentions across two domains in two languages, namely clinical data and user-generated texts on drug therapy in English and Russian. We investigate the role of transfer learning (TL) strategies between four corpora to reduce the number of examples that have to be manually annotated. Evaluation results demonstrate that multi-BERT shows the best transfer capabilities in the zero-shot setting when training and test sets are either in the same language or in the same domain. TL reduces the amount of labeled data needed to achieve high performance on three out of four corpora: pretrained models reach 98–99% of the full dataset performance on both types of entities after training on 10–25% of sentences. We demonstrate that pretraining on data with one or both types of transfer can be effective.
Collapse
|