1
|
Neves M, Klippert A, Knöspel F, Rudeck J, Stolz A, Ban Z, Becker M, Diederich K, Grune B, Kahnau P, Ohnesorge N, Pucher J, Schönfelder G, Bert B, Butzke D. Automatic classification of experimental models in biomedical literature to support searching for alternative methods to animal experiments. J Biomed Semantics 2023; 14:13. [PMID: 37658458 PMCID: PMC10472567 DOI: 10.1186/s13326-023-00292-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 07/29/2023] [Indexed: 09/03/2023] Open
Abstract
Current animal protection laws require replacement of animal experiments with alternative methods, whenever such methods are suitable to reach the intended scientific objective. However, searching for alternative methods in the scientific literature is a time-consuming task that requires careful screening of an enormously large number of experimental biomedical publications. The identification of potentially relevant methods, e.g. organ or cell culture models, or computer simulations, can be supported with text mining tools specifically built for this purpose. Such tools are trained (or fine tuned) on relevant data sets labeled by human experts. We developed the GoldHamster corpus, composed of 1,600 PubMed (Medline) articles (titles and abstracts), in which we manually identified the used experimental model according to a set of eight labels, namely: "in vivo", "organs", "primary cells", "immortal cell lines", "invertebrates", "humans", "in silico" and "other" (models). We recruited 13 annotators with expertise in the biomedical domain and assigned each article to two individuals. Four additional rounds of annotation aimed at improving the quality of the annotations with disagreements in the first round. Furthermore, we conducted various machine learning experiments based on supervised learning to evaluate the corpus for our classification task. We obtained more than 7,000 document-level annotations for the above labels. After the first round of annotation, the inter-annotator agreement (kappa coefficient) varied among labels, and ranged from 0.42 (for "others") to 0.82 (for "invertebrates"), with an overall score of 0.62. All disagreements were resolved in the subsequent rounds of annotation. The best-performing machine learning experiment used the PubMedBERT pre-trained model with fine-tuning to our corpus, which gained an overall f-score of 0.83. We obtained a corpus with high agreement for all labels, and our evaluation demonstrated that our corpus is suitable for training reliable predictive models for automatic classification of biomedical literature according to the used experimental models. Our SMAFIRA - "Smart feature-based interactive" - search tool ( https://smafira.bf3r.de ) will employ this classifier for supporting the retrieval of alternative methods to animal experiments. The corpus is available for download ( https://doi.org/10.5281/zenodo.7152295 ), as well as the source code ( https://github.com/mariananeves/goldhamster ) and the model ( https://huggingface.co/SMAFIRA/goldhamster ).
Collapse
Affiliation(s)
- Mariana Neves
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany.
| | - Antonina Klippert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Current affiliation: Nuvisan ICB GmbH, Müllerstraße 178, 13353, Berlin, Germany
| | - Fanny Knöspel
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Juliane Rudeck
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Ailine Stolz
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Zsofia Ban
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Markus Becker
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Kai Diederich
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Barbara Grune
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Pia Kahnau
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Nils Ohnesorge
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Johannes Pucher
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Gilbert Schönfelder
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
- Institute of Clinical Pharmacology and Toxicology, Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Bettina Bert
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| | - Daniel Butzke
- German Centre for the Protection of Laboratory Animals (Bf3R), German Federal Institute for Risk Assessment (BfR), Berlin, Germany
| |
Collapse
|
2
|
Jeong M, Kang J. Consistency enhancement of model prediction on document-level named entity recognition. Bioinformatics 2023; 39:btad361. [PMID: 37261870 PMCID: PMC10272703 DOI: 10.1093/bioinformatics/btad361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2022] [Revised: 04/17/2023] [Accepted: 05/31/2023] [Indexed: 06/02/2023] Open
Abstract
SUMMARY Biomedical named entity recognition (NER) plays a crucial role in extracting information from documents in biomedical applications. However, many of these applications require NER models to operate at a document level, rather than just a sentence level. This presents a challenge, as the extension from a sentence model to a document model is not always straightforward. Despite the existence of document NER models that are able to make consistent predictions, they still fall short of meeting the expectations of researchers and practitioners in the field. To address this issue, we have undertaken an investigation into the underlying causes of inconsistent predictions. Our research has led us to believe that the use of adjectives and prepositions within entities may be contributing to low label consistency. In this article, we present our method, ConNER, to enhance a label consistency of modifiers such as adjectives and prepositions. By refining the labels of these modifiers, ConNER is able to improve representations of biomedical entities. The effectiveness of our method is demonstrated on four popular biomedical NER datasets. On three datasets, we achieve a higher F1 score than the previous state-of-the-art model. Our method shows its efficacy on two datasets, resulting in 7.5%-8.6% absolute improvements in the F1 score. Our findings suggest that our ConNER method is effective on datasets with intrinsically low label consistency. Through qualitative analysis, we demonstrate how our approach helps the NER model generate more consistent predictions. AVAILABILITY AND IMPLEMENTATION Our code and resources are available at https://github.com/dmis-lab/ConNER/.
Collapse
Affiliation(s)
- Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul 02841, Republic of Korea
- Interdisciplinary Graduate Program in Bioinformatics, Korea University, Seoul, Republic of Korea
- AIGEN Sciences, Seoul 04778, Republic of Korea
| |
Collapse
|
3
|
Abstract
Named Entity Recognition (NER) on Clinical Electronic Medical Records (CEMR) is a fundamental step in extracting disease knowledge by identifying specific entity terms such as diseases, symptoms, etc. However, the state-of-the-art NER methods based on Long Short-Term Memory (LSTM) fail to exploit GPU parallelism fully under the massive medical records. Although a novel NER method based on Iterated Dilated CNNs (ID-CNNs) can accelerate network computing, it tends to ignore the word-order feature and semantic information of the current word. In order to enhance the performance of ID-CNNs-based models on NER tasks, an attention-based ID-CNNs-CRF model, which combines the word-order feature and local context, is proposed. Firstly, position embedding is utilized to fuse word-order information. Secondly, the ID-CNNs architecture is used to extract global semantic information rapidly. Simultaneously, the attention mechanism is employed to pay attention to the local context. Finally, we apply the CRF to obtain the optimal tag sequence. Experiments conducted on two CEMR datasets show that our model outperforms traditional ones. The F1-scores of 94.55% and 91.17% are obtained respectively on these two datasets, and both are better than LSTM-based models.
Collapse
|
4
|
Furrer L, Jancso A, Colic N, Rinaldi F. OGER++: hybrid multi-type entity recognition. J Cheminform 2019; 11:7. [PMID: 30666476 PMCID: PMC6689863 DOI: 10.1186/s13321-018-0326-3] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 12/27/2018] [Indexed: 12/14/2022] Open
Abstract
Background We present a text-mining tool for recognizing biomedical entities in scientific literature. OGER++ is a hybrid system for named entity recognition and concept recognition (linking), which combines a dictionary-based annotator with a corpus-based disambiguation component. The annotator uses an efficient look-up strategy combined with a normalization method for matching spelling variants. The disambiguation classifier is implemented as a feed-forward neural network which acts as a postfilter to the previous step. Results We evaluated the system in terms of processing speed and annotation quality. In the speed benchmarks, the OGER++ web service processes 9.7 abstracts or 0.9 full-text documents per second. On the CRAFT corpus, we achieved 71.4% and 56.7% F1 for named entity recognition and concept recognition, respectively. Conclusions Combining knowledge-based and data-driven components allows creating a system with competitive performance in biomedical text mining.
Collapse
Affiliation(s)
- Lenz Furrer
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Anna Jancso
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Nicola Colic
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland
| | - Fabio Rinaldi
- Institute of Computational Linguistics, University of Zurich, Andreasstr. 15, 8050, Zürich, Switzerland. .,Fondazione Bruno Kessler, Via Sommarive, 18, 38123, Trento, Italy.
| |
Collapse
|
5
|
Chen X, Gururaj AE, Ozyurt B, Liu R, Soysal E, Cohen T, Tiryaki F, Li Y, Zong N, Jiang M, Rogith D, Salimi M, Kim HE, Rocca-Serra P, Gonzalez-Beltran A, Farcas C, Johnson T, Margolis R, Alter G, Sansone SA, Fore IM, Ohno-Machado L, Grethe JS, Xu H. DataMed - an open source discovery index for finding biomedical datasets. J Am Med Inform Assoc 2018; 25:300-308. [PMID: 29346583 PMCID: PMC7378878 DOI: 10.1093/jamia/ocx121] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2017] [Revised: 09/20/2017] [Accepted: 09/28/2017] [Indexed: 12/17/2022] Open
Abstract
Objective Finding relevant datasets is important for promoting data reuse in the biomedical domain, but it is challenging given the volume and complexity of biomedical data. Here we describe the development of an open source biomedical data discovery system called DataMed, with the goal of promoting the building of additional data indexes in the biomedical domain. Materials and Methods DataMed, which can efficiently index and search diverse types of biomedical datasets across repositories, is developed through the National Institutes of Health–funded biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE) consortium. It consists of 2 main components: (1) a data ingestion pipeline that collects and transforms original metadata information to a unified metadata model, called DatA Tag Suite (DATS), and (2) a search engine that finds relevant datasets based on user-entered queries. In addition to describing its architecture and techniques, we evaluated individual components within DataMed, including the accuracy of the ingestion pipeline, the prevalence of the DATS model across repositories, and the overall performance of the dataset retrieval engine. Results and Conclusion Our manual review shows that the ingestion pipeline could achieve an accuracy of 90% and core elements of DATS had varied frequency across repositories. On a manually curated benchmark dataset, the DataMed search engine achieved an inferred average precision of 0.2033 and a precision at 10 (P@10, the number of relevant results in the top 10 search results) of 0.6022, by implementing advanced natural language processing and terminology services. Currently, we have made the DataMed system publically available as an open source package for the biomedical community.
Collapse
Affiliation(s)
- Xiaoling Chen
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Anupama E Gururaj
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | | - Ruiling Liu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ergin Soysal
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Trevor Cohen
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Firat Tiryaki
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yueling Li
- Center for Research in Biological Systems
| | - Nansu Zong
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Min Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Deevakar Rogith
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Mandana Salimi
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Hyeon-Eui Kim
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | | | | | - Claudiu Farcas
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | - Todd Johnson
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ron Margolis
- National Institutes of Health, Bethesda, MD, USA
| | | | | | - Ian M Fore
- National Institutes of Health, Bethesda, MD, USA
| | - Lucila Ohno-Machado
- Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, USA
| | | | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
6
|
Kaewphan S, Hakala K, Miekka N, Salakoski T, Ginter F. Wide-scope biomedical named entity recognition and normalization with CRFs, fuzzy matching and character level modeling. Database (Oxford) 2018; 2018:1-10. [PMID: 30239666 PMCID: PMC6146133 DOI: 10.1093/database/bay096] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Revised: 08/16/2018] [Accepted: 08/17/2018] [Indexed: 11/13/2022]
Abstract
We present a system for automatically identifying a multitude of biomedical entities from the literature. This work is based on our previous efforts in the BioCreative VI: Interactive Bio-ID Assignment shared task in which our system demonstrated state-of-the-art performance with the highest achieved results in named entity recognition. In this paper we describe the original conditional random field-based system used in the shared task as well as experiments conducted since, including better hyperparameter tuning and character level modeling, which led to further performance improvements. For normalizing the mentions into unique identifiers we use fuzzy character n-gram matching. The normalization approach has also been improved with a better abbreviation resolution method and stricter guideline compliance resulting in vastly improved results for various entity types. All tools and models used for both named entity recognition and normalization are publicly available under open license.Database URL: https://github.com/TurkuNLP/BioCreativeVI_BioID_assignment.
Collapse
Affiliation(s)
- Suwisa Kaewphan
- Turku Centre for Computer Science, Turku, Finland
- Department of Future Technologies, University of Turku, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Kai Hakala
- Department of Future Technologies, University of Turku, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Niko Miekka
- Department of Future Technologies, University of Turku, Turku, Finland
| | - Tapio Salakoski
- Turku Centre for Computer Science, Turku, Finland
- University of Turku Graduate School, Turku, Finland
| | - Filip Ginter
- Department of Future Technologies, University of Turku, Turku, Finland
| |
Collapse
|
7
|
Abstract
BACKGROUND Cell lines and cell types are extensively studied in biomedical research yielding to a significant amount of publications each year. Identifying cell lines and cell types precisely in publications is crucial for science reproducibility and knowledge integration. There are efforts for standardisation of the cell nomenclature based on ontology development to support FAIR principles of the cell knowledge. However, it is important to analyse the usage of cell nomenclature in publications at a large scale for understanding the level of uptake of cell nomenclature in literature by scientists. In this study, we analyse the usage of cell nomenclature, both in Vivo, and in Vitro in biomedical literature by using text mining methods and present our results. RESULTS We identified 59% of the cell type classes in the Cell Ontology and 13% of the cell line classes in the Cell Line Ontology in the literature. Our analysis showed that cell line nomenclature is much more ambiguous compared to the cell type nomenclature. However, trends indicate that standardised nomenclature for cell lines and cell types are being increasingly used in publications by the scientists. CONCLUSIONS Our findings provide an insight to understand how experimental cells are described in publications and may allow for an improved standardisation of cell type and cell line nomenclature as well as can be utilised to develop efficient text mining applications on cell types and cell lines. All data generated in this study is available at https://github.com/shenay/CellNomenclatureStudy.
Collapse
Affiliation(s)
- Şenay Kafkas
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University Science and Technology, 4700 KAUST, Thuwal, 23955-6900 Saudi Arabia
| | - Sirarat Sarntivijai
- The European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory, Wellcome Genome Campus, Hinxton, Cambridge, SD CB10 1 UK
| | - Robert Hoehndorf
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences & Engineering Division, King Abdullah University Science and Technology, 4700 KAUST, Thuwal, 23955-6900 Saudi Arabia
| |
Collapse
|
8
|
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 2017; 33:i37-i48. [PMID: 28881963 PMCID: PMC5870729 DOI: 10.1093/bioinformatics/btx228] [Citation(s) in RCA: 186] [Impact Index Per Article: 26.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
MOTIVATION Text mining has become an important tool for biomedical research. The most fundamental text-mining task is the recognition of biomedical named entities (NER), such as genes, chemicals and diseases. Current NER methods rely on pre-defined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult. RESULTS We show that a completely generic method based on deep learning and statistical word embeddings [called long short-term memory network-conditional random field (LSTM-CRF)] outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall. AVAILABILITY AND IMPLEMENTATION The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora/ . CONTACT habibima@informatik.hu-berlin.de.
Collapse
Affiliation(s)
- Maryam Habibi
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Leon Weber
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Mariana Neves
- Enterprise Platform and Integration Concepts, Hasso-Plattner-Institute, Potsdam, Germany
| | - David Luis Wiegandt
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|