1
|
Corvi JO, McKitrick A, Fernández JM, Fuenteslópez CV, Gelpí JL, Ginebra MP, Capella-Gutierrez S, Hakimi O. DEBBIE: The Open Access Database of Experimental Scaffolds and Biomaterials Built Using an Automated Text Mining Pipeline. Adv Healthc Mater 2023; 12:e2300150. [PMID: 37563883 DOI: 10.1002/adhm.202300150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Revised: 05/04/2023] [Indexed: 08/12/2023]
Abstract
Biomaterials research output has experienced an exponential increase over the last three decades. The majority of research is published in the form of scientific articles and is therefore available as unstructured text, making it a challenging input for computational processing. Computational tools are becoming essential to overcome this information overload. Among them, text mining systems present an attractive option for the automated extraction of information from text documents into structured datasets. This work presents the first automated system for biomaterial related information extraction from the National Library of Medicine's premier bibliographic database (MEDLINE) research abstracts into a searchable database. The system is a text mining pipeline that periodically retrieves abstracts from PubMed and identifies research and clinical studies of biomaterials. Thereafter, the pipeline identifies sixteen concept types of interest in the abstract using the Biomaterials Annotator, a tool for biomaterials Named Entity Recognition (NER). These concepts of interest, along with the abstract and relevant metadata are then deposited in DEBBIE, the Database of Experimental Biomaterials and their Biological Effect. DEBBIE is accessible through a web application that provides keyword searches and displays results in an intuitive and meaningful manner, aiming to facilitate an efficient mapping and organization of biomaterials information.
Collapse
Affiliation(s)
- Javier O Corvi
- Barcelona Supercomputing Center (BSC), Barcelona, 08034, Spain
| | | | | | - Carla V Fuenteslópez
- Institute of Biomedical Engineering, Botnar Research Centre, University of Oxford, Oxford, OX3 7LD, UK
| | - Josep L Gelpí
- Department of Biochemistry and Molecular Biology, University of Barcelona, Barcelona, 08014, Spain
| | - Maria-Pau Ginebra
- Department of Materials Science and Engineering, The Technical University of Catalonia, 08222, Barcelona, Spain
| | | | - Osnat Hakimi
- Faculty of Medicine and Health Sciences, The International University of Catalonia, Barcelona, 08017, Spain
| |
Collapse
|
2
|
Gyori BM, Hoyt CT, Steppi A. Gilda: biomedical entity text normalization with machine-learned disambiguation as a service. BIOINFORMATICS ADVANCES 2022; 2:vbac034. [PMID: 36699362 PMCID: PMC9710686 DOI: 10.1093/bioadv/vbac034] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 04/27/2022] [Accepted: 05/06/2022] [Indexed: 01/28/2023]
Abstract
Summary Gilda is a software tool and web service that implements a scored string matching algorithm for names and synonyms across entries in biomedical ontologies covering genes, proteins (and their families and complexes), small molecules, biological processes and diseases. Gilda integrates machine-learned disambiguation models to choose between ambiguous strings given relevant surrounding text as context, and supports species-prioritization in case of ambiguity. Availability and implementation The Gilda web service is available at http://grounding.indra.bio with source code, documentation and tutorials available via https://github.com/indralab/gilda. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Benjamin M Gyori
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| | - Albert Steppi
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA 02115, USA
| |
Collapse
|
3
|
Biziukova NY, Tarasova OA, Rudik AV, Filimonov DA, Poroikov VV. Automatic Recognition of Chemical Entity Mentions in Texts of Scientific Publications. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS 2021. [DOI: 10.3103/s0005105520060023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
4
|
Biziukova N, Tarasova O, Ivanov S, Poroikov V. Automated Extraction of Information From Texts of Scientific Publications: Insights Into HIV Treatment Strategies. Front Genet 2021; 11:618862. [PMID: 33414815 PMCID: PMC7783389 DOI: 10.3389/fgene.2020.618862] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2020] [Accepted: 11/26/2020] [Indexed: 12/16/2022] Open
Abstract
Text analysis can help to identify named entities (NEs) of small molecules, proteins, and genes. Such data are very important for the analysis of molecular mechanisms of disease progression and development of new strategies for the treatment of various diseases and pathological conditions. The texts of publications represent a primary source of information, which is especially important to collect the data of the highest quality due to the immediate obtaining information, in comparison with databases. In our study, we aimed at the development and testing of an approach to the named entity recognition in the abstracts of publications. More specifically, we have developed and tested an algorithm based on the conditional random fields, which provides recognition of NEs of (i) genes and proteins and (ii) chemicals. Careful selection of abstracts strictly related to the subject of interest leads to the possibility of extracting the NEs strongly associated with the subject. To test the applicability of our approach, we have applied it for the extraction of (i) potential HIV inhibitors and (ii) a set of proteins and genes potentially responsible for viremic control in HIV-positive patients. The computational experiments performed provide the estimations of evaluating the accuracy of recognition of chemical NEs and proteins (genes). The precision of the chemical NEs recognition is over 0.91; recall is 0.86, and the F1-score (harmonic mean of precision and recall) is 0.89; the precision of recognition of proteins and genes names is over 0.86; recall is 0.83; while F1-score is above 0.85. Evaluation of the algorithm on two case studies related to HIV treatment confirms our suggestion about the possibility of extracting the NEs strongly relevant to (i) HIV inhibitors and (ii) a group of patients i.e., the group of HIV-positive individuals with an ability to maintain an undetectable HIV-1 viral load overtime in the absence of antiretroviral therapy. Analysis of the results obtained provides insights into the function of proteins that can be responsible for viremic control. Our study demonstrated the applicability of the developed approach for the extraction of useful data on HIV treatment.
Collapse
Affiliation(s)
- Nadezhda Biziukova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Olga Tarasova
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| | - Sergey Ivanov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia.,Department of Bioinformatics, Faculty of Biomedicine, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Vladimir Poroikov
- Laboratory of Structure-Function Based Drug Design, Department of Bioinformatics, Institute of Biomedical Chemistry, Moscow, Russia
| |
Collapse
|
5
|
Zhou H, Ning S, Liu Z, Lang C, Liu Z, Lei B. Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes. BMC Bioinformatics 2020; 21:35. [PMID: 32000677 PMCID: PMC6990512 DOI: 10.1186/s12859-020-3375-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2019] [Accepted: 01/20/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Automated biomedical named entity recognition and normalization serves as the basis for many downstream applications in information management. However, this task is challenging due to name variations and entity ambiguity. A biomedical entity may have multiple variants and a variant could denote several different entity identifiers. RESULTS To remedy the above issues, we present a novel knowledge-enhanced system for protein/gene named entity recognition (PNER) and normalization (PNEN). On one hand, a large amount of entity name knowledge extracted from biomedical knowledge bases is used to recognize more entity variants. On the other hand, structural knowledge of entities is extracted and encoded as identifier (ID) embeddings, which are then used for better entity normalization. Moreover, deep contextualized word representations generated by pre-trained language models are also incorporated into our knowledge-enhanced system for modeling multi-sense information of entities. Experimental results on the BioCreative VI Bio-ID corpus show that our proposed knowledge-enhanced system achieves 0.871 F1-score for PNER and 0.445 F1-score for PNEN, respectively, leading to a new state-of-the-art performance. CONCLUSIONS We propose a knowledge-enhanced system that combines both entity knowledge and deep contextualized word representations. Comparison results show that entity knowledge is beneficial to the PNER and PNEN task and can be well combined with contextualized information in our system for further improvement.
Collapse
Affiliation(s)
- Huiwei Zhou
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China.
| | - Shixian Ning
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China
| | - Zhe Liu
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China
| | - Chengkun Lang
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China
| | - Bizun Lei
- School of Computer Science and Technology, Dalian University of Technology, Chuangxinyuan Building, No.2 Linggong Road, Ganjingzi District, Dalian, 116024, Liaoning, China
| |
Collapse
|
6
|
Jose JM, Yilmaz E, Magalhães J, Castells P, Ferro N, Silva MJ, Martins F. MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7148021 DOI: 10.1007/978-3-030-45442-5_29] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Progress in the field of Natural Language Processing (NLP) has been closely followed by applications in the medical domain. Recent advancements in Neural Language Models (NLMs) have transformed the field and are currently motivating numerous works exploring their application in different domains. In this paper, we explore how NLMs can be used for Medical Entity Linking with the recently introduced MedMentions dataset, which presents two major challenges: (1) a large target ontology of over 2M concepts, and (2) low overlap between concepts in train, validation and test sets. We introduce a solution, MedLinker, that addresses these issues by leveraging specialized NLMs with Approximate Dictionary Matching, and show that it performs competitively on semantic type linking, while improving the state-of-the-art on the more fine-grained task of concept linking (+4 F1 on MedMentions main task).
Collapse
|