1
|
Attana F, Kim S, Spencer J, Iorga BI, Docquier JD, Rossolini GM, Perilli M, Amicosante G, Vila AJ, Vakulenko SB, Mobashery S, Bradford P, Bush K, Partridge SR, Hujer AM, Hujer KM, Bonomo RA, Haider S. SAND: a comprehensive annotation of class D β-lactamases using structural alignment-based numbering. Antimicrob Agents Chemother 2025:e0015025. [PMID: 40422930 DOI: 10.1128/aac.00150-25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2025] [Accepted: 03/29/2025] [Indexed: 05/28/2025] Open
Abstract
Class D β-lactamases are a diverse group of enzymes that contribute to antibiotic resistance by inactivating β-lactam antibiotics. Examination of class D β-lactamases has evolved significantly over the years, with advancements in molecular biology and structural analysis providing deeper insights into their mechanisms of action and variation in specificity. However, one of the challenges in the field is the inconsistent residue numbering and secondary structure annotation across different studies, which complicates the comparison and interpretation of data. To address this, we propose SAND-a standardized naming system for both residues and secondary structure elements, based on a comprehensive structural alignment of all documented sequences and experimentally obtained crystal structures of class D β-lactamases. This unified framework will streamline cross-study comparisons and enhance data interpretation. Moreover, the standardized framework will enable AI-driven natural language processing (NLP) techniques to efficiently mine and compile relevant data from scientific literature, speeding up the discovery process and contributing to more rapid advancements in β-lactamase research.
Collapse
Affiliation(s)
- Fedaa Attana
- UCL School of Pharmacy, University College London, London, United Kingdom
| | - Soobin Kim
- UCL School of Pharmacy, University College London, London, United Kingdom
| | - James Spencer
- School of Cellular and Molecular Medicine, University of Bristol, Bristol, United Kingdom
| | - Bogdan I Iorga
- Institut de Chimie des Substances Naturelles, Université Paris-Saclay, Gif-sur-Yvette, France
| | | | - Gian Maria Rossolini
- Department of Experimental and Clinical Medicine, University of Florence, Florence, Italy
| | - Mariagrazia Perilli
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, L'Aquila, Italy
| | - Gianfranco Amicosante
- Department of Biotechnological and Applied Clinical Sciences, University of L'Aquila, L'Aquila, Italy
| | - Alejandro J Vila
- Instituto de Biología Molecular y Celular de Rosario (CONICET IBR -UNR), Rosario, Argentina
| | - Sergei B Vakulenko
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA
| | - Shahriar Mobashery
- Department of Chemistry and Biochemistry, University of Notre Dame, Notre Dame, Indiana, USA
| | | | - Karen Bush
- Department of Biology, Indiana University, , Bloomington, Indiana, USA
| | - Sally R Partridge
- Sydney School of Medicine, The University of Sydney, , Sydney, New South Wales, Australia
| | - Andrea M Hujer
- Department of Medicine, Case Western Reserve University School of Medicine, , Cleveland, Ohio, USA
- Louis Stokes Cleveland Department of Veterans Affairs Medical Center, Cleveland, Ohio, USA
| | - Kristine M Hujer
- Department of Medicine, Case Western Reserve University School of Medicine, , Cleveland, Ohio, USA
- Louis Stokes Cleveland Department of Veterans Affairs Medical Center, Cleveland, Ohio, USA
| | - Robert A Bonomo
- Department of Medicine, Case Western Reserve University School of Medicine, , Cleveland, Ohio, USA
- Louis Stokes Cleveland Department of Veterans Affairs Medical Center, Cleveland, Ohio, USA
- Department of Molecular Biology and Microbiology, Case Western Reserve University School of Medicine, , Cleveland, Ohio, USA
- Departments of Pharmacology, Biochemistry, and Proteomics and Bioinformatics, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA
- CWRU-Cleveland VAMC Center for Antimicrobial Resistance and Epidemiology (Case VA CARES), Cleveland, Ohio, USA
| | - Shozeb Haider
- UCL School of Pharmacy, University College London, London, United Kingdom
- University of Tabuk (PFSCBR), Tabuk, Saudi Arabia
- UCL Centre for Advanced Research in Computing, University College London, , London, United Kingdom
| |
Collapse
|
2
|
Niyonkuru E, Caufield JH, Carmody LC, Gargano MA, Toro S, Whetzel PL, Blau H, Gomez MS, Casiraghi E, Chimirri L, Reese JT, Valentini G, Haendel MA, Mungall CJ, Robinson PN. Leveraging Generative AI to Accelerate Biocuration of Medical Actions for Rare Disease. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.22.24310814. [PMID: 39228707 PMCID: PMC11370550 DOI: 10.1101/2024.08.22.24310814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Structured representations of clinical data can support computational analysis of individuals and cohorts, and ontologies representing disease entities and phenotypic abnormalities are now commonly used for translational research. The Medical Action Ontology (MAxO) provides a computational representation of treatments and other actions taken for the clinical management of patients. Currently, manual biocuration is used to assign MAxO terms to rare diseases, enabling clinical management of rare diseases to be described computationally for use in clinical decision support and mechanism discovery. However, it is challenging to scale manual curation to comprehensively capture information about medical actions for the more than 10,000 rare diseases. We present AutoMAxO, a semi-automated workflow that leverages Large Language Models (LLMs) to streamline MAxO biocuration for rare diseases. AutoMAxO first uses LLMs to retrieve candidate curations from abstracts of relevant publications. Next, the candidate curations are matched to ontology terms from MAxO, Human Phenotype Ontology (HPO), and MONDO disease ontology via a combination of LLMs and post-processing techniques. Finally, the matched terms are presented in a structured form to a human curator for approval. We used this approach to process 4,918 unique medical abstracts and identified annotations for 21 rare genetic diseases, we extracted 18,631 candidate disease-treatment curations, 538 of which were confirmed and transferred to the MAxO annotation dataset. The results of this project underscore the potential of generative AI to accelerate precision medicine by enabling a robust and comprehensive curation of the primary literature to represent information about diseases and procedures in a structured fashion. Although we focused on MAxO in this project, similar approaches could be taken for other biomedical curation tasks.
Collapse
Affiliation(s)
- Enock Niyonkuru
- Trinity College, Hartford, CT, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Leigh C Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Michael A Gargano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Sabrina Toro
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Patricia L Whetzel
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Mauricio Soto Gomez
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Elena Casiraghi
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Leonardo Chimirri
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
| | - Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Giorgio Valentini
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Melissa A Haendel
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
| |
Collapse
|
3
|
Pu J, Yu Y, Liu Y, Wang D, Gui S, Zhong X, Chen W, Chen X, Chen Y, Chen X, Qiao R, Jiang Y, Zhang H, Fan L, Ren Y, Chen X, Wang H, Xie P. ProMENDA: an updated resource for proteomic and metabolomic characterization in depression. Transl Psychiatry 2024; 14:229. [PMID: 38816410 PMCID: PMC11139925 DOI: 10.1038/s41398-024-02948-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/24/2023] [Revised: 05/15/2024] [Accepted: 05/17/2024] [Indexed: 06/01/2024] Open
Abstract
Depression is a prevalent mental disorder with a complex biological mechanism. Following the rapid development of systems biology technology, a growing number of studies have applied proteomics and metabolomics to explore the molecular profiles of depression. However, a standardized resource facilitating the identification and annotation of the available knowledge from these scattered studies associated with depression is currently lacking. This study presents ProMENDA, an upgraded resource that provides a platform for manual annotation of candidate proteins and metabolites linked to depression. Following the establishment of the protein dataset and the update of the metabolite dataset, the ProMENDA database was developed as a major extension of its initial release. A multi-faceted annotation scheme was employed to provide comprehensive knowledge of the molecules and studies. A new web interface was also developed to improve the user experience. The ProMENDA database now contains 43,366 molecular entries, comprising 20,847 protein entries and 22,519 metabolite entries, which were manually curated from 1370 human, rat, mouse, and non-human primate studies. This represents a significant increase (more than 7-fold) in molecular entries compared to the initial release. To demonstrate the usage of ProMENDA, a case study identifying consistently reported proteins and metabolites in the brains of animal models of depression was presented. Overall, ProMENDA is a comprehensive resource that offers a panoramic view of proteomic and metabolomic knowledge in depression. ProMENDA is freely available at https://menda.cqmu.edu.cn .
Collapse
Affiliation(s)
- Juncai Pu
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Yue Yu
- Department of Health Sciences Research, Mayo Clinic, MN, 55901, USA
| | - Yiyun Liu
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Dongfang Wang
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Siwen Gui
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Xiaogang Zhong
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Weiyi Chen
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Xiaopeng Chen
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Yue Chen
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Xiang Chen
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Renjie Qiao
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Yanyi Jiang
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Hanping Zhang
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Li Fan
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Yi Ren
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Xiangyu Chen
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Haiyang Wang
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China
| | - Peng Xie
- Department of Neurology, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China.
- NHC Key Laboratory of Diagnosis and Treatment on Brain Functional Diseases, The First Affiliated Hospital of Chongqing Medical University, Chongqing, 400016, China.
- The Jinfeng Laboratory, Chongqing, 401336, China.
- Chongqing Institute for Brain and Intelligence, Chongqing, 400072, China.
| |
Collapse
|
4
|
Papadimitriou S, Gravel B, Nachtegael C, De Baere E, Loeys B, Vikkula M, Smits G, Lenaerts T. Toward reporting standards for the pathogenicity of variant combinations involved in multilocus/oligogenic diseases. HGG ADVANCES 2022; 4:100165. [PMID: 36578772 PMCID: PMC9791921 DOI: 10.1016/j.xhgg.2022.100165] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Although standards and guidelines for the interpretation of variants identified in genes that cause Mendelian disorders have been developed, this is not the case for more complex genetic models including variant combinations in multiple genes. During a large curation process conducted on 318 research articles presenting oligogenic variant combinations, we encountered several recurring issues concerning their proper reporting and pathogenicity assessment. These mainly concern the absence of strong evidence that refutes a monogenic model and the lack of a proper genetic and functional assessment of the joint effect of the involved variants. With the increasing accumulation of such cases, it has become essential to develop standards and guidelines on how these oligogenic/multilocus variant combinations should be interpreted, validated, and reported in order to provide high-quality data and supporting evidence to the scientific community.
Collapse
Affiliation(s)
- Sofia Papadimitriou
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, 1050 Brussels, Belgium,Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium,Artificial Intelligence Laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium,Corresponding author
| | - Barbara Gravel
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, 1050 Brussels, Belgium,Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium,Artificial Intelligence Laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, 1050 Brussels, Belgium,Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
| | - Elfride De Baere
- Center for Medical Genetics, Ghent University Hospital, Department of Biomolecular Medicine, Ghent University, 9000 Ghent, Belgium
| | - Bart Loeys
- Center for Medical Genetics, Antwerp University Hospital/University of Antwerp, 2650 Antwerp, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, UCLouvain, Brussels, Belgium
| | - Guillaume Smits
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, 1050 Brussels, Belgium,Center of Human Genetics, Hôpital Erasme, Université Libre de Bruxelles, 1070 Brussels, Belgium,Hôpital Universitaire des Enfants Reine Fabiola, Université Libre de Bruxelles, 1020 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, 1050 Brussels, Belgium,Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium,Artificial Intelligence Laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium,Corresponding author
| |
Collapse
|
5
|
Peng J, Xu D, Lee R, Xu S, Zhou Y, Wang K. Expediting knowledge acquisition by a web framework for Knowledge Graph Exploration and Visualization (KGEV): case studies on COVID-19 and Human Phenotype Ontology. BMC Med Inform Decis Mak 2022; 22:147. [PMID: 35655307 PMCID: PMC9161770 DOI: 10.1186/s12911-022-01848-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 04/11/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Knowledges graphs (KGs) serve as a convenient framework for structuring knowledge. A number of computational methods have been developed to generate KGs from biomedical literature and use them for downstream tasks such as link prediction and question answering. However, there is a lack of computational tools or web frameworks to support the exploration and visualization of the KG themselves, which would facilitate interactive knowledge discovery and formulation of novel biological hypotheses.
Method
We developed a web framework for Knowledge Graph Exploration and Visualization (KGEV), to construct and visualize KGs in five stages: triple extraction, triple filtration, metadata preparation, knowledge integration, and graph database preparation. The application has convenient user interface tools, such as node and edge search and filtering, data source filtering, neighborhood retrieval, and shortest path calculation, that work by querying a backend graph database. Unlike other KGs, our framework allows fast retrieval of relevant texts supporting the relationships in the KG, thus allowing human reviewers to judge the reliability of the knowledge extracted.
Results
We demonstrated a case study of using the KGEV framework to perform research on COVID-19. The COVID-19 pandemic resulted in an explosion of relevant literature, making it challenging to make full use of the vast and heterogenous sources of information. We generated a COVID-19 KG with heterogenous information, including literature information from the CORD-19 dataset, as well as other existing knowledge from eight data sources. We showed the utility of KGEV in three intuitive case studies to explore and query knowledge on COVID-19. A demo of this web application can be accessed at http://covid19nlp.wglab.org. Finally, we also demonstrated a turn-key adaption of the KGEV framework to study clinical phenotypic presentation of human diseases by Human Phenotype Ontology (HPO), illustrating the versatility of the framework.
Conclusion
In an era of literature explosion, the KGEV framework can be applied to many emerging diseases to support structured navigation of the vast amount of newly published biomedical literature and other existing biological knowledge in various databases. It can be also used as a general-purpose tool to explore and query gene-phenotype-disease-drug relationships interactively.
Collapse
|
6
|
Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.10.100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
7
|
Beckman MF, Brennan EJ, Igba CK, Brennan MT, Mougeot FB, Mougeot JLC. A Computational Text Mining-Guided Meta-Analysis Approach to Identify Potential Xerostomia Drug Targets. J Clin Med 2022; 11:1442. [PMID: 35268532 PMCID: PMC8911392 DOI: 10.3390/jcm11051442] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2022] [Revised: 03/02/2022] [Accepted: 03/03/2022] [Indexed: 02/01/2023] Open
Abstract
Xerostomia (subjective complaint of dry mouth) is commonly associated with salivary gland hypofunction. Molecular mechanisms associated with xerostomia pathobiology are poorly understood, thus hampering drug development. Our objectives were to (i) use text-mining tools to investigate xerostomia and dry mouth concepts, (ii) identify associated molecular interactions involving genes as candidate drug targets, and (iii) determine how drugs currently used in clinical trials may impact these genes and associated pathways. PubMed and PubMed Central were used to identify search terms associated with xerostomia and/or dry mouth. Search terms were queried in pubmed2ensembl. Protein-protein interaction (PPI) networks were determined using the gene/protein network visualization program search tool for recurring instances of neighboring genes (STRING). A similar program, Cytoscape, was used to determine PPIs of overlapping gene sets. The drug-gene interaction database (DGIdb) and the clinicaltrials.gov database were used to identify potential drug targets from the xerostomia/dry mouth PPI gene set. We identified 64 search terms in common between xerostomia and dry mouth. STRING confirmed PPIs between identified genes (CL = 0.90). Cytoscape analysis determined 58 shared genes, with cytokine-cytokine receptor interaction representing the most significant pathway (p = 1.29 × 10-23) found in the Kyoto encyclopedia of genes and genomes (KEGG). Fifty-four genes in common had drug interactions, per DGIdb analysis. Eighteen drugs, targeting the xerostomia/dry mouth PPI network, have been evaluated for xerostomia, head and neck cancer oral complications, and Sjögren's Syndrome. The PPI network genes IL6R, EGFR, NFKB1, MPO, and TNFSF13B constitute a possible biomarker signature of xerostomia. Validation of the candidate biomarkers is necessary to better stratify patients at the genetic and molecular levels to facilitate drug development or to monitor response to treatment.
Collapse
Affiliation(s)
| | | | | | | | - Farah B. Mougeot
- Department of Oral Medicine, Carolinas Medical Center, Atrium Health, Charlotte, NC 28203, USA; (M.F.B.); (E.J.B.); (C.K.I.); (M.T.B.)
| | - Jean-Luc C. Mougeot
- Department of Oral Medicine, Carolinas Medical Center, Atrium Health, Charlotte, NC 28203, USA; (M.F.B.); (E.J.B.); (C.K.I.); (M.T.B.)
| |
Collapse
|
8
|
Conceição SIR, Couto FM. Text Mining for Building Biomedical Networks Using Cancer as a Case Study. Biomolecules 2021; 11:biom11101430. [PMID: 34680062 PMCID: PMC8533101 DOI: 10.3390/biom11101430] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 09/24/2021] [Accepted: 09/27/2021] [Indexed: 12/15/2022] Open
Abstract
In the assembly of biological networks it is important to provide reliable interactions in an effort to have the most possible accurate representation of real-life systems. Commonly, the data used to build a network comes from diverse high-throughput essays, however most of the interaction data is available through scientific literature. This has become a challenge with the notable increase in scientific literature being published, as it is hard for human curators to track all recent discoveries without using efficient tools to help them identify these interactions in an automatic way. This can be surpassed by using text mining approaches which are capable of extracting knowledge from scientific documents. One of the most important tasks in text mining for biological network building is relation extraction, which identifies relations between the entities of interest. Many interaction databases already use text mining systems, and the development of these tools will lead to more reliable networks, as well as the possibility to personalize the networks by selecting the desired relations. This review will focus on different approaches of automatic information extraction from biomedical text that can be used to enhance existing networks or create new ones, such as deep learning state-of-the-art approaches, focusing on cancer disease as a case-study.
Collapse
|
9
|
Hobbs ET, Goralski SM, Mitchell A, Simpson A, Leka D, Kotey E, Sekira M, Munro JB, Nadendla S, Jackson R, Gonzalez-Aguirre A, Krallinger M, Giglio M, Erill I. ECO-CollecTF: A Corpus of Annotated Evidence-Based Assertions in Biomedical Manuscripts. Front Res Metr Anal 2021; 6:674205. [PMID: 34327299 PMCID: PMC8313968 DOI: 10.3389/frma.2021.674205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 06/28/2021] [Indexed: 11/20/2022] Open
Abstract
Analysis of high-throughput experiments in the life sciences frequently relies upon standardized information about genes, gene products, and other biological entities. To provide this information, expert curators are increasingly relying on text mining tools to identify, extract and harmonize statements from biomedical journal articles that discuss findings of interest. For determining reliability of the statements, curators need the evidence used by the authors to support their assertions. It is important to annotate the evidence directly used by authors to qualify their findings rather than simply annotating mentions of experimental methods without the context of what findings they support. Text mining tools require tuning and adaptation to achieve accurate performance. Many annotated corpora exist to enable developing and tuning text mining tools; however, none currently provides annotations of evidence based on the extensive and widely used Evidence and Conclusion Ontology. We present the ECO-CollecTF corpus, a novel, freely available, biomedical corpus of 84 documents that captures high-quality, evidence-based statements annotated with the Evidence and Conclusion Ontology.
Collapse
Affiliation(s)
- Elizabeth T Hobbs
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Stephen M Goralski
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Ashley Mitchell
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Andrew Simpson
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Dorjan Leka
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Emmanuel Kotey
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - Matt Sekira
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| | - James B Munro
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Suvarna Nadendla
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Rebecca Jackson
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | | | - Martin Krallinger
- Barcelona Supercomputing Center (BSC), Barcelona, Spain.,Centro Nacional de Investigaciones Oncológicas (CNIO), Madrid, Spain
| | - Michelle Giglio
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, United States
| | - Ivan Erill
- Department of Biological Sciences, University of Maryland Baltimore County, Baltimore, MD, United States
| |
Collapse
|
10
|
Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021; 118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Olga Printseva
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Oleg Rodionov
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
11
|
Bayer S, Clark C, Dang O, Aberdeen J, Brajovic S, Swank K, Hirschman L, Ball R. ADE Eval: An Evaluation of Text Processing Systems for Adverse Event Extraction from Drug Labels for Pharmacovigilance. Drug Saf 2021; 44:83-94. [PMID: 33006728 PMCID: PMC7813736 DOI: 10.1007/s40264-020-00996-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/02/2020] [Indexed: 12/05/2022]
Abstract
INTRODUCTION The US FDA is interested in a tool that would enable pharmacovigilance safety evaluators to automate the identification of adverse drug events (ADEs) mentioned in FDA prescribing information. The MITRE Corporation (MITRE) and the FDA organized a shared task-Adverse Drug Event Evaluation (ADE Eval)-to determine whether the performance of algorithms currently used for natural language processing (NLP) might be good enough for real-world use. OBJECTIVE ADE Eval was conducted to evaluate a range of NLP techniques for identifying ADEs mentioned in publicly available FDA-approved drug labels (package inserts). It was designed specifically to reflect pharmacovigilance practices within the FDA and model possible pharmacovigilance use cases. METHODS Pharmacovigilance-specific annotation guidelines and annotated corpora were created. Two metrics modeled the experiences of FDA safety evaluators: one measured the ability of an algorithm to identify correct Medical Dictionary for Regulatory Activities (MedDRA®) terms for the text from the annotated corpora, and the other assessed the quality of evidence extracted from the corpora to support the selected MedDRA® term by measuring the portion of annotated text an algorithm correctly identified. A third metric assessed the cost of correcting system output for subsequent training (averaged, weighted F1-measure for mention finding). RESULTS In total, 13 teams submitted 23 runs: the top MedDRA® coding F1-measure was 0.79, the top quality score was 0.96, and the top mention-finding F1-measure was 0.89. CONCLUSION While NLP techniques do not perform at levels that would allow them to be used without intervention, it is now worthwhile exploring making NLP outputs available in human pharmacovigilance workflows.
Collapse
Affiliation(s)
- Samuel Bayer
- The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| | - Cheryl Clark
- The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| | - Oanh Dang
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD USA
| | - John Aberdeen
- The MITRE Corporation, 202 Burlington Rd, Bedford, MA 01730 USA
| | - Sonja Brajovic
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD USA
| | - Kimberley Swank
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD USA
| | | | - Robert Ball
- Office of Surveillance and Epidemiology, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, MD USA
| |
Collapse
|
12
|
Gobeill J, Caucheteur D, Michel PA, Mottin L, Pasche E, Ruch P. SIB Literature Services: RESTful customizable search engines in biomedical literature, enriched with automatically mapped biomedical concepts. Nucleic Acids Res 2020; 48:W12-W16. [PMID: 32379317 PMCID: PMC7319474 DOI: 10.1093/nar/gkaa328] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Revised: 04/09/2020] [Accepted: 04/22/2020] [Indexed: 01/05/2023] Open
Abstract
Thanks to recent efforts by the text mining community, biocurators have now access to plenty of good tools and Web interfaces for identifying and visualizing biomedical entities in literature. Yet, many of these systems start with a PubMed query, which is limited by strong Boolean constraints. Some semantic search engines exploit entities for Information Retrieval, and/or deliver relevance-based ranked results. Yet, they are not designed for supporting a specific curation workflow, and allow very limited control on the search process. The Swiss Institute of Bioinformatics Literature Services (SIBiLS) provide personalized Information Retrieval in the biological literature. Indeed, SIBiLS allow fully customizable search in semantically enriched contents, based on keywords and/or mapped biomedical entities from a growing set of standardized and legacy vocabularies. The services have been used and favourably evaluated to assist the curation of genes and gene products, by delivering customized literature triage engines to different curation teams. SIBiLS (https://candy.hesge.ch/SIBiLS) are freely accessible via REST APIs and are ready to empower any curation workflow, built on modern technologies scalable with big data: MongoDB and Elasticsearch. They cover MEDLINE and PubMed Central Open Access enriched by nearly 2 billion of mapped biomedical entities, and are daily updated.
Collapse
Affiliation(s)
- Julien Gobeill
- To whom correspondence should be addressed. Tel: +41 22 388 17 86; Fax: +41 22 546 97 38;
| | - Déborah Caucheteur
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Pierre-André Michel
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
| | - Luc Mottin
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Emilie Pasche
- SIB Text Mining group, Swiss Institute of Bioinformatics, 1206 Geneva, Switzerland
- BiTeM group, Information Sciences, HES-SO / HEG Geneva, 1227 Carouge, Switzerland
| | - Patrick Ruch
- Correspondence may also be addressed to Patrick Ruch. Tel: +41 22 388 17 81; Fax: +41 22 546 97 38;
| |
Collapse
|
13
|
Wei CH, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2020; 47:W587-W593. [PMID: 31114887 DOI: 10.1093/nar/gkz389] [Citation(s) in RCA: 228] [Impact Index Per Article: 45.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2019] [Revised: 04/08/2019] [Accepted: 04/30/2019] [Indexed: 11/12/2022] Open
Abstract
PubTator Central (https://www.ncbi.nlm.nih.gov/research/pubtator/) is a web service for viewing and retrieving bioconcept annotations in full text biomedical articles. PubTator Central (PTC) provides automated annotations from state-of-the-art text mining systems for genes/proteins, genetic variants, diseases, chemicals, species and cell lines, all available for immediate download. PTC annotates PubMed (29 million abstracts) and the PMC Text Mining subset (3 million full text articles). The new PTC web interface allows users to build full text document collections and visualize concept annotations in each document. Annotations are downloadable in multiple formats (XML, JSON and tab delimited) via the online interface, a RESTful web service and bulk FTP. Improved concept identification systems and a new disambiguation module based on deep learning increase annotation accuracy, and the new server-side architecture is significantly faster. PTC is synchronized with PubMed and PubMed Central, with new articles added daily. The original PubTator service has served annotated abstracts for ∼300 million requests, enabling third-party research in use cases such as biocuration support, gene prioritization, genetic disease analysis, and literature-based knowledge discovery. We demonstrate the full text results in PTC significantly increase biomedical concept coverage and anticipate this expansion will both enhance existing downstream applications and enable new use cases.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| |
Collapse
|
14
|
Teodoro D, Knafou J, Naderi N, Pasche E, Gobeill J, Arighi CN, Ruch P. UPCLASS: a deep learning-based classifier for UniProtKB entry publications. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2020:5822772. [PMID: 32367111 PMCID: PMC7198315 DOI: 10.1093/database/baaa026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Revised: 02/19/2020] [Accepted: 03/11/2020] [Indexed: 12/20/2022]
Abstract
In the UniProt Knowledgebase (UniProtKB), publications providing evidence for a specific protein annotation entry are organized across different categories, such as function, interaction and expression, based on the type of data they contain. To provide a systematic way of categorizing computationally mapped bibliographies in UniProt, we investigate a convolutional neural network (CNN) model to classify publications with accession annotations according to UniProtKB categories. The main challenge of categorizing publications at the accession annotation level is that the same publication can be annotated with multiple proteins and thus be associated with different category sets according to the evidence provided for the protein. We propose a model that divides the document into parts containing and not containing evidence for the protein annotation. Then, we use these parts to create different feature sets for each accession and feed them to separate layers of the network. The CNN model achieved a micro F1-score of 0.72 and a macro F1-score of 0.62, outperforming baseline models based on logistic regression and support vector machine by up to 22 and 18 percentage points, respectively. We believe that such an approach could be used to systematically categorize the computationally mapped bibliography in UniProtKB, which represents a significant set of the publications, and help curators to decide whether a publication is relevant for further curation for a protein accession. Database URL: https://goldorak.hesge.ch/bioexpclass/upclass/.
Collapse
Affiliation(s)
- Douglas Teodoro
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nona Naderi
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Emilie Pasche
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Gobeill
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Cecilia N Arighi
- Center of Bioinformatics and Computational Biology, 15 Innovation Way, 19711, Department of Computer and Information Sciences, University of Delaware, Newark, DE, USA
| | - Patrick Ruch
- Geneva School of Business Administration, CH-1227, University of Applied Sciences and Arts Western Switzerland, HES-SO, Geneva, Switzerland.,Text Mining Group, Rue Michel-Servet 1, CH-1206, SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
15
|
Chen Q, Lee K, Yan S, Kim S, Wei CH, Lu Z. BioConceptVec: Creating and evaluating literature-based biomedical concept embeddings on a large scale. PLoS Comput Biol 2020; 16:e1007617. [PMID: 32324731 PMCID: PMC7237030 DOI: 10.1371/journal.pcbi.1007617] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2019] [Revised: 05/19/2020] [Accepted: 12/19/2019] [Indexed: 12/14/2022] Open
Abstract
A massive number of biological entities, such as genes and mutations, are mentioned in the biomedical literature. The capturing of the semantic relatedness of biological entities is vital to many biological applications, such as protein-protein interaction prediction and literature-based discovery. Concept embeddings—which involve the learning of vector representations of concepts using machine learning models—have been employed to capture the semantics of concepts. To develop concept embeddings, named-entity recognition (NER) tools are first used to identify and normalize concepts from the literature, and then different machine learning models are used to train the embeddings. Despite multiple attempts, existing biomedical concept embeddings generally suffer from suboptimal NER tools, small-scale evaluation, and limited availability. In response, we employed high-performance machine learning-based NER tools for concept recognition and trained our concept embeddings, BioConceptVec, via four different machine learning models on ~30 million PubMed abstracts. BioConceptVec covers over 400,000 biomedical concepts mentioned in the literature and is of the largest among the publicly available biomedical concept embeddings to date. To evaluate the validity and utility of BioConceptVec, we respectively performed two intrinsic evaluations (identifying related concepts based on drug-gene and gene-gene interactions) and two extrinsic evaluations (protein-protein interaction prediction and drug-drug interaction extraction), collectively using over 25 million instances from nine independent datasets (17 million instances from six intrinsic evaluation tasks and 8 million instances from three extrinsic evaluation tasks), which is, by far, the most comprehensive to our best knowledge. The intrinsic evaluation results demonstrate that BioConceptVec consistently has, by a large margin, better performance than existing concept embeddings in identifying similar and related concepts. More importantly, the extrinsic evaluation results demonstrate that using BioConceptVec with advanced deep learning models can significantly improve performance in downstream bioinformatics studies and biomedical text-mining applications. Our BioConceptVec embeddings and benchmarking datasets are publicly available at https://github.com/ncbi-nlp/BioConceptVec. Capturing the semantics of related biological concepts, such as genes and mutations, is of significant importance to many research tasks in computational biology such as protein-protein interaction detection, gene-drug association prediction, and biomedical literature-based discovery. Here, we propose to leverage state-of-the-art text mining tools and machine learning models to learn the semantics via vector representations (aka. embeddings) of over 400,000 biological concepts mentioned in the entire PubMed abstracts. Our learned embeddings, namely BioConceptVec, can capture related concepts based on their surrounding contextual information in the literature, which is beyond exact term match or co-occurrence-based methods. BioConceptVec has been thoroughly evaluated in multiple bioinformatics tasks consisting of over 25 million instances from nine different biological datasets. The evaluation results demonstrate that BioConceptVec has better performance than existing methods in all tasks. Finally, BioConceptVec is made freely available to the research community and general public.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Kyubum Lee
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Shankai Yan
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Sun Kim
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, Maryland, United States of America
| |
Collapse
|
16
|
Arnaboldi V, Raciti D, Van Auken K, Chan JN, Müller HM, Sternberg PW. Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase. Database (Oxford) 2020; 2020:baaa006. [PMID: 32185395 PMCID: PMC7078066 DOI: 10.1093/database/baaa006] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Revised: 01/08/2020] [Accepted: 01/14/2020] [Indexed: 01/17/2023]
Abstract
Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved.
Collapse
Affiliation(s)
- Valerio Arnaboldi
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Daniela Raciti
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Kimberly Van Auken
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Juancarlos N Chan
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Hans-Michael Müller
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| | - Paul W Sternberg
- Division of Biology and Biological Engineering 156–29, California Institute of Technology, 1200 E California Blvd, Pasadena, CA 91125, USA
| |
Collapse
|
17
|
Tang YA, Pichler K, Füllgrabe A, Lomax J, Malone J, Munoz-Torres MC, Vasant DV, Williams E, Haendel M. Ten quick tips for biocuration. PLoS Comput Biol 2019; 15:e1006906. [PMID: 31048830 PMCID: PMC6497217 DOI: 10.1371/journal.pcbi.1006906] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Affiliation(s)
- Y. Amy Tang
- Genestack Limited, Cambridge, Cambridgeshire, United Kingdom
- * E-mail:
| | - Klemens Pichler
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom
| | - Anja Füllgrabe
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, United Kingdom
| | - Jane Lomax
- SciBite Limited, BioData Innovation Centre, Hinxton, Cambridgeshire, United Kingdom
| | - James Malone
- SciBite Limited, BioData Innovation Centre, Hinxton, Cambridgeshire, United Kingdom
| | | | - Drashtti V. Vasant
- Bayer Business Services GmbH, BP Research and Development, Translational Sciences, Berlin, Germany
| | - Eleanor Williams
- Centre for Gene Regulation and Expression, School of Life Sciences, University of Dundee, Dundee, United Kingdom
- Genomics England, Queen Mary University of London, London, United Kingdom
| | - Melissa Haendel
- Linus Pauling Institute, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
18
|
Saqi M, Lysenko A, Guo YK, Tsunoda T, Auffray C. Navigating the disease landscape: knowledge representations for contextualizing molecular signatures. Brief Bioinform 2019; 20:609-623. [PMID: 29684165 PMCID: PMC6556902 DOI: 10.1093/bib/bby025] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 02/05/2018] [Indexed: 12/14/2022] Open
Abstract
Large amounts of data emerging from experiments in molecular medicine are leading to the identification of molecular signatures associated with disease subtypes. The contextualization of these patterns is important for obtaining mechanistic insight into the aberrant processes associated with a disease, and this typically involves the integration of multiple heterogeneous types of data. In this review, we discuss knowledge representations that can be useful to explore the biological context of molecular signatures, in particular three main approaches, namely, pathway mapping approaches, molecular network centric approaches and approaches that represent biological statements as knowledge graphs. We discuss the utility of each of these paradigms, illustrate how they can be leveraged with selected practical examples and identify ongoing challenges for this field of research.
Collapse
Affiliation(s)
- Mansoor Saqi
- Mansoor Saqi Data Science Institute, Imperial College London, UK
| | - Artem Lysenko
- Artem Lysenko Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Yi-Ke Guo
- Yi-Ke Guo Data Science Institute, Imperial College London, UK
| | - Tatsuhiko Tsunoda
- Tatsuhiko Tsunoda Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan CREST, JST, Tokyo, Japan Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| | - Charles Auffray
- Charles Auffray European Institute for Systems Biology and Medicine, Lyon, France
| |
Collapse
|
19
|
Labbé C, Grima N, Gautier T, Favier B, Byrne JA. Semi-automated fact-checking of nucleotide sequence reagents in biomedical research publications: The Seek & Blastn tool. PLoS One 2019; 14:e0213266. [PMID: 30822319 PMCID: PMC6396917 DOI: 10.1371/journal.pone.0213266] [Citation(s) in RCA: 31] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Accepted: 02/18/2019] [Indexed: 12/14/2022] Open
Abstract
Nucleotide sequence reagents are verifiable experimental reagents in biomedical publications, because their sequence identities can be independently verified and compared with associated text descriptors. We have previously reported that incorrectly identified nucleotide sequence reagents are characteristic of highly similar human gene knockdown studies, some of which have been retracted from the literature on account of possible research fraud. Because of the throughput limitations of manual verification of nucleotide sequences, we developed a semi-automated fact checking tool, Seek & Blastn, to verify the targeting or non-targeting status of published nucleotide sequence reagents. From previously described and unknown corpora of 48 and 155 publications, respectively, Seek & Blastn correctly extracted 304/342 (88.9%) and 1066/1522 (70.0%) nucleotide sequences and a predicted targeting/ non-targeting status. Seek & Blastn correctly predicted the targeting/ non-targeting status of 293/304 (96.4%) and 988/1066 (92.7%) of the correctly extracted nucleotide sequences. A total of 38/39 (97.4%) or 31/79 (39.2%) Seek & Blastn predictions of incorrect nucleotide sequence reagent use were correct in the two literature corpora. Combined Seek & Blastn and manual analyses identified a list of 91 misidentified nucleotide sequence reagents, which could be built upon through future studies. In summary, incorrect nucleotide sequence reagents represent an under-recognized source of error within the biomedical literature, and fact checking tools such as Seek & Blastn may help to identify papers and manuscripts affected by these errors.
Collapse
Affiliation(s)
- Cyril Labbé
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, Grenoble, France
| | - Natalie Grima
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
| | - Thierry Gautier
- INSERM U1209/ CNRS UMR 5309, Univ. Grenoble Alpes, Grenoble, France
| | - Bertrand Favier
- Univ. Grenoble Alpes, Team GREPI, Etablissement Français du Sang, La Tronche, France
| | - Jennifer A. Byrne
- Molecular Oncology Laboratory, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, New South Wales, Australia
- Discipline of Child and Adolescent Health, Faculty of Medicine and Health, The University of Sydney, Westmead, New South Wales, Australia
| |
Collapse
|
20
|
Abstract
Recent advances in technology have led to the exponential growth of scientific literature in biomedical sciences. This rapid increase in information has surpassed the threshold for manual curation efforts, necessitating the use of text mining approaches in the field of life sciences. One such application of text mining is in fostering in silico drug discovery such as drug target screening, pharmacogenomics, adverse drug event detection, etc. This chapter serves as an introduction to the applications of various text mining approaches in drug discovery. It is divided into two parts with the first half as an overview of text mining in the biosciences. The second half of the chapter reviews strategies and methods for four unique applications of text mining in drug discovery.
Collapse
Affiliation(s)
- Si Zheng
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shazia Dharssi
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA
| | - Meng Wu
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Jiao Li
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, USA.
| |
Collapse
|
21
|
Tawfik NS, Spruit MR. The SNPcurator: literature mining of enriched SNP-disease associations. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:4925332. [PMID: 29688369 PMCID: PMC5844215 DOI: 10.1093/database/bay020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2017] [Accepted: 02/05/2018] [Indexed: 01/08/2023]
Abstract
The uniqueness of each human genetic structure motivated the shift from the current practice of medicine to a more tailored one. This personalized medicine revolution would not be possible today without the genetics data collected from genome-wide association studies (GWASs) that investigate the relation between different phenotypic traits and single-nucleotide polymorphisms (SNPs). The huge increase in the literature publication space imposes a challenge on the conventional manual curation process which is becoming more and more expensive. This research aims at automatically extracting SNP associations of any given disease and its reported statistical significance (P-value) and odd ratio as well as cohort information such as size and ethnicity. Our evaluation illustrates that SNPcurator was able to replicate a large number of SNP-disease associations that were also reported in the NHGRI-EBI Catalog of published GWASs. SNPcurator was also tested by eight external genetics experts, who queried the system to examine diseases of their choice, and was found to be efficient and satisfactory. We conclude that the text-mining-based system has a great potential for helping researchers and scientists, especially in their preliminary genetics research. SNPcurator is publicly available at http://snpcurator.science.uu.nl/. Database URL: http://snpcurator.science.uu.nl/
Collapse
Affiliation(s)
- Noha S Tawfik
- Computer Engineering Department, College of Engineering, Arab Academy for Science, Technology, and Maritime Transport (AAST), Abukir,1029 Alexandria, Egypt.,Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| | - Marco R Spruit
- Department of Information and Computing Sciences, Utrecht University, Princetonplein 5, 3584 CC Utrecht, The Netherlands
| |
Collapse
|
22
|
Sahu SK, Anand A. Drug-drug interaction extraction from biomedical texts using long short-term memory network. J Biomed Inform 2018; 86:15-24. [DOI: 10.1016/j.jbi.2018.08.005] [Citation(s) in RCA: 85] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2018] [Accepted: 08/07/2018] [Indexed: 12/15/2022]
|
23
|
|
24
|
Using machine learning tools for protein database biocuration assistance. Sci Rep 2018; 8:10148. [PMID: 29977071 PMCID: PMC6033909 DOI: 10.1038/s41598-018-28330-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Accepted: 06/21/2018] [Indexed: 12/30/2022] Open
Abstract
Biocuration in the omics sciences has become paramount, as research in these fields rapidly evolves towards increasingly data-dependent models. As a result, the management of web-accessible publicly-available databases becomes a central task in biological knowledge dissemination. One relevant challenge for biocurators is the unambiguous identification of biological entities. In this study, we illustrate the adequacy of machine learning methods as biocuration assistance tools using a publicly available protein database as an example. This database contains information on G Protein-Coupled Receptors (GPCRs), which are part of eukaryotic cell membranes and relevant in cell communication as well as major drug targets in pharmacology. These receptors are characterized according to subtype labels. Previous analysis of this database provided evidence that some of the receptor sequences could be affected by a case of label noise, as they appeared to be too consistently misclassified by machine learning methods. Here, we extend our analysis to recent and quite substantially modified new versions of the database and reveal their now extremely accurate labeling using several machine learning models and different transformations of the unaligned sequences. These findings support the adequacy of our proposed method to identify problematic labeling cases as a tool for database biocuration.
Collapse
|
25
|
Müller HM, Van Auken KM, Li Y, Sternberg PW. Textpresso Central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics 2018; 19:94. [PMID: 29523070 PMCID: PMC5845379 DOI: 10.1186/s12859-018-2103-8] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 03/01/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The biomedical literature continues to grow at a rapid pace, making the challenge of knowledge retrieval and extraction ever greater. Tools that provide a means to search and mine the full text of literature thus represent an important way by which the efficiency of these processes can be improved. RESULTS We describe the next generation of the Textpresso information retrieval system, Textpresso Central (TPC). TPC builds on the strengths of the original system by expanding the full text corpus to include the PubMed Central Open Access Subset (PMC OA), as well as the WormBase C. elegans bibliography. In addition, TPC allows users to create a customized corpus by uploading and processing documents of their choosing. TPC is UIMA compliant, to facilitate compatibility with external processing modules, and takes advantage of Lucene indexing and search technology for efficient handling of millions of full text documents. Like Textpresso, TPC searches can be performed using keywords and/or categories (semantically related groups of terms), but to provide better context for interpreting and validating queries, search results may now be viewed as highlighted passages in the context of full text. To facilitate biocuration efforts, TPC also allows users to select text spans from the full text and annotate them, create customized curation forms for any data type, and send resulting annotations to external curation databases. As an example of such a curation form, we describe integration of TPC with the Noctua curation tool developed by the Gene Ontology (GO) Consortium. CONCLUSION Textpresso Central is an online literature search and curation platform that enables biocurators and biomedical researchers to search and mine the full text of literature by integrating keyword and category searches with viewing search results in the context of the full text. It also allows users to create customized curation interfaces, use those interfaces to make annotations linked to supporting evidence statements, and then send those annotations to any database in the world. Textpresso Central URL: http://www.textpresso.org/tpc.
Collapse
Affiliation(s)
- H.-M. Müller
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - K. M. Van Auken
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - Y. Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| | - P. W. Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125 USA
| |
Collapse
|
26
|
Singhal A, Srivastava J. Research dataset discovery from research publications using web context. WEB INTELLIGENCE 2017. [DOI: 10.3233/web-170354] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Ayush Singhal
- National Center of Biotechnology Information, National Institutes of Health, Bethesda, Maryland, USA. E-mail:
| | - Jaideep Srivastava
- Department of Computer Science and Engineering, University of Minnesota, Twin Cities, Minneapolis, Minnesota, USA. E-mail:
| |
Collapse
|