1
|
Baldarelli RM, Smith CL, Ringwald M, Richardson JE, Bult CJ. Mouse Genome Informatics: an integrated knowledgebase system for the laboratory mouse. Genetics 2024; 227:iyae031. [PMID: 38531069 DOI: 10.1093/genetics/iyae031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Accepted: 02/13/2024] [Indexed: 03/28/2024] Open
Abstract
Mouse Genome Informatics (MGI) is a federation of expertly curated information resources designed to support experimental and computational investigations into genetic and genomic aspects of human biology and disease using the laboratory mouse as a model system. The Mouse Genome Database (MGD) and the Gene Expression Database (GXD) are core MGI databases that share data and system architecture. MGI serves as the central community resource of integrated information about mouse genome features, variation, expression, gene function, phenotype, and human disease models acquired from peer-reviewed publications, author submissions, and major bioinformatics resources. To facilitate integration and standardization of data, biocuration scientists annotate using terms from controlled metadata vocabularies and biological ontologies (e.g. Mammalian Phenotype Ontology, Mouse Developmental Anatomy, Disease Ontology, Gene Ontology, etc.), and by applying international community standards for gene, allele, and mouse strain nomenclature. MGI serves basic scientists, translational researchers, and data scientists by providing access to FAIR-compliant data in both human-readable and compute-ready formats. The MGI resource is accessible at https://informatics.jax.org. Here, we present an overview of the core data types represented in MGI and highlight recent enhancements to the resource with a focus on new data and functionality for MGD and GXD.
Collapse
Affiliation(s)
| | | | | | | | - Carol J Bult
- The Jackson Laboratory, Bar Harbor, ME 04609, USA
| |
Collapse
|
2
|
Liu A, Peng B, Pankajam AV, Scheuermann RH, Zhang Y. Discovery of cell type classification marker genes from single cell RNA sequencing data using NS-Forest. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.22.590194. [PMID: 38712147 PMCID: PMC11071431 DOI: 10.1101/2024.04.22.590194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
The use of single-cell transcriptomic technologies that quantitively describe cell transcriptional phenotypes using single cell/nucleus RNA sequencing (scRNA-seq) is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in efficiently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an effective approach to study large-scale scRNA-seq data. NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and sufficient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package (https://github.com/JCVenterInstitute/NSForest), with several enhancements, to select marker gene combinations that exhibit selective expression patterns among closely related cell types and more efficiently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells. By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to1, with a metric of 1 given to markers that are only expressed within their target cell types and not in cells of any other cell types. We have applied NS-Forest v4.0 on scRNA-seq datasets from three human organs, including the brain, kidney, and lung. We observe that NS-Forest v4.0 outperforms previous versions on its ability to identify markers with higher On-Target Fraction values for closely related cell types and outperforms other marker gene selection approaches on the classification performance with significantly higher F-beta scores.
Collapse
Affiliation(s)
- Angela Liu
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
| | - Beverly Peng
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
| | - Ajith V. Pankajam
- National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
| | - Richard H. Scheuermann
- National Library of Medicine, National Institutes of Health, Bethesda, MD, United States of America
| | - Yun Zhang
- Department of Informatics, J. Craig Venter Institute, La Jolla, CA, United States of America
| |
Collapse
|
3
|
Ross KE, Bastian FB, Buys M, Cook CE, D’Eustachio P, Harrison M, Hermjakob H, Li D, Lord P, Natale DA, Peters B, Sternberg PW, Su AI, Thakur M, Thomas PD, Bateman A. Perspectives on tracking data reuse across biodata resources. BIOINFORMATICS ADVANCES 2024; 4:vbae057. [PMID: 38721398 PMCID: PMC11076920 DOI: 10.1093/bioadv/vbae057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 03/13/2024] [Accepted: 04/11/2024] [Indexed: 06/14/2024]
Abstract
Motivation Data reuse is a common and vital practice in molecular biology and enables the knowledge gathered over recent decades to drive discovery and innovation in the life sciences. Much of this knowledge has been collated into molecular biology databases, such as UniProtKB, and these resources derive enormous value from sharing data among themselves. However, quantifying and documenting this kind of data reuse remains a challenge. Results The article reports on a one-day virtual workshop hosted by the UniProt Consortium in March 2023, attended by representatives from biodata resources, experts in data management, and NIH program managers. Workshop discussions focused on strategies for tracking data reuse, best practices for reusing data, and the challenges associated with data reuse and tracking. Surveys and discussions showed that data reuse is widespread, but critical information for reproducibility is sometimes lacking. Challenges include costs of tracking data reuse, tensions between tracking data and open sharing, restrictive licenses, and difficulties in tracking commercial data use. Recommendations that emerged from the discussion include: development of standardized formats for documenting data reuse, education about the obstacles posed by restrictive licenses, and continued recognition by funding agencies that data management is a critical activity that requires dedicated resources. Availability and implementation Summaries of survey results are available at: https://docs.google.com/forms/d/1j-VU2ifEKb9C-sW6l3ATB79dgHdRk5v_lESv2hawnso/viewanalytics (survey of data providers) and https://docs.google.com/forms/d/18WbJFutUd7qiZoEzbOytFYXSfWFT61hVce0vjvIwIjk/viewanalytics (survey of users).
Collapse
Affiliation(s)
- Karen E Ross
- Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Frederic B Bastian
- Evolutionary Bioinformatics Group, SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland
| | | | | | - Peter D’Eustachio
- Department of Biochemistry & Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY 10012, United States
| | - Melissa Harrison
- Literature Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Henning Hermjakob
- Molecular Systems, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Donghui Li
- Chan Zuckerberg Initiative, Redwood City, CA 94063, United States
| | - Phillip Lord
- School of Computing, Newcastle University, Newcastle upon Tyne NE4 5TG, United Kingdom
| | - Darren A Natale
- Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20007, United States
| | - Bjoern Peters
- Center for Vaccine Innovation, La Jolla Institute of Immunology, La Jolla, CA 92037, United States
| | - Paul W Sternberg
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA 91125, United States
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA 92037, United States
| | - Matthew Thakur
- Data Services, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SA, United Kingdom
| | - Paul D Thomas
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, CA 90089, United States
| | - Alex Bateman
- MSCB, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| |
Collapse
|
4
|
Carlson B, Watkins M, Li M, Furner B, Cohen E, Volchenboum SL. Using A Standardized Nomenclature to Semantically Map Oncology-Related Concepts from Common Data Models to a Pediatric Cancer Data Model. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:874-883. [PMID: 38222364 PMCID: PMC10785885] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
The Pediatric Cancer Data Commons (PCDC) comprises an international community whose ironclad commitment to data sharing is combatting pediatric cancer in an unprecedented way. The byproduct of their data sharing efforts is a gold-standard consensus data model covering many types of pediatric cancer. This article describes an effort to utilize SSSOM, an emerging specification for semantically-rich data mappings, to provide a "hub and spoke" model of mappings from several common data models (CDMs) to the PCDC data model. This provides important contributions to the research community, including: 1) a clear view of the current coverage of these CDMs in the domain of pediatric oncology, and 2) a demonstration of creating standardized mappings. These mappings can allow downstream crosswalk for data transformation and enhance data sharing. This can guide those who currently create and maintain brittle ad hoc data mappings in order to utilize the growing volume of viable research data.
Collapse
Affiliation(s)
- Bradley Carlson
- Department of Pediatrics, University of Chicago, Chicago, IL
| | - Michael Watkins
- Department of Pediatrics, University of Chicago, Chicago, IL
| | - Mei Li
- Department of Pediatrics, University of Chicago, Chicago, IL
| | - Brian Furner
- Department of Pediatrics, University of Chicago, Chicago, IL
| | - Ellen Cohen
- Department of Pediatrics, University of Chicago, Chicago, IL
| | | |
Collapse
|
5
|
Gargano MA, Matentzoglu N, Coleman B, Addo-Lartey EB, Anagnostopoulos A, Anderton J, Avillach P, Bagley AM, Bakštein E, Balhoff JP, Baynam G, Bello SM, Berk M, Bertram H, Bishop S, Blau H, Bodenstein DF, Botas P, Boztug K, Čady J, Callahan TJ, Cameron R, Carbon S, Castellanos F, Caufield JH, Chan LE, Chute C, Cruz-Rojo J, Dahan-Oliel N, Davids JR, de Dieuleveult M, de Souza V, de Vries BBA, de Vries E, DePaulo JR, Derfalvi B, Dhombres F, Diaz-Byrd C, Dingemans AJM, Donadille B, Duyzend M, Elfeky R, Essaid S, Fabrizzi C, Fico G, Firth HV, Freudenberg-Hua Y, Fullerton JM, Gabriel DL, Gilmour K, Giordano J, Goes FS, Moses RG, Green I, Griese M, Groza T, Gu W, Guthrie J, Gyori B, Hamosh A, Hanauer M, Hanušová K, He Y(O, Hegde H, Helbig I, Holasová K, Hoyt CT, Huang S, Hurwitz E, Jacobsen JOB, Jiang X, Joseph L, Keramatian K, King B, Knoflach K, Koolen DA, Kraus M, Kroll C, Kusters M, Ladewig MS, Lagorce D, Lai MC, Lapunzina P, Laraway B, Lewis-Smith D, Li X, Lucano C, Majd M, Marazita ML, Martinez-Glez V, McHenry TH, McInnis MG, McMurry JA, Mihulová M, Millett CE, Mitchell PB, Moslerová V, Narutomi K, Nematollahi S, Nevado J, Nierenberg AA, Čajbiková NN, Nurnberger JI, Ogishima S, Olson D, Ortiz A, Pachajoa H, Perez de Nanclares G, Peters A, Putman T, Rapp CK, Rath A, Reese J, Rekerle L, Roberts A, Roy S, Sanders SJ, Schuetz C, Schulte EC, Schulze TG, Schwarz M, Scott K, Seelow D, Seitz B, Shen Y, Similuk MN, Simon ES, Singh B, Smedley D, Smith CL, Smolinsky JT, Sperry S, Stafford E, Stefancsik R, Steinhaus R, Strawbridge R, Sundaramurthi JC, Talapova P, Tenorio Castano JA, Tesner P, Thomas RH, Thurm A, Turnovec M, van Gijn ME, Vasilevsky NA, Vlčková M, Walden A, Wang K, Wapner R, Ware JS, Wiafe AA, Wiafe SA, Wiggins LD, Williams AE, Wu C, Wyrwoll MJ, Xiong H, Yalin N, Yamamoto Y, Yatham LN, Yocum AK, Young AH, Yüksel Z, Zandi PP, Zankl A, Zarante I, Zvolský M, Toro S, Carmody LC, Harris NL, Munoz-Torres MC, Danis D, Mungall CJ, Köhler S, Haendel MA, Robinson PN. The Human Phenotype Ontology in 2024: phenotypes around the world. Nucleic Acids Res 2024; 52:D1333-D1346. [PMID: 37953324 PMCID: PMC10767975 DOI: 10.1093/nar/gkad1005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Revised: 10/12/2023] [Accepted: 10/19/2023] [Indexed: 11/14/2023] Open
Abstract
The Human Phenotype Ontology (HPO) is a widely used resource that comprehensively organizes and defines the phenotypic features of human disease, enabling computational inference and supporting genomic and phenotypic analyses through semantic similarity and machine learning algorithms. The HPO has widespread applications in clinical diagnostics and translational research, including genomic diagnostics, gene-disease discovery, and cohort analytics. In recent years, groups around the world have developed translations of the HPO from English to other languages, and the HPO browser has been internationalized, allowing users to view HPO term labels and in many cases synonyms and definitions in ten languages in addition to English. Since our last report, a total of 2239 new HPO terms and 49235 new HPO annotations were developed, many in collaboration with external groups in the fields of psychiatry, arthrogryposis, immunology and cardiology. The Medical Action Ontology (MAxO) is a new effort to model treatments and other measures taken for clinical management. Finally, the HPO consortium is contributing to efforts to integrate the HPO and the GA4GH Phenopacket Schema into electronic health records (EHRs) with the goal of more standardized and computable integration of rare disease data in EHRs.
Collapse
Affiliation(s)
| | | | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | | | - Joel Anderton
- Center for Craniofacial and Dental Genetics, Department of Oral and Craniofacial Sciences, School of Dental Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | | | - Anita M Bagley
- Shriners Children's Northern California, Sacramento, CA, USA
| | - Eduard Bakštein
- National Institute of Mental Health, Klecany, Czech Republic
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC 27517, USA
| | - Gareth Baynam
- Rare Care Centre, Perth Children's Hospital, Perth, Australia
| | | | - Michael Berk
- Deakin University, IMPACT - the Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Barwon Health, Geelong, Australia
| | - Holli Bertram
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | - Somer Bishop
- Department of Psychiatry and Behavioral Sciences, UCSF Weil Institute for Neuroscience, San Francisco, CA, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - David F Bodenstein
- Department of Pharmacology and Toxicology, University of Toronto, Toronto, ON, Canada
| | | | - Kaan Boztug
- St. Anna Children's Cancer Research Institute (CCRI), Vienna, Austria
| | - Jolana Čady
- Institute of Health Information and Statistics of the Czech Republic, Prague, Czech Republic
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, NY, NY, USA
| | | | - Seth J Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | | | - J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lauren E Chan
- College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, USA
| | - Christopher G Chute
- Schools of Medicine, Public Health, and Nursing, Johns Hopkins University, Baltimore, MD 21287, USA
| | - Jaime Cruz-Rojo
- UDISGEN (Dysmorphology and Genetics Unit), 12 de Octubre Hospital, Madrid, Spain
| | - Noémi Dahan-Oliel
- Department of Clinical Research, Shriners Hospitals for Children, Montreal, Quebec, Canada
| | - Jon R Davids
- Shriners Children's Northern California, Sacramento, CA, USA
| | - Maud de Dieuleveult
- Département I&D, AP-HP, Banque Nationale de Données Maladies Rares, Paris, France
| | - Vinicius de Souza
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Bert B A de Vries
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, Netherlands
| | | | - J Raymond DePaulo
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Beata Derfalvi
- Department of Pediatrics, Dalhousie University, Halifax, NS, Canada
| | - Ferdinand Dhombres
- Fetal Medicine Department, Armand Trousseau Hospital, Sorbonne University, GRC26, INSERM, Limics, Paris, France
| | - Claudia Diaz-Byrd
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | - Alexander J M Dingemans
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, Netherlands
| | - Bruno Donadille
- St Antoine Hospital, Reference Center for Rare Growth Endocrine Disorders, Sorbonne University, AP-HP, INSERM, US14 - Orphanet, Plateforme Maladies Rares, Paris, France
| | | | - Reem Elfeky
- Department of Immunology, GOS Hospital for Children NHS Foundation Trust, University College London, London, UK
| | - Shahim Essaid
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | | | - Giovanna Fico
- Bipolar and Depressive Disorders Unit, Institute of Neuroscience, Hospital Clinic, University of Barcelona, IDIBAPS, CIBERSAM, Barcelona, Catalonia, Spain
| | - Helen V Firth
- Addenbrooke's Hospital, Cambridge University Hospitals, Cambridge, UK
| | - Yun Freudenberg-Hua
- Department of Psychiatry, Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | | | - Davera L Gabriel
- School of Medicine, Johns Hopkins University, Baltimore, MD 21287, USA
| | | | - Jessica Giordano
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Fernando S Goes
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Rachel Gore Moses
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Ian Green
- SNOMED International, London W2 6BD, UK
| | - Matthias Griese
- Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, German center for Lung research (DZL), Munich, Germany
| | - Tudor Groza
- Rare Care Centre, Perth Children's Hospital, Perth, Australia
| | | | - Julia Guthrie
- Department of Structural and Computational Biology, University of Vienna; Max Perutz Labs, Vienna, Austria
| | - Benjamin Gyori
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Ada Hamosh
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Marc Hanauer
- INSERM, US14 - Orphanet, Plateforme Maladies Rares, Paris, France
| | - Kateřina Hanušová
- Institute of Health Information and Statistics of the Czech Republic, Prague, Czech Republic
| | | | - Harshad Hegde
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Ingo Helbig
- Neurology, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Kateřina Holasová
- Institute of Health Information and Statistics of the Czech Republic, Prague, Czech Republic
| | - Charles Tapley Hoyt
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | | | - Eric Hurwitz
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Julius O B Jacobsen
- William Harvey Research Institute, Queen Mary University of London, London, UK
| | | | - Lisa Joseph
- Neurodevelopmental and Behavioral Phenotyping Service, National Institute of Mental Health, Bethesda, MD, USA
| | - Kamyar Keramatian
- Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada
| | - Bryan King
- Department of Psychiatry and Behavioral Sciences, UCSF Weil Institute for Neuroscience, San Francisco, CA, USA
| | - Katrin Knoflach
- Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, German center for Lung research (DZL), Munich, Germany
| | - David A Koolen
- Department of Human Genetics, Donders Institute for Brain, Cognition and Behaviour, Radboud University Medical Center, Nijmegen, Netherlands
| | - Megan L Kraus
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Carlo Kroll
- William Harvey Research Institute, Queen Mary University of London, London, UK
| | - Maaike Kusters
- Immunology, NIHR Great Ormond Street Hospital BRC, London, UK
| | - Markus S Ladewig
- Department of Ophthalmology, University Clinic Marburg - Campus Fulda, Fulda, Germany
| | - David Lagorce
- INSERM, US14 - Orphanet, Plateforme Maladies Rares, Paris, France
| | - Meng-Chuan Lai
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada
| | - Pablo Lapunzina
- Institute of Medical and Molecular Genetics, Hospital Univ. La Paz, Madrid, Spain
| | - Bryan Laraway
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - David Lewis-Smith
- Translational and Clinical Research Institute, Henry Wellcome Building, Framlington Place, Newcastle University, Newcastle-Upon-Tyne NE14LP, UK
| | | | - Caterina Lucano
- INSERM, US14 - Orphanet, Plateforme Maladies Rares, Paris, France
| | - Marzieh Majd
- Department of Psychiatry, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Mary L Marazita
- Center for Craniofacial and Dental Genetics, Department of Oral and Craniofacial Sciences, School of Dental Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Victor Martinez-Glez
- Center for Genomic Medicine, Parc Taulí Hospital Universitari, Institut d’Investigació i Innovació Parc Taulí (I3PT-CERCA), Sabadell, Spain
| | - Toby H McHenry
- Center for Craniofacial and Dental Genetics, Department of Oral and Craniofacial Sciences, School of Dental Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Melvin G McInnis
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | - Julie A McMurry
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Michaela Mihulová
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Caitlin E Millett
- Feinstein Institutes for Medical Research, Northwell Health, Manhasset, NY, USA
| | - Philip B Mitchell
- Discipline of Psychiatry & Mental Health, School of Clinical Medicine, Faculty of Medicine & Health, University of New South Wales, Sydney, NSW, Australia
| | - Veronika Moslerová
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Kenji Narutomi
- Okinawa Prefectural Nanbu Medical Center & Children's Medical Center
| | - Shahrzad Nematollahi
- School of Physical and Occupational Therapy, McGill University, Montreal, Quebec, Canada
| | - Julian Nevado
- Institute of Medical and Molecular Genetics, Hospital Univ. La Paz, Madrid, Spain
| | - Andrew A Nierenberg
- Dauten Family Center for Bipolar Treatment Innovation, Massachusetts General Hospital, Boston, MA, USA
| | - Nikola Novák Čajbiková
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - John I Nurnberger
- Stark Neurosciences Research Institute, Departments of Psychiatry and Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA
| | | | - Daniel Olson
- Data Collaboration Center, Data Science, Critical Path Institute, Tucson, AZ, USA
| | - Abigail Ortiz
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Harry Pachajoa
- Centro de Investigaciones en Anomalías Congénitas y Enfermedades Raras (CIACER), Universidad Icesi, Cali, Colombia
| | - Guiomar Perez de Nanclares
- Molecular (epi) genetics lab, Bioaraba Health Research Institute, Araba University Hospital, Vitoria-Gasteiz, Spain
| | - Amy Peters
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA, USA
| | - Tim Putman
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Christina K Rapp
- Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, German center for Lung research (DZL), Munich, Germany
| | - Ana Rath
- INSERM, US14 - Orphanet, Plateforme Maladies Rares, Paris, France
| | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Lauren Rekerle
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Angharad M Roberts
- National Heart & Lung Institute & MRC London Institute of Medical Sciences, Imperial College London, London W12 0HS, UK
| | - Suzy Roy
- SNOMED International, London W2 6BD, UK
| | - Stephan J Sanders
- Department of Paediatrics, Institute of Developmental and Regenerative Medicine, University of Oxford, Oxford, UK
| | - Catharina Schuetz
- Universitätsklinikum Carl Gustav Carus, Medizinische Fakultät, TU, Dresden, Germany
| | - Eva C Schulte
- Institute of Psychiatric Phenomics and Genomics (IPPG), LMU University Hospital, LMU Munich, Munich, Germany
| | - Thomas G Schulze
- Department of Psychiatry and Behavioral Sciences, SUNY Upstate Medical University, Syracuse, NY, USA
| | - Martin Schwarz
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Katie Scott
- Department of Psychiatry, Dalhousie University, Halifax, NS, Canada
| | - Dominik Seelow
- Exploratory Diagnostic Sciences, Berliner Institut für Gesundheitsforschung - Charité, Berlin, Germany
| | - Berthold Seitz
- Department of Ophthalmology, Saarland University Medical Center UKS, Homburg/Saar, Germany
| | | | - Morgan N Similuk
- National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, MD 20892, USA
| | - Eric S Simon
- Eisenberg Family Depression Center, University of Michigan, Ann Arbor, MI, USA
| | - Balwinder Singh
- Department of Psychiatry and Psychology, Mayo Clinic, Rochester, MN, USA
| | - Damian Smedley
- William Harvey Research Institute, Queen Mary University of London, London, UK
| | | | - Jake T Smolinsky
- Human Genetics Institute of New Jersey, Rutgers University, Piscataway, NJ, USA
| | - Sarah Sperry
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | | | - Ray Stefancsik
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton CB10 1SD, UK
| | - Robin Steinhaus
- Exploratory Diagnostic Sciences, Berliner Institut für Gesundheitsforschung - Charité, Berlin, Germany
| | - Rebecca Strawbridge
- Department of Psychological Medicine, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | | | - Polina Talapova
- Institute for Research and Health Policy Studies, Tufts Medicine, Boston, MA 2111, USA
| | | | - Pavel Tesner
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Rhys H Thomas
- Translational and Clinical Research Institute, Henry Wellcome Building, Framlington Place, Newcastle University, Newcastle-Upon-Tyne NE14LP, UK
| | - Audrey Thurm
- Neurodevelopmental and Behavioral Phenotyping Service, National Institute of Mental Health, Bethesda, MD, USA
| | - Marek Turnovec
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Marielle E van Gijn
- Department of Genetics, University Medical Center Groningen, Groningen, Netherlands
| | | | - Markéta Vlčková
- Department of Biology and Medical Genetics, 2nd Medical Faculty of Charles University and University Hospital Motol, Prague, Czech Republic
| | - Anita Walden
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Kai Wang
- Chinese HPO Consortium, Beijing, China
| | - Ron Wapner
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - James S Ware
- National Heart & Lung Institute & MRC London Institute of Medical Sciences, Imperial College London, London W12 0HS, UK
| | | | | | - Lisa D Wiggins
- National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Andrew E Williams
- Institute for Research and Health Policy Studies, Tufts Medicine, Boston, MA 2111, USA
| | - Chen Wu
- Chinese HPO Consortium, Beijing, China
| | - Margot J Wyrwoll
- Centre for Regenerative Medicine, Institute for Regeneration and Repair, Institute for Stem Cell Research, University of Edinburgh, Edinburgh, UK
| | - Hui Xiong
- Chinese HPO Consortium, Beijing, China
| | - Nefize Yalin
- Department of Psychological Medicine, Institute of Psychiatry, Psychology & Neuroscience, King's College London, London, UK
| | - Yasunori Yamamoto
- Database Center for Life Science, Joint Support-Center for Data Science Research, Research Organization of Information and Systems, Japan
| | - Lakshmi N Yatham
- Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada
| | - Anastasia K Yocum
- Department of Psychiatry, University of Michigan, Ann Arbor, MI, USA
| | - Allan H Young
- Psychological Medicine, Institute of Psychiatry, Psychology and Neuroscience, King's College London & South London and Maudsley NHS Foundation Trust, Bethlem Royal Hospital, Monks Orchard Road, Beckenham, Kent, London SE5 8AF, UK
| | - Zafer Yüksel
- Department of Human Genetics, Bioscientia Healthcare GmbH, Ingelheim, Germany
| | - Peter P Zandi
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Andreas Zankl
- Faculty of Medicine and Health, The University of Sydney, Camperdown, Australia
| | - Ignacio Zarante
- Institute of Human Genetics, Pontificia Universidad Javeriana, Bogotá, Colombia
| | - Miroslav Zvolský
- Institute of Health Information and Statistics of the Czech Republic, Prague, Czech Republic
| | - Sabrina Toro
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Leigh C Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Nomi L Harris
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | - Monica C Munoz-Torres
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Daniel Danis
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
| | | | - Melissa A Haendel
- University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| |
Collapse
|
6
|
Cooper L, Elser J, Laporte MA, Arnaud E, Jaiswal P. Planteome 2024 Update: Reference Ontologies and Knowledgebase for Plant Biology. Nucleic Acids Res 2024; 52:D1548-D1555. [PMID: 38055832 PMCID: PMC10767901 DOI: 10.1093/nar/gkad1028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 10/14/2023] [Accepted: 10/23/2023] [Indexed: 12/08/2023] Open
Abstract
The Planteome project (https://planteome.org/) provides a suite of reference and crop-specific ontologies and an integrated knowledgebase of plant genomics data. The plant genomics data in the Planteome has been obtained through manual and automated curation and sourced from more than 40 partner databases and resources. Here, we report on updates to the Planteome reference ontologies, namely, the Plant Ontology (PO), Trait Ontology (TO), the Plant Experimental Conditions Ontology (PECO), and integration of species/crop-specific vocabularies from our partners, the Crop Ontology (CO) into the TO ontology graph. Currently, 11 CO vocabularies are integrated into the Planteome with the addition of yam, sorghum, and potato since 2018. In addition, the size of the annotation database has increased by 34%, and the number of bioentities (genes, proteins, etc.) from 125 plant taxa has increased by 72%. We developed new tools to facilitate user requests and improvements to the CO vocabularies, and to allow fast searching and browsing of PO terms and definitions. These enhancements and future changes to automate the TO-CO mappings and knowledge discovery tools ensure that the Planteome will continue to be a valuable resource for plant biology.
Collapse
Affiliation(s)
- Laurel Cooper
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
| | - Justin Elser
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
| | | | - Elizabeth Arnaud
- Digital Inclusion, Biodiversity International, 34397 Montpellier, France
| | - Pankaj Jaiswal
- Department of Botany and Plant Pathology, Oregon State University, Corvallis, OR 97331, USA
| |
Collapse
|
7
|
Eloe-Fadrosh EA, Mungall CJ, Miller MA, Smith M, Patil SS, Kelliher JM, Johnson LYD, Rodriguez FE, Chain PSG, Hu B, Thornton MB, McCue LA, McHardy AC, Harris NL, Reddy TBK, Mukherjee S, Hunter CI, Walls R, Schriml LM. A Practical Approach to Using the Genomic Standards Consortium MIxS Reporting Standard for Comparative Genomics and Metagenomics. Methods Mol Biol 2024; 2802:587-609. [PMID: 38819573 DOI: 10.1007/978-1-0716-3838-5_20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Comparative analysis of (meta)genomes necessitates aggregation, integration, and synthesis of well-annotated data using standards. The Genomic Standards Consortium (GSC) collaborates with the research community to develop and maintain the Minimum Information about any (x) Sequence (MIxS) reporting standard for genomic data. To facilitate the use of the GSC's MIxS reporting standard, we provide a description of the structure and terminology, how to navigate ontologies for required terms in MIxS, and demonstrate practical usage through a soil metagenome example.
Collapse
Affiliation(s)
- Emiley A Eloe-Fadrosh
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Christopher J Mungall
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Mark Andrew Miller
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Montana Smith
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Sujay Sanjeev Patil
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Julia M Kelliher
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Leah Y D Johnson
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | | | - Patrick S G Chain
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Bin Hu
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA
| | - Michael B Thornton
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Lee Ann McCue
- Pacific Northwest National Laboratory, Richland, WA, USA
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Nomi L Harris
- Environmental Genomics and System Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - T B K Reddy
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Supratim Mukherjee
- DOE Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christopher I Hunter
- GigaScience Press, Hong Kong Science Park, Pak Shek Kok, New Territories, Hong Kong
| | | | - Lynn M Schriml
- University of Maryland School of Medicine, Institute for Genome Sciences, Baltimore, MD, USA
| |
Collapse
|
8
|
Seah BKB. Paying it forward: Crowdsourcing the harmonisation and linking of taxon names and biodiversity identifiers. Biodivers Data J 2023; 11:e114076. [PMID: 38312332 PMCID: PMC10838036 DOI: 10.3897/bdj.11.e114076] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Accepted: 11/06/2023] [Indexed: 02/06/2024] Open
Abstract
Linking records for the same taxa between different databases is an essential step when working with biodiversity data. However, name-matching alone is error-prone, because of issues such as homonyms (unrelated taxa with the same name) and synonyms (same taxon under different names). Therefore, most projects will require some curation to ensure that taxon identifiers are correctly linked. Unfortunately, formal guidance on such curation is uncommon and these steps are often ad hoc and poorly documented, which hinders transparency and reproducibility, yet the task requires specialist knowledge and cannot be easily automated without careful validation. Here, we present a case study on linking identifiers between the GBIF and NCBI taxonomies for a species checklist. This represents a common scenario: finding published sequence data (from NCBI) for species chosen by occurrence or geographical distribution (from GBIF). Wikidata, a publicly editable knowledge base of structured data, can serve as an additional information source for identifier linking. We suggest a software toolkit for taxon name-matching and data-cleaning, describe common issues encountered during curation and propose concrete steps to address them. For example, about 2.8% of the taxa in our dataset had wrong identifiers linked on Wikidata because of errors in name-matching caused by homonyms. By correcting such errors during data-cleaning, either directly (through editing Wikidata) or indirectly (by reporting errors in GBIF or NCBI), we crowdsource the curation and contribute to community resources, thereby improving the quality of downstream analyses.
Collapse
Affiliation(s)
- Brandon Kwee Boon Seah
- Thünen Institute for Biodiversity, Braunschweig, GermanyThünen Institute for BiodiversityBraunschweigGermany
| |
Collapse
|
9
|
Lee J, Song J. Towards Semantic Smart Cities: A Study on the Conceptualization and Implementation of Semantic Context Inference Systems. SENSORS (BASEL, SWITZERLAND) 2023; 23:9392. [PMID: 38067764 PMCID: PMC10708828 DOI: 10.3390/s23239392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 11/16/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023]
Abstract
Smart cities provide integrated management and operation of urban data emerging within a city, supplying the infrastructure for smart city services and resolving various urban challenges. Nevertheless, cities continue to grapple with substantial issues, such as contagious diseases and terrorism, that pose severe financial and human risks. These problems sporadically arise in various locales, and current smart city frameworks lack the capability to autonomously identify and address these issues. The challenge intensifies especially when trying to recognize and respond to unprecedented problems. The primary objective of this research is to predict potential urban issues and support their resolution proactively. To achieve this, our system makes use of semantic reasoning to understand the ongoing situations within the city. In this process, the 5W1H principles serve as inference rules, guiding the extraction and consolidation of context. Firstly, utilizing domain-specific annotation templates, we craft a semantic graph by amalgamating information from various sources available in the city, such as municipal public data and IoT platforms. Subsequently, the system autonomously infers and accumulates contexts of situations occurring in the city using 5W1H-based reasoning. As a result, the accumulated contexts allow for inferring potential urban problems by identifying repeated disruptions in city services at specific times or locations and establishing connections among them. The main contribution of this paper lies in proposing a comprehensive conceptual model for the suggested system and presenting actual implementation cases and applicable use cases. These contributions facilitate awareness among city administrators and citizens within a smart city regarding potential problem-prone areas or times, thereby aiding in the preemptive identification and mitigation of urban challenges.
Collapse
Affiliation(s)
| | - JaeSeung Song
- Depatment of Convergence Engineering for Intelligent Drone, Sejong University, Gwangjin-gu, Seoul 05006, Republic of Korea;
| |
Collapse
|
10
|
Clarke JL, Cooper LD, Poelchau MF, Berardini TZ, Elser J, Farmer AD, Ficklin S, Kumari S, Laporte MA, Nelson RT, Sadohara R, Selby P, Thessen AE, Whitehead B, Sen TZ. Data sharing and ontology use among agricultural genetics, genomics, and breeding databases and resources of the Agbiodata Consortium. Database (Oxford) 2023; 2023:baad076. [PMID: 37971715 PMCID: PMC10653126 DOI: 10.1093/database/baad076] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Accepted: 10/17/2023] [Indexed: 11/19/2023]
Abstract
Over the last couple of decades, there has been a rapid growth in the number and scope of agricultural genetics, genomics and breeding databases and resources. The AgBioData Consortium (https://www.agbiodata.org/) currently represents 44 databases and resources (https://www.agbiodata.org/databases) covering model or crop plant and animal GGB data, ontologies, pathways, genetic variation and breeding platforms (referred to as 'databases' throughout). One of the goals of the Consortium is to facilitate FAIR (Findable, Accessible, Interoperable, and Reusable) data management and the integration of datasets which requires data sharing, along with structured vocabularies and/or ontologies. Two AgBioData working groups, focused on Data Sharing and Ontologies, respectively, conducted a Consortium-wide survey to assess the current status and future needs of the members in those areas. A total of 33 researchers responded to the survey, representing 37 databases. Results suggest that data-sharing practices by AgBioData databases are in a fairly healthy state, but it is not clear whether this is true for all metadata and data types across all databases; and that, ontology use has not substantially changed since a similar survey was conducted in 2017. Based on our evaluation of the survey results, we recommend (i) providing training for database personnel in a specific data-sharing techniques, as well as in ontology use; (ii) further study on what metadata is shared, and how well it is shared among databases; (iii) promoting an understanding of data sharing and ontologies in the stakeholder community; (iv) improving data sharing and ontologies for specific phenotypic data types and formats; and (v) lowering specific barriers to data sharing and ontology use, by identifying sustainability solutions, and the identification, promotion, or development of data standards. Combined, these improvements are likely to help AgBioData databases increase development efforts towards improved ontology use, and data sharing via programmatic means. Database URL https://www.agbiodata.org/databases.
Collapse
Affiliation(s)
- Jennifer L Clarke
- Department of Statistics and Department of Food Science and Technology, University of Nebraska–Lincoln, 340 Hardin Hall North Wing, Lincoln, NE 68583, USA
| | - Laurel D Cooper
- Department of Botany and Plant Pathology, Oregon State University, 2503 Cordley Hall, Corvallis, OR 97331, USA
| | - Monica F Poelchau
- USDA, Agricultural Research Service, National Agricultural Library, 10301 Baltimore Ave, Beltsville 20705, USA
| | - Tanya Z Berardini
- The Arabidopsis Information Resource and Phoenix Bioinformatic, 39899 Balentine Drive, Suite 200, Newark, CA, USA
| | - Justin Elser
- Department of Botany and Plant Pathology, Oregon State University, 2503 Cordley Hall, Corvallis, OR 97331, USA
| | - Andrew D Farmer
- National Center for Genome Resources, 2935 Rodeo Park Dr. E., Santa Fe, NM 87505, USA
| | - Stephen Ficklin
- Department of Horticulture, Washington State University, 249 Clark Hall, PO Box 646414, Pullman, WA 99164, USA
| | - Sunita Kumari
- Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724, USA
| | - Marie-Angélique Laporte
- Digital Inclusion, Bioversity International, Parc Scientifique Agropolis II, 1990 Bd de la Lironde, Montpellier 34397, France
| | - Rex T Nelson
- USDA, Agricultural Research Service, Corn Insects and Crop Genetics Research Unit, Iowa State University, 716 Farmhouse Lane, Ames, IA 50011, USA
| | - Rie Sadohara
- Department of Plant, Soil, and Microbial Sciences, Michigan State University, 1066 Bogue St, East Lansing, MI 48824, USA
| | - Peter Selby
- School of Integrative Plant Science, College of Agriculture and Life Sciences, Cornell University, 215 Garden Avenue, Ithaca, NY 14850, USA
| | - Anne E Thessen
- Department of Biomedical Informatics, University of Colorado Anschutz, 1890 N. Revere Court, Mailstop F600, Aurora CO 80045, USA
| | - Brandon Whitehead
- Data Science and Informatics, Manaaki Whenua—Landcare Research, Ltd., Riddet Road, Massey University, Palmerston North 4472, New Zealand
| | - Taner Z Sen
- USDA, Agricultural Research Service, Crop Improvement Genetics Research Unit, Western Regional Research Center, 800 Buchanan St, Albany 94710, USA
- Department of Bioengineering, University of California, 306 Stanley Hall, Berkeley, CA 94720, USA
| |
Collapse
|
11
|
Meyer R, Appeltans W, Duncan WD, Dimitrova M, Gan YM, Stjernegaard Jeppesen T, Mungall C, Paul DL, Provoost P, Robertson T, Schriml L, Suominen S, Walls R, Sweetlove M, Ung V, Van de Putte A, Wallis E, Wieczorek J, Buttigieg PL. Aligning Standards Communities for Omics Biodiversity Data: Sustainable Darwin Core-MIxS Interoperability. Biodivers Data J 2023; 11:e112420. [PMID: 37829294 PMCID: PMC10565567 DOI: 10.3897/bdj.11.e112420] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2023] [Accepted: 10/02/2023] [Indexed: 10/14/2023] Open
Abstract
The standardization of data, encompassing both primary and contextual information (metadata), plays a pivotal role in facilitating data (re-)use, integration, and knowledge generation. However, the biodiversity and omics communities, converging on omics biodiversity data, have historically developed and adopted their own distinct standards, hindering effective (meta)data integration and collaboration. In response to this challenge, the Task Group (TG) for Sustainable DwC-MIxS Interoperability was established. Convening experts from the Biodiversity Information Standards (TDWG) and the Genomic Standards Consortium (GSC) alongside external stakeholders, the TG aimed to promote sustainable interoperability between the Minimum Information about any (x) Sequence (MIxS) and Darwin Core (DwC) specifications. To achieve this goal, the TG utilized the Simple Standard for Sharing Ontology Mappings (SSSOM) to create a comprehensive mapping of DwC keys to MIxS keys. This mapping, combined with the development of the MIxS-DwC extension, enables the incorporation of MIxS core terms into DwC-compliant metadata records, facilitating seamless data exchange between MIxS and DwC user communities. Through the implementation of this translation layer, data produced in either MIxS- or DwC-compliant formats can now be efficiently brokered, breaking down silos and fostering closer collaboration between the biodiversity and omics communities. To ensure its sustainability and lasting impact, TDWG and GSC have both signed a Memorandum of Understanding (MoU) on creating a continuous model to synchronize their standards. These achievements mark a significant step forward in enhancing data sharing and utilization across domains, thereby unlocking new opportunities for scientific discovery and advancement.
Collapse
Affiliation(s)
- Raïssa Meyer
- Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Reserach, Bremerhaven, GermanyAlfred Wegener Institute, Helmholtz Centre for Polar and Marine ReserachBremerhavenGermany
- Max Planck Institute for Marine Microbiology, Bremen, GermanyMax Planck Institute for Marine MicrobiologyBremenGermany
- University of Bremen, Faculty of Geosciences, Bremen, GermanyUniversity of Bremen, Faculty of GeosciencesBremenGermany
| | - Ward Appeltans
- Intergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS), Oostende, BelgiumIntergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS)OostendeBelgium
| | - William D. Duncan
- University of Florida, Gainesville, United States of AmericaUniversity of FloridaGainesvilleUnited States of America
| | - Mariya Dimitrova
- Bulgarian Academy of Sciences, Sofia, BulgariaBulgarian Academy of SciencesSofiaBulgaria
- Pensoft Publishers, Sofia, BulgariaPensoft PublishersSofiaBulgaria
| | - Yi-Ming Gan
- Royal Belgian Institute of Natural Sciences, Brussels, BelgiumRoyal Belgian Institute of Natural SciencesBrusselsBelgium
| | - Thomas Stjernegaard Jeppesen
- Global Biodiversity Information Facility (GBIF), Copenhagen, DenmarkGlobal Biodiversity Information Facility (GBIF)CopenhagenDenmark
| | - Christopher Mungall
- Lawrence Berkeley National Laboratory, National Microbiome Data Collaborative (NMDC), Berkeley, United States of AmericaLawrence Berkeley National Laboratory, National Microbiome Data Collaborative (NMDC)BerkeleyUnited States of America
| | - Deborah L Paul
- University of Illinois, Illinois Natural History Survey, Species File Group, Champaign-Urbana, United States of AmericaUniversity of Illinois, Illinois Natural History Survey, Species File GroupChampaign-UrbanaUnited States of America
| | - Pieter Provoost
- Intergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS), Oostende, BelgiumIntergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS)OostendeBelgium
| | - Tim Robertson
- Global Biodiversity Information Facility (GBIF), Copenhagen, DenmarkGlobal Biodiversity Information Facility (GBIF)CopenhagenDenmark
| | - Lynn Schriml
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, MD, USA, Baltimore, United States of AmericaDepartment of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, MD, USABaltimoreUnited States of America
| | - Saara Suominen
- Intergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS), Oostende, BelgiumIntergovernmental Oceanographic Commission of UNESCO, Ocean Biodiversity Information System (OBIS)OostendeBelgium
| | - Ramona Walls
- Critical Path Institute, Tucson, United States of AmericaCritical Path InstituteTucsonUnited States of America
| | - Maxime Sweetlove
- Royal Belgian Institute of Natural Sciences, Brussels, BelgiumRoyal Belgian Institute of Natural SciencesBrusselsBelgium
| | - Visotheary Ung
- UMR 7205 CNRS-MNHN-SU-EPHE-UA, Paris, FranceUMR 7205 CNRS-MNHN-SU-EPHE-UAParisFrance
| | - Anton Van de Putte
- Royal Belgian Institute of Natural Sciences, Université Libre de Bruxelles, Brussels, BelgiumRoyal Belgian Institute of Natural Sciences, Université Libre de BruxellesBrusselsBelgium
| | - Elycia Wallis
- Atlas of Living Australia, CSIRO, Melbourne, AustraliaAtlas of Living Australia, CSIROMelbourneAustralia
| | - John Wieczorek
- University of California, Berkeley, United States of AmericaUniversity of CaliforniaBerkeleyUnited States of America
| | - Pier Luigi Buttigieg
- Helmholtz Metadata Collaboration, GEOMAR Helmholtz Centre for Ocean Research, Kiel, GermanyHelmholtz Metadata Collaboration, GEOMAR Helmholtz Centre for Ocean ResearchKielGermany
| |
Collapse
|
12
|
Jones SE, Bradwell KR, Chan LE, McMurry JA, Olson-Chen C, Tarleton J, Wilkins KJ, Ly V, Ljazouli S, Qin Q, Faherty EG, Lau YK, Xie C, Kao YH, Liebman MN, Mariona F, Challa AP, Li L, Ratcliffe SJ, Haendel MA, Patel RC, Hill EL. Who is pregnant? Defining real-world data-based pregnancy episodes in the National COVID Cohort Collaborative (N3C). JAMIA Open 2023; 6:ooad067. [PMID: 37600074 PMCID: PMC10432357 DOI: 10.1093/jamiaopen/ooad067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 05/12/2023] [Accepted: 08/08/2023] [Indexed: 08/22/2023] Open
Abstract
Objectives To define pregnancy episodes and estimate gestational age within electronic health record (EHR) data from the National COVID Cohort Collaborative (N3C). Materials and Methods We developed a comprehensive approach, named Hierarchy and rule-based pregnancy episode Inference integrated with Pregnancy Progression Signatures (HIPPS), and applied it to EHR data in the N3C (January 1, 2018-April 7, 2022). HIPPS combines: (1) an extension of a previously published pregnancy episode algorithm, (2) a novel algorithm to detect gestational age-specific signatures of a progressing pregnancy for further episode support, and (3) pregnancy start date inference. Clinicians performed validation of HIPPS on a subset of episodes. We then generated pregnancy cohorts based on gestational age precision and pregnancy outcomes for assessment of accuracy and comparison of COVID-19 and other characteristics. Results We identified 628 165 pregnant persons with 816 471 pregnancy episodes, of which 52.3% were live births, 24.4% were other outcomes (stillbirth, ectopic pregnancy, abortions), and 23.3% had unknown outcomes. Clinician validation agreed 98.8% with HIPPS-identified episodes. We were able to estimate start dates within 1 week of precision for 475 433 (58.2%) episodes. 62 540 (7.7%) episodes had incident COVID-19 during pregnancy. Discussion HIPPS provides measures of support for pregnancy-related variables such as gestational age and pregnancy outcomes based on N3C data. Gestational age precision allows researchers to find time to events with reasonable confidence. Conclusion We have developed a novel and robust approach for inferring pregnancy episodes and gestational age that addresses data inconsistency and missingness in EHR data.
Collapse
Affiliation(s)
- Sara E Jones
- Office of Data Science and Emerging Technologies, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Rockville, MD 20852, United States
| | | | - Lauren E Chan
- College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, United States
| | - Julie A McMurry
- Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO 80045, United States
| | - Courtney Olson-Chen
- Department of Obstetrics and Gynecology, University of Rochester Medical Center, Rochester, NY 14620, United States
| | - Jessica Tarleton
- Department of Obstetrics and Gynecology, Medical University of South Carolina, Charleston, SC 29425, United States
| | - Kenneth J Wilkins
- Biostatistics Program, Office of the Director, National Institute of Diabetes and Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, United States
| | - Victoria Ly
- Department of Obstetrics and Gynecology, University of Rochester Medical Center, Rochester, NY 14620, United States
| | - Saad Ljazouli
- Palantir Technologies, Denver, CO 80202, United States
| | - Qiuyuan Qin
- Department of Public Health Sciences, University of Rochester Medical Center, Rochester, NY 14618, United States
| | - Emily Groene Faherty
- School of Public Health, University of Minnesota, Minneapolis, MN 55455, United States
| | | | - Catherine Xie
- Department of Public Health Sciences, University of Rochester Medical Center, Rochester, NY 14618, United States
| | - Yu-Han Kao
- Sema4, Stamford, CT 06902, United States
| | | | - Federico Mariona
- Beaumont Hospital, Dearborn, MI 48124, United States
- Wayne State University, Detroit, MI 48202, United States
| | - Anup P Challa
- Department of Chemical and Biomolecular Engineering, Vanderbilt University, Nashville, TN 37212, United States
| | - Li Li
- Sema4, Stamford, CT 06902, United States
| | - Sarah J Ratcliffe
- Department of Public Health Sciences, University of Virginia, Charlottesville, VA 22903, United States
| | - Melissa A Haendel
- College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, United States
| | - Rena C Patel
- Department of Medicine and Global Health, University of Washington, Seattle, WA 98105, United States
| | - Elaine L Hill
- Department of Obstetrics and Gynecology, University of Rochester Medical Center, Rochester, NY 14620, United States
- Department of Public Health Sciences, University of Rochester Medical Center, Rochester, NY 14618, United States
| |
Collapse
|
13
|
Amardeilh F, Aubin S, Bernard S, Bravo S, Bossy R, Faron C, Michel F, Raphel J, Roussey C. Combining different points of view on plant descriptions: mapping agricultural plant roles and biological taxa. Front Artif Intell 2023; 6:1188036. [PMID: 37829659 PMCID: PMC10565037 DOI: 10.3389/frai.2023.1188036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2023] [Accepted: 08/14/2023] [Indexed: 10/14/2023] Open
Abstract
This article describes our study on the alignment of two complementary knowledge graphs useful in agriculture: the thesaurus of cultivated plants in France named French Crop Usage (FCU) and the French national taxonomic repository TAXREF for fauna, flora, and fungi. FCU describes the usages of plants in agriculture: "tomatoes" are crops used for human food, and "grapevines" are crops used for human beverage. TAXREF describes biological taxa and associated scientific names: for example, a tomato species may be "Solanum lycopersicum" or a grapevine species may be "Vitis vinifera". Both knowledge graphs contain vernacular names of plants but those names are ambiguous. Thus, a group of agricultural experts produced some mappings from FCU crops to TAXREF taxa. Moreover, new RDF properties have been defined to declare those new types of mapping relations between plant descriptions. The metadata for the mappings and the mapping set are encoded with the Simple Standard for Sharing Ontological Mappings (SSSOM), a new model which, among other qualities, offers means to report on provenance of particular interest for this study. The produced mappings are available for download in Recherche Data Gouv, the federated national platform for research data in France.
Collapse
Affiliation(s)
| | | | - Stephan Bernard
- Université Clermont Auvergne, INRAE, UR TSCF, Clermont-Ferrand, France
| | | | - Robert Bossy
- MaIAGE, INRAE, Université Paris-Saclay, Jouy-en-Josas, France
| | - Catherine Faron
- Université Côte d'Azur, Inria, I3S, Sophia-Antipolis, France
| | - Franck Michel
- Université Côte d'Azur, Inria, I3S, Sophia-Antipolis, France
| | | | - Catherine Roussey
- Université Clermont Auvergne, INRAE, UR TSCF, Clermont-Ferrand, France
- MISTEA, University of Montpellier, INRAE & Institut Agro, Montpellier, France
| |
Collapse
|
14
|
Stefancsik R, Balhoff JP, Balk MA, Ball RL, Bello SM, Caron AR, Chesler EJ, de Souza V, Gehrke S, Haendel M, Harris LW, Harris NL, Ibrahim A, Koehler S, Matentzoglu N, McMurry JA, Mungall CJ, Munoz-Torres MC, Putman T, Robinson P, Smedley D, Sollis E, Thessen AE, Vasilevsky N, Walton DO, Osumi-Sutherland D. The Ontology of Biological Attributes (OBA)-computational traits for the life sciences. Mamm Genome 2023; 34:364-378. [PMID: 37076585 PMCID: PMC10382347 DOI: 10.1007/s00335-023-09992-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2023] [Accepted: 04/06/2023] [Indexed: 04/21/2023]
Abstract
Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or any population-focussed measurable trait data. The integration of trait and biological attribute information with an ever increasing body of chemical, environmental and biological data greatly facilitates computational analyses and it is also highly relevant to biomedical and clinical applications. The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role. OBA is a standardised representational framework for observable attributes that are characteristics of biological entities, organisms, or parts of organisms. OBA has a modular design which provides several benefits for users and data integrators, including an automated and meaningful classification of trait terms computed on the basis of logical inferences drawn from domain-specific ontologies for cells, anatomical and other relevant entities. The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. The term components in OBA provide semantic links and enable knowledge and data integration across specialised research community boundaries, thereby breaking silos.
Collapse
Affiliation(s)
- Ray Stefancsik
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK.
| | - James P Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC, 27517, USA
| | - Meghan A Balk
- Natural History Museum, University of Oslo, Oslo, Norway
| | - Robyn L Ball
- The Jackson Laboratory, Bar Harbor, ME, 04609, USA
| | | | - Anita R Caron
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | - Vinicius de Souza
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sarah Gehrke
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Melissa Haendel
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Laura W Harris
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Nomi L Harris
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Arwa Ibrahim
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | | | - Julie A McMurry
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Tim Putman
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | | | - Damian Smedley
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London, EC1M 6BQ, UK
| | - Elliot Sollis
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Anne E Thessen
- Anschutz Medical Campus, University of Colorado, Aurora, CO, 80045, USA
| | - Nicole Vasilevsky
- Data Collaboration Center, Critical Path Institute, Tucson, AZ, 85718, USA
| | | | | |
Collapse
|
15
|
Charlet J, Cui L. Knowledge Representation and Management 2022: Findings in Ontology Development and Applications. Yearb Med Inform 2023; 32:225-229. [PMID: 38147864 PMCID: PMC10751114 DOI: 10.1055/s-0043-1768747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023] Open
Abstract
OBJECTIVES To select, present, and summarize the best papers in 2022 for the Knowledge Representation and Management (KRM) section of the International Medical Informatics Association (IMIA) Yearbook. METHODS We conducted PubMed queries and followed the IMIA Yearbook guidelines for performing biomedical informatics literature review to select the best papers in KRM published in 2022. RESULTS We retrieved 1,847 publications from PubMed. We nominated 15 candidate best papers, and two of them were finally selected as the best papers in the KRM section. The topics covered by the candidate papers include ontology and knowledge graph creation, ontology applications, ontology quality assurance, ontology mapping standard, and conceptual model. CONCLUSIONS In the KRM best paper selection for 2022, the candidate best papers encompassed a broad range of topics, with ontology and knowledge graph creation remaining a considerable research focus.
Collapse
Affiliation(s)
- Jean Charlet
- Sorbonne Université, INSERM, Univ Sorbonne Paris Nord, LIMICS, Paris, France
- AP-HP, DRCI, Paris, France
| | - Licong Cui
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | | |
Collapse
|
16
|
Della Mea V, Almborg AH, Martinuzzi M, Tu SW, Martinuzzi A. Harmonization of ICF Body Structures and ICD-11 Anatomic Detail: One foundation for multiple classifications. PLoS One 2023; 18:e0280106. [PMID: 37498874 PMCID: PMC10374130 DOI: 10.1371/journal.pone.0280106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Accepted: 04/08/2023] [Indexed: 07/29/2023] Open
Abstract
The Family of International Classifications of the World Health Organization (WHO-FIC) currently includes three reference classifications, namely International Classification of Diseases (ICD), International Classification of Functioning, Disability, and Health (ICF), and International Classification of Health Interventions (ICHI). Recently, the three classifications have been incorporated into a single WHO-FIC Foundation that serves as the repository of all concepts in the classifications. Each classification serves a specific classification need. However, they share some common concepts that are present, in different forms, in two or all of them. For the WHO-FIC Foundation to be a logically consistent repository without duplicates, these common concepts must be reconciled. One important set of shared concepts is the representation of human anatomy entities, which are not always modeled in the same way and with the same level of detail. To understand the relationships among the three anatomical representations, an effort is needed to compare them, identifying common areas, gaps, and compatible and incompatible modeling. The work presented here contributes to this effort, focusing on the anatomy representations in ICF and ICD-11. For this aim, three experts were asked to identify, for each entity in the ICF Body Structures, one or more entities in the ICD-11 Anatomic Detail that could be considered identical, broader or narrower. To do this, they used a specifically developed web application, which also automatically identified the most obvious equivalences. A total of 631 maps were independently identified by the three mappers for 218 ICF Body Structures, with an interobserver agreement of 93.5%. Together with 113 maps identified by the software, they were then consolidated into 434 relations. The results highlight some differences between the two classifications: in general, ICF is less detailed than ICD-11; ICF favors lumping of structures; in very few cases, the two classifications follow different anatomic models. For these issues, solutions have to be found that are compliant with the WHO approach to classification modeling and maintenance.
Collapse
Affiliation(s)
- Vincenzo Della Mea
- Dept. of Mathematics, Computer Science and Physics, University of Udine, Udine, Italy
- Italian WHO-FIC Collaboration Center, Trieste, Italy
| | - Ann-Helene Almborg
- National Board of Health and Welfare, Stockholm, Sweden
- Nordic WHO-FIC Collaboration Center, Oslo, Norway
| | | | - Samson W Tu
- Stanford Center for Biomedical Informatics Research, Stanford University, Stanford, CA, United States of America
- Stanford WHO-FIC Collaboration Center, Stanford, CA, United States of America
| | - Andrea Martinuzzi
- Department of Neurorehabiltation, Conegliano Research Centre, RCCS Medea, Conegliano, Italy
- Italian WHO-FIC Collaboration Center, Trieste, Italy
| |
Collapse
|
17
|
Callahan TJ, Stefanski AL, Wyrwa JM, Zeng C, Ostropolets A, Banda JM, Baumgartner WA, Boyce RD, Casiraghi E, Coleman BD, Collins JH, Deakyne Davies SJ, Feinstein JA, Lin AY, Martin B, Matentzoglu NA, Meeker D, Reese J, Sinclair J, Taneja SB, Trinkley KE, Vasilevsky NA, Williams AE, Zhang XA, Denny JC, Ryan PB, Hripcsak G, Bennett TD, Haendel MA, Robinson PN, Hunter LE, Kahn MG. Ontologizing health systems data at scale: making translational discovery a reality. NPJ Digit Med 2023; 6:89. [PMID: 37208468 PMCID: PMC10196319 DOI: 10.1038/s41746-023-00830-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 04/28/2023] [Indexed: 05/21/2023] Open
Abstract
Common data models solve many challenges of standardizing electronic health record (EHR) data but are unable to semantically integrate all of the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OBO ontologies requires significant manual curation and domain expertise. We introduce OMOP2OBO, an algorithm for mapping Observational Medical Outcomes Partnership (OMOP) vocabularies to OBO ontologies. Using OMOP2OBO, we produced mappings for 92,367 conditions, 8611 drug ingredients, and 10,673 measurement results, which covered 68-99% of concepts used in clinical practice when examined across 24 hospitals. When used to phenotype rare disease patients, the mappings helped systematically identify undiagnosed patients who might benefit from genetic testing. By aligning OMOP vocabularies to OBO ontologies our algorithm presents new opportunities to advance EHR-based deep phenotyping.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA.
| | - Adrianne L Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Jordan M Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Chenjie Zeng
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Anna Ostropolets
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, 30303, USA
| | - William A Baumgartner
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15260, USA
| | - Elena Casiraghi
- Computer Science, Università degli Studi di Milano, Milan, Italy
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Ben D Coleman
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Janine H Collins
- Department of Haematology, University of Cambridge, Cambridge, UK
| | - Sara J Deakyne Davies
- Department of Research Informatics & Data Science, Analytics Resource Center, Children's Hospital Colorado, Aurora, CO, 80045, USA
| | - James A Feinstein
- Adult and Child Center for Health Outcomes Research and Delivery Science (ACCORDS), University of Colorado Anschutz School of Medicine, Aurora, CO, 80045, USA
| | - Asiyah Y Lin
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Blake Martin
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | | | | | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Katy E Trinkley
- Department of Family Medicine, University of Colorado Anschutz School of Medicine, Aurora, CO, 80045, USA
| | - Nicole A Vasilevsky
- Translational and Integrative Sciences Lab, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Andrew E Williams
- Tufts Institute for Clinical Research and Health Policy Studies, Tufts University, Boston, MA, 02155, USA
| | - Xingmin A Zhang
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Joshua C Denny
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, 08869, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Tellen D Bennett
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Melissa A Haendel
- Departments of Biomedical Informatics and Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, 06032, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael G Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| |
Collapse
|
18
|
Stefancsik R, Balhoff JP, Balk MA, Ball R, Bello SM, Caron AR, Chessler E, de Souza V, Gehrke S, Haendel M, Harris LW, Harris NL, Ibrahim A, Koehler S, Matentzoglu N, McMurry JA, Mungall CJ, Munoz-Torres MC, Putman T, Robinson P, Smedley D, Sollis E, Thessen AE, Vasilevsky N, Walton DO, Osumi-Sutherland D. The Ontology of Biological Attributes (OBA) - Computational Traits for the Life Sciences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.26.525742. [PMID: 36747660 PMCID: PMC9900877 DOI: 10.1101/2023.01.26.525742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
Existing phenotype ontologies were originally developed to represent phenotypes that manifest as a character state in relation to a wild-type or other reference. However, these do not include the phenotypic trait or attribute categories required for the annotation of genome-wide association studies (GWAS), Quantitative Trait Loci (QTL) mappings or any population-focused measurable trait data. Moreover, variations in gene expression in response to environmental disturbances even without any genetic alterations can also be associated with particular biological attributes. The integration of trait and biological attribute information with an ever increasing body of chemical, environmental and biological data greatly facilitates computational analyses and it is also highly relevant to biomedical and clinical applications. The Ontology of Biological Attributes (OBA) is a formalised, species-independent collection of interoperable phenotypic trait categories that is intended to fulfil a data integration role. OBA is a standardised representational framework for observable attributes that are characteristics of biological entities, organisms, or parts of organisms. OBA has a modular design which provides several benefits for users and data integrators, including an automated and meaningful classification of trait terms computed on the basis of logical inferences drawn from domain-specific ontologies for cells, anatomical and other relevant entities. The logical axioms in OBA also provide a previously missing bridge that can computationally link Mendelian phenotypes with GWAS and quantitative traits. The term components in OBA provide semantic links and enable knowledge and data integration across specialised research community boundaries, thereby breaking silos.
Collapse
Affiliation(s)
- Ray Stefancsik
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC 27517, USA
| | - Meghan A. Balk
- National Ecological Observatory Network, Battelle, Boulder, CO 80301, USA
| | - Robyn Ball
- The Jackson Laboratory, Bar Harbor, ME 04609, USA
| | | | - Anita R. Caron
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | - Vinicius de Souza
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sarah Gehrke
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | - Melissa Haendel
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | - Laura W. Harris
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Nomi L. Harris
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Arwa Ibrahim
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | | | | | - Julie A. McMurry
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | - Christopher J. Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Tim Putman
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | | | - Damian Smedley
- William Harvey Research Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, London EC1M 6BQ, UK
| | - Elliot Sollis
- European Bioinformatics Institute (EMBL-EBI), Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Anne E Thessen
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | - Nicole Vasilevsky
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, USA
| | | | | |
Collapse
|
19
|
Unifying the identification of biomedical entities with the Bioregistry. Sci Data 2022; 9:714. [DOI: 10.1038/s41597-022-01807-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Accepted: 10/26/2022] [Indexed: 11/21/2022] Open
Abstract
AbstractThe standardized identification of biomedical entities is a cornerstone of interoperability, reuse, and data integration in the life sciences. Several registries have been developed to catalog resources maintaining identifiers for biomedical entities such as small molecules, proteins, cell lines, and clinical trials. However, existing registries have struggled to provide sufficient coverage and metadata standards that meet the evolving needs of modern life sciences researchers. Here, we introduce the Bioregistry, an integrative, open, community-driven metaregistry that synthesizes and substantially expands upon 23 existing registries. The Bioregistry addresses the need for a sustainable registry by leveraging public infrastructure and automation, and employing a progressive governance model centered around open code and open data to foster community contribution. The Bioregistry can be used to support the standardized annotation of data, models, ontologies, and scientific literature, thereby promoting their interoperability and reuse. The Bioregistry can be accessed through https://bioregistry.io and its source code and data are available under the MIT and CC0 Licenses at https://github.com/biopragmatics/bioregistry.
Collapse
|
20
|
Matentzoglu N, Goutte-Gattat D, Tan SZK, Balhoff JP, Carbon S, Caron AR, Duncan WD, Flack JE, Haendel M, Harris NL, Hogan WR, Hoyt CT, Jackson RC, Kim H, Kir H, Larralde M, McMurry JA, Overton JA, Peters B, Pilgrim C, Stefancsik R, Robb SMC, Toro S, Vasilevsky NA, Walls R, Mungall CJ, Osumi-Sutherland D. Ontology Development Kit: a toolkit for building, maintaining and standardizing biomedical ontologies. Database (Oxford) 2022; 2022:6754192. [PMID: 36208225 PMCID: PMC9547537 DOI: 10.1093/database/baac087] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Revised: 08/19/2022] [Accepted: 09/23/2022] [Indexed: 11/21/2022]
Abstract
Similar to managing software packages, managing the ontology life cycle involves multiple complex workflows such as preparing releases, continuous quality control checking and dependency management. To manage these processes, a diverse set of tools is required, from command-line utilities to powerful ontology-engineering environmentsr. Particularly in the biomedical domain, which has developed a set of highly diverse yet inter-dependent ontologies, standardizing release practices and metadata and establishing shared quality standards are crucial to enable interoperability. The Ontology Development Kit (ODK) provides a set of standardized, customizable and automatically executable workflows, and packages all required tooling in a single Docker image. In this paper, we provide an overview of how the ODK works, show how it is used in practice and describe how we envision it driving standardization efforts in our community. Database URL: https://github.com/INCATools/ontology-development-kit.
Collapse
Affiliation(s)
| | - Damien Goutte-Gattat
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Shawn Zheng Kai Tan
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - James P Balhoff
- RENCI, University of North Carolina, Chapel Hill, NC, North Carolina 27517, USA
| | - Seth Carbon
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - Anita R Caron
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - William D Duncan
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA,College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Joe E Flack
- School of Medicine, Johns Hopkins University, 733 N Broadway, Baltimore, Baltimore, MD 21205, USA
| | - Melissa Haendel
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nomi L Harris
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | - William R Hogan
- College of Dentistry; Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, William D. Duncan: 1395 Center Dr, Gainesville, William R. Hogan: 1600 SW Archer Rd, Gainesville, FL 32610, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, 200 Longwood Avenue Armenise Building Room 109, Boston, MA 02115, USA
| | - Rebecca C Jackson
- Bend Informatics LLC, 5305 RIVER RD NORTH, STE B, KEIZER, OR 97303, USA
| | | | - Huseyin Kir
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Martin Larralde
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, Meyerhofstraße 1, Heidelberg 69117, Germany
| | - Julie A McMurry
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | | | - Bjoern Peters
- Institute for Allergy & Immunology, La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA 92037, USA
| | - Clare Pilgrim
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge, CB2 3DY, UK
| | - Ray Stefancsik
- Samples Phenotypes and Ontologies Team (SPOT), European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, UK
| | - Sofia MC Robb
- Stowers Institute for Medical Research, 1000 E. 50th St., Kansas City, MO 64110, USA
| | - Sabrina Toro
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Nicole A Vasilevsky
- University of Colorado Anschutz Medical Campus, 13001 E 17th Pl, Aurora, CO 80045, USA
| | - Ramona Walls
- Critical Path Institute, 1730 E River Road, Tucson, AZ 85718, USA
| | - Christopher J Mungall
- Berkeley Bioinformatics Open-source Projects (BBOP), Lawrence Berkeley National Laboratory (LBNL), 1 Cyclotron Road, Mailstop 977-0257, Berkeley, CA 94720, USA
| | | |
Collapse
|