1
|
Kushida T, de Farias TM, Sima AC, Dessimoz C, Chiba H, Bastian FB, Masuya H. Federated SPARQL query performance evaluation for exploring disease model mouse: combining gene expression, orthology, and disease knowledge graphs. BMC Med Inform Decis Mak 2025; 25:189. [PMID: 40380154 DOI: 10.1186/s12911-025-03013-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 04/09/2025] [Indexed: 05/19/2025] Open
Abstract
BACKGROUND The RIKEN BRC develops and maintains the RIKEN BioResource MetaDatabase to help users explore appropriate target bioresources for their experiments and prepare precise and high-quality data infrastructures. The Swiss Institute of Bioinformatics develops two databases across multi-species for the study of gene expression and orthology: Bgee and Orthologous MAtrix (OMA, an orthology database). METHODS This study combines the RIKEN BioResource data with Resource Description Framework (RDF) datasets from Bgee, a gene expression database, the OMA, the DisGeNET, a human gene-disease association, Mouse Genome Informatics (MGI), UniProt, and four disease ontologies in the RIKEN BioResource MetaDatabase. Our aim is to evaluate the distributed SPARQL query performance when exploring which model organisms are most appropriate for specific medical science research applications across the aforementioned interoperable datasets. More precisely in our biomedical use cases, we investigate disease-related genes, as well as anatomical parts where these genes are expressed and subsequently identify appropriate bioresource candidates available for specific disease research applications. RESULTS We illustrate the above through two use cases targeting either Alzheimer's disease or melanoma. We identified 14 Alzheimer's disease-related genes that were expressed in the prefrontal cortex (e.g., APP and APOE) and 55 RIKEN bioresources, which were genetically modified mice related to these genes, predicted to be relevant to Alzheimer's disease research. Furthermore, executing a transitive search for the Uberon terms by using the Property Paths function, we identified 14 melanoma-related genes (e.g., HRAS and PTEN), and 12 anatomical parts in which these genes were expressed, such as the "skin of limb" as an example. Finally, we compared the performance of the federated SPARQL query via the remote Bgee SPARQL endpoint with the performance of a centralized SPARQL query using the Bgee dataset as part of the RIKEN BioResource MetaDatabase. CONCLUSIONS As a result, we confirmed that the performance of the federated approach degraded. We concluded that we reduced the degradation of the query performance of the federated approach from the BioResource MetaDatabase to the SIB by refining the transferred data through a subquery and enhancing the server specifications thereby optimizing the triple store query evaluation.
Collapse
Affiliation(s)
| | - Tarcisio Mendes de Farias
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University of Lausanne, Lausanne, Switzerland
| | - Ana C Sima
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Christophe Dessimoz
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University of Lausanne, Lausanne, Switzerland
| | - Hirokazu Chiba
- Database Center for Life Science, DS, ROIS, Kashiwa-shi, Japan
| | - Frederic B Bastian
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University of Lausanne, Lausanne, Switzerland
| | | |
Collapse
|
2
|
Vendetti J, Harris NL, Dorf MV, Skrenchuk A, Caufield JH, Gonçalves RS, Graybeal JB, Hegde H, Redmond T, Mungall CJ, Musen MA. BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies. Nucleic Acids Res 2025:gkaf402. [PMID: 40357648 DOI: 10.1093/nar/gkaf402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/11/2025] [Accepted: 05/02/2025] [Indexed: 05/15/2025] Open
Abstract
BioPortal (https://bioportal.bioontology.org) is the world's most comprehensive repository of biomedical ontologies. It provides infrastructure for finding, sharing, searching, and utilizing biomedical ontologies. Launched in 2005, BioPortal now includes 1549 ontologies (1182 of them public). Its open, freely accessible website enables anyone (i) to browse the ontology library, (ii) to search for terms across ontologies, (iii) to browse mappings between terms, (iv) to see popularity ratings and recommendations on which ontologies are most relevant to their use cases, (v) to annotate text with ontology terms, (vi) to submit an ontology, and (vii) to request ontology changes. The library of ontologies can be accessed programmatically via a REST application programming interface (API). Recent enhancements include a BioPortal knowledge graph that integrates knowledge from multiple ontologies; a unified data model for interoperability with other knowledge sources; ontology popularity ratings and recommendations for relevant ontologies; and the ability to request ontology changes via a simple user interface that automatically converts user change requests to GitHub Pull Requests that specify the edits that will be made to the ontology upon approval.
Collapse
Affiliation(s)
- Jennifer Vendetti
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Nomi L Harris
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Michael V Dorf
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Alex Skrenchuk
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - J Harry Caufield
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Rafael S Gonçalves
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - John B Graybeal
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Harshad Hegde
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Timothy Redmond
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| |
Collapse
|
3
|
McLaughlin J, Lagrimas J, Iqbal H, Parkinson H, Harmse H. OLS4: a new Ontology Lookup Service for a growing interdisciplinary knowledge ecosystem. BIOINFORMATICS (OXFORD, ENGLAND) 2025; 41:btaf279. [PMID: 40323307 DOI: 10.1093/bioinformatics/btaf279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/24/2025] [Accepted: 05/01/2025] [Indexed: 05/23/2025]
Abstract
SUMMARY The Ontology Lookup Service (OLS) is an open source search engine for ontologies which is used extensively in the bioinformatics and chemistry communities to annotate biological and biomedical data with ontology terms. Recently, there has been a significant increase in the size and complexity of ontologies due to new scales of biological knowledge, such as spatial transcriptomics, new ontology development methodologies, and curation on an increased scale. Existing Web-based tools for ontology browsing such as BioPortal and OntoBee do not support the full range of definitions used by today's ontologies. In order to support the community going forward, we have developed OLS4, implementing the complete OWL2 specification, internationalization support for multiple languages, and a new user interface with UX enhancements such as links out to external databases. OLS4 has replaced OLS3 in production at EMBL-EBI and has a backward compatible API supporting users of OLS3 to transition. AVAILABILITY AND IMPLEMENTATION The source code of OLS is available at https://github.com/EBISPOT/ols4 and DOI 10.5281/zenodo.14960290 with Apache 2.0 License. A freely available implementation is accessible at https://www.ebi.ac.uk/ols4.
Collapse
Affiliation(s)
- James McLaughlin
- Samples, Phenotypes and Ontologies Team (SPOT), EMBL-EBI, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Josh Lagrimas
- Samples, Phenotypes and Ontologies Team (SPOT), EMBL-EBI, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Haider Iqbal
- Samples, Phenotypes and Ontologies Team (SPOT), EMBL-EBI, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Helen Parkinson
- Samples, Phenotypes and Ontologies Team (SPOT), EMBL-EBI, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| | - Henriette Harmse
- Samples, Phenotypes and Ontologies Team (SPOT), EMBL-EBI, Wellcome Genome Campus, Hinxton CB10 1SD, United Kingdom
| |
Collapse
|
4
|
Hegde H, Vendetti J, Goutte-Gattat D, Caufield JH, Graybeal JB, Harris NL, Karam N, Kindermann C, Matentzoglu N, Overton JA, Musen MA, Mungall CJ. A change language for ontologies and knowledge graphs. Database (Oxford) 2025; 2025:baae133. [PMID: 39841813 PMCID: PMC11753292 DOI: 10.1093/database/baae133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 11/21/2024] [Accepted: 12/30/2024] [Indexed: 01/24/2025]
Abstract
Ontologies and knowledge graphs (KGs) are general-purpose computable representations of some domain, such as human anatomy, and are frequently a crucial part of modern information systems. Most of these structures change over time, incorporating new knowledge or information that was previously missing. Managing these changes is a challenge, both in terms of communicating changes to users and providing mechanisms to make it easier for multiple stakeholders to contribute. To fill that need, we have created KGCL, the Knowledge Graph Change Language (https://github.com/INCATools/kgcl), a standard data model for describing changes to KGs and ontologies at a high level, and an accompanying human-readable Controlled Natural Language (CNL). This language serves two purposes: a curator can use it to request desired changes, and it can also be used to describe changes that have already happened, corresponding to the concepts of "apply patch" and "diff" commonly used for managing changes in text documents and computer programs. Another key feature of KGCL is that descriptions are at a high enough level to be useful and understood by a variety of stakeholders-e.g. ontology edits can be specified by commands like "add synonym 'arm' to 'forelimb'" or "move 'Parkinson disease' under 'neurodegenerative disease'." We have also built a suite of tools for managing ontology changes. These include an automated agent that integrates with and monitors GitHub ontology repositories and applies any requested changes and a new component in the BioPortal ontology resource that allows users to make change requests directly from within the BioPortal user interface. Overall, the KGCL data model, its CNL, and associated tooling allow for easier management and processing of changes associated with the development of ontologies and KGs. Database URL: https://github.com/INCATools/kgcl.
Collapse
Affiliation(s)
- Harshad Hegde
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, One Cyclotron Rd., Berkeley, CA 94720, United States
| | - Jennifer Vendetti
- Center for Biomedical Informatics Research, Stanford University, 3180 Porter Dr., Palo Alto, CA 94304, United States
| | - Damien Goutte-Gattat
- Department of Physiology, Development and Neuroscience, University of Cambridge, Downing Street, Cambridge CB2 3DY, United Kingdom
| | - J Harry Caufield
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, One Cyclotron Rd., Berkeley, CA 94720, United States
| | - John B Graybeal
- Center for Biomedical Informatics Research, Stanford University, 3180 Porter Dr., Palo Alto, CA 94304, United States
| | - Nomi L Harris
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, One Cyclotron Rd., Berkeley, CA 94720, United States
| | - Naouel Karam
- Institute for Applied Informatics (InfAI), Leipzig University, Goerdelerring 9, Leipzig 04109, Germany
| | - Christian Kindermann
- Center for Biomedical Informatics Research, Stanford University, 3180 Porter Dr., Palo Alto, CA 94304, United States
| | | | - James A Overton
- Knocean Inc., 2 - 107 Quebec Ave., Toronto, Ontario M6P 2T3, Canada
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, 3180 Porter Dr., Palo Alto, CA 94304, United States
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, One Cyclotron Rd., Berkeley, CA 94720, United States
| |
Collapse
|
5
|
Charlet J, Cui L. Knowledge Representation and Management: 2023 Highlights and the Rise of Knowledge Graph Embeddings. Yearb Med Inform 2024; 33:223-226. [PMID: 40199309 PMCID: PMC12020553 DOI: 10.1055/s-0044-1800748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/10/2025] Open
Abstract
OBJECTIVES We aim to identify, select, and summarize the best papers published in 2023 for the Knowledge Representation and Management (KRM) section of the International Medical Informatics Association (IMIA) Yearbook. METHODS We performed PubMed queries and adhered to the IMIA Yearbook guidelines for conducting biomedical informatics literature review to select the best papers in KRM published in 2023. RESULTS Our search yielded a total of 1,666 publications from PubMed. From these, we identified 15 papers as potential candidates for the best papers, and three of them were finally selected as the best papers in the KRM section. The candidate best papers covered three main topics: knowledge graph, knowledge interoperability, and ontology. Notably, two of the three selected best papers explored the potential of knowledge graph embeddings for predicting intensive care unit readmissions and measuring disease distances, respectively. CONCLUSIONS The selection process for the best papers in the KRM section for 2023 showcased a wide spectrum of topics, with knowledge graph embeddings emerging as a promising area for supporting machine learning applications in biomedicine.
Collapse
Affiliation(s)
- Jean Charlet
- Sorbonne Université, INSERM, Univ Sorbonne Paris Nord, LIMICS, Paris, France
- AP-HP, DRCI, Paris, France
| | - Licong Cui
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| |
Collapse
|
6
|
Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization Using Large Language Models. ARXIV 2024:arXiv:2305.13338v3. [PMID: 37292480 PMCID: PMC10246080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - J Harry Caufield
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nomi L Harris
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | | | - Christopher J Mungall
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
7
|
Di Maria A, Bellomo L, Billeci F, Cardillo A, Alaimo S, Ferragina P, Ferro A, Pulvirenti A. NetMe 2.0: a web-based platform for extracting and modeling knowledge from biomedical literature as a labeled graph. Bioinformatics 2024; 40:btae194. [PMID: 38597890 PMCID: PMC11074003 DOI: 10.1093/bioinformatics/btae194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 03/29/2024] [Accepted: 04/08/2024] [Indexed: 04/11/2024] Open
Abstract
MOTIVATION The rapid increase of bio-medical literature makes it harder and harder for scientists to keep pace with the discoveries on which they build their studies. Therefore, computational tools have become more widespread, among which network analysis plays a crucial role in several life-science contexts. Nevertheless, building correct and complete networks about some user-defined biomedical topics on top of the available literature is still challenging. RESULTS We introduce NetMe 2.0, a web-based platform that automatically extracts relevant biomedical entities and their relations from a set of input texts-i.e. in the form of full-text or abstract of PubMed Central's papers, free texts, or PDFs uploaded by users-and models them as a BioMedical Knowledge Graph (BKG). NetMe 2.0 also implements an innovative Retrieval Augmented Generation module (Graph-RAG) that works on top of the relationships modeled by the BKG and allows the distilling of well-formed sentences that explain their content. The experimental results show that NetMe 2.0 can infer comprehensive and reliable biological networks with significant Precision-Recall metrics when compared to state-of-the-art approaches. AVAILABILITY AND IMPLEMENTATION https://netme.click/.
Collapse
Affiliation(s)
- Antonio Di Maria
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | | | - Fabrizio Billeci
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Alfio Cardillo
- Department of Computer Science, University of Catania, Catania, 95125, Italy
| | - Salvatore Alaimo
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa, 56126 , Italy
| | - Alfredo Ferro
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| | - Alfredo Pulvirenti
- Department of Clinical and Experimental Medicine, University of Catania, Catania, 95125, Italy
| |
Collapse
|
8
|
Callahan TJ, Tripodi IJ, Stefanski AL, Cappelletti L, Taneja SB, Wyrwa JM, Casiraghi E, Matentzoglu NA, Reese J, Silverstein JC, Hoyt CT, Boyce RD, Malec SA, Unni DR, Joachimiak MP, Robinson PN, Mungall CJ, Cavalleri E, Fontana T, Valentini G, Mesiti M, Gillenwater LA, Santangelo B, Vasilevsky NA, Hoehndorf R, Bennett TD, Ryan PB, Hripcsak G, Kahn MG, Bada M, Baumgartner WA, Hunter LE. An open source knowledge graph ecosystem for the life sciences. Sci Data 2024; 11:363. [PMID: 38605048 PMCID: PMC11009265 DOI: 10.1038/s41597-024-03171-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 03/21/2024] [Indexed: 04/13/2024] Open
Abstract
Translational research requires data at multiple scales of biological organization. Advancements in sequencing and multi-omics technologies have increased the availability of these data, but researchers face significant integration challenges. Knowledge graphs (KGs) are used to model complex phenomena, and methods exist to construct them automatically. However, tackling complex biomedical integration problems requires flexibility in the way knowledge is modeled. Moreover, existing KG construction methods provide robust tooling at the cost of fixed or limited choices among knowledge representation models. PheKnowLator (Phenotype Knowledge Translator) is a semantic ecosystem for automating the FAIR (Findable, Accessible, Interoperable, and Reusable) construction of ontologically grounded KGs with fully customizable knowledge representation. The ecosystem includes KG construction resources (e.g., data preparation APIs), analysis tools (e.g., SPARQL endpoint resources and abstraction algorithms), and benchmarks (e.g., prebuilt KGs). We evaluated the ecosystem by systematically comparing it to existing open-source KG construction methods and by analyzing its computational performance when used to construct 12 different large-scale KGs. With flexible knowledge representation, PheKnowLator enables fully customizable KGs without compromising performance or usability.
Collapse
Affiliation(s)
- Tiffany J Callahan
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA.
| | - Ignacio J Tripodi
- Computer Science Department, Interdisciplinary Quantitative Biology, University of Colorado Boulder, Boulder, CO, 80301, USA
| | - Adrianne L Stefanski
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, 15260, USA
| | - Jordan M Wyrwa
- Department of Physical Medicine and Rehabilitation, School of Medicine, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | | | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Jonathan C Silverstein
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Charles Tapley Hoyt
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, 02115, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15206, USA
| | - Scott A Malec
- Division of Translational Informatics, University of New Mexico School of Medicine, Albuquerque, NM, 87131, USA
| | - Deepak R Unni
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Marcin P Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- Berlin Institute of Health at Charité-Universitatsmedizin, 10117, Berlin, Germany
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Emanuele Cavalleri
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
- ELLIS, European Laboratory for Learning and Intelligent Systems, Milan Unit, Italy
| | - Marco Mesiti
- AnacletoLab, Dipartimento di Informatica, Universit`a degli Studi di Milano, Via Celoria 18, 20133, Milan, Italy
| | - Lucas A Gillenwater
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Brook Santangelo
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Nicole A Vasilevsky
- Data Collaboration Center, Critical Path Institute, 1840 E River Rd. Suite 100, Tucson, AZ, 85718, USA
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences & Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Tellen D Bennett
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
- Department of Pediatrics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Patrick B Ryan
- Janssen Research and Development, Raritan, NJ, 08869, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, 10032, USA
| | - Michael G Kahn
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael Bada
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - William A Baumgartner
- Division of General Internal Medicine, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado Anschutz Medical Campus, Aurora, CO, 80045, USA.
- Department of Biomedical Informatics, University of Colorado School of Medicine, Aurora, CO, 80045, USA.
| |
Collapse
|
9
|
Cappelletti L, Rekerle L, Fontana T, Hansen P, Casiraghi E, Ravanmehr V, Mungall CJ, Yang JJ, Spranger L, Karlebach G, Caufield JH, Carmody L, Coleman B, Oprea TI, Reese J, Valentini G, Robinson PN. Node-degree aware edge sampling mitigates inflated classification performance in biomedical random walk-based graph representation learning. BIOINFORMATICS ADVANCES 2024; 4:vbae036. [PMID: 38577542 PMCID: PMC10994718 DOI: 10.1093/bioadv/vbae036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 01/31/2024] [Accepted: 02/29/2024] [Indexed: 04/06/2024]
Abstract
Motivation Graph representation learning is a family of related approaches that learn low-dimensional vector representations of nodes and other graph elements called embeddings. Embeddings approximate characteristics of the graph and can be used for a variety of machine-learning tasks such as novel edge prediction. For many biomedical applications, partial knowledge exists about positive edges that represent relationships between pairs of entities, but little to no knowledge is available about negative edges that represent the explicit lack of a relationship between two nodes. For this reason, classification procedures are forced to assume that the vast majority of unlabeled edges are negative. Existing approaches to sampling negative edges for training and evaluating classifiers do so by uniformly sampling pairs of nodes. Results We show here that this sampling strategy typically leads to sets of positive and negative examples with imbalanced node degree distributions. Using representative heterogeneous biomedical knowledge graph and random walk-based graph machine learning, we show that this strategy substantially impacts classification performance. If users of graph machine-learning models apply the models to prioritize examples that are drawn from approximately the same distribution as the positive examples are, then performance of models as estimated in the validation phase may be artificially inflated. We present a degree-aware node sampling approach that mitigates this effect and is simple to implement. Availability and implementation Our code and data are publicly available at https://github.com/monarch-initiative/negativeExampleSelection.
Collapse
Affiliation(s)
- Luca Cappelletti
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy
| | - Lauren Rekerle
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
| | - Tommaso Fontana
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy
| | - Peter Hansen
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
| | - Elena Casiraghi
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States
| | - Vida Ravanmehr
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States
| | - Jeremy J Yang
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of Medicine, Albuquerque, NM 87102, United States
| | - Leonard Spranger
- Institute of Bioinformatics, Freie Universität Berlin, Berlin, 14195, Germany
| | - Guy Karlebach
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
| | - J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States
| | - Leigh Carmody
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
| | - Ben Coleman
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, United States
| | - Tudor I Oprea
- Department of Internal Medicine and UNM Comprehensive Cancer Center, UNM School of Medicine, Albuquerque, NM 87102, United States
| | - Justin Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94710, United States
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano 20133, Italy
- ELLIS—European Laboratory for Learning and Intelligent Systems
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, CT 06032, United States
- Institute for Systems Genomics, University of Connecticut, Farmington, CT 06032, United States
- ELLIS—European Laboratory for Learning and Intelligent Systems
- Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Berlin, 10117, Germany
| |
Collapse
|