1
|
McNeela D, Sala F, Gitter A. Product Manifold Representations for Learning on Biological Pathways. ARXIV 2025:arXiv:2401.15478v2. [PMID: 39975438 PMCID: PMC11838783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Machine learning models that embed graphs in non-Euclidean spaces have shown substantial benefits in a variety of contexts, but their application has not been studied extensively in the biological domain, particularly with respect to biological pathway graphs. Such graphs exhibit a variety of complex network structures, presenting challenges to existing embedding approaches. Learning high-quality embeddings for biological pathway graphs is important for researchers looking to understand the underpinnings of disease and train high-quality predictive models on these networks. In this work, we investigate the effects of embedding pathway graphs in non-Euclidean mixed-curvature spaces and compare against traditional Euclidean graph representation learning models. We then train a supervised model using the learned node embeddings to predict missing protein-protein interactions in pathway graphs. We find large reductions in distortion and boosts on in-distribution edge prediction performance as a result of using mixed-curvature embeddings and their corresponding graph neural network models. However, we find that mixed-curvature representations underperform existing baselines on out-of-distribution edge prediction performance suggesting that these representations may overfit to the training graph topology. We provide our Mixed-Curvature Product Graph Convolutional Network code at https://github.com/mcneela/Mixed-Curvature-GCN and our pathway analysis code at https://github.com/mcneela/Mixed-Curvature-Pathways.
Collapse
Affiliation(s)
- Daniel McNeela
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| | - Frederic Sala
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
| | - Anthony Gitter
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
2
|
Joe H, Kim HG. Multi-label classification with XGBoost for metabolic pathway prediction. BMC Bioinformatics 2024; 25:52. [PMID: 38297220 PMCID: PMC10832249 DOI: 10.1186/s12859-024-05666-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Accepted: 01/22/2024] [Indexed: 02/02/2024] Open
Abstract
BACKGROUND Metabolic pathway prediction is one possible approach to address the problem in system biology of reconstructing an organism's metabolic network from its genome sequence. Recently there have been developments in machine learning-based pathway prediction methods that conclude that machine learning-based approaches are similar in performance to the most used method, PathoLogic which is a rule-based method. One issue is that previous studies evaluated PathoLogic without taxonomic pruning which decreases its performance. RESULTS In this study, we update the evaluation results from previous studies to demonstrate that PathoLogic with taxonomic pruning outperforms previous machine learning-based approaches and that further improvements in performance need to be made for them to be competitive. Furthermore, we introduce mlXGPR, a XGBoost-based metabolic pathway prediction method based on the multi-label classification pathway prediction framework introduced from mlLGPR. We also improve on this multi-label framework by utilizing correlations between labels using classifier chains. We propose a ranking method that determines the order of the chain so that lower performing classifiers are placed later in the chain to utilize the correlations between labels more. We evaluate mlXGPR with and without classifier chains on single-organism and multi-organism benchmarks. Our results indicate that mlXGPR outperform other previous pathway prediction methods including PathoLogic with taxonomic pruning in terms of hamming loss, precision and F1 score on single organism benchmarks. CONCLUSIONS The results from our study indicate that the performance of machine learning-based pathway prediction methods can be substantially improved and can even outperform PathoLogic with taxonomic pruning.
Collapse
Affiliation(s)
- Hyunwhan Joe
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea
| | - Hong-Gee Kim
- Biomedical Knowledge Engineering Lab., Seoul National University, Seoul, Republic of Korea.
- School of Dentistry and Dental Research Institute, Seoul National University, Seoul, Republic of Korea.
| |
Collapse
|
3
|
Alvarez-Mamani E, Dechant R, Beltran-Castañón CA, Ibáñez AJ. Graph embedding on mass spectrometry- and sequencing-based biomedical data. BMC Bioinformatics 2024; 25:1. [PMID: 38166530 PMCID: PMC10763173 DOI: 10.1186/s12859-023-05612-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2023] [Accepted: 12/11/2023] [Indexed: 01/04/2024] Open
Abstract
Graph embedding techniques are using deep learning algorithms in data analysis to solve problems of such as node classification, link prediction, community detection, and visualization. Although typically used in the context of guessing friendships in social media, several applications for graph embedding techniques in biomedical data analysis have emerged. While these approaches remain computationally demanding, several developments over the last years facilitate their application to study biomedical data and thus may help advance biological discoveries. Therefore, in this review, we discuss the principles of graph embedding techniques and explore the usefulness for understanding biological network data derived from mass spectrometry and sequencing experiments, the current workhorses of systems biology studies. In particular, we focus on recent examples for characterizing protein-protein interaction networks and predicting novel drug functions.
Collapse
Affiliation(s)
- Edwin Alvarez-Mamani
- Engineering Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
| | - Reinhard Dechant
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru
- Calico Life Sciences, 1170 Veterans Blvd, San Francisco, CA, 94080, USA
| | | | - Alfredo J Ibáñez
- Institute for Omics Sciences and Applied Biotechnology (ICOBA PUCP), Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
- Science Department, Pontificia Universidad Católica del Perú, San Miguel, Lima, Peru.
| |
Collapse
|
4
|
Anstett J, Plominsky AM, DeLong EF, Kiesser A, Jürgens K, Morgan-Lang C, Stepanauskas R, Stewart FJ, Ulloa O, Woyke T, Malmstrom R, Hallam SJ. A compendium of bacterial and archaeal single-cell amplified genomes from oxygen deficient marine waters. Sci Data 2023; 10:332. [PMID: 37244914 DOI: 10.1038/s41597-023-02222-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Accepted: 05/10/2023] [Indexed: 05/29/2023] Open
Abstract
Oxygen-deficient marine waters referred to as oxygen minimum zones (OMZs) or anoxic marine zones (AMZs) are common oceanographic features. They host both cosmopolitan and endemic microorganisms adapted to low oxygen conditions. Microbial metabolic interactions within OMZs and AMZs drive coupled biogeochemical cycles resulting in nitrogen loss and climate active trace gas production and consumption. Global warming is causing oxygen-deficient waters to expand and intensify. Therefore, studies focused on microbial communities inhabiting oxygen-deficient regions are necessary to both monitor and model the impacts of climate change on marine ecosystem functions and services. Here we present a compendium of 5,129 single-cell amplified genomes (SAGs) from marine environments encompassing representative OMZ and AMZ geochemical profiles. Of these, 3,570 SAGs have been sequenced to different levels of completion, providing a strain-resolved perspective on the genomic content and potential metabolic interactions within OMZ and AMZ microbiomes. Hierarchical clustering confirmed that samples from similar oxygen concentrations and geographic regions also had analogous taxonomic compositions, providing a coherent framework for comparative community analysis.
Collapse
Affiliation(s)
- Julia Anstett
- Graduate Program in Genome Sciences and Technology, Genome Sciences Centre, University of British Columbia, Vancouver, British Columbia, Canada
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada
| | - Alvaro M Plominsky
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada
- Marine Biology Research Division, Scripps Institution of Oceanography, University of California San Diego, La Jolla, CA, 92037, USA
| | - Edward F DeLong
- Daniel K. Inouye Center for Microbial Oceanography: Research and Education, University of Hawaii, Manoa, Honolulu, HI, 96822, USA
| | - Alyse Kiesser
- School of Engineering, The University of British Columbia, Kelowna, BC, Canada
| | - Klaus Jürgens
- Leibniz Institute for Baltic Sea Research, Warnemünde, Germany
| | - Connor Morgan-Lang
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Frank J Stewart
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- Center for Microbial Dynamics and Infection, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology and Cell Biology, Montana State University, Bozeman, MT, USA
| | - Osvaldo Ulloa
- Departamento de Oceanografía, Universidad de Concepción, Casilla 160-C, 4070386, Concepción, Chile
- Instituto Milenio de Oceanografía, Casilla 1313, 4070386, Concepción, Chile
| | - Tanja Woyke
- Department of Energy Joint Genome Institute, Berkeley, CA, USA
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Rex Malmstrom
- Department of Energy Joint Genome Institute, Berkeley, CA, USA
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Steven J Hallam
- Graduate Program in Genome Sciences and Technology, Genome Sciences Centre, University of British Columbia, Vancouver, British Columbia, Canada.
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, British Columbia, V6T 1Z3, Canada.
- Graduate Program in Bioinformatics, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Life Sciences Institute, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
- ECOSCOPE Training Program, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|
5
|
Basher ARMA, Mclaughlin RJ, Hallam SJ. Metabolic Pathway Prediction Using Non-Negative Matrix Factorization with Improved Precision. J Comput Biol 2021; 28:1075-1103. [PMID: 34520674 DOI: 10.1089/cmb.2021.0258] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Machine learning provides a probabilistic framework for metabolic pathway inference from genomic sequence information at different levels of complexity and completion. However, several challenges, including pathway features engineering, multiple mapping of enzymatic reactions, and emergent or distributed metabolism within populations or communities of cells, can limit prediction performance. In this article, we present triUMPF (triple non-negative matrix factorization [NMF] with community detection for metabolic pathway inference), which combines three stages of NMF to capture myriad relationships between enzymes and pathways within a graph network. This is followed by community detection to extract a higher-order structure based on the clustering of vertices that share similar statistical properties. We evaluated triUMPF performance by using experimental datasets manifesting diverse multi-label properties, including Tier 1 genomes from the BioCyc collection of organismal Pathway/Genome Databases and low complexity microbial communities. Resulting performance metrics equaled or exceeded other prediction methods on organismal genomes with improved precision on multi-organismal datasets.
Collapse
Affiliation(s)
- Abdur Rahman M A Basher
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada
| | - Ryan J Mclaughlin
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada
| | - Steven J Hallam
- Graduate Program in Bioinformatics, University of British Columbia, Genome Sciences Centre, Vancouver, British Columbia, Canada.,Department of Microbiology & Immunology, University of British Columbia, Vancouver, British Columbia, Canada.,Genome Science and Technology Program, University of British Columbia, Vancouver, British Columbia, Canada.,Life Sciences Institute, University of British Columbia, Vancouver, British Columbia, Canada.,ECOSCOPE Training Program, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|