1
|
Xu Z, Barghout RA, Wu J, Garg D, Song YS, Mahadevan R. CPI-Pred: A deep learning framework for predicting functional parameters of compound-protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.16.633372. [PMID: 39896624 PMCID: PMC11785036 DOI: 10.1101/2025.01.16.633372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Recent advancements in deep learning have enabled functional annotation of genome sequences, facilitating the discovery of new enzymes and metabolites. However, accurately predicting compound-protein interactions (CPI) from sequences remains challenging due to the complexity of these interactions and the sparsity and heterogeneity of available data, which constrain the generalization of patterns across their solution space. In this work, we introduce CPI-Pred, a versatile deep learning model designed to predict compound-protein interaction function. CPI-Pred integrates compound representations derived from a novel message-passing neural network and enzyme representations generated by state-of-the-art protein language models, leveraging innovative sequence pooling and cross-attention mechanisms. To train and evaluate CPI-Pred, we compiled the largest dataset of enzyme kinetic parameters to date, encompassing four key metrics: the Michaelis-Menten constant (K M), enzyme turnover number (k cat), catalytic efficiency (k cat /K M), and inhibition constant (K I). These kinetic parameters are critical for elucidating enzyme function in metabolic contexts and understanding their regulation by compounds within biological networks. We demonstrate that CPI-Pred can predict diverse types of CPI using only the amino acid sequence of enzymes and structural representations of compounds, outperforming state-of-the-art models on unseen compounds and structurally dissimilar enzymes. Over workflow provides a valuable tool for tackling a range of metabolic engineering challenges, including the designing of novel enzyme sequences and compounds, such as enzyme inhibitors. Additionally, the datasets curated in this study offer a valuable resource for the scientific community, serving as a benchmark for machine learning models focused on enzyme activity and promiscuity prediction.
Collapse
Affiliation(s)
- Zhiqing Xu
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
| | - Rana Ahmed Barghout
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
| | - Jinghao Wu
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
| | - Dhruv Garg
- Computer Science and Engineering Department, Thapar Institute of Engineering and Technology, Patiala, India
| | - Yun S Song
- Computer Science Division and Department of Statistics, University of California, Berkeley, CA, US
| | - Radhakrishnan Mahadevan
- Department of Chemical Engineering and Applied Chemistry, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
2
|
Lee H, Ozbulak U, Park H, Depuydt S, De Neve W, Vankerschaver J. Assessing the reliability of point mutation as data augmentation for deep learning with genomic data. BMC Bioinformatics 2024; 25:170. [PMID: 38689247 PMCID: PMC11059627 DOI: 10.1186/s12859-024-05787-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Accepted: 04/15/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Deep neural networks (DNNs) have the potential to revolutionize our understanding and treatment of genetic diseases. An inherent limitation of deep neural networks, however, is their high demand for data during training. To overcome this challenge, other fields, such as computer vision, use various data augmentation techniques to artificially increase the available training data for DNNs. Unfortunately, most data augmentation techniques used in other domains do not transfer well to genomic data. RESULTS Most genomic data possesses peculiar properties and data augmentations may significantly alter the intrinsic properties of the data. In this work, we propose a novel data augmentation technique for genomic data inspired by biology: point mutations. By employing point mutations as substitutes for codons, we demonstrate that our newly proposed data augmentation technique enhances the performance of DNNs across various genomic tasks that involve coding regions, such as translation initiation and splice site detection. CONCLUSION Silent and missense mutations are found to positively influence effectiveness, while nonsense mutations and random mutations in non-coding regions generally lead to degradation. Overall, point mutation-based augmentations in genomic datasets present valuable opportunities for improving the accuracy and reliability of predictive models for DNA sequences.
Collapse
Affiliation(s)
| | - Utku Ozbulak
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
| | - Homin Park
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Stephen Depuydt
- Erasmus Brussels University of Applied Sciences and Arts, Brussels, Belgium
| | - Wesley De Neve
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea
- IDLab, Department of Electronics and Information Systems, Ghent University, Ghent, Belgium
| | - Joris Vankerschaver
- Center for Biosystems and Biotech Data Science, Ghent University Global Campus, Incheon, South Korea.
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium.
| |
Collapse
|
3
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
4
|
Minot M, Reddy ST. Meta learning addresses noisy and under-labeled data in machine learning-guided antibody engineering. Cell Syst 2024; 15:4-18.e4. [PMID: 38194961 DOI: 10.1016/j.cels.2023.12.003] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 07/21/2023] [Accepted: 12/07/2023] [Indexed: 01/11/2024]
Abstract
Machine learning-guided protein engineering is rapidly progressing; however, collecting high-quality, large datasets remains a bottleneck. Directed evolution and protein engineering studies often require extensive experimental processes to eliminate noise and label protein sequence-function data. Meta learning has proven effective in other fields in learning from noisy data via bi-level optimization given the availability of a small dataset with trusted labels. Here, we leverage meta learning approaches to overcome noisy and under-labeled data and expedite workflows in antibody engineering. We generate yeast display antibody mutagenesis libraries and screen them for target antigen binding followed by deep sequencing. We then create representative learning tasks, including learning from noisy training data, positive and unlabeled learning, and learning out of distribution properties. We demonstrate that meta learning has the potential to reduce experimental screening time and improve the robustness of machine learning models by training with noisy and under-labeled training data.
Collapse
Affiliation(s)
- Mason Minot
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland
| | - Sai T Reddy
- ETH Zurich, Department of Biosystems Science and Engineering, Basel 4056, Switzerland.
| |
Collapse
|
5
|
Riwes MM, Golob JL, Magenau J, Shan M, Dick G, Braun T, Schmidt TM, Pawarode A, Anand S, Ghosh M, Maciejewski J, King D, Choi S, Yanik G, Geer M, Hillman E, Lyssiotis CA, Tewari M, Reddy P. Feasibility of a dietary intervention to modify gut microbial metabolism in patients with hematopoietic stem cell transplantation. Nat Med 2023; 29:2805-2813. [PMID: 37857710 PMCID: PMC10667101 DOI: 10.1038/s41591-023-02587-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 09/12/2023] [Indexed: 10/21/2023]
Abstract
Evaluation of the impact of dietary intervention on gastrointestinal microbiota and metabolites after allogeneic hematopoietic stem cell transplantation (HCT) is lacking. We conducted a feasibility study as the first of a two-phase trial. Ten adults received resistant potato starch (RPS) daily from day -7 to day 100. The primary objective was to test the feasibility of RPS and its effect on intestinal microbiome and metabolites, including the short-chain fatty acid butyrate. Feasibility met the preset goal of 60% or more, adhering to 70% or more doses; fecal butyrate levels were significantly higher when participants were on RPS than when they were not (P < 0.0001). An exploratory objective was to evaluate plasma metabolites. We observed longitudinal changes in plasma metabolites compared to baseline, which were independent of RPS (P < 0.0001). However, in recipients of RPS, the dominant plasma metabolites were more stable compared to historical controls with significant difference at engraftment (P < 0.05). These results indicate that RPS in recipients of allogeneic HCT is feasible; in this study, it was associated with significant alterations in intestinal and plasma metabolites. A phase 2 trial examining the effect of RPS on graft-versus-host disease in recipients of allogeneic HCT is underway. ClinicalTrials.gov registration: NCT02763033 .
Collapse
Affiliation(s)
- Mary M Riwes
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA.
| | - Jonathan L Golob
- Department of Internal Medicine, Division of Infectious Disease, University of Michigan, Ann Arbor, MI, USA
| | - John Magenau
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Mengrou Shan
- Department of Molecular & Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | - Gregory Dick
- Department of Earth & Environmental Sciences, University of Michigan, Ann Arbor, MI, USA
| | - Thomas Braun
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Thomas M Schmidt
- Department of Internal Medicine, Division of Infectious Disease, University of Michigan, Ann Arbor, MI, USA
| | - Attaphol Pawarode
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Sarah Anand
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Monalisa Ghosh
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - John Maciejewski
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Darren King
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Sung Choi
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Gregory Yanik
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Marcus Geer
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Ethan Hillman
- Department of Internal Medicine, Division of Infectious Disease, University of Michigan, Ann Arbor, MI, USA
| | - Costas A Lyssiotis
- Department of Molecular & Integrative Physiology, University of Michigan, Ann Arbor, MI, USA
| | - Muneesh Tewari
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA
| | - Pavan Reddy
- Department of Internal Medicine, Division of Hematology and Oncology, University of Michigan, Rogel Cancer Center, Ann Arbor, MI, USA.
- Dan L Duncan Comprehensive Cancer Center, Baylor College of Medicine, Houston, TX, USA.
| |
Collapse
|