1
|
Jiang J, Zhang C, Ke L, Hayes N, Zhu Y, Qiu H, Zhang B, Zhou T, Wei GW. A review of machine learning methods for imbalanced data challenges in chemistry. Chem Sci 2025; 16:7637-7658. [PMID: 40271022 PMCID: PMC12013631 DOI: 10.1039/d5sc00270b] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2025] [Accepted: 04/06/2025] [Indexed: 04/25/2025] Open
Abstract
Imbalanced data, where certain classes are significantly underrepresented in a dataset, is a widespread machine learning (ML) challenge across various fields of chemistry, yet it remains inadequately addressed. This data imbalance can lead to biased ML or deep learning (DL) models, which fail to accurately predict the underrepresented classes, thus limiting the robustness and applicability of these models. With the rapid advancement of ML and DL algorithms, several promising solutions to this issue have emerged, prompting the need for a comprehensive review of current methodologies. In this review, we examine the prominent ML approaches used to tackle the imbalanced data challenge in different areas of chemistry, including resampling techniques, data augmentation techniques, algorithmic approaches, and feature engineering strategies. Each of these methods is evaluated in the context of its application across various aspects of chemistry, such as drug discovery, materials science, cheminformatics, and catalysis. We also explore future directions for overcoming the imbalanced data challenge and emphasize data augmentation via physical models, large language models (LLMs), and advanced mathematics. The benefit of balanced data in new material design and production and the persistent challenges are discussed. Overall, this review aims to elucidate the prevalent ML techniques applied to mitigate the impacts of imbalanced data within the field of chemistry and offer insights into future directions for research and application.
Collapse
Affiliation(s)
- Jian Jiang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Chunhuan Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Lu Ke
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Nicole Hayes
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
| | - Yueying Zhu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Huahai Qiu
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Bengong Zhang
- Research Center of Nonlinear Science, School of Mathematical and Physical Sciences, Wuhan Textile University Wuhan 430200 P R. China
| | - Tianshou Zhou
- Key Laboratory of Computational Mathematics, Guangdong Province, School of Mathematics, Sun Yat-sen University Guangzhou 510006 P R. China
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University East Lansing Michigan 48824 USA
- Department of Electrical and Computer Engineering, Michigan State University East Lansing Michigan 48824 USA
- Department of Biochemistry and Molecular Biology, Michigan State University East Lansing Michigan 48824 USA
| |
Collapse
|
2
|
Shukla D, Martin J, Morcos F, Potoyan DA. Thermal Adaptation of Cytosolic Malate Dehydrogenase Revealed by Deep Learning and Coevolutionary Analysis. J Chem Theory Comput 2025; 21:3277-3287. [PMID: 40079215 PMCID: PMC11948321 DOI: 10.1021/acs.jctc.4c01774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
Protein evolution has shaped enzymes that maintain stability and function across diverse thermal environments. While sequence variation, thermal stability and conformational dynamics are known to influence an enzyme's thermal adaptation, how these factors collectively govern stability and function across diverse temperatures remains unresolved. Cytosolic malate dehydrogenase (cMDH), a citric acid cycle enzyme, is an ideal model for studying these mechanisms due to its temperature-sensitive flexibility and broad presence in species from diverse thermal environments. In this study, we employ techniques inspired by deep learning and statistical mechanics to uncover how sequence variation and conformational dynamics shape patterns of cMDH's thermal adaptation. By integrating coevolutionary models with variational autoencoders (VAE), we generate a latent generative landscape (LGL) of the cMDH sequence space, enabling us to explore mutational pathways and predict fitness using direct coupling analysis (DCA). Structure predictions via AlphaFold and molecular dynamics simulations further illuminate how variations in hydrophobic interactions and conformational flexibility contribute to the thermal stability of warm- and cold-adapted cMDH orthologs. Notably, we identify the ratio of hydrophobic contacts between two regions as a predictive order parameter for thermal stability features, providing a quantitative metric for understanding cMDH dynamics across temperatures. The integrative computational framework employed in this study provides mechanistic insights into protein adaptation at both sequence and structural levels, offering unique perspectives on the evolution of thermal stability and creating avenues for the rational design of proteins with optimized thermal properties.
Collapse
Affiliation(s)
- Divyanshu Shukla
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| | - Jonathan Martin
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
| | - Faruck Morcos
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
- Departments
of Bioengineering and Physics, UT Dallas, Richardson, TX 75080, United States
- Center
for
Systems Biology, UT Dallas, Richardson, TX 75080, United States
| | - Davit A. Potoyan
- Department
of Chemistry, Iowa State University, Ames, Iowa 50011, United States
- Department
of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, United States
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| |
Collapse
|
3
|
Jiang Y, Wang Z, Scheuring S. A structural biology compatible file format for atomic force microscopy. Nat Commun 2025; 16:1671. [PMID: 39955301 PMCID: PMC11829953 DOI: 10.1038/s41467-025-56760-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Accepted: 01/30/2025] [Indexed: 02/17/2025] Open
Abstract
Cryogenic electron microscopy (cryo-EM), X-ray crystallography, and nuclear magnetic resonance (NMR) contribute structural data that are interchangeable, cross-verifiable, and visualizable on common platforms, making them powerful tools for our understanding of protein structures. Unfortunately, atomic force microscopy (AFM) has so far failed to interface with these structural biology methods, despite the recent development of localization AFM (LAFM) that allows extracting high-resolution structural information from AFM data. Here, we build on LAFM and develop a pipeline that transforms AFM data into 3D-density files (.afm) that are readable by programs commonly used to visualize, analyze, and interpret structural data. We show that 3D-LAFM densities can serve as force fields to steer molecular dynamics flexible fitting (MDFF) to obtain structural models of previously unresolved states based on AFM observations in close-to-native environment. Besides, the .afm format enables direct 3D or 2D visualization and analysis of conventional AFM images. We anticipate that the file format will find wide usage and embed AFM in the repertoire of methods routinely used by the structural biology community, allowing AFM researchers to deposit data in repositories in a format that allows comparison and cross-verification with data from other techniques.
Collapse
Affiliation(s)
- Yining Jiang
- Biochemistry & Structural Biology, Cell & Developmental Biology, and Molecular Biology (BCMB) Program, Weill Cornell Graduate School of Medical Sciences, New York, NY, USA
- Weill Cornell Medicine, Department of Anesthesiology, New York, NY, USA
| | - Zhaokun Wang
- Weill Cornell Medicine, Department of Anesthesiology, New York, NY, USA
- Physiology, Biophysics and Systems Biology Graduate Program, Weill Cornell Graduate School of Medical Sciences, New York, NY, USA
| | - Simon Scheuring
- Weill Cornell Medicine, Department of Anesthesiology, New York, NY, USA.
- Weill Cornell Medicine, Department of Physiology and Biophysics, New York, NY, USA.
| |
Collapse
|
4
|
Ojha AA, Votapka LW, Amaro RE. Advances and Challenges in Milestoning Simulations for Drug-Target Kinetics. J Chem Theory Comput 2024; 20:9759-9769. [PMID: 39508322 PMCID: PMC11603602 DOI: 10.1021/acs.jctc.4c01108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 10/30/2024] [Accepted: 10/31/2024] [Indexed: 11/15/2024]
Abstract
Molecular dynamics simulations have become indispensable for exploring complex biological processes, yet their limitations in capturing rare events hinder our understanding of drug-target kinetics. In this Perspective, we investigate the domain of milestoning simulations to understand this challenge. The milestoning approach divides the phase space of the drug-target complex into discrete cells, offering extended time scale insights. This Perspective traces the history, applications, and future potential of milestoning simulations in the context of drug-target kinetics. It explores the fundamental principles of milestoning, highlighting the importance of probabilistic transitions and transition time independence. Markovian milestoning with Voronoi tessellations is revisited to address the traditional milestoning challenges. While observing the advancements in this field, this Perspective also addresses impending challenges in estimating drug-target unbinding rate constants through milestoning simulations, paving the way for more effective drug design strategies.
Collapse
Affiliation(s)
- Anupam Anand Ojha
- Department
of Chemistry and Biochemistry, University
of California San Diego, La Jolla, California 92093, United States
- Center
for Computational Biology and Center for Computational Mathematics, Flatiron Institute, New York, New York 10010, United States
| | - Lane W. Votapka
- Department
of Chemistry and Biochemistry, University
of California San Diego, La Jolla, California 92093, United States
| | - Rommie E. Amaro
- Department
of Molecular Biology, University of California
San Diego, La Jolla, California 92093, United States
| |
Collapse
|
5
|
Henderson R, Anasti K, Manne K, Stalls V, Saunders C, Bililign Y, Williams A, Bubphamala P, Montani M, Kachhap S, Li J, Jaing C, Newman A, Cain DW, Lu X, Venkatayogi S, Berry M, Wagh K, Korber B, Saunders KO, Tian M, Alt F, Wiehe K, Acharya P, Alam SM, Haynes BF. Engineering immunogens that select for specific mutations in HIV broadly neutralizing antibodies. Nat Commun 2024; 15:9503. [PMID: 39489734 PMCID: PMC11532496 DOI: 10.1038/s41467-024-53120-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Accepted: 09/27/2024] [Indexed: 11/05/2024] Open
Abstract
Vaccine development targeting rapidly evolving pathogens such as HIV-1 requires induction of broadly neutralizing antibodies (bnAbs) with conserved paratopes and mutations, and in some cases, the same Ig-heavy chains. The current trial-and-error search for immunogen modifications that improve selection for specific bnAb mutations is imprecise. Here, to precisely engineer bnAb boosting immunogens, we use molecular dynamics simulations to examine encounter states that form when antibodies collide with the HIV-1 Envelope (Env). By mapping how bnAbs use encounter states to find their bound states, we identify Env mutations predicted to select for specific antibody mutations in two HIV-1 bnAb B cell lineages. The Env mutations encode antibody affinity gains and select for desired antibody mutations in vivo. These results demonstrate proof-of-concept that Env immunogens can be designed to directly select for specific antibody mutations at residue-level precision by vaccination, thus demonstrating the feasibility of sequential bnAb-inducing HIV-1 vaccine design.
Collapse
Affiliation(s)
- Rory Henderson
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
| | - Kara Anasti
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Kartik Manne
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Victoria Stalls
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Carrie Saunders
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Yishak Bililign
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Ashliegh Williams
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Pimthada Bubphamala
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Maya Montani
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Sangita Kachhap
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Jingjing Li
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Chuancang Jaing
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Amanda Newman
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Derek W Cain
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
| | - Xiaozhi Lu
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Sravani Venkatayogi
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Madison Berry
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Kshitij Wagh
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
| | - Bette Korber
- Theoretical Biology and Biophysics Group, Los Alamos National Laboratory, Los Alamos, NM, USA
- The New Mexico Consortium, Los Alamos, NM, USA
| | - Kevin O Saunders
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Surgery, Duke University Medical Center, Durham, NC, USA
| | - Ming Tian
- Program in Cellular and Molecular Medicine, Boston Children's Hospital, Boston, MA, USA
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Fred Alt
- Program in Cellular and Molecular Medicine, Boston Children's Hospital, Boston, MA, USA
- Department of Genetics, Harvard Medical School, Boston, MA, USA
| | - Kevin Wiehe
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
| | - Priyamvada Acharya
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Surgery, Duke University Medical Center, Durham, NC, USA
- Department of Biochemistry, Duke University, Durham, NC, USA
| | - S Munir Alam
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA
- Department of Medicine, Duke University Medical Center, Durham, NC, USA
- Department of Pathology, Duke University School of Medicine, Durham, NC, USA
| | - Barton F Haynes
- Duke Human Vaccine Institute, Duke University Medical Center, Durham, NC, USA.
- Department of Immunology, Duke University Medical Center, Durham, NC, USA.
| |
Collapse
|
6
|
Kacher J, Sokolova OS, Tarek M. A Deep Learning Approach to Uncover Voltage-Gated Ion Channels' Intermediate States. J Phys Chem B 2024; 128:8724-8736. [PMID: 39213618 DOI: 10.1021/acs.jpcb.4c03182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Owing to recent advancements in cryo-electron microscopy, voltage-gated ion channels have gained a greater comprehension of their structural characteristics. However, a significant enigma remains unsolved for a large majority of these channels: their gating mechanism. This mechanism, which encompasses the conformational changes between open and closed states, is pivotal to their proper functioning. Beyond the binary states of open and closed, an ensemble of intermediate states defines the transition path in-between. Due to the lack of experimental data, one might resort to molecular dynamics simulations as an alternative to decipher these states and the transitions between them. However, the high-energy barriers and the colossal time scales involved hinder access to the latter. We present here an application of deep learning as a reliable pipeline for a comprehensive exploration of voltage-gated ion channel conformational rearrangements during gating. We showcase the pipeline performance specifically on the Kv1.2 voltage sensor domain and confront the results with existing data. We demonstrate how our physics-based deep learning approach contributes to the theoretical understanding of these channels and how it might provide further insights into the exploration of channelopathies.
Collapse
Affiliation(s)
- Julia Kacher
- Université de Lorraine, CNRS, LPCT, F-54000 Nancy, France
| | - Olga S Sokolova
- Faculty of Biology, Lomonosov Moscow State University, 1-12 Leninskie Gory, 119234 Moscow, Russia
- Shenzhen MSU-BIT University, 1 International University Park Road, Dayun New Town, Longgang District, Shenzhen 518172, China
| | - Mounir Tarek
- Université de Lorraine, CNRS, LPCT, F-54000 Nancy, France
| |
Collapse
|
7
|
Wang J, Wang X, Chu Y, Li C, Li X, Meng X, Fang Y, No KT, Mao J, Zeng X. Exploring the Conformational Ensembles of Protein-Protein Complex with Transformer-Based Generative Model. J Chem Theory Comput 2024; 20:4469-4480. [PMID: 38816696 DOI: 10.1021/acs.jctc.4c00255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Protein-protein interactions are the basis of many protein functions, and understanding the contact and conformational changes of protein-protein interactions is crucial for linking the protein structure to biological function. Although difficult to detect experimentally, molecular dynamics (MD) simulations are widely used to study the conformational ensembles and dynamics of protein-protein complexes, but there are significant limitations in sampling efficiency and computational costs. In this study, a generative neural network was trained on protein-protein complex conformations obtained from molecular simulations to directly generate novel conformations with physical realism. We demonstrated the use of a deep learning model based on the transformer architecture to explore the conformational ensembles of protein-protein complexes through MD simulations. The results showed that the learned latent space can be used to generate unsampled conformations of protein-protein complexes for obtaining new conformations complementing pre-existing ones, which can be used as an exploratory tool for the analysis and enhancement of molecular simulations of protein-protein complexes.
Collapse
Affiliation(s)
- Jianmin Wang
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Korea
| | - Xun Wang
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
- High Performance Computer Research Center, University of Chinese Academy of Sciences, Beijing 100190, P. R. China
| | - Yanyi Chu
- Department of Pathology, Stanford University School of Medicine, Stanford, California 94305, United States
| | - Chunyan Li
- School of Informatics, Yunnan Normal University, Kunming, Yunnan 650500, P. R. China
| | - Xue Li
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
| | - Xiangyu Meng
- School of Computer Science and Technology, China University of Petroleum, Qingdao, Shandong 266580, P. R. China
| | - Yitian Fang
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200030, P. R. China
| | - Kyoung Tai No
- The Interdisciplinary Graduate Program in Integrative Biotechnology, Yonsei University, Incheon 21983, Korea
| | - Jiashun Mao
- School of Medical Information and Engineering, Southwest Medical University, Luzhou, Sichuan 646000, P. R. China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan 410082, P. R. China
| |
Collapse
|
8
|
Mansoor S, Baek M, Park H, Lee GR, Baker D. Protein Ensemble Generation Through Variational Autoencoder Latent Space Sampling. J Chem Theory Comput 2024; 20:2689-2695. [PMID: 38547871 PMCID: PMC11008089 DOI: 10.1021/acs.jctc.3c01057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Revised: 02/25/2024] [Accepted: 02/26/2024] [Indexed: 04/10/2024]
Abstract
Mapping the ensemble of protein conformations that contribute to function and can be targeted by small molecule drugs remains an outstanding challenge. Here, we explore the use of variational autoencoders for reducing the challenge of dimensionality in the protein structure ensemble generation problem. We convert high-dimensional protein structural data into a continuous, low-dimensional representation, carry out a search in this space guided by a structure quality metric, and then use RoseTTAFold guided by the sampled structural information to generate 3D structures. We use this approach to generate ensembles for the cancer relevant protein K-Ras, train the VAE on a subset of the available K-Ras crystal structures and MD simulation snapshots, and assess the extent of sampling close to crystal structures withheld from training. We find that our latent space sampling procedure rapidly generates ensembles with high structural quality and is able to sample within 1 Å of held-out crystal structures, with a consistency higher than that of MD simulation or AlphaFold2 prediction. The sampled structures sufficiently recapitulate the cryptic pockets in the held-out K-Ras structures to allow for small molecule docking.
Collapse
Affiliation(s)
- Sanaa Mansoor
- Department
of Biochemistry, University of Washington, Seattle, Washington 98195, United States
- Institute
for Protein Design, University of Washington, Seattle, Washington 98195, United States
- Molecular
Engineering Graduate Program, University
of Washington, Seattle, Washington 98195, United States
| | - Minkyung Baek
- Department
of Biochemistry, University of Washington, Seattle, Washington 98195, United States
- Institute
for Protein Design, University of Washington, Seattle, Washington 98195, United States
- School
of Biological Sciences, Seoul National University, Seoul 08826, Republic of Korea
| | - Hahnbeom Park
- Department
of Biochemistry, University of Washington, Seattle, Washington 98195, United States
- Institute
for Protein Design, University of Washington, Seattle, Washington 98195, United States
- Brain
Science Institute, Korea Institute of Science
and Technology, Seoul 02792, Republic of Korea
| | - Gyu Rie Lee
- Department
of Biochemistry, University of Washington, Seattle, Washington 98195, United States
- Institute
for Protein Design, University of Washington, Seattle, Washington 98195, United States
| | - David Baker
- Department
of Biochemistry, University of Washington, Seattle, Washington 98195, United States
- Institute
for Protein Design, University of Washington, Seattle, Washington 98195, United States
- Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, United States
| |
Collapse
|
9
|
Floch A, Galochkina T, Pirenne F, Tournamille C, de Brevern AG. Molecular dynamics of the human RhD and RhAG blood group proteins. Front Chem 2024; 12:1360392. [PMID: 38566898 PMCID: PMC10985258 DOI: 10.3389/fchem.2024.1360392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 03/07/2024] [Indexed: 04/04/2024] Open
Abstract
Introduction: Blood group antigens of the RH system (formerly known as "Rhesus") play an important role in transfusion medicine because of the severe haemolytic consequences of antibodies to these antigens. No crystal structure is available for RhD proteins with its partner RhAG, and the precise stoichiometry of the trimer complex remains unknown. Methods: To analyse their structural properties, the trimers formed by RhD and/or RhAG subunits were generated by protein modelling and molecular dynamics simulations were performed. Results: No major differences in structural behaviour were found between trimers of different compositions. The conformation of the subunits is relatively constant during molecular dynamics simulations, except for three large disordered loops. Discussion: This work makes it possible to propose a reasonable stoichiometry and demonstrates the potential of studying the structural behaviour of these proteins to investigate the hundreds of genetic variants relevant to transfusion medicine.
Collapse
Affiliation(s)
- Aline Floch
- University Paris Est Créteil, INSERM U955 Equipe Transfusion et Maladies du Globule Rouge, IMRB, Créteil, France
- Laboratoire de Biologie Médicale de Référence en Immuno-Hématologie Moléculaire, Etablissement Français du Sang Ile-de-France, Créteil, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, Biologie Intégrée du Globule Rouge, UMR_S1134, BIGR, INSERM, DSIMB Bioinformatics team, Paris, France
| | - France Pirenne
- University Paris Est Créteil, INSERM U955 Equipe Transfusion et Maladies du Globule Rouge, IMRB, Créteil, France
- Laboratoire de Biologie Médicale de Référence en Immuno-Hématologie Moléculaire, Etablissement Français du Sang Ile-de-France, Créteil, France
| | - Christophe Tournamille
- University Paris Est Créteil, INSERM U955 Equipe Transfusion et Maladies du Globule Rouge, IMRB, Créteil, France
- Laboratoire de Biologie Médicale de Référence en Immuno-Hématologie Moléculaire, Etablissement Français du Sang Ile-de-France, Créteil, France
| | - Alexandre G. de Brevern
- Université Paris Cité and Université des Antilles and Université de la Réunion, Biologie Intégrée du Globule Rouge, UMR_S1134, BIGR, INSERM, DSIMB Bioinformatics team, Paris, France
| |
Collapse
|
10
|
Chen J, Potlapalli R, Quan H, Chen L, Xie Y, Pouriyeh S, Sakib N, Liu L, Xie Y. Exploring DNA Damage and Repair Mechanisms: A Review with Computational Insights. BIOTECH 2024; 13:3. [PMID: 38247733 PMCID: PMC10801582 DOI: 10.3390/biotech13010003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 11/21/2023] [Accepted: 12/29/2023] [Indexed: 01/23/2024] Open
Abstract
DNA damage is a critical factor contributing to genetic alterations, directly affecting human health, including developing diseases such as cancer and age-related disorders. DNA repair mechanisms play a pivotal role in safeguarding genetic integrity and preventing the onset of these ailments. Over the past decade, substantial progress and pivotal discoveries have been achieved in DNA damage and repair. This comprehensive review paper consolidates research efforts, focusing on DNA repair mechanisms, computational research methods, and associated databases. Our work is a valuable resource for scientists and researchers engaged in computational DNA research, offering the latest insights into DNA-related proteins, diseases, and cutting-edge methodologies. The review addresses key questions, including the major types of DNA damage, common DNA repair mechanisms, the availability of reliable databases for DNA damage and associated diseases, and the predominant computational research methods for enzymes involved in DNA damage and repair.
Collapse
Affiliation(s)
- Jiawei Chen
- College of Letter and Science, University of California, Berkeley, CA 94720, USA;
| | - Ravi Potlapalli
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| | - Heng Quan
- Department of Civil and Urban Engineering, New York University, New York, NY 11201, USA;
| | - Lingtao Chen
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| | - Ying Xie
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| | - Seyedamin Pouriyeh
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| | - Nazmus Sakib
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| | - Lichao Liu
- Stanford Cardiovascular Institute, Stanford University School of Medicine, Palo Alto, CA 94304, USA;
| | - Yixin Xie
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (R.P.); (Y.X.); (S.P.); (N.S.)
| |
Collapse
|
11
|
Bernardi A, Bennett WFD, He S, Jones D, Kirshner D, Bennion BJ, Carpenter TS. Advances in Computational Approaches for Estimating Passive Permeability in Drug Discovery. MEMBRANES 2023; 13:851. [PMID: 37999336 PMCID: PMC10673305 DOI: 10.3390/membranes13110851] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2023] [Revised: 10/19/2023] [Accepted: 10/21/2023] [Indexed: 11/25/2023]
Abstract
Passive permeation of cellular membranes is a key feature of many therapeutics. The relevance of passive permeability spans all biological systems as they all employ biomembranes for compartmentalization. A variety of computational techniques are currently utilized and under active development to facilitate the characterization of passive permeability. These methods include lipophilicity relations, molecular dynamics simulations, and machine learning, which vary in accuracy, complexity, and computational cost. This review briefly introduces the underlying theories, such as the prominent inhomogeneous solubility diffusion model, and covers a number of recent applications. Various machine-learning applications, which have demonstrated good potential for high-volume, data-driven permeability predictions, are also discussed. Due to the confluence of novel computational methods and next-generation exascale computers, we anticipate an exciting future for computationally driven permeability predictions.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Timothy S. Carpenter
- Lawrence Livermore National Laboratory, Livermore, CA 94550, USA; (A.B.); (W.F.D.B.); (S.H.); (D.J.); (D.K.); (B.J.B.)
| |
Collapse
|
12
|
Chen SH, Weiss KL, Stanley C, Bhowmik D. Structural characterization of an intrinsically disordered protein complex using integrated small-angle neutron scattering and computing. Protein Sci 2023; 32:e4772. [PMID: 37646172 PMCID: PMC10503416 DOI: 10.1002/pro.4772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 08/22/2023] [Accepted: 08/27/2023] [Indexed: 09/01/2023]
Abstract
Characterizing structural ensembles of intrinsically disordered proteins (IDPs) and intrinsically disordered regions (IDRs) of proteins is essential for studying structure-function relationships. Due to the different neutron scattering lengths of hydrogen and deuterium, selective labeling and contrast matching in small-angle neutron scattering (SANS) becomes an effective tool to study dynamic structures of disordered systems. However, experimental timescales typically capture measurements averaged over multiple conformations, leaving complex SANS data for disentanglement. We hereby demonstrate an integrated method to elucidate the structural ensemble of a complex formed by two IDRs. We use data from both full contrast and contrast matching with residue-specific deuterium labeling SANS experiments, microsecond all-atom molecular dynamics (MD) simulations with four molecular mechanics force fields, and an autoencoder-based deep learning (DL) algorithm. From our combined approach, we show that selective deuteration provides additional information that helps characterize structural ensembles. We find that among the four force fields, a99SB-disp and CHARMM36m show the strongest agreement with SANS and NMR experiments. In addition, our DL algorithm not only complements conventional structural analysis methods but also successfully differentiates NMR and MD structures which are indistinguishable on the free energy surface. Lastly, we present an ensemble that describes experimental SANS and NMR data better than MD ensembles generated by one single force field and reveal three clusters of distinct conformations. Our results demonstrate a new integrated approach for characterizing structural ensembles of IDPs.
Collapse
Affiliation(s)
- Serena H. Chen
- Computational Sciences and Engineering DivisionOak Ridge National LaboratoryOak RidgeTennesseeUSA
| | - Kevin L. Weiss
- Neutron Scattering DivisionOak Ridge National LaboratoryOak RidgeTennesseeUSA
| | - Christopher Stanley
- Computational Sciences and Engineering DivisionOak Ridge National LaboratoryOak RidgeTennesseeUSA
| | - Debsindhu Bhowmik
- Computational Sciences and Engineering DivisionOak Ridge National LaboratoryOak RidgeTennesseeUSA
| |
Collapse
|
13
|
Xiao S, Song Z, Tian H, Tao P. Assessments of Variational Autoencoder in Protein Conformation Exploration. JOURNAL OF COMPUTATIONAL BIOPHYSICS AND CHEMISTRY 2023; 22:489-501. [PMID: 38826699 PMCID: PMC11138204 DOI: 10.1142/s2737416523500217] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Molecular dynamics (MD) simulations have been extensively used to study protein dynamics and subsequently functions. However, MD simulations are often insufficient to explore adequate conformational space for protein functions within reachable timescales. Accordingly, many enhanced sampling methods, including variational autoencoder (VAE) based methods, have been developed to address this issue. The purpose of this study is to evaluate the feasibility of using VAE to assist in the exploration of protein conformational landscapes. Using three modeling systems, we showed that VAE could capture high-level hidden information which distinguishes protein conformations. These models could also be used to generate new physically plausible protein conformations for direct sampling in favorable conformational spaces. We also found that VAE worked better in interpolation than extrapolation and increasing latent space dimension could lead to a trade-off between performances and complexities.
Collapse
Affiliation(s)
- Sian Xiao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75205, United States
| | - Zilin Song
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75205, United States
| | - Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75205, United States
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75205, United States
| |
Collapse
|
14
|
Zheng LE, Barethiya S, Nordquist E, Chen J. Machine Learning Generation of Dynamic Protein Conformational Ensembles. Molecules 2023; 28:4047. [PMID: 37241789 PMCID: PMC10220786 DOI: 10.3390/molecules28104047] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2023] [Revised: 05/04/2023] [Accepted: 05/09/2023] [Indexed: 05/28/2023] Open
Abstract
Machine learning has achieved remarkable success across a broad range of scientific and engineering disciplines, particularly its use for predicting native protein structures from sequence information alone. However, biomolecules are inherently dynamic, and there is a pressing need for accurate predictions of dynamic structural ensembles across multiple functional levels. These problems range from the relatively well-defined task of predicting conformational dynamics around the native state of a protein, which traditional molecular dynamics (MD) simulations are particularly adept at handling, to generating large-scale conformational transitions connecting distinct functional states of structured proteins or numerous marginally stable states within the dynamic ensembles of intrinsically disordered proteins. Machine learning has been increasingly applied to learn low-dimensional representations of protein conformational spaces, which can then be used to drive additional MD sampling or directly generate novel conformations. These methods promise to greatly reduce the computational cost of generating dynamic protein ensembles, compared to traditional MD simulations. In this review, we examine recent progress in machine learning approaches towards generative modeling of dynamic protein ensembles and emphasize the crucial importance of integrating advances in machine learning, structural data, and physical principles to achieve these ambitious goals.
Collapse
Affiliation(s)
- Li-E Zheng
- Department of Gynecology, The First Affiliated Hospital of Fujian Medical University, Fuzhou 350005, China;
| | - Shrishti Barethiya
- Department of Chemistry, University of Massachusetts Amherst, Amherst, MA 01003, USA; (S.B.); (E.N.)
| | - Erik Nordquist
- Department of Chemistry, University of Massachusetts Amherst, Amherst, MA 01003, USA; (S.B.); (E.N.)
| | - Jianhan Chen
- Department of Chemistry, University of Massachusetts Amherst, Amherst, MA 01003, USA; (S.B.); (E.N.)
| |
Collapse
|
15
|
Verkhivker G, Alshahrani M, Gupta G, Xiao S, Tao P. From Deep Mutational Mapping of Allosteric Protein Landscapes to Deep Learning of Allostery and Hidden Allosteric Sites: Zooming in on "Allosteric Intersection" of Biochemical and Big Data Approaches. Int J Mol Sci 2023; 24:7747. [PMID: 37175454 PMCID: PMC10178073 DOI: 10.3390/ijms24097747] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 04/22/2023] [Accepted: 04/23/2023] [Indexed: 05/15/2023] Open
Abstract
The recent advances in artificial intelligence (AI) and machine learning have driven the design of new expert systems and automated workflows that are able to model complex chemical and biological phenomena. In recent years, machine learning approaches have been developed and actively deployed to facilitate computational and experimental studies of protein dynamics and allosteric mechanisms. In this review, we discuss in detail new developments along two major directions of allosteric research through the lens of data-intensive biochemical approaches and AI-based computational methods. Despite considerable progress in applications of AI methods for protein structure and dynamics studies, the intersection between allosteric regulation, the emerging structural biology technologies and AI approaches remains largely unexplored, calling for the development of AI-augmented integrative structural biology. In this review, we focus on the latest remarkable progress in deep high-throughput mining and comprehensive mapping of allosteric protein landscapes and allosteric regulatory mechanisms as well as on the new developments in AI methods for prediction and characterization of allosteric binding sites on the proteome level. We also discuss new AI-augmented structural biology approaches that expand our knowledge of the universe of protein dynamics and allostery. We conclude with an outlook and highlight the importance of developing an open science infrastructure for machine learning studies of allosteric regulation and validation of computational approaches using integrative studies of allosteric mechanisms. The development of community-accessible tools that uniquely leverage the existing experimental and simulation knowledgebase to enable interrogation of the allosteric functions can provide a much-needed boost to further innovation and integration of experimental and computational technologies empowered by booming AI field.
Collapse
Affiliation(s)
- Gennady Verkhivker
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
- Department of Biomedical and Pharmaceutical Sciences, Chapman University School of Pharmacy, Irvine, CA 92618, USA
| | - Mohammed Alshahrani
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
| | - Grace Gupta
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
| | - Sian Xiao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, TX 75275, USA; (S.X.); (P.T.)
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, TX 75275, USA; (S.X.); (P.T.)
| |
Collapse
|
16
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
17
|
Zhu JJ, Zhang NJ, Wei T, Chen HF. Enhancing Conformational Sampling for Intrinsically Disordered and Ordered Proteins by Variational Autoencoder. Int J Mol Sci 2023; 24:ijms24086896. [PMID: 37108059 PMCID: PMC10138423 DOI: 10.3390/ijms24086896] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2023] [Revised: 03/26/2023] [Accepted: 03/27/2023] [Indexed: 04/29/2023] Open
Abstract
Intrinsically disordered proteins (IDPs) account for more than 50% of the human proteome and are closely associated with tumors, cardiovascular diseases, and neurodegeneration, which have no fixed three-dimensional structure under physiological conditions. Due to the characteristic of conformational diversity, conventional experimental methods of structural biology, such as NMR, X-ray diffraction, and CryoEM, are unable to capture conformational ensembles. Molecular dynamics (MD) simulation can sample the dynamic conformations at the atomic level, which has become an effective method for studying the structure and function of IDPs. However, the high computational cost prevents MD simulations from being widely used for IDPs conformational sampling. In recent years, significant progress has been made in artificial intelligence, which makes it possible to solve the conformational reconstruction problem of IDP with fewer computational resources. Here, based on short MD simulations of different IDPs systems, we use variational autoencoders (VAEs) to achieve the generative reconstruction of IDPs structures and include a wider range of sampled conformations from longer simulations. Compared with the generative autoencoder (AEs), VAEs add an inference layer between the encoder and decoder in the latent space, which can cover the conformational landscape of IDPs more comprehensively and achieve the effect of enhanced sampling. Through experimental verification, the Cα RMSD between VAE-generated and MD simulation sampling conformations in the 5 IDPs test systems was significantly lower than that of AE. The Spearman correlation coefficient on the structure was higher than that of AE. VAE can also achieve excellent performance regarding structured proteins. In summary, VAEs can be used to effectively sample protein structures.
Collapse
Affiliation(s)
- Jun-Jie Zhu
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ning-Jie Zhang
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ting Wei
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hai-Feng Chen
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai Center for Bioinformation Technology, Shanghai 200240, China
| |
Collapse
|
18
|
Yoo J, Kim TY, Joung I, Song SO. Industrializing AI/ML during the end-to-end drug discovery process. Curr Opin Struct Biol 2023; 79:102528. [PMID: 36736243 DOI: 10.1016/j.sbi.2023.102528] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 12/16/2022] [Accepted: 12/20/2022] [Indexed: 02/04/2023]
Abstract
Drug discovery aims to select proper targets and drug candidates to address unmet clinical needs. The end-to-end drug discovery process includes all stages of drug discovery from target identification to drug candidate selection. Recently, several artificial intelligence and machine learning (AI/ML)-based drug discovery companies have attempted to build data-driven platforms spanning the end-to-end drug discovery process. The ability to identify elusive targets essentially leads to the diversification of discovery pipelines, thereby increasing the ability to address unmet needs. Modern ML technologies are complementing traditional computer-aided drug discovery by accelerating candidate optimization in innovative ways. This review summarizes recent developments in AI/ML methods from target identification to molecule optimization, and concludes with an overview of current industrial trends in end-to-end AI/ML platforms.
Collapse
Affiliation(s)
- Jiho Yoo
- Standigm Inc., 3F, 70 Nonhyeon-ro 85-gil, Gangnam-gu, Seoul, South Korea, 06234 +82.2.501.8118
| | - Tae Yong Kim
- Standigm Inc., 3F, 70 Nonhyeon-ro 85-gil, Gangnam-gu, Seoul, South Korea, 06234 +82.2.501.8118
| | - InSuk Joung
- Standigm Inc., 3F, 70 Nonhyeon-ro 85-gil, Gangnam-gu, Seoul, South Korea, 06234 +82.2.501.8118
| | - Sang Ok Song
- Standigm Inc., 3F, 70 Nonhyeon-ro 85-gil, Gangnam-gu, Seoul, South Korea, 06234 +82.2.501.8118.
| |
Collapse
|
19
|
Agajanian S, Alshahrani M, Bai F, Tao P, Verkhivker GM. Exploring and Learning the Universe of Protein Allostery Using Artificial Intelligence Augmented Biophysical and Computational Approaches. J Chem Inf Model 2023; 63:1413-1428. [PMID: 36827465 PMCID: PMC11162550 DOI: 10.1021/acs.jcim.2c01634] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/26/2023]
Abstract
Allosteric mechanisms are commonly employed regulatory tools used by proteins to orchestrate complex biochemical processes and control communications in cells. The quantitative understanding and characterization of allosteric molecular events are among major challenges in modern biology and require integration of innovative computational experimental approaches to obtain atomistic-level knowledge of the allosteric states, interactions, and dynamic conformational landscapes. The growing body of computational and experimental studies empowered by emerging artificial intelligence (AI) technologies has opened up new paradigms for exploring and learning the universe of protein allostery from first principles. In this review we analyze recent developments in high-throughput deep mutational scanning of allosteric protein functions; applications and latest adaptations of Alpha-fold structural prediction methods for studies of protein dynamics and allostery; new frontiers in integrating machine learning and enhanced sampling techniques for characterization of allostery; and recent advances in structural biology approaches for studies of allosteric systems. We also highlight recent computational and experimental studies of the SARS-CoV-2 spike (S) proteins revealing an important and often hidden role of allosteric regulation driving functional conformational changes, binding interactions with the host receptor, and mutational escape mechanisms of S proteins which are critical for viral infection. We conclude with a summary and outlook of future directions suggesting that AI-augmented biophysical and computer simulation approaches are beginning to transform studies of protein allostery toward systematic characterization of allosteric landscapes, hidden allosteric states, and mechanisms which may bring about a new revolution in molecular biology and drug discovery.
Collapse
Affiliation(s)
- Steve Agajanian
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, California 92866, United States
| | - Mohammed Alshahrani
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, California 92866, United States
| | - Fang Bai
- Shanghai Institute for Advanced Immunochemical Studies, School of Life Science and Technology and Information Science and Technology, Shanghai Tech University, 393 Middle Huaxia Road, Shanghai 201210, China
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas 75205, United States
| | - Gennady M Verkhivker
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, California 92866, United States
- Department of Biomedical and Pharmaceutical Sciences, Chapman University School of Pharmacy, Irvine, California 92618, United States
| |
Collapse
|
20
|
Tian H, Jiang X, Xiao S, La Force H, Larson EC, Tao P. LAST: Latent Space-Assisted Adaptive Sampling for Protein Trajectories. J Chem Inf Model 2023; 63:67-75. [PMID: 36472885 PMCID: PMC9904845 DOI: 10.1021/acs.jcim.2c01213] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Molecular dynamics (MD) simulation is widely used to study protein conformations and dynamics. However, conventional simulation suffers from being trapped in some local energy minima that are hard to escape. Thus, most of the computational time is spent sampling in the already visited regions. This leads to an inefficient sampling process and further hinders the exploration of protein movements in affordable simulation time. The advancement of deep learning provides new opportunities for protein sampling. Variational autoencoders are a class of deep learning models to learn a low-dimensional representation (referred to as the latent space) that can capture the key features of the input data. Based on this characteristic, we proposed a new adaptive sampling method, latent space-assisted adaptive sampling for protein trajectories (LAST), to accelerate the exploration of protein conformational space. This method comprises cycles of (i) variational autoencoder training, (ii) seed structure selection on the latent space, and (iii) conformational sampling through additional MD simulations. The proposed approach is validated through the sampling of four structures of two protein systems: two metastable states of Escherichia coli adenosine kinase (ADK) and two native states of Vivid (VVD). In all four conformations, seed structures were shown to lie on the boundary of conformation distributions. Moreover, large conformational changes were observed in a shorter simulation time when compared with structural dissimilarity sampling (SDS) and conventional MD (cMD) simulations in both systems. In metastable ADK simulations, LAST explored two transition paths toward two stable states, while SDS explored only one and cMD neither. In VVD light state simulations, LAST was three times faster than cMD simulation with a similar conformational space. Overall, LAST is comparable to SDS and is a promising tool in adaptive sampling. The LAST method is publicly available at https://github.com/smu-tao-group/LAST to facilitate related research.
Collapse
Affiliation(s)
- Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas75206, United States
| | - Xi Jiang
- Department of Statistical Science, Southern Methodist University, Dallas, Texas75206, United States
| | - Sian Xiao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas75206, United States
| | - Hunter La Force
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas75206, United States
| | - Eric C Larson
- Department of Computer Science, Southern Methodist University, Dallas, Texas75206, United States
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, Texas75206, United States
| |
Collapse
|
21
|
Tian H, Ketkar R, Tao P. ADMETboost: a web server for accurate ADMET prediction. J Mol Model 2022; 28:408. [PMID: 36454321 PMCID: PMC9903341 DOI: 10.1007/s00894-022-05373-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 10/31/2022] [Indexed: 12/03/2022]
Abstract
The absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties are important in drug discovery as they define efficacy and safety. In this work, we applied an ensemble of features, including fingerprints and descriptors, and a tree-based machine learning model, extreme gradient boosting, for accurate ADMET prediction. Our model performs well in the Therapeutics Data Commons ADMET benchmark group. For 22 tasks, our model is ranked first in 18 tasks and top 3 in 21 tasks. The trained machine learning models are integrated in ADMETboost, a web server that is publicly available at https://ai-druglab.smu.edu/admet .
Collapse
Affiliation(s)
- Hao Tian
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA
| | | | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, 75205, TX, USA.
| |
Collapse
|
22
|
Avery C, Patterson J, Grear T, Frater T, Jacobs DJ. Protein Function Analysis through Machine Learning. Biomolecules 2022; 12:1246. [PMID: 36139085 PMCID: PMC9496392 DOI: 10.3390/biom12091246] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2022] [Revised: 08/22/2022] [Accepted: 08/31/2022] [Indexed: 11/16/2022] Open
Abstract
Machine learning (ML) has been an important arsenal in computational biology used to elucidate protein function for decades. With the recent burgeoning of novel ML methods and applications, new ML approaches have been incorporated into many areas of computational biology dealing with protein function. We examine how ML has been integrated into a wide range of computational models to improve prediction accuracy and gain a better understanding of protein function. The applications discussed are protein structure prediction, protein engineering using sequence modifications to achieve stability and druggability characteristics, molecular docking in terms of protein-ligand binding, including allosteric effects, protein-protein interactions and protein-centric drug discovery. To quantify the mechanisms underlying protein function, a holistic approach that takes structure, flexibility, stability, and dynamics into account is required, as these aspects become inseparable through their interdependence. Another key component of protein function is conformational dynamics, which often manifest as protein kinetics. Computational methods that use ML to generate representative conformational ensembles and quantify differences in conformational ensembles important for function are included in this review. Future opportunities are highlighted for each of these topics.
Collapse
Affiliation(s)
- Chris Avery
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - John Patterson
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Tyler Grear
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Theodore Frater
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| | - Donald J. Jacobs
- Department of Physics and Optical Science, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
| |
Collapse
|
23
|
Nussinov R, Zhang M, Liu Y, Jang H. AlphaFold, Artificial Intelligence (AI), and Allostery. J Phys Chem B 2022; 126:6372-6383. [PMID: 35976160 PMCID: PMC9442638 DOI: 10.1021/acs.jpcb.2c04346] [Citation(s) in RCA: 55] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 08/03/2022] [Indexed: 02/08/2023]
Abstract
AlphaFold has burst into our lives. A powerful algorithm that underscores the strength of biological sequence data and artificial intelligence (AI). AlphaFold has appended projects and research directions. The database it has been creating promises an untold number of applications with vast potential impacts that are still difficult to surmise. AI approaches can revolutionize personalized treatments and usher in better-informed clinical trials. They promise to make giant leaps toward reshaping and revamping drug discovery strategies, selecting and prioritizing combinations of drug targets. Here, we briefly overview AI in structural biology, including in molecular dynamics simulations and prediction of microbiota-human protein-protein interactions. We highlight the advancements accomplished by the deep-learning-powered AlphaFold in protein structure prediction and their powerful impact on the life sciences. At the same time, AlphaFold does not resolve the decades-long protein folding challenge, nor does it identify the folding pathways. The models that AlphaFold provides do not capture conformational mechanisms like frustration and allostery, which are rooted in ensembles, and controlled by their dynamic distributions. Allostery and signaling are properties of populations. AlphaFold also does not generate ensembles of intrinsically disordered proteins and regions, instead describing them by their low structural probabilities. Since AlphaFold generates single ranked structures, rather than conformational ensembles, it cannot elucidate the mechanisms of allosteric activating driver hotspot mutations nor of allosteric drug resistance. However, by capturing key features, deep learning techniques can use the single predicted conformation as the basis for generating a diverse ensemble.
Collapse
Affiliation(s)
- Ruth Nussinov
- Computational
Structural Biology Section, Frederick National
Laboratory for Cancer Research, Frederick, Maryland 21702, United States
- Department
of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
| | - Mingzhen Zhang
- Computational
Structural Biology Section, Frederick National
Laboratory for Cancer Research, Frederick, Maryland 21702, United States
| | - Yonglan Liu
- Cancer
Innovation Laboratory, National Cancer Institute, Frederick, Maryland 21702, United States
| | - Hyunbum Jang
- Computational
Structural Biology Section, Frederick National
Laboratory for Cancer Research, Frederick, Maryland 21702, United States
| |
Collapse
|
24
|
Rudden LSP, Hijazi M, Barth P. Deep learning approaches for conformational flexibility and switching properties in protein design. Front Mol Biosci 2022; 9:928534. [PMID: 36032687 PMCID: PMC9399439 DOI: 10.3389/fmolb.2022.928534] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 07/15/2022] [Indexed: 11/30/2022] Open
Abstract
Following the hugely successful application of deep learning methods to protein structure prediction, an increasing number of design methods seek to leverage generative models to design proteins with improved functionality over native proteins or novel structure and function. The inherent flexibility of proteins, from side-chain motion to larger conformational reshuffling, poses a challenge to design methods, where the ideal approach must consider both the spatial and temporal evolution of proteins in the context of their functional capacity. In this review, we highlight existing methods for protein design before discussing how methods at the forefront of deep learning-based design accommodate flexibility and where the field could evolve in the future.
Collapse
Affiliation(s)
- Lucas S. P. Rudden
- Institute of Bioengineering, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
| | | | - Patrick Barth
- Institute of Bioengineering, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
| |
Collapse
|
25
|
Xiao S, Tian H, Tao P. PASSer2.0: Accurate Prediction of Protein Allosteric Sites Through Automated Machine Learning. Front Mol Biosci 2022; 9:879251. [PMID: 35898310 PMCID: PMC9309527 DOI: 10.3389/fmolb.2022.879251] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2022] [Accepted: 05/23/2022] [Indexed: 11/16/2022] Open
Abstract
Allostery is a fundamental process in regulating protein activities. The discovery, design, and development of allosteric drugs demand better identification of allosteric sites. Several computational methods have been developed previously to predict allosteric sites using static pocket features and protein dynamics. Here, we define a baseline model for allosteric site prediction and present a computational model using automated machine learning. Our model, PASSer2.0, advanced the previous results and performed well across multiple indicators with 82.7% of allosteric pockets appearing among the top three positions. The trained machine learning model has been integrated with the Protein Allosteric Sites Server (PASSer) to facilitate allosteric drug discovery.
Collapse
Affiliation(s)
| | - Hao Tian
- Center for Research Computing, Center for Drug Discovery, Design and Delivery (CD4), Department of Chemistry, Southern Methodist University, Dallas, TX, United States
| | - Peng Tao
- Center for Research Computing, Center for Drug Discovery, Design and Delivery (CD4), Department of Chemistry, Southern Methodist University, Dallas, TX, United States
| |
Collapse
|
26
|
Extra Proximal-Gradient Network with Learned Regularization for Image Compressive Sensing Reconstruction. J Imaging 2022; 8:jimaging8070178. [PMID: 35877622 PMCID: PMC9319865 DOI: 10.3390/jimaging8070178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 06/16/2022] [Accepted: 06/20/2022] [Indexed: 12/05/2022] Open
Abstract
Learned optimization algorithms are promising approaches to inverse problems by leveraging advanced numerical optimization schemes and deep neural network techniques in machine learning. In this paper, we propose a novel deep neural network architecture imitating an extra proximal gradient algorithm to solve a general class of inverse problems with a focus on applications in image reconstruction. The proposed network features learned regularization that incorporates adaptive sparsification mappings, robust shrinkage selections, and nonlocal operators to improve solution quality. Numerical results demonstrate the improved efficiency and accuracy of the proposed network over several state-of-the-art methods on a variety of test problems.
Collapse
|
27
|
Trozzi F, Karki N, Song Z, Verma N, Kraka E, Zoltowski BD, Tao P. Allosteric control of ACE2 peptidase domain dynamics. Org Biomol Chem 2022; 20:3605-3618. [PMID: 35420112 PMCID: PMC9205182 DOI: 10.1039/d2ob00606e] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
The Angiotensin Converting Enzyme 2 (ACE2) assists the regulation of blood pressure and is the main target of the coronaviruses responsible for SARS and COVID19. The catalytic function of ACE2 relies on the opening and closing motion of its peptidase domain (PD). In this study, we investigated the possibility of allosterically controlling the ACE2 PD functional dynamics. After confirming that ACE2 PD binding site opening-closing motion is dominant in characterizing its conformational landscape, we observed that few mutations in the viral receptor binding domain fragments were able to impart different effects on the binding site opening of ACE2 PD. This showed that binding to the solvent exposed area of ACE2 PD can effectively alter the conformational profile of the protein, and thus likely its catalytic function. Using a targeted machine learning model and relative entropy-based statistical analysis, we proposed the mechanism for the allosteric perturbation that regulates the ACE2 PD binding site dynamics at atomistic level. The key residues and the source of the allosteric regulation of ACE PD dynamics are also presented.
Collapse
Affiliation(s)
- Francesco Trozzi
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Nischal Karki
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Zilin Song
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Niraj Verma
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Elfi Kraka
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Brian D Zoltowski
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, USA.
| |
Collapse
|
28
|
Baltrukevich H, Podlewska S. From Data to Knowledge: Systematic Review of Tools for Automatic Analysis of Molecular Dynamics Output. Front Pharmacol 2022; 13:844293. [PMID: 35359865 PMCID: PMC8960308 DOI: 10.3389/fphar.2022.844293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Accepted: 01/26/2022] [Indexed: 12/02/2022] Open
Abstract
An increasing number of crystal structures available on one side, and the boost of computational power available for computer-aided drug design tasks on the other, have caused that the structure-based drug design tools are intensively used in the drug development pipelines. Docking and molecular dynamics simulations, key representatives of the structure-based approaches, provide detailed information about the potential interaction of a ligand with a target receptor. However, at the same time, they require a three-dimensional structure of a protein and a relatively high amount of computational resources. Nowadays, as both docking and molecular dynamics are much more extensively used, the amount of data output from these procedures is also growing. Therefore, there are also more and more approaches that facilitate the analysis and interpretation of the results of structure-based tools. In this review, we will comprehensively summarize approaches for handling molecular dynamics simulations output. It will cover both statistical and machine-learning-based tools, as well as various forms of depiction of molecular dynamics output.
Collapse
Affiliation(s)
- Hanna Baltrukevich
- Maj Institute of Pharmacology, Polish Academy of Sciences, Kraków, Poland
- Faculty of Pharmacy, Chair of Technology and Biotechnology of Medical Remedies, Jagiellonian University Medical College in Krakow, Kraków, Poland
| | - Sabina Podlewska
- Maj Institute of Pharmacology, Polish Academy of Sciences, Kraków, Poland
| |
Collapse
|