1
|
Li G, Zhou J, Luo J, Liang C. Accurate prediction of virulence factors using pre-train protein language model and ensemble learning. BMC Genomics 2025; 26:517. [PMID: 40399812 PMCID: PMC12093764 DOI: 10.1186/s12864-025-11694-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2025] [Accepted: 05/09/2025] [Indexed: 05/23/2025] Open
Abstract
BACKGROUND As bacterial pathogens develop increasing resistance to antibiotics, strategies targeting virulence factors (VFs) have emerged as a promising and effective approach for treating bacterial infections. Existing methods mainly relied on sequence similarity, and remote homology relationships cannot be discovered by sequence analysis alone. RESULTS To address this limitation, we developed a protein language model and ensemble learning approach for VF identification (PLMVF). Specifically, we extracted features from protein sequences using ESM-2 and their three-dimensional (3D) structures using ESMFold. We calculated the true TM-score of the proteins based on their 3D structures and trained a TM-predictor model to predict structural similarity, thereby capturing hidden remote homology information within the sequences. Subsequently, we concatenated the sequence-level features extracted by ESM-2 with the predicted TM-score features to form a comprehensive feature set for prediction. Extensive experimental validation demonstrated that PLMVF achieved an accuracy (ACC) of 86.1%, significantly outperforming existing models across multiple evaluation metrics. This study provided an ideal tool for identifying novel targets in the development of anti-virulence therapies, offering promise for the effective prevention and control of pathogenic bacterial infections. CONCLUSIONS The proposed PLMVF model offers an efficient computational approach for VF identification.
Collapse
Affiliation(s)
- Guanghui Li
- School of Information and Software Engineering, East China Jiaotong University, Nanchang, 330013, China.
| | - Jian Zhou
- School of Information and Software Engineering, East China Jiaotong University, Nanchang, 330013, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410082, China
| | - Cheng Liang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| |
Collapse
|
2
|
Wang X, Zhang T, Liu G, Cui Z, Zeng Z, Long C, Zheng W, Yang J. LightRoseTTA: High-Efficient and Accurate Protein Structure Prediction Using a Light-Weight Deep Graph Model. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2309051. [PMID: 40134034 PMCID: PMC12097069 DOI: 10.1002/advs.202309051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 04/17/2024] [Indexed: 03/27/2025]
Abstract
Accurately predicting protein structure, from sequences to 3D structures, is of great significance in biological research. To tackle this issue, a representative deep big model, RoseTTAFold, is proposed with promising success. Here, "a light-weight deep graph network, named LightRoseTTA," is reported to achieve accurate and highly efficient prediction for proteins. Notably, three highlights are possessed by LightRoseTTA: i) high-accurate structure prediction for proteins, being "competitive with RoseTTAFold" on multiple popular datasets including CASP14 and CAMEO; ii) high-efficient training and inference with a light-weight model, costing "only 1 week on one single NVIDIA 3090 GPU for model-training" (vs 30 days on 8 NVIDIA V100 GPUs for RoseTTAFold) and containing "only 1.4M parameters" (vs 130M in RoseTTAFold); iii) low dependency on multi-sequence alignment (MSA), achieving the best performance on three MSA-insufficient datasets: Orphan, De novo, and Orphan25. Besides, LightRoseTTA is "transferable" from general proteins to antibody data, as verified in the experiments. The time and resource costs of LightRoseTTA and RoseTTAFold are further discussed to demonstrate the feasibility of light-weight models for protein structure prediction, which may be crucial in resource-limited research for universities and academic institutions. The code and model are released to speed biological research (https://github.com/psp3dcg/LightRoseTTA).
Collapse
Affiliation(s)
- Xudong Wang
- School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjing210094China
| | - Tong Zhang
- School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjing210094China
| | - Guangbu Liu
- School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjing210094China
| | - Zhen Cui
- School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjing210094China
| | - Zhiyong Zeng
- School of AutomationNanjing University of Science and TechnologyNanjing210094China
| | - Cheng Long
- School of Computer EngineeringNanyang Technological UniversityNo. 50, Nanyang AvenueSingapore639798Singapore
| | - Wenming Zheng
- School of Biological Science & Medical EngineeringSoutheast UniversityNanjing210096China
| | - Jian Yang
- School of Computer Science and EngineeringNanjing University of Science and TechnologyNanjing210094China
| |
Collapse
|
3
|
Meng Y, Zhang Z, Zhou C, Tang X, Hu X, Tian G, Yang J, Yao Y. Protein structure prediction via deep learning: an in-depth review. Front Pharmacol 2025; 16:1498662. [PMID: 40248099 PMCID: PMC12003282 DOI: 10.3389/fphar.2025.1498662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Accepted: 02/28/2025] [Indexed: 04/19/2025] Open
Abstract
The application of deep learning algorithms in protein structure prediction has greatly influenced drug discovery and development. Accurate protein structures are crucial for understanding biological processes and designing effective therapeutics. Traditionally, experimental methods like X-ray crystallography, nuclear magnetic resonance, and cryo-electron microscopy have been the gold standard for determining protein structures. However, these approaches are often costly, inefficient, and time-consuming. At the same time, the number of known protein sequences far exceeds the number of experimentally determined structures, creating a gap that necessitates the use of computational approaches. Deep learning has emerged as a promising solution to address this challenge over the past decade. This review provides a comprehensive guide to applying deep learning methodologies and tools in protein structure prediction. We initially outline the databases related to the protein structure prediction, then delve into the recently developed large language models as well as state-of-the-art deep learning-based methods. The review concludes with a perspective on the future of predicting protein structure, highlighting potential challenges and opportunities.
Collapse
Affiliation(s)
- Yajie Meng
- College of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
| | - Zhuang Zhang
- College of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
| | - Chang Zhou
- College of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
| | - Xianfang Tang
- College of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
| | - Xinrong Hu
- College of Computer Science and Artificial Intelligence, Wuhan Textile University, Wuhan, China
| | | | | | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Data Science and Intelligence Education, Ministry of Education, Hainan Normal University, Haikou, China
- Key Laboratory of Computational Science and Application of Hainan Province, Hainan Normal University, Haikou, China
| |
Collapse
|
4
|
Harihar B, Saravanan KM, Gromiha MM, Selvaraj S. Importance of Inter-residue Contacts for Understanding Protein Folding and Unfolding Rates, Remote Homology, and Drug Design. Mol Biotechnol 2025; 67:862-884. [PMID: 38498284 DOI: 10.1007/s12033-024-01119-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Accepted: 02/10/2024] [Indexed: 03/20/2024]
Abstract
Inter-residue interactions in protein structures provide valuable insights into protein folding and stability. Understanding these interactions can be helpful in many crucial applications, including rational design of therapeutic small molecules and biologics, locating functional protein sites, and predicting protein-protein and protein-ligand interactions. The process of developing machine learning models incorporating inter-residue interactions has been improved recently. This review highlights the theoretical models incorporating inter-residue interactions in predicting folding and unfolding rates of proteins. Utilizing contact maps to depict inter-residue interactions aids researchers in developing computer models for detecting remote homologs and interface residues within protein-protein complexes which, in turn, enhances our knowledge of the relationship between sequence and structure of proteins. Further, the application of contact maps derived from inter-residue interactions is highlighted in the field of drug discovery. Overall, this review presents an extensive assessment of the significant models that use inter-residue interactions to investigate folding rates, unfolding rates, remote homology, and drug development, providing potential future advancements in constructing efficient computational models in structural biology.
Collapse
Affiliation(s)
- Balasubramanian Harihar
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Konda Mani Saravanan
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India
- Department of Biotechnology, Bharath Institute of Higher Education and Research, Chennai, Tamil Nadu, 600073, India
| | - Michael M Gromiha
- Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai, Tamil Nadu, 600036, India
| | - Samuel Selvaraj
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli, Tamil Nadu, 620024, India.
| |
Collapse
|
5
|
Culkins C, Adomanis R, Phan N, Robinson B, Slaton E, Lothrop E, Chen Y, Kimmel BR. Unlocking the Gates: Therapeutic Agents for Noninvasive Drug Delivery Across the Blood-Brain Barrier. Mol Pharm 2024; 21:5430-5454. [PMID: 39324552 DOI: 10.1021/acs.molpharmaceut.4c00604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/27/2024]
Abstract
The blood-brain barrier (BBB) is a highly selective network of various cell types that acts as a filter between the blood and the brain parenchyma. Because of this, the BBB remains a major obstacle for drug delivery to the central nervous system (CNS). In recent years, there has been a focus on developing various modifiable platforms, such as monoclonal antibodies (mAbs), nanobodies (Nbs), peptides, and nanoparticles, as both therapeutic agents and carriers for targeted drug delivery to treat brain cancers and diseases. Methods for bypassing the BBB can be invasive or noninvasive. Invasive techniques, such as transient disruption of the BBB using low pulse electrical fields and intracerebroventricular infusion, lack specificity and have numerous safety concerns. In this review, we will focus on noninvasive transport mechanisms that offer high levels of biocompatibility, personalization, specificity and are regarded as generally safer than their invasive counterparts. Modifiable platforms can be designed to noninvasively traverse the BBB through one or more of the following pathways: passive diffusion through a physio-pathologically disrupted BBB, adsorptive-mediated transcytosis, receptor-mediated transcytosis, shuttle-mediated transcytosis, and somatic gene transfer. Through understanding the noninvasive pathways, new applications, including Chimeric Antigen Receptors T-cell (CAR-T) therapy, and approaches for drug delivery across the BBB are emerging.
Collapse
Affiliation(s)
- Courtney Culkins
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Roman Adomanis
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Nathan Phan
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Blaise Robinson
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Ethan Slaton
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Elijah Lothrop
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Yinuo Chen
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
| | - Blaise R Kimmel
- Department of Chemical and Biomolecular Engineering, The Ohio State University, Columbus, Ohio 43210, United States
- Center for Cancer Engineering, Ohio State University Comprehensive Cancer Center, The Ohio State University, Columbus, Ohio 43210, United States
- Pelotonia Institute for Immuno-Oncology, Ohio State University Comprehensive Cancer Center, The Ohio State University, Columbus, Ohio 43210, United States
| |
Collapse
|
6
|
Nguyen VTD, Nguyen ND, Hy TS. ProteinReDiff: Complex-based ligand-binding proteins redesign by equivariant diffusion-based generative models. STRUCTURAL DYNAMICS (MELVILLE, N.Y.) 2024; 11:064102. [PMID: 39629167 PMCID: PMC11614476 DOI: 10.1063/4.0000271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Accepted: 11/06/2024] [Indexed: 12/07/2024]
Abstract
Proteins, serving as the fundamental architects of biological processes, interact with ligands to perform a myriad of functions essential for life. Designing functional ligand-binding proteins is pivotal for advancing drug development and enhancing therapeutic efficacy. In this study, we introduce ProteinReDiff, an diffusion framework targeting the redesign of ligand-binding proteins. Using equivariant diffusion-based generative models, ProteinReDiff enables the creation of high-affinity ligand-binding proteins without the need for detailed structural information, leveraging instead the potential of initial protein sequences and ligand SMILES strings. Our evaluations across sequence diversity, structural preservation, and ligand binding affinity underscore ProteinReDiff's potential to advance computational drug discovery and protein engineering.
Collapse
Affiliation(s)
| | - Nhan D Nguyen
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA
| | - Truong Son Hy
- Department of Computer Science, University of Alabama at Birmingham, Birmingham, Alabama 35294, USA
| |
Collapse
|
7
|
Zhang F, Li Z, Zhao K, Zhao P, Zhang G. Prediction of Inter-Residue Multiple Distances and Exploration of Protein Multiple Conformations by Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1731-1739. [PMID: 38857126 DOI: 10.1109/tcbb.2024.3411825] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2024]
Abstract
AlphaFold2 has achieved a major breakthrough in end-to-end prediction for static protein structures. However, protein conformational change is considered to be a key factor in protein biological function. Inter-residue multiple distances prediction is of great significance for research on protein multiple conformations exploration. In this study, we proposed an inter-residue multiple distances prediction method, DeepMDisPre, based on an improved network which integrates triangle update, axial attention and ResNet to predict multiple distances of residue pairs. We built a dataset which contains proteins with a single structure and proteins with multiple conformations to train the network. We tested DeepMDisPre on 114 proteins with multiple conformations. The results show that the inter-residue distance distribution predicted by DeepMDisPre tends to have multiple peaks for flexible residue pairs than for rigid residue pairs. On two cases of proteins with multiple conformations, we modeled the multiple conformations relatively accurately by using the predicted inter-residue multiple distances. In addition, we also tested the performance of DeepMDisPre on 279 proteins with a single structure. Experimental results demonstrate that the average contact accuracy of DeepMDisPre is higher than that of the comparative method. In terms of static protein modeling, the average TM-score of the 3D models built by DeepMDisPre is also improved compared with the comparative method.
Collapse
|
8
|
Liu J, Guo Z, You H, Zhang C, Lai L. All-Atom Protein Sequence Design Based on Geometric Deep Learning. Angew Chem Int Ed Engl 2024:e202411461. [PMID: 39295564 DOI: 10.1002/anie.202411461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 09/09/2024] [Accepted: 09/18/2024] [Indexed: 09/21/2024]
Abstract
Designing sequences for specific protein backbones is a key step in creating new functional proteins. Here, we introduce GeoSeqBuilder, a deep learning framework that integrates protein sequence generation with side chain conformation prediction to produce the complete all-atom structures for designed sequences. GeoSeqBuilder uses spatial geometric features from protein backbones and explicitly includes three-body interactions of neighboring residues. GeoSeqBuilder achieves native residue type recovery rate of 51.6 %, comparable to ProteinMPNN and other leading methods, while accurately predicting side chain conformations. We first used GeoSeqBuilder to design sequences for thioredoxin and a hallucinated three-helical bundle protein. All the 15 tested sequences expressed as soluble monomeric proteins with high thermal stability, and the 2 high-resolution crystal structures solved closely match the designed models. The generated protein sequences exhibit low similarity (minimum 23 %) to the original sequences, with significantly altered hydrophobic cores. We further redesigned the hydrophobic core of glutathione peroxidase 4, and 3 of the 5 designs showed improved enzyme activity. Although further testing is needed, the high experimental success rate in our testing demonstrates that GeoSeqBuilder is a powerful tool for designing novel sequences for predefined protein structures with atomic details. GeoSeqBuilder is available at https://github.com/PKUliujl/GeoSeqBuilder.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Zheng Guo
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Hantian You
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Luhua Lai
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Peking University, Chengdu, 510100, Sichuan, China
| |
Collapse
|
9
|
Basu S, Subedi U, Tonelli M, Afshinpour M, Tiwari N, Fuentes EJ, Chakravarty S. Assessing the functional roles of coevolving PHD finger residues. Protein Sci 2024; 33:e5065. [PMID: 38923615 PMCID: PMC11201814 DOI: 10.1002/pro.5065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/21/2024] [Accepted: 05/16/2024] [Indexed: 06/28/2024]
Abstract
Although in silico folding based on coevolving residue constraints in the deep-learning era has transformed protein structure prediction, the contributions of coevolving residues to protein folding, stability, and other functions in physical contexts remain to be clarified and experimentally validated. Herein, the PHD finger module, a well-known histone reader with distinct subtypes containing subtype-specific coevolving residues, was used as a model to experimentally assess the contributions of coevolving residues and to clarify their specific roles. The results of the assessment, including proteolysis and thermal unfolding of wildtype and mutant proteins, suggested that coevolving residues have varying contributions, despite their large in silico constraints. Residue positions with large constraints were found to contribute to stability in one subtype but not others. Computational sequence design and generative model-based energy estimates of individual structures were also implemented to complement the experimental assessment. Sequence design and energy estimates distinguish coevolving residues that contribute to folding from those that do not. The results of proteolytic analysis of mutations at positions contributing to folding were consistent with those suggested by sequence design and energy estimation. Thus, we report a comprehensive assessment of the contributions of coevolving residues, as well as a strategy based on a combination of approaches that should enable detailed understanding of the residue contributions in other large protein families.
Collapse
Affiliation(s)
- Shraddha Basu
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Ujwal Subedi
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Marco Tonelli
- National Magnetic Resonance Facility at Madison (NMRFAM), University of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Maral Afshinpour
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| | - Nitija Tiwari
- Department of Biochemistry & Molecular BiologyUniversity of IowaIowa CityIowaUSA
| | - Ernesto J. Fuentes
- Department of Biochemistry & Molecular BiologyUniversity of IowaIowa CityIowaUSA
| | - Suvobrata Chakravarty
- Department of Chemistry & BiochemistrySouth Dakota State UniversityBrookingsSouth DakotaUSA
| |
Collapse
|
10
|
Si Y, Zou J, Gao Y, Chuai G, Liu Q, Chen L. Foundation models in molecular biology. BIOPHYSICS REPORTS 2024; 10:135-151. [PMID: 39027316 PMCID: PMC11252241 DOI: 10.52601/bpr.2024.240006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 03/04/2024] [Indexed: 07/20/2024] Open
Abstract
Determining correlations between molecules at various levels is an important topic in molecular biology. Large language models have demonstrated a remarkable ability to capture correlations from large amounts of data in the field of natural language processing as well as image generation, and correlations captured from data using large language models can also be applicable to solving a wide range of specific tasks, hence large language models are also referred to as foundation models. The massive amount of data that exists in the field of molecular biology provides an excellent basis for the development of foundation models, and the recent emergence of foundation models in the field of molecular biology has really pushed the entire field forward. We summarize the foundation models developed based on RNA sequence data, DNA sequence data, protein sequence data, single-cell transcriptome data, and spatial transcriptome data respectively, and further discuss the research directions for the development of foundation models in molecular biology.
Collapse
Affiliation(s)
- Yunda Si
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
| | - Jiawei Zou
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| | - Yicheng Gao
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Guohui Chuai
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Qi Liu
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Frontier Science Center for Stem Cell Research, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai 201804, China
| | - Luonan Chen
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China
- Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
11
|
Mazooji K, Shomorony I. Fast multiple sequence alignment via multi-armed bandits. Bioinformatics 2024; 40:i328-i336. [PMID: 38940160 PMCID: PMC11211838 DOI: 10.1093/bioinformatics/btae225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of hidden Markov models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a multi-armed bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time, particularly for datasets with long sequences. AVAILABILITY AND IMPLEMENTATION The code used to produce the results in this paper is available on GitHub at: https://github.com/ilanshom/adaptiveMSA.
Collapse
Affiliation(s)
- Kayvon Mazooji
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| |
Collapse
|
12
|
Wang L, Wen Z, Liu SW, Zhang L, Finley C, Lee HJ, Fan HJS. Overview of AlphaFold2 and breakthroughs in overcoming its limitations. Comput Biol Med 2024; 176:108620. [PMID: 38761500 DOI: 10.1016/j.compbiomed.2024.108620] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Revised: 05/01/2024] [Accepted: 05/14/2024] [Indexed: 05/20/2024]
Abstract
Predicting three-dimensional (3D) protein structures has been challenging for decades. The emergence of AlphaFold2 (AF2), a deep learning-based machine learning method developed by DeepMind, became a game changer in the protein folding community. AF2 can predict a protein's three-dimensional structure with high confidence based on its amino acid sequence. Accurate prediction of protein structures can dramatically accelerate our understanding of biological mechanisms and provide a solid foundation for reliable drug design. Although AF2 breaks through the barriers in predicting protein structures, many rooms remain to be further studied. This review provides a brief historical overview of the development of protein structure prediction, covering template-based, template-free, and machine learning-based methods. In addition to reviewing the potential benefits (Pros) and considerations (Cons) of using AF2, this review summarizes the diverse applications, including protein structure predictions, dynamic changes, point mutation, integration of language model and experimental data, protein complex, and protein-peptide interaction. It underscores recent advancements in efficiency, reliability, and broad application of AF2. This comprehensive review offers valuable insights into the applications of AF2 and AF2-inspired AI methods in structural biology and its potential for clinically significant drug target discovery.
Collapse
Affiliation(s)
- Lei Wang
- College of Chemical Engineering, Sichuan University of Science and Engineering, Zigong City, Sichuan Province, 64300, China
| | - Zehua Wen
- College of Chemical Engineering, Sichuan University of Science and Engineering, Zigong City, Sichuan Province, 64300, China
| | - Shi-Wei Liu
- College of Chemical Engineering, Sichuan University of Science and Engineering, Zigong City, Sichuan Province, 64300, China
| | - Lihong Zhang
- Digestive Department, Binhai New Area Hospital of TCM Tianjin, Tianjin, 300451, China
| | - Cierra Finley
- Department of Natural Sciences, Southwest Tennessee Community College, Memphis, TN, 38015, USA
| | - Ho-Jin Lee
- Department of Natural Sciences, Southwest Tennessee Community College, Memphis, TN, 38015, USA; Division of Natural & Mathematical Sciences, LeMoyne-Own College, Memphis, TN, 38126, USA.
| | - Hua-Jun Shawn Fan
- College of Chemical Engineering, Sichuan University of Science and Engineering, Zigong City, Sichuan Province, 64300, China.
| |
Collapse
|
13
|
Si Y, Yan C. Protein language model-embedded geometric graphs power inter-protein contact prediction. eLife 2024; 12:RP92184. [PMID: 38564241 PMCID: PMC10987090 DOI: 10.7554/elife.92184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/04/2024] Open
Abstract
Accurate prediction of contacting residue pairs between interacting proteins is very useful for structural characterization of protein-protein interactions. Although significant improvement has been made in inter-protein contact prediction recently, there is still a large room for improving the prediction accuracy. Here we present a new deep learning method referred to as PLMGraph-Inter for inter-protein contact prediction. Specifically, we employ rotationally and translationally invariant geometric graphs obtained from structures of interacting proteins to integrate multiple protein language models, which are successively transformed by graph encoders formed by geometric vector perceptrons and residual networks formed by dimensional hybrid residual blocks to predict inter-protein contacts. Extensive evaluation on multiple test sets illustrates that PLMGraph-Inter outperforms five top inter-protein contact prediction methods, including DeepHomo, GLINTER, CDPred, DeepHomo2, and DRN-1D2D_Inter, by large margins. In addition, we also show that the prediction of PLMGraph-Inter can complement the result of AlphaFold-Multimer. Finally, we show leveraging the contacts predicted by PLMGraph-Inter as constraints for protein-protein docking can dramatically improve its performance for protein complex structure prediction.
Collapse
Affiliation(s)
- Yunda Si
- School of Physics, Huazhong University of Science and TechnologyWuhanChina
| | - Chengfei Yan
- School of Physics, Huazhong University of Science and TechnologyWuhanChina
| |
Collapse
|
14
|
Li H, Sun X, Cui W, Xu M, Dong J, Ekundayo BE, Ni D, Rao Z, Guo L, Stahlberg H, Yuan S, Vogel H. Computational drug development for membrane protein targets. Nat Biotechnol 2024; 42:229-242. [PMID: 38361054 DOI: 10.1038/s41587-023-01987-2] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2023] [Accepted: 09/13/2023] [Indexed: 02/17/2024]
Abstract
The application of computational biology in drug development for membrane protein targets has experienced a boost from recent developments in deep learning-driven structure prediction, increased speed and resolution of structure elucidation, machine learning structure-based design and the evaluation of big data. Recent protein structure predictions based on machine learning tools have delivered surprisingly reliable results for water-soluble and membrane proteins but have limitations for development of drugs that target membrane proteins. Structural transitions of membrane proteins have a central role during transmembrane signaling and are often influenced by therapeutic compounds. Resolving the structural and functional basis of dynamic transmembrane signaling networks, especially within the native membrane or cellular environment, remains a central challenge for drug development. Tackling this challenge will require an interplay between experimental and computational tools, such as super-resolution optical microscopy for quantification of the molecular interactions of cellular signaling networks and their modulation by potential drugs, cryo-electron microscopy for determination of the structural transitions of proteins in native cell membranes and entire cells, and computational tools for data analysis and prediction of the structure and function of cellular signaling networks, as well as generation of promising drug candidates.
Collapse
Affiliation(s)
- Haijian Li
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
| | - Xiaolin Sun
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
| | - Wenqiang Cui
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Marc Xu
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Junlin Dong
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Babatunde Edukpe Ekundayo
- Laboratory of Biological Electron Microscopy, IPHYS, SB, EPFL and Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
| | - Dongchun Ni
- Laboratory of Biological Electron Microscopy, IPHYS, SB, EPFL and Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland
| | - Zhili Rao
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
| | - Liwei Guo
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China
| | - Henning Stahlberg
- Laboratory of Biological Electron Microscopy, IPHYS, SB, EPFL and Department of Fundamental Microbiology, Faculty of Biology and Medicine, University of Lausanne, Lausanne, Switzerland.
| | - Shuguang Yuan
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China.
| | - Horst Vogel
- Center for Computer-Aided Drug Discovery, Faculty of Pharmaceutical Sciences, Shenzhen Institute of Advanced Technology/Chinese Academy of Sciences (SIAT/CAS), Shenzhen, China.
- Institut des Sciences et Ingénierie Chimiques (ISIC), Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland.
| |
Collapse
|
15
|
Zheng L, Shi S, Sun X, Lu M, Liao Y, Zhu S, Zhang H, Pan Z, Fang P, Zeng Z, Li H, Li Z, Xue W, Zhu F. MoDAFold: a strategy for predicting the structure of missense mutant protein based on AlphaFold2 and molecular dynamics. Brief Bioinform 2024; 25:bbae006. [PMID: 38305456 PMCID: PMC10835750 DOI: 10.1093/bib/bbae006] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Revised: 12/26/2023] [Accepted: 01/01/2024] [Indexed: 02/03/2024] Open
Abstract
Protein structure prediction is a longstanding issue crucial for identifying new drug targets and providing a mechanistic understanding of protein functions. To enhance the progress in this field, a spectrum of computational methodologies has been cultivated. AlphaFold2 has exhibited exceptional precision in predicting wild-type protein structures, with performance exceeding that of other methods. However, predicting the structures of missense mutant proteins using AlphaFold2 remains challenging due to the intricate and substantial structural alterations caused by minor sequence variations in the mutant proteins. Molecular dynamics (MD) has been validated for precisely capturing changes in amino acid interactions attributed to protein mutations. Therefore, for the first time, a strategy entitled 'MoDAFold' was proposed to improve the accuracy and reliability of missense mutant protein structure prediction by combining AlphaFold2 with MD. Multiple case studies have confirmed the superior performance of MoDAFold compared to other methods, particularly AlphaFold2.
Collapse
Affiliation(s)
- Lingyan Zheng
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
| | - Shuiyang Shi
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Xiuna Sun
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
| | - Mingkun Lu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
| | - Yang Liao
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Sisi Zhu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou 311121, China
| | - Hongning Zhang
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Ziqi Pan
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
| | - Pan Fang
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Zhenyu Zeng
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Honglin Li
- School of Pharmacy, East China University of Science and Technology, Shanghai 200237, China
| | - Zhaorong Li
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| | - Weiwei Xue
- School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, The Second Affiliated Hospital, Zhejiang University School of Medicine, Zhejiang University, Hangzhou 310058, China
- Industry Solutions Research and Development, Alibaba Cloud Computing, Hangzhou 330110, China
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou 330110, China
| |
Collapse
|
16
|
Peng CX, Liang F, Xia YH, Zhao KL, Hou MH, Zhang GJ. Recent Advances and Challenges in Protein Structure Prediction. J Chem Inf Model 2024; 64:76-95. [PMID: 38109487 DOI: 10.1021/acs.jcim.3c01324] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2023]
Abstract
Artificial intelligence has made significant advances in the field of protein structure prediction in recent years. In particular, DeepMind's end-to-end model, AlphaFold2, has demonstrated the capability to predict three-dimensional structures of numerous unknown proteins with accuracy levels comparable to those of experimental methods. This breakthrough has opened up new possibilities for understanding protein structure and function as well as accelerating drug discovery and other applications in the field of biology and medicine. Despite the remarkable achievements of artificial intelligence in the field, there are still some challenges and limitations. In this Review, we discuss the recent progress and some of the challenges in protein structure prediction. These challenges include predicting multidomain protein structures, protein complex structures, multiple conformational states of proteins, and protein folding pathways. Furthermore, we highlight directions in which further improvements can be conducted.
Collapse
Affiliation(s)
- Chun-Xiang Peng
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Fang Liang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Yu-Hao Xia
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Kai-Long Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Ming-Hua Hou
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
17
|
Shishparenok AN, Gladilina YA, Zhdanov DD. Engineering and Expression Strategies for Optimization of L-Asparaginase Development and Production. Int J Mol Sci 2023; 24:15220. [PMID: 37894901 PMCID: PMC10607044 DOI: 10.3390/ijms242015220] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 10/11/2023] [Accepted: 10/13/2023] [Indexed: 10/29/2023] Open
Abstract
Genetic engineering for heterologous expression has advanced in recent years. Model systems such as Escherichia coli, Bacillus subtilis and Pichia pastoris are often used as host microorganisms for the enzymatic production of L-asparaginase, an enzyme widely used in the clinic for the treatment of leukemia and in bakeries for the reduction of acrylamide. Newly developed recombinant L-asparaginase (L-ASNase) may have a low affinity for asparagine, reduced catalytic activity, low stability, and increased glutaminase activity or immunogenicity. Some successful commercial preparations of L-ASNase are now available. Therefore, obtaining novel L-ASNases with improved properties suitable for food or clinical applications remains a challenge. The combination of rational design and/or directed evolution and heterologous expression has been used to create enzymes with desired characteristics. Computer design, combined with other methods, could make it possible to generate mutant libraries of novel L-ASNases without costly and time-consuming efforts. In this review, we summarize the strategies and approaches for obtaining and developing L-ASNase with improved properties.
Collapse
Affiliation(s)
- Anastasiya N. Shishparenok
- Laboratory of Medical Biotechnology, Institute of Biomedical Chemistry, Pogodinskaya St. 10/8, 119121 Moscow, Russia; (A.N.S.); (Y.A.G.)
| | - Yulia A. Gladilina
- Laboratory of Medical Biotechnology, Institute of Biomedical Chemistry, Pogodinskaya St. 10/8, 119121 Moscow, Russia; (A.N.S.); (Y.A.G.)
| | - Dmitry D. Zhdanov
- Laboratory of Medical Biotechnology, Institute of Biomedical Chemistry, Pogodinskaya St. 10/8, 119121 Moscow, Russia; (A.N.S.); (Y.A.G.)
- Department of Biochemistry, Peoples’ Friendship University of Russia named after Patrice Lumumba (RUDN University), Miklukho—Maklaya St. 6, 117198 Moscow, Russia
| |
Collapse
|
18
|
Huang B, Kong L, Wang C, Ju F, Zhang Q, Zhu J, Gong T, Zhang H, Yu C, Zheng WM, Bu D. Protein Structure Prediction: Challenges, Advances, and the Shift of Research Paradigms. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:913-925. [PMID: 37001856 PMCID: PMC10928435 DOI: 10.1016/j.gpb.2022.11.014] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/23/2022] [Accepted: 11/30/2022] [Indexed: 03/31/2023]
Abstract
Protein structure prediction is an interdisciplinary research topic that has attracted researchers from multiple fields, including biochemistry, medicine, physics, mathematics, and computer science. These researchers adopt various research paradigms to attack the same structure prediction problem: biochemists and physicists attempt to reveal the principles governing protein folding; mathematicians, especially statisticians, usually start from assuming a probability distribution of protein structures given a target sequence and then find the most likely structure, while computer scientists formulate protein structure prediction as an optimization problem - finding the structural conformation with the lowest energy or minimizing the difference between predicted structure and native structure. These research paradigms fall into the two statistical modeling cultures proposed by Leo Breiman, namely, data modeling and algorithmic modeling. Recently, we have also witnessed the great success of deep learning in protein structure prediction. In this review, we present a survey of the efforts for protein structure prediction. We compare the research paradigms adopted by researchers from different fields, with an emphasis on the shift of research paradigms in the era of deep learning. In short, the algorithmic modeling techniques, especially deep neural networks, have considerably improved the accuracy of protein structure prediction; however, theories interpreting the neural networks and knowledge on protein folding are still highly desired.
Collapse
Affiliation(s)
- Bin Huang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Lupeng Kong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; Changping Laboratory, Beijing 102206, China
| | - Chao Wang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
| | - Fusong Ju
- Microsoft Research AI4Science, Beijing 100080, China
| | - Qi Zhang
- Huawei Noah's Ark Lab, Wuhan 430206, China
| | - Jianwei Zhu
- Microsoft Research AI4Science, Beijing 100080, China
| | - Tiansu Gong
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China
| | - Haicang Zhang
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Chungong Yu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| | - Wei-Mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China.
| | - Dongbo Bu
- Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China; University of Chinese Academy of Sciences, Beijing 100049, China; Zhongke Big Data Academy, Zhengzhou 450046, China.
| |
Collapse
|
19
|
Lin P, Yan Y, Tao H, Huang SY. Deep transfer learning for inter-chain contact predictions of transmembrane protein complexes. Nat Commun 2023; 14:4935. [PMID: 37582780 PMCID: PMC10427616 DOI: 10.1038/s41467-023-40426-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 07/21/2023] [Indexed: 08/17/2023] Open
Abstract
Membrane proteins are encoded by approximately a quarter of human genes. Inter-chain residue-residue contact information is important for structure prediction of membrane protein complexes and valuable for understanding their molecular mechanism. Although many deep learning methods have been proposed to predict the intra-protein contacts or helix-helix interactions in membrane proteins, it is still challenging to accurately predict their inter-chain contacts due to the limited number of transmembrane proteins. Addressing the challenge, here we develop a deep transfer learning method for predicting inter-chain contacts of transmembrane protein complexes, named DeepTMP, by taking advantage of the knowledge pre-trained from a large data set of non-transmembrane proteins. DeepTMP utilizes a geometric triangle-aware module to capture the correct inter-chain interaction from the coevolution information generated by protein language models. DeepTMP is extensively evaluated on a test set of 52 self-associated transmembrane protein complexes, and compared with state-of-the-art methods including DeepHomo2.0, CDPred, GLINTER, DeepHomo, and DNCON2_Inter. It is shown that DeepTMP considerably improves the precision of inter-chain contact prediction and outperforms the existing approaches in both accuracy and robustness.
Collapse
Affiliation(s)
- Peicong Lin
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yumeng Yan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Huanyu Tao
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
| |
Collapse
|
20
|
Liu X, Duan Y, Hong X, Xie J, Liu S. Challenges in structural modeling of RNA-protein interactions. Curr Opin Struct Biol 2023; 81:102623. [PMID: 37301066 DOI: 10.1016/j.sbi.2023.102623] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 05/14/2023] [Accepted: 05/16/2023] [Indexed: 06/12/2023]
Abstract
In the past few years, the number of RNA-binding proteins (RBP) and RNA-RBP interactions has increased significantly. Here, we review recent developments in the methodology for protein-RNA and protein-protein complex structure modeling with deep learning and co-evolution, as well as discuss the challenges and opportunities for building a reliable approach for protein-RNA complex structure modelling. Protein Data bank (PDB) and Cross-linking immunoprecipitation (CLIP) data could be combined together and used to infer 2D geometry of protein-RNA interactions by deep learning.
Collapse
Affiliation(s)
- Xudong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Yingtian Duan
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Xu Hong
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Juan Xie
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China
| | - Shiyong Liu
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China.
| |
Collapse
|
21
|
Meng Q, Guo F, Tang J. Improved structure-related prediction for insufficient homologous proteins using MSA enhancement and pre-trained language model. Brief Bioinform 2023:bbad217. [PMID: 37321965 DOI: 10.1093/bib/bbad217] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2023] [Revised: 04/18/2023] [Accepted: 05/21/2023] [Indexed: 06/17/2023] Open
Abstract
In recent years, protein structure problems have become a hotspot for understanding protein folding and function mechanisms. It has been observed that most of the protein structure works rely on and benefit from co-evolutionary information obtained by multiple sequence alignment (MSA). As an example, AlphaFold2 (AF2) is a typical MSA-based protein structure tool which is famous for its high accuracy. As a consequence, these MSA-based methods are limited by the quality of the MSAs. Especially for orphan proteins that have no homologous sequence, AlphaFold2 performs unsatisfactorily as MSA depth decreases, which may pose a barrier to its widespread application in protein mutation and design problems in which there are no rich homologous sequences and rapid prediction is needed. In this paper, we constructed two standard datasets for orphan and de novo proteins which have insufficient/none homology information, called Orphan62 and Design204, respectively, to fairly evaluate the performance of the various methods in this case. Then, depending on whether or not utilizing scarce MSA information, we summarized two approaches, MSA-enhanced and MSA-free methods, to effectively solve the issue without sufficient MSAs. MSA-enhanced model aims to improve poor MSA quality from the data source by knowledge distillation and generation models. MSA-free model directly learns the relationship between residues on enormous protein sequences from pre-trained models, bypassing the step of extracting the residue pair representation from MSA. Next, we evaluated the performance of four MSA-free methods (trRosettaX-Single, TRFold, ESMFold and ProtT5) and MSA-enhanced (Bagging MSA) method compared with a traditional MSA-based method AlphaFold2, in two protein structure-related prediction tasks, respectively. Comparison analyses show that trRosettaX-Single and ESMFold which belong to MSA-free method can achieve fast prediction ($\sim\! 40$s) and comparable performance compared with AF2 in tertiary structure prediction, especially for short peptides, $\alpha $-helical segments and targets with few homologous sequences. Bagging MSA utilizing MSA enhancement improves the accuracy of our trained base model which is an MSA-based method when poor homology information exists in secondary structure prediction. Our study provides biologists an insight of how to select rapid and appropriate prediction tools for enzyme engineering and peptide drug development. CONTACT guofei@csu.edu.cn, jj.tang@siat.ac.cn.
Collapse
Affiliation(s)
- Qiaozhen Meng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518000, China
| |
Collapse
|
22
|
Huang B, Fan T, Wang K, Zhang H, Yu C, Nie S, Qi Y, Zheng WM, Han J, Fan Z, Sun S, Ye S, Yang H, Bu D. Accurate and efficient protein sequence design through learning concise local environment of residues. Bioinformatics 2023; 39:btad122. [PMID: 36916746 PMCID: PMC10027430 DOI: 10.1093/bioinformatics/btad122] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 01/30/2023] [Accepted: 02/19/2023] [Indexed: 03/15/2023] Open
Abstract
MOTIVATION Computational protein sequence design has been widely applied in rational protein engineering and increasing the design accuracy and efficiency is highly desired. RESULTS Here, we present ProDESIGN-LE, an accurate and efficient approach to protein sequence design. ProDESIGN-LE adopts a concise but informative representation of the residue's local environment and trains a transformer to learn the correlation between local environment of residues and their amino acid types. For a target backbone structure, ProDESIGN-LE uses the transformer to assign an appropriate residue type for each position based on its local environment within this structure, eventually acquiring a designed sequence with all residues fitting well with their local environments. We applied ProDESIGN-LE to design sequences for 68 naturally occurring and 129 hallucinated proteins within 20 s per protein on average. The designed proteins have their predicted structures perfectly resembling the target structures with a state-of-the-art average TM-score exceeding 0.80. We further experimentally validated ProDESIGN-LE by designing five sequences for an enzyme, chloramphenicol O-acetyltransferase type III (CAT III), and recombinantly expressing the proteins in Escherichia coli. Of these proteins, three exhibited excellent solubility, and one yielded monomeric species with circular dichroism spectra consistent with the natural CAT III protein. AVAILABILITY AND IMPLEMENTATION The source code of ProDESIGN-LE is available at https://github.com/bigict/ProDESIGN-LE.
Collapse
Affiliation(s)
- Bin Huang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
| | - Tingwen Fan
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Kaiyue Wang
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People’s Republic of China, Beijing 100083, China
| | - Haicang Zhang
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Chungong Yu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Shuyu Nie
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Yangshuo Qi
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
- School of Life Sciences, Hebei University, Baoding, Hebei 071002, China
| | - Wei-Mou Zheng
- University of Chinese Academy of Sciences, Beijing 100110, China
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
| | - Jian Han
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Zheng Fan
- Institutional Center for Shared Technologies and Facilities, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| | - Sheng Ye
- Beijing Advanced Innovation Center for Big Data-based Precision Medicine, School of Engineering Medicine, Beihang University, Beijing 100083, China
- Key Laboratory of Big Data-based Precision Medicine (Beihang University), Ministry of Industry and Information Technology of the People’s Republic of China, Beijing 100083, China
| | - Huaiyi Yang
- University of Chinese Academy of Sciences, Beijing 100110, China
- Key Lab of Microbial Physiological & Metabolic Engineering, State Key Lab of Mycology, Institute of Microbiology, Chinese Academy of Sciences, Beijing 100101, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100110, China
- Zhongke Big Data Academy, Zhengzhou, Henan 450046, China
| |
Collapse
|
23
|
Banerjee A, Saha S, Tvedt NC, Yang LW, Bahar I. Mutually beneficial confluence of structure-based modeling of protein dynamics and machine learning methods. Curr Opin Struct Biol 2023; 78:102517. [PMID: 36587424 PMCID: PMC10038760 DOI: 10.1016/j.sbi.2022.102517] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/31/2022]
Abstract
Proteins sample an ensemble of conformers under physiological conditions, having access to a spectrum of modes of motions, also called intrinsic dynamics. These motions ensure the adaptation to various interactions in the cell, and largely assist in, if not determine, viable mechanisms of biological function. In recent years, machine learning frameworks have proven uniquely useful in structural biology, and recent studies further provide evidence to the utility and/or necessity of considering intrinsic dynamics for increasing their predictive ability. Efficient quantification of dynamics-based attributes by recently developed physics-based theories and models such as elastic network models provides a unique opportunity to generate data on dynamics for training ML models towards inferring mechanisms of protein function, assessing pathogenicity, or estimating binding affinities.
Collapse
Affiliation(s)
- Anupam Banerjee
- Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh PA 15261, USA
| | - Satyaki Saha
- Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh PA 15261, USA
| | - Nathan C Tvedt
- Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh PA 15261, USA; Computational and Applied Mathematics and Statistics, The College of William and Mary, Williamsburg, VA 23185, USA
| | - Lee-Wei Yang
- Institute of Bioinformatics and Structural Biology, and PhD Program in Biomedical Artificial Intelligence, National Tsing Hua University, Hsinchu 300044, Taiwan; Physics Division, National Center for Theoretical Sciences, Taipei 106319, Taiwan
| | - Ivet Bahar
- Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh PA 15261, USA.
| |
Collapse
|
24
|
Liu J, Zhao K, Zhang G. Improved model quality assessment using sequence and structural information by enhanced deep neural networks. Brief Bioinform 2023; 24:6865134. [PMID: 36460624 DOI: 10.1093/bib/bbac507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 10/02/2022] [Accepted: 10/24/2022] [Indexed: 12/04/2022] Open
Abstract
Protein model quality assessment plays an important role in protein structure prediction, protein design and drug discovery. In this work, DeepUMQA2, a substantially improved version of DeepUMQA for protein model quality assessment, is proposed. First, sequence features containing protein co-evolution information and structural features reflecting family information are extracted to complement model-dependent features. Second, a novel backbone network based on triangular multiplication update and axial attention mechanism is designed to enhance information exchange between inter-residue pairs. On CASP13 and CASP14 datasets, the performance of DeepUMQA2 increases by 20.5 and 20.4% compared with DeepUMQA, respectively (measured by top 1 loss). Moreover, on the three-month CAMEO dataset (11 March to 04 June 2022), DeepUMQA2 outperforms DeepUMQA by 15.5% (measured by local AUC0,0.2) and ranks first among all competing server methods in CAMEO blind test. Experimental results show that DeepUMQA2 outperforms state-of-the-art model quality assessment methods, such as ProQ3D-LDDT, ModFOLD8, and DeepAccNet and DeepUMQA2 can select more suitable best models than state-of-the-art protein structure methods, such as AlphaFold2, RoseTTAFold and I-TASSER, provided themselves.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology
| | - Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology
| |
Collapse
|
25
|
Park M, Ivanovic S, Chu G, Shen C, Warnow T. UPP2: fast and accurate alignment of datasets with fragmentary sequences. Bioinformatics 2023; 39:6982552. [PMID: 36625535 PMCID: PMC9846425 DOI: 10.1093/bioinformatics/btad007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2022] [Revised: 12/01/2022] [Accepted: 01/09/2023] [Indexed: 01/11/2023] Open
Abstract
MOTIVATION Multiple sequence alignment (MSA) is a basic step in many bioinformatics pipelines. However, achieving highly accurate alignments on large datasets, especially those with sequence length heterogeneity, is a challenging task. Ultra-large multiple sequence alignment using Phylogeny-aware Profiles (UPP) is a method for MSA estimation that builds an ensemble of Hidden Markov Models (eHMM) to represent an estimated alignment on the full-length sequences in the input, and then adds the remaining sequences into the alignment using selected HMMs in the ensemble. Although UPP provides good accuracy, it is computationally intensive on large datasets. RESULTS We present UPP2, a direct improvement on UPP. The main advance is a fast technique for selecting HMMs in the ensemble that allows us to achieve the same accuracy as UPP but with greatly reduced runtime. We show that UPP2 produces more accurate alignments compared to leading MSA methods on datasets exhibiting substantial sequence length heterogeneity and is among the most accurate otherwise. AVAILABILITY AND IMPLEMENTATION https://github.com/gillichu/sepp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Minhyuk Park
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Stefan Ivanovic
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Gillian Chu
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | - Chengze Shen
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61820, USA
| | | |
Collapse
|
26
|
Liu J, Zhang C, Lai L. GeoPacker: A novel deep learning framework for protein side-chain modeling. Protein Sci 2022; 31:e4484. [PMID: 36309961 PMCID: PMC9667900 DOI: 10.1002/pro.4484] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/23/2022] [Accepted: 10/26/2022] [Indexed: 12/13/2022]
Abstract
Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning-based methods have achieved impressive success not only in protein structure prediction, but also in protein sequence design. However, highly efficient and accurate protein side-chain prediction methods that can give detailed atomic interactions are still lacking. In the present study, we developed a deep learning based method, GeoPacker, that uses geometric deep learning coupled ResNet for protein side-chain modeling. GeoPacker explicitly represents atomic interactions with rotational and translational invariance for information extraction of relative locations. GeoPacker outperformed the state-of-the-art energy function-based methods in side-chain structure prediction accuracy and runs about 10 and 700 times faster than the deep learning-based method DLPacker and OPUS-rota4 with comparable prediction accuracy, respectively. The performance of GeoPacker does not depend on the secondary structures that the residues belong to. GeoPacker gives highly accurate predictions for buried residues in the protein core as well as protein-protein interface, making it a useful tool for protein structure modeling, protein, and interaction design.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
| | - Luhua Lai
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| |
Collapse
|
27
|
Protein structure prediction in the deep learning era. Curr Opin Struct Biol 2022; 77:102495. [PMID: 36371845 DOI: 10.1016/j.sbi.2022.102495] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/03/2022] [Accepted: 10/04/2022] [Indexed: 11/11/2022]
Abstract
Significant advances have been achieved in protein structure prediction, especially with the recent development of the AlphaFold2 and the RoseTTAFold systems. This article reviews the progress in deep learning-based protein structure prediction methods in the past two years. First, we divide the representative methods into two categories: the two-step approach and the end-to-end approach. Then, we show that the two-step approach is possible to achieve similar accuracy to the state-of-the-art end-to-end approach AlphaFold2. Compared to the end-to-end approach, the two-step approach requires fewer computing resources. We conclude that it is valuable to keep developing both approaches. Finally, a few outstanding challenges in function-orientated protein structure prediction are pointed out for future development.
Collapse
|
28
|
Becker F, Stanke M. learnMSA: learning and aligning large protein families. Gigascience 2022; 11:giac104. [PMID: 36399060 PMCID: PMC9673500 DOI: 10.1093/gigascience/giac104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 09/01/2022] [Accepted: 10/06/2022] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. RESULTS We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum-Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. CONCLUSIONS Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.
Collapse
Affiliation(s)
- Felix Becker
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17489 Greifswald, Germany
| | - Mario Stanke
- Institute of Mathematics and Computer Science, University of Greifswald, Walther-Rathenau-Straße 47, 17489 Greifswald, Germany
| |
Collapse
|
29
|
Terwilliger TC, Poon BK, Afonine PV, Schlicksup CJ, Croll TI, Millán C, Richardson JS, Read RJ, Adams PD. Improved AlphaFold modeling with implicit experimental information. Nat Methods 2022; 19:1376-1382. [PMID: 36266465 PMCID: PMC9636017 DOI: 10.1038/s41592-022-01645-6] [Citation(s) in RCA: 73] [Impact Index Per Article: 24.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 09/09/2022] [Indexed: 12/02/2022]
Abstract
Machine-learning prediction algorithms such as AlphaFold and RoseTTAFold can create remarkably accurate protein models, but these models usually have some regions that are predicted with low confidence or poor accuracy. We hypothesized that by implicitly including new experimental information such as a density map, a greater portion of a model could be predicted accurately, and that this might synergistically improve parts of the model that were not fully addressed by either machine learning or experiment alone. An iterative procedure was developed in which AlphaFold models are automatically rebuilt on the basis of experimental density maps and the rebuilt models are used as templates in new AlphaFold predictions. We show that including experimental information improves prediction beyond the improvement obtained with simple rebuilding guided by the experimental data. This procedure for AlphaFold modeling with density has been incorporated into an automated procedure for interpretation of crystallographic and electron cryo-microscopy maps.
Collapse
Affiliation(s)
- Thomas C Terwilliger
- New Mexico Consortium, Los Alamos, NM, USA.
- Los Alamos National Laboratory, Los Alamos, NM, USA.
| | - Billy K Poon
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Pavel V Afonine
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Christopher J Schlicksup
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Tristan I Croll
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Claudia Millán
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | | | - Randy J Read
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Paul D Adams
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Bioengineering, University of California, Berkeley, CA, USA
| |
Collapse
|
30
|
Wu C, Guo D. Computational Docking Reveals Co-Evolution of C4 Carbon Delivery Enzymes in Diverse Plants. Int J Mol Sci 2022; 23:12688. [PMID: 36293547 PMCID: PMC9604239 DOI: 10.3390/ijms232012688] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2022] [Revised: 10/14/2022] [Accepted: 10/19/2022] [Indexed: 11/16/2022] Open
Abstract
Proteins are modular functionalities regulating multiple cellular activities in prokaryotes and eukaryotes. As a consequence of higher plants adapting to arid and thermal conditions, C4 photosynthesis is the carbon fixation process involving multi-enzymes working in a coordinated fashion. However, how these enzymes interact with each other and whether they co-evolve in parallel to maintain interactions in different plants remain elusive to date. Here, we report our findings on the global protein co-evolution relationship and local dynamics of co-varying site shifts in key C4 photosynthetic enzymes. We found that in most of the selected key C4 photosynthetic enzymes, global pairwise co-evolution events exist to form functional couplings. Besides, protein-protein interactions between these enzymes may suggest their unknown functionalities in the carbon delivery process. For PEPC and PPCK regulation pairs, pocket formation at the interactive interface are not necessary for their function. This feature is distinct from another well-known regulation pair in C4 photosynthesis, namely, PPDK and PPDK-RP, where the pockets are necessary. Our findings facilitate the discovery of novel protein regulation types and contribute to expanding our knowledge about C4 photosynthesis.
Collapse
Affiliation(s)
| | - Dianjing Guo
- State Key Laboratory of Agrobiotechnology, School of Life Sciences, The Chinese University of Hong Kong, Shatin, Hong Kong SAR, China
| |
Collapse
|
31
|
Liu Z, Yu DJ. cpxDeepMSA: A Deep Cascade Algorithm for Constructing Multiple Sequence Alignments of Protein–Protein Interactions. Int J Mol Sci 2022; 23:ijms23158459. [PMID: 35955594 PMCID: PMC9369210 DOI: 10.3390/ijms23158459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2022] [Revised: 07/18/2022] [Accepted: 07/28/2022] [Indexed: 12/10/2022] Open
Abstract
Protein–protein interactions (PPIs) are fundamental to many biological processes. The coevolution-based prediction of interacting residues has made great strides in protein complexes that are known to interact. A multiple sequence alignment (MSA) is the basis of coevolution analysis. MSAs have recently made significant progress in the protein monomer sequence analysis. However, no standard or efficient pipelines are available for the sensitive protein complex MSA (cpxMSA) collection. How to generate cpxMSA is one of the most challenging problems of sequence coevolution analysis. Although several methods have been developed to address this problem, no standalone program exists. Furthermore, the number of built-in properties is limited; hence, it is often difficult for users to analyze sequence coevolution according to their desired cpxMSA. In this article, we developed a novel cpxMSA approach (cpxDeepMSA. We used different protein monomer databases and incorporated the three strategies (genomic distance, phylogeny information, and STRING interaction network) used to join the monomer MSA results of protein complexes, which can prevent using a single method fail to the joint two-monomer MSA causing the cpxMSA construction failure. We anticipate that the cpxDeepMSA algorithm will become a useful high-throughput tool in protein complex structure predictions, inter-protein residue-residue contacts, and the biological sequence coevolution analysis.
Collapse
|
32
|
Li Y, Zhang C, Yu DJ, Zhang Y. Deep learning geometrical potential for high-accuracy ab initio protein structure prediction. iScience 2022; 25:104425. [PMID: 35663033 PMCID: PMC9160776 DOI: 10.1016/j.isci.2022.104425] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 05/02/2022] [Accepted: 05/11/2022] [Indexed: 11/22/2022] Open
Abstract
Ab initio protein structure prediction has been vastly boosted by the modeling of inter-residue contact/distance maps in recent years. We developed a new deep learning model, DeepPotential, which accurately predicts the distribution of a complementary set of geometric descriptors including a novel hydrogen-bonding potential defined by C-alpha atom coordinates. On 154 Free-Modeling/Hard targets from the CASP and CAMEO experiments, DeepPotential demonstrated significant advantage on both geometrical feature prediction and full-length structure construction, with Top-L/5 contact accuracy and TM-score of full-length models 4.1% and 6.7% higher than the best of other deep-learning restraint prediction approaches. Detail analyses showed that the major contributions to the TM-score/contact-map improvements come from the employment of multi-tasking network architecture and metagenome-based MSA collection assisted with confidence-based MSA selection, where hydrogen-bonding and inter-residue orientation predictions help improve hydrogen-bonding network and secondary structure packing. These results demonstrated new progress in the deep-learning restraint-guided ab initio protein structure prediction. Multi-tasking network architecture for multiple inter-residue geometries Novel deep learning model for improved hydrogen-bonding modeling Rapid and high-accuracy Ab initio protein structure prediction
Collapse
Affiliation(s)
- Yang Li
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 21000, China.,Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, 21000, China
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109, USA.,Department of Biological Chemistry, University of Michigan, Ann Arbor, MI 48109, USA
| |
Collapse
|
33
|
Weissenow K, Heinzinger M, Rost B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure 2022; 30:1169-1177.e4. [DOI: 10.1016/j.str.2022.05.001] [Citation(s) in RCA: 68] [Impact Index Per Article: 22.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 02/25/2022] [Accepted: 04/29/2022] [Indexed: 01/27/2023]
|
34
|
Kong L, Ju F, Zheng WM, Zhu J, Sun S, Xu J, Bu D. ProALIGN: Directly Learning Alignments for Protein Structure Prediction via Exploiting Context-Specific Alignment Motifs. J Comput Biol 2022; 29:92-105. [PMID: 35073170 PMCID: PMC8892980 DOI: 10.1089/cmb.2021.0430] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Template-based modeling (TBM), including homology modeling and protein threading, is one of the most reliable techniques for protein structure prediction. It predicts protein structure by building an alignment between the query sequence under prediction and the templates with solved structures. However, it is still very challenging to build the optimal sequence-template alignment, especially when only distantly related templates are available. Here we report a novel deep learning approach ProALIGN that can predict much more accurate sequence-template alignment. Like protein sequences consisting of sequence motifs, protein alignments are also composed of frequently occurring alignment motifs with characteristic patterns. Alignment motifs are context-specific as their characteristic patterns are tightly related to sequence contexts of the aligned regions. Inspired by this observation, we represent a protein alignment as a binary matrix (in which 1 denotes an aligned residue pair) and then use a deep convolutional neural network to predict the optimal alignment from the query protein and its template. The trained neural network implicitly but effectively encodes an alignment scoring function, which reduces inaccuracies in the handcrafted scoring functions widely used by the current threading approaches. For a query protein and a template, we apply the neural network to directly infer likelihoods of all possible residue pairs in their entirety, which could effectively consider the correlations among multiple residues. We further construct the alignment with maximum likelihood, and finally build a structure model according to the alignment. Tested on three independent data sets with a total of 6688 protein alignment targets and 80 CASP13 TBM targets, our method achieved much better alignments and 3D structure models than the existing methods, including HHpred, CNFpred, CEthreader, and DeepThreader. These results clearly demonstrate the effectiveness of exploiting the context-specific alignment motifs by deep learning for protein threading.
Collapse
Affiliation(s)
- Lupeng Kong
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
- Toyota Technological Institute, Chicago, Illinois, USA
| | - Fusong Ju
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wei-mou Zheng
- Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing, China
| | | | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jinbo Xu
- Toyota Technological Institute, Chicago, Illinois, USA
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
35
|
Hot spots-making directed evolution easier. Biotechnol Adv 2022; 56:107926. [DOI: 10.1016/j.biotechadv.2022.107926] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Revised: 01/04/2022] [Accepted: 02/07/2022] [Indexed: 01/20/2023]
|
36
|
Tran NH, Xu J, Li M. A tale of solving two computational challenges in protein science: neoantigen prediction and protein structure prediction. Brief Bioinform 2022; 23:bbab493. [PMID: 34891158 PMCID: PMC8769896 DOI: 10.1093/bib/bbab493] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2021] [Revised: 10/11/2021] [Accepted: 10/26/2021] [Indexed: 12/30/2022] Open
Abstract
In this article, we review two challenging computational questions in protein science: neoantigen prediction and protein structure prediction. Both topics have seen significant leaps forward by deep learning within the past five years, which immediately unlocked new developments of drugs and immunotherapies. We show that deep learning models offer unique advantages, such as representation learning and multi-layer architecture, which make them an ideal choice to leverage a huge amount of protein sequence and structure data to address those two problems. We also discuss the impact and future possibilities enabled by those two applications, especially how the data-driven approach by deep learning shall accelerate the progress towards personalized biomedicine.
Collapse
Affiliation(s)
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, USA
| | - Ming Li
- University of Waterloo, Canada
| |
Collapse
|
37
|
Su H, Wang W, Du Z, Peng Z, Gao S, Cheng M, Yang J. Improved Protein Structure Prediction Using a New Multi-Scale Network and Homologous Templates. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2021; 8:e2102592. [PMID: 34719864 PMCID: PMC8693034 DOI: 10.1002/advs.202102592] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/19/2021] [Revised: 09/12/2021] [Indexed: 06/04/2023]
Abstract
The accuracy of de novo protein structure prediction has been improved considerably in recent years, mostly due to the introduction of deep learning techniques. In this work, trRosettaX, an improved version of trRosetta for protein structure prediction is presented. The major improvement over trRosetta consists of two folds. The first is the application of a new multi-scale network, i.e., Res2Net, for improved prediction of inter-residue geometries, including distance and orientations. The second is an attention-based module to exploit multiple homologous templates to increase the accuracy further. Compared with trRosetta, trRosettaX improves the contact precision by 6% and 8% on the free modeling targets of CASP13 and CASP14, respectively. A preliminary version of trRosettaX is ranked as one of the top server groups in CASP14's blind test. Additional benchmark test on 161 targets from CAMEO (between Jun and Sep 2020) shows that trRosettaX achieves an average TM-score ≈0.8, outperforming the top groups in CAMEO. These data suggest the effectiveness of using the multi-scale network and the benefit of incorporating homologous templates into the network. The trRosettaX algorithm is incorporated into the trRosetta server since Nov 2020. The web server, the training and inference codes are available at: https://yanglab.nankai.edu.cn/trRosetta/.
Collapse
Affiliation(s)
- Hong Su
- School of Mathematical SciencesNankai UniversityTianjin300071China
| | - Wenkai Wang
- School of Mathematical SciencesNankai UniversityTianjin300071China
| | - Zongyang Du
- School of Mathematical SciencesNankai UniversityTianjin300071China
| | - Zhenling Peng
- Research Center for Mathematics and Interdisciplinary SciencesShandong UniversityQingdao266237China
| | - Shang‐Hua Gao
- College of Computer ScienceNankai UniversityTianjin300071China
| | - Ming‐Ming Cheng
- College of Computer ScienceNankai UniversityTianjin300071China
| | - Jianyi Yang
- Research Center for Mathematics and Interdisciplinary SciencesShandong UniversityQingdao266237China
| |
Collapse
|
38
|
Kong L, Ju F, Zhang H, Sun S, Bu D. FALCON2: a web server for high-quality prediction of protein tertiary structures. BMC Bioinformatics 2021; 22:439. [PMID: 34525939 PMCID: PMC8444573 DOI: 10.1186/s12859-021-04353-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 09/01/2021] [Indexed: 11/21/2022] Open
Abstract
BACKGROUND Accurate prediction of protein tertiary structures is highly desired as the knowledge of protein structures provides invaluable insights into protein functions. We have designed two approaches to protein structure prediction, including a template-based modeling approach (called ProALIGN) and an ab initio prediction approach (called ProFOLD). Briefly speaking, ProALIGN aligns a target protein with templates through exploiting the patterns of context-specific alignment motifs and then builds the final structure with reference to the homologous templates. In contrast, ProFOLD uses an end-to-end neural network to estimate inter-residue distances of target proteins and builds structures that satisfy these distance constraints. These two approaches emphasize different characteristics of target proteins: ProALIGN exploits structure information of homologous templates of target proteins while ProFOLD exploits the co-evolutionary information carried by homologous protein sequences. Recent progress has shown that the combination of template-based modeling and ab initio approaches is promising. RESULTS In the study, we present FALCON2, a web server that integrates ProALIGN and ProFOLD to provide high-quality protein structure prediction service. For a target protein, FALCON2 executes ProALIGN and ProFOLD simultaneously to predict possible structures and selects the most likely one as the final prediction result. We evaluated FALCON2 on widely-used benchmarks, including 104 CASP13 (the 13th Critical Assessment of protein Structure Prediction) targets and 91 CASP14 targets. In-depth examination suggests that when high-quality templates are available, ProALIGN is superior to ProFOLD and in other cases, ProFOLD shows better performance. By integrating these two approaches with different emphasis, FALCON2 server outperforms the two individual approaches and also achieves state-of-the-art performance compared with existing approaches. CONCLUSIONS By integrating template-based modeling and ab initio approaches, FALCON2 provides an easy-to-use and high-quality protein structure prediction service for the community and we expect it to enable insights into a deep understanding of protein functions.
Collapse
Affiliation(s)
- Lupeng Kong
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Fusong Ju
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Haicang Zhang
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Shiwei Sun
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| | - Dongbo Bu
- Key Lab of Intelligent Information Processing, Big-Data Academy, Institute of Computing Technology, Chinese Academy of Sciences, 100190 Beijing, China
- University of Chinese Academy of Sciences, 100049 Beijing, China
| |
Collapse
|
39
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
40
|
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D. Accurate prediction of protein structures and interactions using a three-track neural network. Science 2021; 373:871-876. [PMID: 34282049 PMCID: PMC7612213 DOI: 10.1126/science.abj8754] [Citation(s) in RCA: 2928] [Impact Index Per Article: 732.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 07/07/2021] [Indexed: 01/17/2023]
Abstract
DeepMind presented notably accurate predictions at the recent 14th Critical Assessment of Structure Prediction (CASP14) conference. We explored network architectures that incorporate related ideas and obtained the best performance with a three-track network in which information at the one-dimensional (1D) sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The three-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging x-ray crystallography and cryo-electron microscopy structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short-circuiting traditional approaches that require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.
Collapse
Affiliation(s)
- Minkyung Baek
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Frank DiMaio
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Justas Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Sergey Ovchinnikov
- Faculty of Arts and Sciences, Division of Science, Harvard University, Cambridge, MA 02138, USA
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138, USA
| | - Gyu Rie Lee
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Jue Wang
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Qian Cong
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Lisa N Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - R Dustin Schaeffer
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Claudia Millán
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - Hahnbeom Park
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Carson Adams
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
| | - Caleb R Glassman
- Program in Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Andy DeGiovanni
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Jose H Pereira
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Andria V Rodrigues
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Alberdina A van Dijk
- Department of Biochemistry, Focus Area Human Metabolomics, North-West University, 2531 Potchefstroom, South Africa
| | - Ana C Ebrecht
- Department of Biochemistry, Focus Area Human Metabolomics, North-West University, 2531 Potchefstroom, South Africa
| | - Diederik J Opperman
- Department of Biotechnology, University of the Free State, 205 Nelson Mandela Drive, Bloemfontein 9300, South Africa
| | - Theo Sagmeister
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
| | - Christoph Buhlheller
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
- Medical University of Graz, Graz, Austria
| | - Tea Pavkov-Keller
- Institute of Molecular Biosciences, University of Graz, Humboldtstrasse 50, 8010 Graz, Austria
- BioTechMed-Graz, Graz, Austria
| | - Manoj K Rathinaswamy
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada
| | - Udit Dalwadi
- Life Sciences Institute, Department of Biochemistry and Molecular Biology, The University of British Columbia, Vancouver, BC, Canada
| | - Calvin K Yip
- Life Sciences Institute, Department of Biochemistry and Molecular Biology, The University of British Columbia, Vancouver, BC, Canada
| | - John E Burke
- Department of Biochemistry and Microbiology, University of Victoria, Victoria, BC, Canada
| | - K Christopher Garcia
- Program in Immunology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Molecular and Cellular Physiology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Department of Structural Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
- Howard Hughes Medical Institute, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Paul D Adams
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Department of Bioengineering, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Randy J Read
- Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Cambridge, UK
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98195, USA.
- Institute for Protein Design, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98195, USA
| |
Collapse
|