1
|
Zhou B, Zheng L, Wu B, Tan Y, Lv O, Yi K, Fan G, Hong L. Protein Engineering with Lightweight Graph Denoising Neural Networks. J Chem Inf Model 2024; 64:3650-3661. [PMID: 38630581 DOI: 10.1021/acs.jcim.4c00036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024]
Abstract
Protein engineering faces challenges in finding optimal mutants from a massive pool of candidate mutants. In this study, we introduce a deep-learning-based data-efficient fitness prediction tool to steer protein engineering. Our methodology establishes a lightweight graph neural network scheme for protein structures, which efficiently analyzes the microenvironment of amino acids in wild-type proteins and reconstructs the distribution of the amino acid sequences that are more likely to pass natural selection. This distribution serves as a general guidance for scoring proteins toward arbitrary properties on any order of mutations. Our proposed solution undergoes extensive wet-lab experimental validation spanning diverse physicochemical properties of various proteins, including fluorescence intensity, antigen-antibody affinity, thermostability, and DNA cleavage activity. More than 40% of ProtLGN-designed single-site mutants outperform their wild-type counterparts across all studied proteins and targeted properties. More importantly, our model can bypass the negative epistatic effect to combine single mutation sites and form deep mutants with up to seven mutation sites in a single round, whose physicochemical properties are significantly improved. This observation provides compelling evidence of the structure-based model's potential to guide deep mutations in protein engineering. Overall, our approach emerges as a versatile tool for protein engineering, benefiting both the computational and bioengineering communities.
Collapse
Affiliation(s)
- Bingxin Zhou
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
| | - Lirong Zheng
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Banghao Wu
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Tan
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
| | - Outongyi Lv
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Kai Yi
- School of Mathematics and Statistics, University of New South Wales, Sydney 2052, Australia
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
| | - Liang Hong
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai 200240, China
- Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai 201203, China
| |
Collapse
|
2
|
Shuvo MH, Karim M, Bhattacharya D. iQDeep: an integrated web server for protein scoring using multiscale deep learning models. J Mol Biol 2023; 435:168057. [PMID: 37356909 PMCID: PMC10291203 DOI: 10.1016/j.jmb.2023.168057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2022] [Revised: 03/01/2023] [Accepted: 03/17/2023] [Indexed: 06/27/2023]
Abstract
The remarkable recent advances in protein structure prediction have enabled computational modeling of protein structures with considerably higher accuracy than ever before. While state-of-the-art structure prediction methods provide self-assessment confidence scores of their own predictions, an independent and open-access system for protein scoring is still needed that can be applied to a broad range of predictive modeling scenarios. Here, we present iQDeep, an integrated and highly customizable web server for protein scoring, freely available at http://fusion.cs.vt.edu/iQDeep. The underlying method of iQDeep employs multiscale deep residual neural networks (ResNets) to perform residue-level error classifications, and then probabilistically combines the error classifications for protein scoring. By adjusting the error resolutions, our method can reliably estimate the standard- or high-accuracy variants of the Global Distance Test metric for versatile protein scoring. The performance of the method has been extensively tested and compared against the state-of-the-art approaches in multiple rounds of Critical Assessment of Techniques for Protein Structure Prediction (CASP) experiments including benchmark assessment in CASP12 and CASP13 as well as blind evaluation in CASP14. The iQDeep web server offers a number of convenient features, including (i) the choice of individual and batch processing modes; (ii) an interactive and privacy-preserving web interface for automated job submission, tracking, and results retrieval; (iii) web-based quantitative and visual analyses of the results including overall estimated score and its residue-wise breakdown along with agreements between various sequence- and structural-level features; (iv) extensive help information on job submission and results interpretation via web-based tutorial and help tooltips.
Collapse
Affiliation(s)
- Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States. https://twitter.com/mzs0149
| | - Mohimenul Karim
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States. https://twitter.com/CaptainRafi97
| | | |
Collapse
|
3
|
Zhang P, Xia C, Shen HB. High-accuracy protein model quality assessment using attention graph neural networks. Brief Bioinform 2023; 24:7025462. [PMID: 36736352 DOI: 10.1093/bib/bbac614] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Revised: 11/23/2022] [Accepted: 12/12/2022] [Indexed: 02/05/2023] Open
Abstract
Great improvement has been brought to protein tertiary structure prediction through deep learning. It is important but very challenging to accurately rank and score decoy structures predicted by different models. CASP14 results show that existing quality assessment (QA) approaches lag behind the development of protein structure prediction methods, where almost all existing QA models degrade in accuracy when the target is a decoy of high quality. How to give an accurate assessment to high-accuracy decoys is particularly useful with the available of accurate structure prediction methods. Here we propose a fast and effective single-model QA method, QATEN, which can evaluate decoys only by their topological characteristics and atomic types. Our model uses graph neural networks and attention mechanisms to evaluate global and amino acid level scores, and uses specific loss functions to constrain the network to focus more on high-precision decoys and protein domains. On the CASP14 evaluation decoys, QATEN performs better than other QA models under all correlation coefficients when targeting average LDDT. QATEN shows promising performance when considering only high-accuracy decoys. Compared to the embedded evaluation modules of predicted ${C}_{\alpha^{-}} RMSD$ (pRMSD) in RosettaFold and predicted LDDT (pLDDT) in AlphaFold2, QATEN is complementary and capable of achieving better evaluation on some decoy structures generated by AlphaFold2 and RosettaFold. These results suggest that the new QATEN approach can be used as a reliable independent assessment algorithm for high-accuracy protein structure decoys.
Collapse
Affiliation(s)
- Peidong Zhang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Chunqiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China
| |
Collapse
|
4
|
Bhattacharya S, Roche R, Shuvo MH, Moussad B, Bhattacharya D. Contact-Assisted Threading in Low-Homology Protein Modeling. Methods Mol Biol 2023; 2627:41-59. [PMID: 36959441 DOI: 10.1007/978-1-0716-2974-1_3] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/25/2023]
Abstract
The ability to successfully predict the three-dimensional structure of a protein from its amino acid sequence has made considerable progress in the recent past. The progress is propelled by the improved accuracy of deep learning-based inter-residue contact map predictors coupled with the rising growth of protein sequence databases. Contact map encodes interatomic interaction information that can be exploited for highly accurate prediction of protein structures via contact map threading even for the query proteins that are not amenable to direct homology modeling. As such, contact-assisted threading has garnered considerable research effort. In this chapter, we provide an overview of existing contact-assisted threading methods while highlighting the recent advances and discussing some of the current limitations and future prospects in the application of contact-assisted threading for improving the accuracy of low-homology protein modeling.
Collapse
Affiliation(s)
- Sutanu Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL, USA
| | | | - Md Hossain Shuvo
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| | - Bernard Moussad
- Department of Computer Science, Virginia Tech, Blacksburg, VA, USA
| | | |
Collapse
|
5
|
Mohseni Behbahani Y, Crouzet S, Laine E, Carbone A. Deep Local Analysis evaluates protein docking conformations with locally oriented cubes. Bioinformatics 2022; 38:4505-4512. [PMID: 35962985 PMCID: PMC9525006 DOI: 10.1093/bioinformatics/btac551] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 07/04/2022] [Accepted: 08/08/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION With the recent advances in protein 3D structure prediction, protein interactions are becoming more central than ever before. Here, we address the problem of determining how proteins interact with one another. More specifically, we investigate the possibility of discriminating near-native protein complex conformations from incorrect ones by exploiting local environments around interfacial residues. RESULTS Deep Local Analysis (DLA)-Ranker is a deep learning framework applying 3D convolutions to a set of locally oriented cubes representing the protein interface. It explicitly considers the local geometry of the interfacial residues along with their neighboring atoms and the regions of the interface with different solvent accessibility. We assessed its performance on three docking benchmarks made of half a million acceptable and incorrect conformations. We show that DLA-Ranker successfully identifies near-native conformations from ensembles generated by molecular docking. It surpasses or competes with other deep learning-based scoring functions. We also showcase its usefulness to discover alternative interfaces. AVAILABILITY AND IMPLEMENTATION http://gitlab.lcqb.upmc.fr/dla-ranker/DLA-Ranker.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yasser Mohseni Behbahani
- Sorbonne Université, CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Paris 75005, France
| | - Simon Crouzet
- Sorbonne Université, CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Paris 75005, France
| | | | | |
Collapse
|
6
|
Kurniawan J, Ishida T. Protein Model Quality Estimation Using Molecular Dynamics Simulation. ACS OMEGA 2022; 7:24274-24281. [PMID: 35874260 PMCID: PMC9301944 DOI: 10.1021/acsomega.2c01475] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The estimation of protein model quality remains a challenging task and is important for protein structural model utilization. In the last decade, existing methods that rely on machine learning to deep learning have been developed and shown progressive improvement. Despite utilizing more sophisticated techniques and introducing new features, none of these methods employ explicit protein structure stability information. Hypothetically, protein model quality might be indicated by its structural stability in an in silico system disclosed by the structural difference from its initial structure. One of the possible methods to exploit such information is by implementing molecular dynamics simulations that have shown successful applications in many research fields. We present a novel approach by introducing explicit protein structure stability information using molecular dynamics simulation. Despite using only simple features, small data with no training process required, and a short molecular dynamics simulation time, our method shows comparable performance to the state-of-the-art deep learning-based method.
Collapse
|
7
|
Overview on the Development of Intelligent Methods for Mineral Resource Prediction under the Background of Geological Big Data. MINERALS 2022. [DOI: 10.3390/min12050616] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
In the age of big data, the prediction and evaluation of geological mineral resources have gradually entered a new stage, intelligent prospecting. This review briefly summarizes the research development of textual data mining and spatial data mining. It is considered that the current research on mineral resource prediction has integrated logical reasoning, theoretical models, computational simulations, and other scientific research models, and has gradually advanced toward a new model. This type of new model has tried to mine unknown and effective knowledge from big data by intelligent analysis methods. However, many challenges have come forward, including four aspects: (i) discovery of prospecting big data based on geological knowledge system; (ii) construction of the conceptual prospecting model by intelligent text mining; (iii) mineral prediction by intelligent spatial big data mining; (iv) sharing and visualization of the mineral prediction data. By extending the geological analysis in the process of prospecting prediction to the logical rules associated with expert knowledge points, the theory and methods of intelligent mineral prediction were preliminarily established based on geological big data. The core of the theory is to promote the flow, invocation, circulation, and optimization of the three key factors of “knowledge”, “model”, and “data”, and to preliminarily constitute the prototype of intelligent linkage mechanisms. It could be divided into four parts: intelligent datamation, intelligent informatization, intelligent knowledgeization, and intelligent servitization.
Collapse
|
8
|
Akhter N, Kabir KL, Chennupati G, Vangara R, Alexandrov BS, Djidjev H, Shehu A. Improved Protein Decoy Selection via Non-Negative Matrix Factorization. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1670-1682. [PMID: 33400654 DOI: 10.1109/tcbb.2020.3049088] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
A central challenge in protein modeling research and protein structure prediction in particular is known as decoy selection. The problem refers to selecting biologically-active/native tertiary structures among a multitude of physically-realistic structures generated by template-free protein structure prediction methods. Research on decoy selection is active. Clustering-based methods are popular, but they fail to identify good/near-native decoys on datasets where near-native decoys are severely under-sampled by a protein structure prediction method. Reasonable progress is reported by methods that additionally take into account the internal energy of a structure and employ it to identify basins in the energy landscape organizing the multitude of decoys. These methods, however, incur significant time costs for extracting basins from the landscape. In this paper, we propose a novel decoy selection method based on non-negative matrix factorization. We demonstrate that our method outperforms energy landscape-based methods. In particular, the proposed method addresses both the time cost issue and the challenge of identifying good decoys in a sparse dataset, successfully recognizing near-native decoys for both easy and hard protein targets.
Collapse
|
9
|
Baek I, Kim SB. 3-Dimensional convolutional neural networks for predicting StarCraft Ⅱ results and extracting key game situations. PLoS One 2022; 17:e0264550. [PMID: 35239703 PMCID: PMC8893650 DOI: 10.1371/journal.pone.0264550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 02/11/2022] [Indexed: 11/19/2022] Open
Abstract
In real-time strategy games, players collect resources, control various units, and create strategies to win. The creation of winning strategies requires accurately analyzing previous games; therefore, it is important to be able to identify the key situations that determined the outcomes of those games. However, previous studies have mainly focused on predicting game results. In this study, we propose a methodology to predict outcomes and to identify information about the turning points that determine outcomes in StarCraft Ⅱ, one of the most popular real-time strategy games. We used replay data from StarCraft Ⅱ that is similar to video data providing continuous multiple images. First, we trained a result prediction model using 3D-residual networks (3D-ResNet) and replay data to improve prediction performance by utilizing in-game spatiotemporal information. Second, we used gradient-weighted class activation mapping to extract information defining the key situations that significantly influenced the outcomes of the game. We then proved that the proposed method outperforms by comparing 2D-residual networks (2D-ResNet) using only one time-point information and 3D-ResNet with multiple time-point information. We verified the usefulness of our methodology on a 3D-ResNet with a gradient class activation map linked to a StarCraft Ⅱ replay dataset.
Collapse
Affiliation(s)
- Insung Baek
- School of Industrial and Management Engineering, Korea University, Seoul, Republic of Korea
| | - Seoung Bum Kim
- School of Industrial and Management Engineering, Korea University, Seoul, Republic of Korea
- * E-mail:
| |
Collapse
|
10
|
Machine learning to estimate the local quality of protein crystal structures. Sci Rep 2021; 11:23599. [PMID: 34880321 PMCID: PMC8654820 DOI: 10.1038/s41598-021-02948-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 11/24/2021] [Indexed: 11/23/2022] Open
Abstract
Low-resolution electron density maps can pose a major obstacle in the determination and use of protein structures. Herein, we describe a novel method, called quality assessment based on an electron density map (QAEmap), which evaluates local protein structures determined by X-ray crystallography and could be applied to correct structural errors using low-resolution maps. QAEmap uses a three-dimensional deep convolutional neural network with electron density maps and their corresponding coordinates as input and predicts the correlation between the local structure and putative high-resolution experimental electron density map. This correlation could be used as a metric to modify the structure. Further, we propose that this method may be applied to evaluate ligand binding, which can be difficult to determine at low resolution.
Collapse
|
11
|
Ovchinnikov S, Huang PS. Structure-based protein design with deep learning. Curr Opin Chem Biol 2021; 65:136-144. [PMID: 34547592 PMCID: PMC8671290 DOI: 10.1016/j.cbpa.2021.08.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/13/2021] [Indexed: 12/11/2022]
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, 02138, USA.
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
12
|
Li J, Yanagisawa K, Yoshikawa Y, Ohue M, Akiyama Y. Plasma protein binding prediction focusing on residue-level features and circularity of cyclic peptides by deep learning. Bioinformatics 2021; 38:1110-1117. [PMID: 34849593 PMCID: PMC8796384 DOI: 10.1093/bioinformatics/btab726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 09/22/2021] [Accepted: 10/11/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION In recent years, cyclic peptide drugs have been receiving increasing attention because they can target proteins that are difficult to be tackled by conventional small-molecule drugs or antibody drugs. Plasma protein binding rate (%PPB) is a significant pharmacokinetic property of a compound in drug discovery and design. However, due to structural differences, previous computational prediction methods developed for small-molecule compounds cannot be successfully applied to cyclic peptides, and methods for predicting the PPB rate of cyclic peptides with high accuracy are not yet available. RESULTS Cyclic peptides are larger than small molecules, and their local structures have a considerable impact on PPB; thus, molecular descriptors expressing residue-level local features of cyclic peptides, instead of those expressing the entire molecule, as well as the circularity of the cyclic peptides should be considered. Therefore, we developed a prediction method named CycPeptPPB using deep learning that considers both factors. First, the macrocycle ring of cyclic peptides was decomposed residue by residue. The residue-based descriptors were arranged according to the sequence information of the cyclic peptide. Furthermore, the circular data augmentation method was used, and the circular convolution method CyclicConv was devised to express the cyclic structure. CycPeptPPB exhibited excellent performance, with mean absolute error (MAE) of 4.79% and correlation coefficient (R) of 0.92 for the public drug dataset, compared to the prediction performance of the existing PPB rate prediction software (MAE=15.08%, R=0.63). AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in the online supplementary material. The source code of CycPeptPPB is available at https://github.com/akiyamalab/cycpeptppb. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianan Li
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan,AIST-TokyoTech Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology, Tsukuba, Ibaraki 305-8560, Japan
| | - Keisuke Yanagisawa
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan,Middle-Molecule IT-based Drug Discovery Laboratory (MIDL), Tokyo Institute of Technology, Kawasaki, Kanagawa 210-0821, Japan
| | - Yasushi Yoshikawa
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan,Middle-Molecule IT-based Drug Discovery Laboratory (MIDL), Tokyo Institute of Technology, Kawasaki, Kanagawa 210-0821, Japan
| | - Masahito Ohue
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Meguro-ku, Tokyo 152-8550, Japan,Middle-Molecule IT-based Drug Discovery Laboratory (MIDL), Tokyo Institute of Technology, Kawasaki, Kanagawa 210-0821, Japan
| | | |
Collapse
|
13
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
14
|
Igashov I, Pavlichenko N, Grudinin S. Spherical convolutions on molecular graphs for protein model quality assessment. MACHINE LEARNING: SCIENCE AND TECHNOLOGY 2021. [DOI: 10.1088/2632-2153/abf856] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Abstract
Processing information on three-dimensional (3D) objects requires methods stable to rigid-body transformations, in particular rotations, of the input data. In image processing tasks, convolutional neural networks achieve this property using rotation-equivariant operations. However, contrary to images, graphs generally have irregular topology. This makes it challenging to define a rotation-equivariant convolution operation on these structures. In this work, we propose spherical graph convolutional network that processes 3D models of proteins represented as molecular graphs. In a protein molecule, individual amino acids have common topological elements. This allows us to unambiguously associate each amino acid with a local coordinate system and construct rotation-equivariant spherical filters that operate on angular information between graph nodes. Within the framework of the protein model quality assessment problem, we demonstrate that the proposed spherical convolution method significantly improves the quality of model assessment compared to the standard message-passing approach. It is also comparable to state-of-the-art methods, as we demonstrate on critical assessment of structure prediction benchmarks. The proposed technique operates only on geometric features of protein 3D models. This makes it universal and applicable to any other geometric-learning task where the graph structure allows constructing local coordinate systems. The method is available at https://team.inria.fr/nano-d/software/s-gcn/.
Collapse
|
15
|
Takei Y, Ishida T. P3CMQA: Single-Model Quality Assessment Using 3DCNN with Profile-Based Features. Bioengineering (Basel) 2021; 8:bioengineering8030040. [PMID: 33808604 PMCID: PMC8003382 DOI: 10.3390/bioengineering8030040] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2021] [Revised: 03/12/2021] [Accepted: 03/16/2021] [Indexed: 11/16/2022] Open
Abstract
Model quality assessment (MQA), which selects near-native structures from structure models, is an important process in protein tertiary structure prediction. The three-dimensional convolution neural network (3DCNN) was applied to the task, but the performance was comparable to existing methods because it used only atom-type features as the input. Thus, we added sequence profile-based features, which are also used in other methods, to improve the performance. We developed a single-model MQA method for protein structures based on 3DCNN using sequence profile-based features, namely, P3CMQA. Performance evaluation using a CASP13 dataset showed that profile-based features improved the assessment performance, and the proposed method was better than currently available single-model MQA methods, including the previous 3DCNN-based method. We also implemented a web-interface of the method to make it more user-friendly.
Collapse
Affiliation(s)
- Yuma Takei
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8550, Japan;
- Real World Big-Data Computation Open Innovation Laboratory (RWBC-OIL), National Institute of Advanced Industrial Science and Technology (AIST), Aomi, Koto-ku, Tokyo 135-0064, Japan
| | - Takashi Ishida
- Department of Computer Science, School of Computing, Tokyo Institute of Technology, Ookayama, Meguro-ku, Tokyo 152-8550, Japan;
- Correspondence:
| |
Collapse
|
16
|
Shuvo MH, Bhattacharya S, Bhattacharya D. QDeep: distance-based protein model quality estimation by residue-level ensemble error classifications using stacked deep residual neural networks. Bioinformatics 2021; 36:i285-i291. [PMID: 32657397 PMCID: PMC7355297 DOI: 10.1093/bioinformatics/btaa455] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Protein model quality estimation, in many ways, informs protein structure prediction. Despite their tight coupling, existing model quality estimation methods do not leverage inter-residue distance information or the latest technological breakthrough in deep learning that has recently revolutionized protein structure prediction. RESULTS We present a new distance-based single-model quality estimation method called QDeep by harnessing the power of stacked deep residual neural networks (ResNets). Our method first employs stacked deep ResNets to perform residue-level ensemble error classifications at multiple predefined error thresholds, and then combines the predictions from the individual error classifiers for estimating the quality of a protein structural model. Experimental results show that our method consistently outperforms existing state-of-the-art methods including ProQ2, ProQ3, ProQ3D, ProQ4, 3DCNN, MESHI, and VoroMQA in multiple independent test datasets across a wide-range of accuracy measures; and that predicted distance information significantly contributes to the improved performance of QDeep. AVAILABILITY AND IMPLEMENTATION https://github.com/Bhattacharya-Lab/QDeep. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Md Hossain Shuvo
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA
| | - Sutanu Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA
| | - Debswapna Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, AL 36849, USA.,Department of Biological Sciences, Auburn University, Auburn, AL 36849, USA
| |
Collapse
|
17
|
Jamasb AR, Day B, Cangea C, Liò P, Blundell TL. Deep Learning for Protein-Protein Interaction Site Prediction. Methods Mol Biol 2021; 2361:263-288. [PMID: 34236667 DOI: 10.1007/978-1-0716-1641-3_16] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Protein-protein interactions (PPIs) are central to cellular functions. Experimental methods for predicting PPIs are well developed but are time and resource expensive and suffer from high false-positive error rates at scale. Computational prediction of PPIs is highly desirable for a mechanistic understanding of cellular processes and offers the potential to identify highly selective drug targets. In this chapter, details of developing a deep learning approach to predicting which residues in a protein are involved in forming a PPI-a task known as PPI site prediction-are outlined. The key decisions to be made in defining a supervised machine learning project in this domain are here highlighted. Alternative training regimes for deep learning models to address shortcomings in existing approaches and provide starting points for further research are discussed. This chapter is written to serve as a companion to developing deep learning approaches to protein-protein interaction site prediction, and an introduction to developing geometric deep learning projects operating on protein structure graphs.
Collapse
Affiliation(s)
- Arian R Jamasb
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.,Department of Biochemistry, University of Cambridge, Cambridge, UK
| | - Ben Day
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Cătălina Cangea
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Tom L Blundell
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
18
|
Akhter N, Chennupati G, Djidjev H, Shehu A. Decoy selection for protein structure prediction via extreme gradient boosting and ranking. BMC Bioinformatics 2020; 21:189. [PMID: 33297949 PMCID: PMC7724862 DOI: 10.1186/s12859-020-3523-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
Background Identifying one or more biologically-active/native decoys from millions of non-native decoys is one of the major challenges in computational structural biology. The extreme lack of balance in positive and negative samples (native and non-native decoys) in a decoy set makes the problem even more complicated. Consensus methods show varied success in handling the challenge of decoy selection despite some issues associated with clustering large decoy sets and decoy sets that do not show much structural similarity. Recent investigations into energy landscape-based decoy selection approaches show promises. However, lack of generalization over varied test cases remains a bottleneck for these methods. Results We propose a novel decoy selection method, ML-Select, a machine learning framework that exploits the energy landscape associated with the structure space probed through a template-free decoy generation. The proposed method outperforms both clustering and energy ranking-based methods, all the while consistently offering better performance on varied test-cases. Moreover, ML-Select shows promising results even for the decoy sets consisting of mostly low-quality decoys. Conclusions ML-Select is a useful method for decoy selection. This work suggests further research in finding more effective ways to adopt machine learning frameworks in achieving robust performance for decoy selection in template-free protein structure prediction.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Bikini At al Rd., Los Alamos, 87545, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, 22030, VA, USA.,Department of Bioengineering, George Mason University, Fairfax, 22030, VA, USA.,School of Systems Biology, George Mason University, Manassas, 20110, VA, USA
| |
Collapse
|
19
|
Chew AK, Jiang S, Zhang W, Zavala VM, Van Lehn RC. Fast predictions of liquid-phase acid-catalyzed reaction rates using molecular dynamics simulations and convolutional neural networks. Chem Sci 2020; 11:12464-12476. [PMID: 34094451 PMCID: PMC8163029 DOI: 10.1039/d0sc03261a] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
The rates of liquid-phase, acid-catalyzed reactions relevant to the upgrading of biomass into high-value chemicals are highly sensitive to solvent composition and identifying suitable solvent mixtures is theoretically and experimentally challenging. We show that the complex atomistic configurations of reactant-solvent environments generated by classical molecular dynamics simulations can be exploited by 3D convolutional neural networks to enable accurate predictions of Brønsted acid-catalyzed reaction rates for model biomass compounds. We develop a 3D convolutional neural network, which we call SolventNet, and train it to predict acid-catalyzed reaction rates using experimental reaction data and corresponding molecular dynamics simulation data for seven biomass-derived oxygenates in water-cosolvent mixtures. We show that SolventNet can predict reaction rates for additional reactants and solvent systems an order of magnitude faster than prior simulation methods. This combination of machine learning with molecular dynamics enables the rapid, high-throughput screening of solvent systems and identification of improved biomass conversion conditions.
Collapse
Affiliation(s)
- Alex K Chew
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI 53706 USA .,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison Madison WI 53706 USA
| | - Shengli Jiang
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI 53706 USA
| | - Weiqi Zhang
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI 53706 USA
| | - Victor M Zavala
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI 53706 USA
| | - Reid C Van Lehn
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI 53706 USA .,DOE Great Lakes Bioenergy Research Center, University of Wisconsin-Madison Madison WI 53706 USA
| |
Collapse
|
20
|
Grigas AT, Mei Z, Treado JD, Levine ZA, Regan L, O'Hern CS. Using physical features of protein core packing to distinguish real proteins from decoys. Protein Sci 2020; 29:1931-1944. [PMID: 32710566 PMCID: PMC7454528 DOI: 10.1002/pro.3914] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2020] [Revised: 07/10/2020] [Accepted: 07/20/2020] [Indexed: 01/06/2023]
Abstract
The ability to consistently distinguish real protein structures from computationally generated model decoys is not yet a solved problem. One route to distinguish real protein structures from decoys is to delineate the important physical features that specify a real protein. For example, it has long been appreciated that the hydrophobic cores of proteins contribute significantly to their stability. We used two sources to obtain datasets of decoys to compare with real protein structures: submissions to the biennial Critical Assessment of protein Structure Prediction competition, in which researchers attempt to predict the structure of a protein only knowing its amino acid sequence, and also decoys generated by 3DRobot, which have user-specified global root-mean-squared deviations from experimentally determined structures. Our analysis revealed that both sets of decoys possess cores that do not recapitulate the key features that define real protein cores. In particular, the model structures appear more densely packed (because of energetically unfavorable atomic overlaps), contain too few residues in the core, and have improper distributions of hydrophobic residues throughout the structure. Based on these observations, we developed a feed-forward neural network, which incorporates key physical features of protein cores, to predict how well a computational model recapitulates the real protein structure without knowledge of the structure of the target sequence. By identifying the important features of protein structure, our method is able to rank decoy structures with similar accuracy to that obtained by state-of-the-art methods that incorporate many additional features. The small number of physical features makes our model interpretable, emphasizing the importance of protein packing and hydrophobicity in protein structure prediction.
Collapse
Affiliation(s)
- Alex T. Grigas
- Graduate Program in Computational Biology and BioinformaticsYale UniversityNew HavenConnecticutUSA
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
| | - Zhe Mei
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of ChemistryYale UniversityNew HavenConnecticutUSA
| | - John D. Treado
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of Mechanical Engineering and Materials ScienceYale UniversityNew HavenConnecticutUSA
| | - Zachary A. Levine
- Department of PathologyYale UniversityNew HavenConnecticutUSA
- Department of Molecular Biophysics and BiochemistryYale UniversityNew HavenConnecticutUSA
| | - Lynne Regan
- Institute of Quantitative Biology, Biochemistry and Biotechnology, Centre for Synthetic and Systems Biology, School of Biological SciencesUniversity of EdinburghEdinburghUK
| | - Corey S. O'Hern
- Graduate Program in Computational Biology and BioinformaticsYale UniversityNew HavenConnecticutUSA
- Integrated Graduate Program in Physical and Engineering BiologyYale UniversityNew HavenConnecticutUSA
- Department of Mechanical Engineering and Materials ScienceYale UniversityNew HavenConnecticutUSA
- Department of PhysicsYale UniversityNew HavenConnecticutUSA
- Department of Applied PhysicsYale UniversityNew HavenConnecticutUSA
| |
Collapse
|
21
|
Chen J, Siu SWI. Machine Learning Approaches for Quality Assessment of Protein Structures. Biomolecules 2020; 10:biom10040626. [PMID: 32316682 PMCID: PMC7226485 DOI: 10.3390/biom10040626] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 04/07/2020] [Accepted: 04/09/2020] [Indexed: 11/16/2022] Open
Abstract
Protein structures play a very important role in biomedical research, especially in drug discovery and design, which require accurate protein structures in advance. However, experimental determinations of protein structure are prohibitively costly and time-consuming, and computational predictions of protein structures have not been perfected. Methods that assess the quality of protein models can help in selecting the most accurate candidates for further work. Driven by this demand, many structural bioinformatics laboratories have developed methods for estimating model accuracy (EMA). In recent years, EMA by machine learning (ML) have consistently ranked among the top-performing methods in the community-wide CASP challenge. Accordingly, we systematically review all the major ML-based EMA methods developed within the past ten years. The methods are grouped by their employed ML approach-support vector machine, artificial neural networks, ensemble learning, or Bayesian learning-and their significances are discussed from a methodology viewpoint. To orient the reader, we also briefly describe the background of EMA, including the CASP challenge and its evaluation metrics, and introduce the major ML/DL techniques. Overall, this review provides an introductory guide to modern research on protein quality assessment and directions for future research in this area.
Collapse
|
22
|
Akhter N, Chennupati G, Kabir KL, Djidjev H, Shehu A. Unsupervised and Supervised Learning over theEnergy Landscape for Protein Decoy Selection. Biomolecules 2019; 9:E607. [PMID: 31615116 PMCID: PMC6843838 DOI: 10.3390/biom9100607] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 10/03/2019] [Accepted: 10/04/2019] [Indexed: 11/17/2022] Open
Abstract
The energy landscape that organizes microstates of a molecular system and governs theunderlying molecular dynamics exposes the relationship between molecular form/structure, changesto form, and biological activity or function in the cell. However, several challenges stand in the wayof leveraging energy landscapes for relating structure and structural dynamics to function. Energylandscapes are high-dimensional, multi-modal, and often overly-rugged. Deep wells or basins inthem do not always correspond to stable structural states but are instead the result of inherentinaccuracies in semi-empirical molecular energy functions. Due to these challenges, energeticsis typically ignored in computational approaches addressing long-standing central questions incomputational biology, such as protein decoy selection. In the latter, the goal is to determine over apossibly large number of computationally-generated three-dimensional structures of a protein thosestructures that are biologically-active/native. In recent work, we have recast our attention on theprotein energy landscape and its role in helping us to advance decoy selection. Here, we summarizesome of our successes so far in this direction via unsupervised learning. More importantly, we furtheradvance the argument that the energy landscape holds valuable information to aid and advance thestate of protein decoy selection via novel machine learning methodologies that leverage supervisedlearning. Our focus in this article is on decoy selection for the purpose of a rigorous, quantitativeevaluation of how leveraging protein energy landscapes advances an important problem in proteinmodeling. However, the ideas and concepts presented here are generally useful to make discoveriesin studies aiming to relate molecular structure and structural dynamics to function.
Collapse
Affiliation(s)
- Nasrin Akhter
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Gopinath Chennupati
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Kazi Lutful Kabir
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
| | - Hristo Djidjev
- Information Sciences (CCS-3) Group, Los Alamos National Laboratory, Los Alamos, NM 87545, USA.
| | - Amarda Shehu
- Department of Computer Science, George Mason University, Fairfax, VA 22030, USA.
- Center for Adaptive Human-Machine Partnership, George Mason University, Fairfax, VA 22030, USA.
- Department of Bioengineering, George Mason University, Fairfax, VA 22030, USA.
- School of Systems Biology, George Mason University, Fairfax, VA 22030, USA.
| |
Collapse
|