101
|
Hockenberry AJ, Wilke CO. Evolutionary couplings detect side-chain interactions. PeerJ 2019; 7:e7280. [PMID: 31328041 PMCID: PMC6622159 DOI: 10.7717/peerj.7280] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2019] [Accepted: 06/09/2019] [Indexed: 12/19/2022] Open
Abstract
Patterns of amino acid covariation in large protein sequence alignments can inform the prediction of de novo protein structures, binding interfaces, and mutational effects. While algorithms that detect these so-called evolutionary couplings between residues have proven useful for practical applications, less is known about how and why these methods perform so well, and what insights into biological processes can be gained from their application. Evolutionary coupling algorithms are commonly benchmarked by comparison to true structural contacts derived from solved protein structures. However, the methods used to determine true structural contacts are not standardized and different definitions of structural contacts may have important consequences for interpreting the results from evolutionary coupling analyses and understanding their overall utility. Here, we show that evolutionary coupling analyses are significantly more likely to identify structural contacts between side-chain atoms than between backbone atoms. We use both simulations and empirical analyses to highlight that purely backbone-based definitions of true residue–residue contacts (i.e., based on the distance between Cα atoms) may underestimate the accuracy of evolutionary coupling algorithms by as much as 40% and that a commonly used reference point (Cβ atoms) underestimates the accuracy by 10–15%. These findings show that co-evolutionary outcomes differ according to which atoms participate in residue–residue interactions and suggest that accounting for different interaction types may lead to further improvements to contact-prediction methods.
Collapse
Affiliation(s)
- Adam J Hockenberry
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| |
Collapse
|
102
|
Wu Q, Peng Z, Anishchenko I, Cong Q, Baker D, Yang J. Protein contact prediction using metagenome sequence data and residual neural networks. Bioinformatics 2019; 36:41-48. [PMID: 31173061 PMCID: PMC8792440 DOI: 10.1093/bioinformatics/btz477] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2018] [Revised: 05/30/2019] [Accepted: 06/04/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Almost all protein residue contact prediction methods rely on the availability of deep multiple sequence alignments (MSAs). However, many proteins from the poorly populated families do not have sufficient number of homologs in the conventional UniProt database. Here we aim to solve this issue by exploring the rich sequence data from the metagenome sequencing projects. RESULTS Based on the improved MSA constructed from the metagenome sequence data, we developed MapPred, a new deep learning-based contact prediction method. MapPred consists of two component methods, DeepMSA and DeepMeta, both trained with the residual neural networks. DeepMSA was inspired by the recent method DeepCov, which was trained on 441 matrices of covariance features. By considering the symmetry of contact map, we reduced the number of matrices to 231, which makes the training more efficient in DeepMSA. Experiments show that DeepMSA outperforms DeepCov by 10-13% in precision. DeepMeta works by combining predicted contacts and other sequence profile features. Experiments on three benchmark datasets suggest that the contribution from the metagenome sequence data is significant with P-values less than 4.04E-17. MapPred is shown to be complementary and comparable the state-of-the-art methods. The success of MapPred is attributed to three factors: the deeper MSA from the metagenome sequence data, improved feature design in DeepMSA and optimized training by the residual neural networks. AVAILABILITY AND IMPLEMENTATION http://yanglab.nankai.edu.cn/mappred/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qi Wu
- School of Mathematical Sciences, Nankai University, Tianjin 300071, China
| | - Zhenling Peng
- To whom correspondence should be addressed. E-mail: or
| | - Ivan Anishchenko
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Qian Cong
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - David Baker
- Department of Biochemistry, Seattle, WA 98105, USA,Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Jianyi Yang
- To whom correspondence should be addressed. E-mail: or
| |
Collapse
|
103
|
Hou J, Wu T, Cao R, Cheng J. Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13. Proteins 2019; 87:1165-1178. [PMID: 30985027 PMCID: PMC6800999 DOI: 10.1002/prot.25697] [Citation(s) in RCA: 104] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2019] [Revised: 04/04/2019] [Accepted: 04/12/2019] [Indexed: 12/28/2022]
Abstract
Predicting residue‐residue distance relationships (eg, contacts) has become the key direction to advance protein structure prediction since 2014 CASP11 experiment, while deep learning has revolutionized the technology for contact and distance distribution prediction since its debut in 2012 CASP10 experiment. During 2018 CASP13 experiment, we enhanced our MULTICOM protein structure prediction system with three major components: contact distance prediction based on deep convolutional neural networks, distance‐driven template‐free (ab initio) modeling, and protein model ranking empowered by deep learning and contact prediction. Our experiment demonstrates that contact distance prediction and deep learning methods are the key reasons that MULTICOM was ranked 3rd out of all 98 predictors in both template‐free and template‐based structure modeling in CASP13. Deep convolutional neural network can utilize global information in pairwise residue‐residue features such as coevolution scores to substantially improve contact distance prediction, which played a decisive role in correctly folding some free modeling and hard template‐based modeling targets. Deep learning also successfully integrated one‐dimensional structural features, two‐dimensional contact information, and three‐dimensional structural quality scores to improve protein model quality assessment, where the contact prediction was demonstrated to consistently enhance ranking of protein models for the first time. The success of MULTICOM system clearly shows that protein contact distance prediction and model selection driven by deep learning holds the key of solving protein structure prediction problem. However, there are still challenges in accurately predicting protein contact distance when there are few homologous sequences, folding proteins from noisy contact distances, and ranking models of hard targets.
Collapse
Affiliation(s)
- Jie Hou
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Tianqi Wu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| | - Renzhi Cao
- Department of Computer Science, Pacific Lutheran University, Tacoma, Washington
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri
| |
Collapse
|
104
|
Bhattacharya S, Bhattacharya D. Does inclusion of residue-residue contact information boost protein threading? Proteins 2019; 87:596-606. [PMID: 30882932 DOI: 10.1002/prot.25684] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2018] [Revised: 02/20/2019] [Accepted: 03/13/2019] [Indexed: 12/26/2022]
Abstract
Template-based modeling is considered as one of the most successful approaches for protein structure prediction. However, reliably and accurately selecting optimal template proteins from a library of known protein structures having similar folds as the target protein and making correct alignments between the target sequence and the template structures, a template-based modeling technique known as threading, remains challenging, particularly for non- or distantly-homologous protein targets. With the recent advancement in protein residue-residue contact map prediction powered by sequence co-evolution and machine learning, here we systematically analyze the effect of inclusion of residue-residue contact information in improving the accuracy and reliability of protein threading. We develop a new threading algorithm by incorporating various sequential and structural features, and subsequently integrate residue-residue contact information as an additional scoring term for threading template selection. We show that the inclusion of contact information attains statistically significantly better threading performance compared to a baseline threading algorithm that does not utilize contact information when everything else remains the same. Experimental results demonstrate that our contact based threading approach outperforms popular threading method MUSTER, contact-assisted ab initio folding method CONFOLD2, and recent state-of-the-art contact-assisted protein threading methods EigenTHREADER and map_align on several benchmarks. Our study illustrates that the inclusion of contact maps is a promising avenue in protein threading to ultimately help to improve the accuracy of protein structure prediction.
Collapse
Affiliation(s)
- Sutanu Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama
| | - Debswapna Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama
| |
Collapse
|
105
|
Luttrell J, Liu T, Zhang C, Wang Z. Predicting protein residue-residue contacts using random forests and deep networks. BMC Bioinformatics 2019; 20:100. [PMID: 30871477 PMCID: PMC6419322 DOI: 10.1186/s12859-019-2627-6] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The ability to predict which pairs of amino acid residues in a protein are in contact with each other offers many advantages for various areas of research that focus on proteins. For example, contact prediction can be used to reduce the computational complexity of predicting the structure of proteins and even to help identify functionally important regions of proteins. These predictions are becoming especially important given the relatively low number of experimentally determined protein structures compared to the amount of available protein sequence data. RESULTS Here we have developed and benchmarked a set of machine learning methods for performing residue-residue contact prediction, including random forests, direct-coupling analysis, support vector machines, and deep networks (stacked denoising autoencoders). These methods are able to predict contacting residue pairs given only the amino acid sequence of a protein. According to our own evaluations performed at a resolution of +/- two residues, the predictors we trained with the random forest algorithm were our top performing methods with average top 10 prediction accuracy scores of 85.13% (short range), 74.49% (medium range), and 54.49% (long range). Our ensemble models (stacked denoising autoencoders combined with support vector machines) were our best performing deep network predictors and achieved top 10 prediction accuracy scores of 75.51% (short range), 60.26% (medium range), and 43.85% (long range) using the same evaluation. These tests were blindly performed on targets from the CASP11 dataset; and the results suggested that our models achieved comparable performance to contact predictors developed by groups that participated in CASP11. CONCLUSIONS Due to the challenging nature of contact prediction, it is beneficial to develop and benchmark a variety of different prediction methods. Our work has produced useful tools with a simple interface that can provide contact predictions to users without requiring a lengthy installation process. In addition to this, we have released our C++ implementation of the direct-coupling analysis method as a standalone software package. Both this tool and our RFcon web server are freely available to the public at http://dna.cs.miami.edu/RFcon /.
Collapse
Affiliation(s)
- Joseph Luttrell
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, 118 College Drive, Hattiesburg, MS, 39406, USA
| | - Tong Liu
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL, 33124, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, 118 College Drive, Hattiesburg, MS, 39406, USA
| | - Zheng Wang
- Department of Computer Science, University of Miami, 1365 Memorial Drive, Coral Gables, FL, 33124, USA.
| |
Collapse
|
106
|
Ji S, Oruç T, Mead L, Rehman MF, Thomas CM, Butterworth S, Winn PJ. DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure. PLoS One 2019; 14:e0205214. [PMID: 30620738 PMCID: PMC6324825 DOI: 10.1371/journal.pone.0205214] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 12/13/2018] [Indexed: 11/28/2022] Open
Abstract
Rapid, accurate prediction of protein structure from amino acid sequence would accelerate fields as diverse as drug discovery, synthetic biology and disease diagnosis. Massively improved prediction of protein structures has been driven by improving the prediction of the amino acid residues that contact in their 3D structure. For an average globular protein, around 92% of all residue pairs are non-contacting, therefore accurate prediction of only a small percentage of inter-amino acid distances could increase the number of constraints to guide structure determination. We have trained deep neural networks to predict inter-residue contacts and distances. Distances are predicted with an accuracy better than most contact prediction techniques. Addition of distance constraints improved de novo structure predictions for test sets of 158 protein structures, as compared to using the best contact prediction methods alone. Importantly, usage of distance predictions allows the selection of better models from the structure pool without a need for an external model assessment tool. The results also indicate how the accuracy of distance prediction methods might be improved further.
Collapse
Affiliation(s)
- Shuangxi Ji
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Tuğçe Oruç
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Liam Mead
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Muhammad Fayyaz Rehman
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | | | - Sam Butterworth
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- Division of Pharmacy and Optometry, School of Health Sciences, Manchester Academic Health Sciences Centre, University of Manchester, Manchester, M13 9PL, United Kingdom
| | - Peter James Winn
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- * E-mail:
| |
Collapse
|
107
|
Dehghani T, Naghibzadeh M, Eghdami M. BetaDL: A protein beta-sheet predictor utilizing a deep learning model and independent set solution. Comput Biol Med 2019; 104:241-249. [PMID: 30530227 DOI: 10.1016/j.compbiomed.2018.11.021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2018] [Revised: 11/23/2018] [Accepted: 11/27/2018] [Indexed: 10/27/2022]
Abstract
The sequence-based prediction of beta-residue contacts and beta-sheet structures contain key information for protein structure prediction. However, the determination of beta-sheet structures poses numerous challenges due to long-range beta-residue interactions and the huge number of possible beta-sheet structures. Recently gaining attention has been the prediction of residue contacts based on deep learning models whose results have led to improvement in protein structure prediction. In addition, to reduce the computational complexity of determining beta-sheet structures, it has been suggested that this problem be transformed into graph-based solutions. Consequently, the current work proposes BetaDL, a combination of a deep learning and a graph-based beta-sheet structure predictor. BetaDL adopts deep learning models to capture beta-residue contacts and improve beta-sheet structure predictions. In addition, a graph-based approach is presented to model the beta-sheets conformational space and a new score function is introduced to evaluate beta-sheets. Furthermore, the present study demonstrates that the beta-sheet structure can be predicted within an acceptable computational time by the utilization of a heuristic maximum weight independent set solution. When compared to state-of-the-art methods, experimental results from BetaSheet916 and BetaSheet1452 datasets indicate that BetaDL improves the accuracy of beta-residue contact and beta-sheet structure prediction. Using BetaDL, beta-sheet structures are predicted with a 4% and 6% improvement in the F1-score at the residue and strand levels, respectively. BetaDL's source code and data are available at http://kerg.um.ac.ir/index.php/datasets/#BetaDL.
Collapse
Affiliation(s)
- Toktam Dehghani
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Mahmoud Naghibzadeh
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran.
| | - Mahdie Eghdami
- Department of Computer Engineering, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
108
|
Ding W, Mao W, Shao D, Zhang W, Gong H. DeepConPred2: An Improved Method for the Prediction of Protein Residue Contacts. Comput Struct Biotechnol J 2018; 16:503-510. [PMID: 30505403 PMCID: PMC6247404 DOI: 10.1016/j.csbj.2018.10.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2018] [Revised: 10/16/2018] [Accepted: 10/18/2018] [Indexed: 12/18/2022] Open
Abstract
Information of residue-residue contacts is essential for understanding the mechanism of protein folding, and has been successfully applied as special topological restraints to simplify the conformational sampling in de novo protein structure prediction. Prediction of protein residue contacts has experienced amazingly rapid progresses recently, with prediction accuracy approaching impressively high levels in the past two years. In this work, we introduce a second version of our residue contact predictor, DeepConPred2, which exhibits substantially improved performance and sufficiently reduced running time after model re-optimization and feature updates. When testing on the CASP12 free modeling targets, our program reaches at least the same level of prediction accuracy as the best contact predictors so far and provides information complementary to other state-of-the-art methods in contact-assisted folding.
Collapse
Affiliation(s)
- Wenze Ding
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenzhi Mao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Di Shao
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Wenxuan Zhang
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China.,Beijing Innovation Center of Structural Biology, Tsinghua University, Beijing 100084, China
| |
Collapse
|