1
|
Aldossari RM, Ali A, Rehman MU, Rashid S, Ahmad SB. Computational Approaches for Identification of Potential Plant Bioactives as Novel G6PD Inhibitors Using Advanced Tools and Databases. Molecules 2023; 28:molecules28073018. [PMID: 37049781 PMCID: PMC10096328 DOI: 10.3390/molecules28073018] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2023] [Revised: 02/26/2023] [Accepted: 03/06/2023] [Indexed: 03/31/2023] Open
Abstract
In glucose metabolism, the pentose phosphate pathway (PPP) is the major metabolic pathway that plays a crucial role in cancer growth and metastasis. Although it has been pointed out that blockade of the PPP is a promising approach against cancer, in the clinical setting, effective anti-PPP agents are still not available. Dysfunction of the G6PD enzyme in this pathway leads to cancer development as this enzyme possesses oncogenic activity. In the present study, an attempt was made to identify bioactive compounds that can be developed as potential G6PD inhibitors. In the present study, 11 natural compounds and a controlled drug were taken. The physicochemical and toxicity properties of the compounds were determined via ADMET and ProTox-II analysis. In the present study, the findings of docking studies revealed that staurosporine was the most effective compound with the highest binding energy of −9.2 kcal/mol when docked against G6PD. Homology modeling revealed that 97.56% of the residues were occupied in the Ramachandran-favored region. The modeled protein gave a quality Z-score of −10.13 by ProSA tool. iMODS server provided significant insights into the mobility, stability and flexibility of the G6PD protein that described the collective functional protein motion. In the present study, the physical and functional interactions between proteins were determined by STRING. CASTp server determined the topological and geometric properties of the G6PD protein. The findings of the present study revealed that staurosporine could be developed as a potential G6PD inhibitor; however, further in vivo and in vitro studies are needed for further validation of these results.
Collapse
Affiliation(s)
- Rana M. Aldossari
- Department of Pharmacology & Toxicology, College of Pharmacy, Prince Sattam Bin Abdulaziz University, P.O. Box 173, Al-Kharj 11942, Saudi Arabia
| | - Aarif Ali
- Division of Veterinary Biochemistry, Faculty of Veterinary Science and Animal Husbandry, SKUAST-Kashmir, Alustang, Shuhama 190006, Jammu & Kashmir, India
| | - Muneeb U. Rehman
- Department of Clinical Pharmacy, College of Pharmacy, King Saud University, P.O. Box 2457, Riyadh 11451, Saudi Arabia
| | - Summya Rashid
- Department of Pharmacology & Toxicology, College of Pharmacy, Prince Sattam Bin Abdulaziz University, P.O. Box 173, Al-Kharj 11942, Saudi Arabia
- Correspondence:
| | - Sheikh Bilal Ahmad
- Division of Veterinary Biochemistry, Faculty of Veterinary Science and Animal Husbandry, SKUAST-Kashmir, Alustang, Shuhama 190006, Jammu & Kashmir, India
| |
Collapse
|
2
|
Zhang H, Huang Y, Bei Z, Ju Z, Meng J, Hao M, Zhang J, Zhang H, Xi W. Inter-Residue Distance Prediction From Duet Deep Learning Models. Front Genet 2022; 13:887491. [PMID: 35651930 PMCID: PMC9148999 DOI: 10.3389/fgene.2022.887491] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2022] [Accepted: 03/30/2022] [Indexed: 12/04/2022] Open
Abstract
Residue distance prediction from the sequence is critical for many biological applications such as protein structure reconstruction, protein–protein interaction prediction, and protein design. However, prediction of fine-grained distances between residues with long sequence separations still remains challenging. In this study, we propose DuetDis, a method based on duet feature sets and deep residual network with squeeze-and-excitation (SE), for protein inter-residue distance prediction. DuetDis embraces the ability to learn and fuse features directly or indirectly extracted from the whole-genome/metagenomic databases and, therefore, minimize the information loss through ensembling models trained on different feature sets. We evaluate DuetDis and 11 widely used peer methods on a large-scale test set (610 proteins chains). The experimental results suggest that 1) prediction results from different feature sets show obvious differences; 2) ensembling different feature sets can improve the prediction performance; 3) high-quality multiple sequence alignment (MSA) used for both training and testing can greatly improve the prediction performance; and 4) DuetDis is more accurate than peer methods for the overall prediction, more reliable in terms of model prediction score, and more robust against shallow multiple sequence alignment (MSA).
Collapse
Affiliation(s)
- Huiling Zhang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Ying Huang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhendong Bei
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zhen Ju
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Jintao Meng
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Min Hao
- College of Electronic and Information Engineering, Southwest University, Chongqing, China
| | - Jingjing Zhang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Haiping Zhang
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenhui Xi
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- University of Chinese Academy of Sciences, Beijing, China
- *Correspondence: Wenhui Xi,
| |
Collapse
|
3
|
Mehrabiani KM, Cheng RR, Onuchic JN. Expanding Direct Coupling Analysis to Identify Heterodimeric Interfaces from Limited Protein Sequence Data. J Phys Chem B 2021; 125:11408-11417. [PMID: 34618469 DOI: 10.1021/acs.jpcb.1c07145] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Direct coupling analysis (DCA) is a global statistical approach that uses information encoded in protein sequence data to predict spatial contacts in a three-dimensional structure of a folded protein. DCA has been widely used to predict the monomeric fold at amino acid resolution and to identify biologically relevant interaction sites within a folded protein. Going beyond single proteins, DCA has also been used to identify spatial contacts that stabilize the interaction in protein complex formation. However, extracting this higher order information necessary to predict dimer contacts presents a significant challenge. A DCA evolutionary signal is much stronger at the single protein level (intraprotein contacts) than at the protein-protein interface (interprotein contacts). Therefore, if DCA-derived information is to be used to predict the structure of these complexes, there is a need to identify statistically significant DCA predictions. We propose a simple Z-score measure that can filter good predictions despite noisy, limited data. This new methodology not only improves our prediction ability but also provides a quantitative measure for the validity of the prediction.
Collapse
Affiliation(s)
- Kareem M Mehrabiani
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States.,Systems, Synthetic, and Physical Biology, Rice University, Houston, Texas 77005, United States
| | - Ryan R Cheng
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States
| | - José N Onuchic
- Center for Theoretical Biological Physics, Rice University, Houston, Texas 77005, United States.,Systems, Synthetic, and Physical Biology, Rice University, Houston, Texas 77005, United States.,Department of Physics & Astronomy, Rice University, Houston, Texas 77005, United States.,Department of Chemistry, Rice University, Houston, Texas 77005, United States.,Department of Biosciences, Rice University, Houston, Texas 77005, United States
| |
Collapse
|
4
|
Laine E, Eismann S, Elofsson A, Grudinin S. Protein sequence-to-structure learning: Is this the end(-to-end revolution)? Proteins 2021; 89:1770-1786. [PMID: 34519095 DOI: 10.1002/prot.26235] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2021] [Revised: 08/16/2021] [Accepted: 09/03/2021] [Indexed: 01/08/2023]
Abstract
The potential of deep learning has been recognized in the protein structure prediction community for some time, and became indisputable after CASP13. In CASP14, deep learning has boosted the field to unanticipated levels reaching near-experimental accuracy. This success comes from advances transferred from other machine learning areas, as well as methods specifically designed to deal with protein sequences and structures, and their abstractions. Novel emerging approaches include (i) geometric learning, that is, learning on representations such as graphs, three-dimensional (3D) Voronoi tessellations, and point clouds; (ii) pretrained protein language models leveraging attention; (iii) equivariant architectures preserving the symmetry of 3D space; (iv) use of large meta-genome databases; (v) combinations of protein representations; and (vi) finally truly end-to-end architectures, that is, differentiable models starting from a sequence and returning a 3D structure. Here, we provide an overview and our opinion of the novel deep learning approaches developed in the last 2 years and widely used in CASP14.
Collapse
Affiliation(s)
- Elodie Laine
- Sorbonne Université, CNRS, IBPS, Laboratoire de Biologie Computationnelle et Quantitative (LCQB), Paris, France
| | - Stephan Eismann
- Department of Computer Science and Applied Physics, Stanford University, Stanford, California, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics and Science for Life Laboratory, Stockholm University, Solna, Sweden
| | - Sergei Grudinin
- Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, Grenoble, France
| |
Collapse
|
5
|
Pearce R, Zhang Y. Toward the solution of the protein structure prediction problem. J Biol Chem 2021; 297:100870. [PMID: 34119522 PMCID: PMC8254035 DOI: 10.1016/j.jbc.2021.100870] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 06/07/2021] [Accepted: 06/09/2021] [Indexed: 11/20/2022] Open
Abstract
Since Anfinsen demonstrated that the information encoded in a protein's amino acid sequence determines its structure in 1973, solving the protein structure prediction problem has been the Holy Grail of structural biology. The goal of protein structure prediction approaches is to utilize computational modeling to determine the spatial location of every atom in a protein molecule starting from only its amino acid sequence. Depending on whether homologous structures can be found in the Protein Data Bank (PDB), structure prediction methods have been historically categorized as template-based modeling (TBM) or template-free modeling (FM) approaches. Until recently, TBM has been the most reliable approach to predicting protein structures, and in the absence of reliable templates, the modeling accuracy sharply declines. Nevertheless, the results of the most recent community-wide assessment of protein structure prediction experiment (CASP14) have demonstrated that the protein structure prediction problem can be largely solved through the use of end-to-end deep machine learning techniques, where correct folds could be built for nearly all single-domain proteins without using the PDB templates. Critically, the model quality exhibited little correlation with the quality of available template structures, as well as the number of sequence homologs detected for a given target protein. Thus, the implementation of deep-learning techniques has essentially broken through the 50-year-old modeling border between TBM and FM approaches and has made the success of high-resolution structure prediction significantly less dependent on template availability in the PDB library.
Collapse
Affiliation(s)
- Robin Pearce
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, USA; Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan, USA.
| |
Collapse
|
6
|
Zhang H, Bei Z, Xi W, Hao M, Ju Z, Saravanan KM, Zhang H, Guo N, Wei Y. Evaluation of residue-residue contact prediction methods: From retrospective to prospective. PLoS Comput Biol 2021; 17:e1009027. [PMID: 34029314 PMCID: PMC8177648 DOI: 10.1371/journal.pcbi.1009027] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2020] [Revised: 06/04/2021] [Accepted: 04/28/2021] [Indexed: 12/31/2022] Open
Abstract
Sequence-based residue contact prediction plays a crucial role in protein structure reconstruction. In recent years, the combination of evolutionary coupling analysis (ECA) and deep learning (DL) techniques has made tremendous progress for residue contact prediction, thus a comprehensive assessment of current methods based on a large-scale benchmark data set is very needed. In this study, we evaluate 18 contact predictors on 610 non-redundant proteins and 32 CASP13 targets according to a wide range of perspectives. The results show that different methods have different application scenarios: (1) DL methods based on multi-categories of inputs and large training sets are the best choices for low-contact-density proteins such as the intrinsically disordered ones and proteins with shallow multi-sequence alignments (MSAs). (2) With at least 5L (L is sequence length) effective sequences in the MSA, all the methods show the best performance, and methods that rely only on MSA as input can reach comparable achievements as methods that adopt multi-source inputs. (3) For top L/5 and L/2 predictions, DL methods can predict more hydrophobic interactions while ECA methods predict more salt bridges and disulfide bonds. (4) ECA methods can detect more secondary structure interactions, while DL methods can accurately excavate more contact patterns and prune isolated false positives. In general, multi-input DL methods with large training sets dominate current approaches with the best overall performance. Despite the great success of current DL methods must be stated the fact that there is still much room left for further improvement: (1) With shallow MSAs, the performance will be greatly affected. (2) Current methods show lower precisions for inter-domain compared with intra-domain contact predictions, as well as very high imbalances in precisions between intra-domains. (3) Strong prediction similarities between DL methods indicating more feature types and diversified models need to be developed. (4) The runtime of most methods can be further optimized. The amino acid sequence of a protein ultimately determines its tertiary structure, and the tertiary structure determines its function(s) and plays a key role in understanding biological processes and disease pathogenesis. Protein tertiary structure can be determined using experimental techniques such as cryo-electron microscopy, nuclear magnetic resonance and X-ray crystallography, which are very expensive and time-consuming. As an alternative, researchers are trying to use in silico methods to predict the 3D structures. Residue contact-assisted protein folding paves an avenue for sequence-based protein structure prediction and therefore has become one of the most challenging and promising problems in structural bioinformatics. Over the past years, contact prediction has undergone continuous evolution in techniques. Through a retrospective analysis of traditional machine learning /evolutionary coupling analysis methods/ consensus machine learning methods and a multi-perspective study on recently developed deep learning methods, we explore the most advanced contact predictors, pursue application scenarios for different methods, and seek prospective directions for further improvement. We anticipate that our study will serve as a practical and useful guide for the development of future approaches to contact prediction.
Collapse
Affiliation(s)
- Huiling Zhang
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Zhendong Bei
- Cloud Computing Department, Alibaba Group, Hangzhou, China
| | - Wenhui Xi
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Min Hao
- College of Electronic and Information Engineering, Southwest University, Chongqing, China
| | - Zhen Ju
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Konda Mani Saravanan
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Haiping Zhang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Ning Guo
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Yanjie Wei
- University of Chinese Academy of Sciences, Beijing, China
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
- * E-mail:
| |
Collapse
|
7
|
Roche R, Bhattacharya S, Bhattacharya D. Hybridized distance- and contact-based hierarchical structure modeling for folding soluble and membrane proteins. PLoS Comput Biol 2021; 17:e1008753. [PMID: 33621244 PMCID: PMC7935296 DOI: 10.1371/journal.pcbi.1008753] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2020] [Revised: 03/05/2021] [Accepted: 01/31/2021] [Indexed: 11/18/2022] Open
Abstract
Crystallography and NMR system (CNS) is currently a widely used method for fragment-free ab initio protein folding from inter-residue distance or contact maps. Despite its widespread use in protein structure prediction, CNS is a decade-old macromolecular structure determination system that was originally developed for solving macromolecular geometry from experimental restraints as opposed to predictive modeling driven by interaction map data. As such, the adaptation of the CNS experimental structure determination protocol for ab initio protein folding is intrinsically anomalous that may undermine the folding accuracy of computational protein structure prediction. In this paper, we propose a new CNS-free hierarchical structure modeling method called DConStruct for folding both soluble and membrane proteins driven by distance and contact information. Rigorous experimental validation shows that DConStruct attains much better reconstruction accuracy than CNS when tested with the same input contact map at varying contact thresholds. The hierarchical modeling with iterative self-correction employed in DConStruct scales at a much higher degree of folding accuracy than CNS with the increase in contact thresholds, ultimately approaching near-optimal reconstruction accuracy at higher-thresholded contact maps. The folding accuracy of DConStruct can be further improved by exploiting distance-based hybrid interaction maps at tri-level thresholding, as demonstrated by the better performance of our method in folding free modeling targets from the 12th and 13th rounds of the Critical Assessment of techniques for protein Structure Prediction (CASP) experiments compared to popular CNS- and fragment-based approaches and energy-minimization protocols, some of which even using much finer-grained distance maps than ours. Additional large-scale benchmarking shows that DConStruct can significantly improve the folding accuracy of membrane proteins compared to a CNS-based approach. These results collectively demonstrate the feasibility of greatly improving the accuracy of ab initio protein folding by optimally exploiting the information encoded in inter-residue interaction maps beyond what is possible by CNS. Predicting the folded and functional 3-dimensional structure of a protein molecule from its amino acid sequence is of central importance to structural biology. Recently, promising advances have been made in ab initio protein folding due to the reasonably accurate estimation of inter-residue interaction maps at increasingly higher resolutions that range from binary contacts to finer-grained distances. Despite the progress in predicting the interaction maps, approaches for turning the residue-residue interactions projected in these maps into their precise spatial positioning heavily rely on a decade-old experimental structure determination protocol that is not suitable for predictive modeling. This paper presents a new hierarchical structure modeling method, DConStruct, which can better exploit the information encoded in the interaction maps at multiple granularities, from binary contact maps to distance-based hybrid maps at tri-level thresholding, for improved ab initio folding. Multiple large-scale benchmarking experiments show that our proposed method can substantially improve the folding accuracy for both soluble and membrane proteins compared to state-of-the-art approaches. DConStruct is licensed under the GNU General Public License v3 and freely available at https://github.com/Bhattacharya-Lab/DConStruct.
Collapse
Affiliation(s)
- Rahmatullah Roche
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, United States of America
| | - Sutanu Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, United States of America
| | - Debswapna Bhattacharya
- Department of Computer Science and Software Engineering, Auburn University, Auburn, Alabama, United States of America
- Department of Biological Sciences, Auburn University, Auburn, Alabama, United States of America
- * E-mail:
| |
Collapse
|
8
|
Seffernick JT, Lindert S. Hybrid methods for combined experimental and computational determination of protein structure. J Chem Phys 2020; 153:240901. [PMID: 33380110 PMCID: PMC7773420 DOI: 10.1063/5.0026025] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Accepted: 11/10/2020] [Indexed: 02/04/2023] Open
Abstract
Knowledge of protein structure is paramount to the understanding of biological function, developing new therapeutics, and making detailed mechanistic hypotheses. Therefore, methods to accurately elucidate three-dimensional structures of proteins are in high demand. While there are a few experimental techniques that can routinely provide high-resolution structures, such as x-ray crystallography, nuclear magnetic resonance (NMR), and cryo-EM, which have been developed to determine the structures of proteins, these techniques each have shortcomings and thus cannot be used in all cases. However, additionally, a large number of experimental techniques that provide some structural information, but not enough to assign atomic positions with high certainty have been developed. These methods offer sparse experimental data, which can also be noisy and inaccurate in some instances. In cases where it is not possible to determine the structure of a protein experimentally, computational structure prediction methods can be used as an alternative. Although computational methods can be performed without any experimental data in a large number of studies, inclusion of sparse experimental data into these prediction methods has yielded significant improvement. In this Perspective, we cover many of the successes of integrative modeling, computational modeling with experimental data, specifically for protein folding, protein-protein docking, and molecular dynamics simulations. We describe methods that incorporate sparse data from cryo-EM, NMR, mass spectrometry, electron paramagnetic resonance, small-angle x-ray scattering, Förster resonance energy transfer, and genetic sequence covariation. Finally, we highlight some of the major challenges in the field as well as possible future directions.
Collapse
Affiliation(s)
- Justin T. Seffernick
- Department of Chemistry and Biochemistry, Ohio State University, Columbus, Ohio 43210, USA
| | - Steffen Lindert
- Department of Chemistry and Biochemistry, Ohio State University, Columbus, Ohio 43210, USA
| |
Collapse
|
9
|
Zhang GJ, Wang XQ, Ma LF, Wang LJ, Hu J, Zhou XG. Two-Stage Distance Feature-based Optimization Algorithm for De novo Protein Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:2119-2130. [PMID: 31107659 DOI: 10.1109/tcbb.2019.2917452] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
De novo protein structure prediction can be treated as a conformational space optimization problem under the guidance of an energy function. However, it is a challenge of how to design an accurate energy function which ensures low-energy conformations close to native structures. Fortunately, recent studies have shown that the accuracy of de novo protein structure prediction can be significantly improved by integrating the residue-residue distance information. In this paper, a two-stage distance feature-based optimization algorithm (TDFO) for de novo protein structure prediction is proposed within the framework of evolutionary algorithm. In TDFO, a similarity model is first designed by using feature information which is extracted from distance profiles by bisecting K-means algorithm. The similarity model-based selection strategy is then developed to guide conformation search, and thus improve the quality of the predicted models. Moreover, global and local mutation strategies are designed, and a state estimation strategy is also proposed to strike a trade-off between the exploration and exploitation of the search space. Experimental results of 35 benchmark proteins show that the proposed TDFO can improve prediction accuracy for a large portion of test proteins.
Collapse
|
10
|
Liu J, Zhou XG, Zhang Y, Zhang GJ. CGLFold: a contact-assisted de novo protein structure prediction using global exploration and loop perturbation sampling algorithm. Bioinformatics 2020; 36:2443-2450. [PMID: 31860059 DOI: 10.1093/bioinformatics/btz943] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 12/10/2019] [Accepted: 12/18/2019] [Indexed: 12/27/2022] Open
Abstract
MOTIVATION Regions that connect secondary structure elements in a protein are known as loops, whose slight change will produce dramatic effect on the entire topology. This study investigates whether the accuracy of protein structure prediction can be improved using a loop-specific sampling strategy. RESULTS A novel de novo protein structure prediction method that combines global exploration and loop perturbation is proposed in this study. In the global exploration phase, the fragment recombination and assembly are used to explore the massive conformational space and generate native-like topology. In the loop perturbation phase, a loop-specific local perturbation model is designed to improve the accuracy of the conformation and is solved by differential evolution algorithm. These two phases enable a cooperation between global exploration and local exploitation. The filtered contact information is used to construct the conformation selection model for guiding the sampling. The proposed CGLFold is tested on 145 benchmark proteins, 14 free modeling (FM) targets of CASP13 and 29 FM targets of CASP12. The experimental results show that the loop-specific local perturbation can increase the structure diversity and success rate of conformational update and gradually improve conformation accuracy. CGLFold obtains template modeling score ≥ 0.5 models on 95 standard test proteins, 7 FM targets of CASP13 and 9 FM targets of CASP12. AVAILABILITY AND IMPLEMENTATION The source code and executable versions are freely available at https://github.com/iobio-zjut/CGLFold. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xiao-Gen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109-2218, USA
| | - Gui-Jun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
11
|
Baldessari F, Capelli R, Carloni P, Giorgetti A. Coevolutionary data-based interaction networks approach highlighting key residues across protein families: The case of the G-protein coupled receptors. Comput Struct Biotechnol J 2020; 18:1153-1159. [PMID: 32489528 PMCID: PMC7260681 DOI: 10.1016/j.csbj.2020.05.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2020] [Revised: 05/01/2020] [Accepted: 05/06/2020] [Indexed: 12/26/2022] Open
Abstract
We present an approach that, by integrating structural data with Direct Coupling Analysis, is able to pinpoint most of the interaction hotspots (i.e. key residues for the biological activity) across very sparse protein families in a single run. An application to the Class A G-protein coupled receptors (GPCRs), both in their active and inactive states, demonstrates the predictive power of our approach. The latter can be easily extended to any other kind of protein family, where it is expected to highlight most key sites involved in their functional activity.
Collapse
Affiliation(s)
- Filippo Baldessari
- Department of Biotechnology, Università di Verona, Ca Vignal 1, strada Le Grazie 15, I-37134 Verona, Italy
| | - Riccardo Capelli
- Computational Biomedicine Section, IAS-5/INM-9, Forschungzentrum Jülich, Wilhelm-Johnen-straße, D-52425 Jülich, Germany
| | - Paolo Carloni
- Computational Biomedicine Section, IAS-5/INM-9, Forschungzentrum Jülich, Wilhelm-Johnen-straße, D-52425 Jülich, Germany
| | - Alejandro Giorgetti
- Department of Biotechnology, Università di Verona, Ca Vignal 1, strada Le Grazie 15, I-37134 Verona, Italy
- Computational Biomedicine Section, IAS-5/INM-9, Forschungzentrum Jülich, Wilhelm-Johnen-straße, D-52425 Jülich, Germany
| |
Collapse
|
12
|
Zhang GJ, Ma LF, Wang XQ, Zhou XG. Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1068-1081. [PMID: 30295627 DOI: 10.1109/tcbb.2018.2873691] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Ab initio protein tertiary structure prediction is one of the long-standing problems in structural bioinformatics. With the help of residue-residue contact and secondary structure prediction information, the accuracy of ab initio structure prediction can be enhanced. In this study, an improved differential evolution with secondary structure and residue-residue contact information referred to as SCDE is proposed for protein structure prediction. In SCDE, two score models based on secondary structure and contact information are proposed, and two selection strategies, namely, secondary structure-based selection strategy and contact-based selection strategy, are designed to guide conformation space search. A probability distribution function is designed to balance these two selection strategies. Experimental results on a benchmark dataset with 28 proteins and four free model targets in CASP12 demonstrate that the proposed SCDE is effective and efficient.
Collapse
|
13
|
Badaczewska-Dawid AE, Kolinski A, Kmiecik S. Computational reconstruction of atomistic protein structures from coarse-grained models. Comput Struct Biotechnol J 2019; 18:162-176. [PMID: 31969975 PMCID: PMC6961067 DOI: 10.1016/j.csbj.2019.12.007] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2019] [Accepted: 12/10/2019] [Indexed: 01/02/2023] Open
Abstract
Three-dimensional protein structures, whether determined experimentally or theoretically, are often too low resolution. In this mini-review, we outline the computational methods for protein structure reconstruction from incomplete coarse-grained to all atomistic models. Typical reconstruction schemes can be divided into four major steps. Usually, the first step is reconstruction of the protein backbone chain starting from the C-alpha trace. This is followed by side-chains rebuilding based on protein backbone geometry. Subsequently, hydrogen atoms can be reconstructed. Finally, the resulting all-atom models may require structure optimization. Many methods are available to perform each of these tasks. We discuss the available tools and their potential applications in integrative modeling pipelines that can transfer coarse-grained information from computational predictions, or experiment, to all atomistic structures.
Collapse
Affiliation(s)
| | | | - Sebastian Kmiecik
- Faculty of Chemistry, Biological and Chemical Research Center, University of Warsaw, Pasteura 1, 02-093 Warsaw, Poland
| |
Collapse
|
14
|
Wozniak PP, Pelc J, Skrzypecki M, Vriend G, Kotulska M. Bio-knowledge-based filters improve residue-residue contact prediction accuracy. Bioinformatics 2019; 34:3675-3683. [PMID: 29850768 DOI: 10.1093/bioinformatics/bty416] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2017] [Accepted: 05/19/2018] [Indexed: 11/13/2022] Open
Abstract
Motivation Residue-residue contact prediction through direct coupling analysis has reached impressive accuracy, but yet higher accuracy will be needed to allow for routine modelling of protein structures. One way to improve the prediction accuracy is to filter predicted contacts using knowledge about the particular protein of interest or knowledge about protein structures in general. Results We focus on the latter and discuss a set of filters that can be used to remove false positive contact predictions. Each filter depends on one or a few cut-off parameters for which the filter performance was investigated. Combining all filters while using default parameters resulted for a test set of 851 protein domains in the removal of 29% of the predictions of which 92% were indeed false positives. Availability and implementation All data and scripts are available at http://comprec-lin.iiar.pwr.edu.pl/FPfilter/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- P P Wozniak
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - J Pelc
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - M Skrzypecki
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - M Kotulska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| |
Collapse
|
15
|
Vajdi A, Zarringhalam K, Haspel N. Patch-DCA: improved protein interface prediction by utilizing structural information and clustering DCA scores. Bioinformatics 2019; 36:1460-1467. [DOI: 10.1093/bioinformatics/btz791] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2019] [Revised: 09/30/2019] [Accepted: 10/15/2019] [Indexed: 01/07/2023] Open
Abstract
Abstract
Motivation
Over the past decade, there have been impressive advances in determining the 3D structures of protein complexes. However, there are still many complexes with unknown structures, even when the structures of the individual proteins are known. The advent of protein sequence information provides an opportunity to leverage evolutionary information to enhance the accuracy of protein–protein interface prediction. To this end, several statistical and machine learning methods have been proposed. In particular, direct coupling analysis has recently emerged as a promising approach for identification of protein contact maps from sequential information. However, the ability of these methods to detect protein–protein inter-residue contacts remains relatively limited.
Results
In this work, we propose a method to integrate sequential and co-evolution information with structural and functional information to increase the performance of protein–protein interface prediction. Further, we present a post-processing clustering method that improves the average relative F1 score by 70% and 24% and the average relative precision by 80% and 36% in comparison with two state-of-the-art methods, PSICOV and GREMLIN.
Availability and implementation
https://github.com/BioMLBoston/PatchDCA
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Amir Vajdi
- Computer Science Department, University of Massachusetts Boston, Boston, MA, USA
- Department of Informatics and Analytics, Dana-Farber Cancer Institute, Boston, MA, USA
| | | | - Nurit Haspel
- Computer Science Department, University of Massachusetts Boston, Boston, MA, USA
| |
Collapse
|
16
|
Accurate Classification of Biological and non-Biological Interfaces in Protein Crystal Structures using Subtle Covariation Signals. Sci Rep 2019; 9:12603. [PMID: 31471543 PMCID: PMC6717244 DOI: 10.1038/s41598-019-48913-8] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2017] [Accepted: 08/14/2019] [Indexed: 11/08/2022] Open
Abstract
Proteins often work as oligomers or multimers in vivo. Therefore, elucidating their oligomeric or multimeric form (quaternary structure) is crucially important to ascertain their function. X-ray crystal structures of numerous proteins have been accumulated, providing information related to their biological units. Extracting information of biological units from protein crystal structures represents a meaningful task for modern biology. Nevertheless, although many methods have been proposed for identifying biological units appearing in protein crystal structures, it is difficult to distinguish biological protein-protein interfaces from crystallographic ones. Therefore, our simple but highly accurate classifier was developed to infer biological units in protein crystal structures using large amounts of protein sequence information and a modern contact prediction method to exploit covariation signals (CSs) in proteins. We demonstrate that our proposed method is promising even for weak signals of biological interfaces. We also discuss the relation between classification accuracy and conservation of biological units, and illustrate how the selection of sequences included in multiple sequence alignments as sources for obtaining CSs affects the results. With increased amounts of sequence data, the proposed method is expected to become increasingly useful.
Collapse
|
17
|
The role of coevolutionary signatures in protein interaction dynamics, complex inference, molecular recognition, and mutational landscapes. Curr Opin Struct Biol 2019; 56:179-186. [PMID: 31029927 DOI: 10.1016/j.sbi.2019.03.024] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Revised: 03/18/2019] [Accepted: 03/19/2019] [Indexed: 11/22/2022]
Abstract
Evolution imposes constraints at the interface of interacting biomolecules in order to preserve function or maintain fitness. This pressure may have a direct effect on the sequence composition of interacting biomolecules. As a result, statistical patterns of amino acid or nucleotide covariance that encode for physical and functional interactions are observed in sequences of extant organisms. In recent years, global pairwise models of amino acid and nucleotide coevolution from multiple sequence alignments have been developed and utilized to study molecular interactions in structural biology. In proteins, for which the energy landscape is funneled and minimally frustrated, a direct connection between the physical and sequence space landscapes can be established. Estimating coevolutionary information from sequences of interacting molecules has a broad impact in molecular biology. Applications include the accurate determination of 3D structures of molecular complexes, inference of protein interaction partners, models of protein-protein interaction specificity, the elucidation, and design of protein-nucleic acid recognition as well as the discovery of genome-wide epistatic effects. The current state of the art of coevolutionary analysis includes biomedical applications ranging from mutational landscapes and drug-design to vaccine development.
Collapse
|
18
|
Haldane A, Levy RM. Influence of multiple-sequence-alignment depth on Potts statistical models of protein covariation. Phys Rev E 2019; 99:032405. [PMID: 30999494 PMCID: PMC6508952 DOI: 10.1103/physreve.99.032405] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2018] [Indexed: 02/02/2023]
Abstract
Potts statistical models have become a popular and promising way to analyze mutational covariation in protein multiple sequence alignments (MSAs) in order to understand protein structure, function, and fitness. But the statistical limitations of these models, which can have millions of parameters and are fit to MSAs of only thousands or hundreds of effective sequences using a procedure known as inverse Ising inference, are incompletely understood. In this work we predict how model quality degrades as a function of the number of sequences N, sequence length L, amino-acid alphabet size q, and the degree of conservation of the MSA, in different applications of the Potts models: in "fitness" predictions of individual protein sequences, in predictions of the effects of single-point mutations, in "double mutant cycle" predictions of epistasis, and in 3D contact prediction in protein structure. We show how as MSA depth N decreases an "overfitting" effect occurs such that sequences in the training MSA have overestimated fitness, and we predict the magnitude of this effect and discuss how regularization can help correct for it, using a regularization procedure motivated by statistical analysis of the effects of finite sampling. We find that as N decreases the quality of point-mutation effect predictions degrade least, fitness and epistasis predictions degrade more rapidly, and contact predictions are most affected. However, overfitting becomes negligible for MSA depths of more than a few thousand effective sequences, as often used in practice, and regularization becomes less necessary. We discuss the implications of these results for users of Potts covariation analysis.
Collapse
Affiliation(s)
- Allan Haldane
- Center for Biophysics and Computational Biology, Department of
Physics, and Institute for Computational Molecular Science, Temple
University, Philadelphia, Pennsylvania 19122
| | - Ronald M. Levy
- Center for Biophysics and Computational Biology, Department of
Chemistry, and Institute for Computational Molecular Science, Temple
University, Philadelphia, Pennsylvania 19122
| |
Collapse
|
19
|
Ji S, Oruç T, Mead L, Rehman MF, Thomas CM, Butterworth S, Winn PJ. DeepCDpred: Inter-residue distance and contact prediction for improved prediction of protein structure. PLoS One 2019; 14:e0205214. [PMID: 30620738 PMCID: PMC6324825 DOI: 10.1371/journal.pone.0205214] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2018] [Accepted: 12/13/2018] [Indexed: 11/28/2022] Open
Abstract
Rapid, accurate prediction of protein structure from amino acid sequence would accelerate fields as diverse as drug discovery, synthetic biology and disease diagnosis. Massively improved prediction of protein structures has been driven by improving the prediction of the amino acid residues that contact in their 3D structure. For an average globular protein, around 92% of all residue pairs are non-contacting, therefore accurate prediction of only a small percentage of inter-amino acid distances could increase the number of constraints to guide structure determination. We have trained deep neural networks to predict inter-residue contacts and distances. Distances are predicted with an accuracy better than most contact prediction techniques. Addition of distance constraints improved de novo structure predictions for test sets of 158 protein structures, as compared to using the best contact prediction methods alone. Importantly, usage of distance predictions allows the selection of better models from the structure pool without a need for an external model assessment tool. The results also indicate how the accuracy of distance prediction methods might be improved further.
Collapse
Affiliation(s)
- Shuangxi Ji
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Tuğçe Oruç
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Liam Mead
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | - Muhammad Fayyaz Rehman
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
| | | | - Sam Butterworth
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- Division of Pharmacy and Optometry, School of Health Sciences, Manchester Academic Health Sciences Centre, University of Manchester, Manchester, M13 9PL, United Kingdom
| | - Peter James Winn
- School of Biosciences, University of Birmingham, Edgbaston Birmingham, B15 2TT, United Kingdom
- * E-mail:
| |
Collapse
|
20
|
Coevolutionary Signals and Structure-Based Models for the Prediction of Protein Native Conformations. Methods Mol Biol 2019; 1851:83-103. [PMID: 30298393 DOI: 10.1007/978-1-4939-8736-8_5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
The analysis of coevolutionary signals from families of evolutionarily related sequences is a recent conceptual framework that provides valuable information about unique intramolecular interactions and, therefore, can assist in the elucidation of biomolecular conformations. It is based on the idea that compensatory mutations at specific residue positions in a sequence help preserve stability of protein architecture and function and leave a statistical signature related to residue-residue interactions in the 3D structure of the protein. Consequently, statistical analysis of these correlated mutations in subsets of protein sequence alignments can be used to predict which residue pairs should be in spatial proximity in the native functional protein fold. These predicted signals can be then used to guide molecular dynamics (MD) simulations to predict the three-dimensional coordinates of a functional amino acid chain. In this chapter, we introduce a general and efficient methodology to perform coevolutionary analysis on protein sequences and to use this information in combination with computational physical models to predict the native 3D conformation of functional polypeptides. We present a step-by-step methodology that includes the description and application of software tools and databases required to infer tertiary structures of a protein fold. The general pipeline includes instructions on (1) how to obtain direct amino acid couplings from protein sequences using direct coupling analysis (DCA), (2) how to incorporate such signals as interaction potentials in Cα structure-based models (SBMs) to drive protein-folding MD simulations, (3) a procedure to estimate secondary structure and how to include such estimates in the topology files required in the MD simulations, and (4) how to build full atomic models based on the top Cα candidates selected in the pipeline. The information presented in this chapter is self-contained and sufficient to allow a computational scientist to predict structures of proteins using publicly available algorithms and databases.
Collapse
|
21
|
de Oliveira SHP, Shi J, Deane CM. Comparing co-evolution methods and their application to template-free protein structure prediction. Bioinformatics 2018; 33:373-381. [PMID: 28171606 PMCID: PMC5860252 DOI: 10.1093/bioinformatics/btw618] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2016] [Revised: 09/19/2016] [Accepted: 09/22/2016] [Indexed: 02/01/2023] Open
Abstract
Motivation Co-evolution methods have been used as contact predictors to identify pairs of residues that share spatial proximity. Such contact predictors have been compared in terms of the precision of their predictions, but there is no study that compares their usefulness to model generation. Results We compared eight different co-evolution methods for a set of ∼3500 proteins and found that metaPSICOV stage 2 produces, on average, the most precise predictions. Precision of all the methods is dependent on SCOP class, with most methods predicting contacts in all α and membrane proteins poorly. The contact predictions were then used to assist in de novo model generation. We found that it was not the method with the highest average precision, but rather metaPSICOV stage 1 predictions that consistently led to the best models being produced. Our modelling results show a correlation between the proportion of predicted long range contacts that are satisfied on a model and its quality. We used this proportion to effectively classify models as correct/incorrect; discarding decoys classified as incorrect led to an enrichment in the proportion of good decoys in our final ensemble by a factor of seven. For 17 out of the 18 cases where correct answers were generated, the best models were not discarded by this approach. We were also able to identify eight cases where no correct decoy had been generated. Availability and Implementation Data is available for download from: http://opig.stats.ox.ac.uk/resources. Contact saulo.deoliveira@dtc.ox.ac.uk Supplimentary Information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jiye Shi
- Department of Informatics, UCB Pharma, Slough SL1 3WE, UK,Shanghai Institute of Applied Physics, Chinese Academy of Sciences, Shanghai 201800, China
| | | |
Collapse
|
22
|
dos Santos RN, Khan S, Morcos F. Characterization of C-ring component assembly in flagellar motors from amino acid coevolution. ROYAL SOCIETY OPEN SCIENCE 2018; 5:171854. [PMID: 29892378 PMCID: PMC5990795 DOI: 10.1098/rsos.171854] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2017] [Accepted: 04/05/2018] [Indexed: 06/08/2023]
Abstract
Bacterial flagellar motility, an important virulence factor, is energized by a rotary motor localized within the flagellar basal body. The rotor module consists of a large framework (the C-ring), composed of the FliG, FliM and FliN proteins. FliN and FliM contacts the FliG torque ring to control the direction of flagellar rotation. We report that structure-based models constrained only by residue coevolution can recover the binding interface of atomic X-ray dimer complexes with remarkable accuracy (approx. 1 Å RMSD). We propose a model for FliM-FliN heterodimerization, which agrees accurately with homologous interfaces as well as in situ cross-linking experiments, and hence supports a proposed architecture for the lower portion of the C-ring. Furthermore, this approach allowed the identification of two discrete and interchangeable homodimerization interfaces between FliM middle domains that agree with experimental measurements and might be associated with C-ring directional switching dynamics triggered upon binding of CheY signal protein. Our findings provide structural details of complex formation at the C-ring that have been difficult to obtain with previous methodologies and clarify the architectural principle that underpins the ultra-sensitive allostery exhibited by this ring assembly that controls the clockwise or counterclockwise rotation of flagella.
Collapse
Affiliation(s)
- Ricardo Nascimento dos Santos
- Institute of Chemistry and Center for Computational Engineering and Science, University of Campinas, Campinas, SP, Brazil
| | - Shahid Khan
- Molecular Biology Consortium, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, USA
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, USA
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, USA
| |
Collapse
|
23
|
Gil N, Fiser A. Identifying functionally informative evolutionary sequence profiles. Bioinformatics 2018; 34:1278-1286. [PMID: 29211823 PMCID: PMC5905606 DOI: 10.1093/bioinformatics/btx779] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2017] [Accepted: 11/29/2017] [Indexed: 01/06/2023] Open
Abstract
Motivation Multiple sequence alignments (MSAs) can provide essential input to many bioinformatics applications, including protein structure prediction and functional annotation. However, the optimal selection of sequences to obtain biologically informative MSAs for such purposes is poorly explored, and has traditionally been performed manually. Results We present Selection of Alignment by Maximal Mutual Information (SAMMI), an automated, sequence-based approach to objectively select an optimal MSA from a large set of alternatives sampled from a general sequence database search. The hypothesis of this approach is that the mutual information among MSA columns will be maximal for those MSAs that contain the most diverse set possible of the most structurally and functionally homogeneous protein sequences. SAMMI was tested to select MSAs for functional site residue prediction by analysis of conservation patterns on a set of 435 proteins obtained from protein-ligand (peptides, nucleic acids and small substrates) and protein-protein interaction databases. Availability and implementation: A freely accessible program, including source code, implementing SAMMI is available at https://github.com/nelsongil92/SAMMI.git. Contact andras.fiser@einstein.yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nelson Gil
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| | - Andras Fiser
- Department of Systems & Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, USA
| |
Collapse
|
24
|
Michel M, Menéndez Hurtado D, Uziela K, Elofsson A. Large-scale structure prediction by improved contact predictions and model quality assessment. Bioinformatics 2018; 33:i23-i29. [PMID: 28881974 PMCID: PMC5870574 DOI: 10.1093/bioinformatics/btx239] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Motivation Accurate contact predictions can be used for predicting the structure of proteins. Until recently these methods were limited to very big protein families, decreasing their utility. However, recent progress by combining direct coupling analysis with machine learning methods has made it possible to predict accurate contact maps for smaller families. To what extent these predictions can be used to produce accurate models of the families is not known. Results We present the PconsFold2 pipeline that uses contact predictions from PconsC3, the CONFOLD folding algorithm and model quality estimations to predict the structure of a protein. We show that the model quality estimation significantly increases the number of models that reliably can be identified. Finally, we apply PconsFold2 to 6379 Pfam families of unknown structure and find that PconsFold2 can, with an estimated 90% specificity, predict the structure of up to 558 Pfam families of unknown structure. Out of these, 415 have not been reported before. Availability and Implementation Datasets as well as models of all the 558 Pfam families are available at http://c3.pcons.net/. All programs used here are freely available.
Collapse
Affiliation(s)
- Mirco Michel
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - David Menéndez Hurtado
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Karolis Uziela
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| | - Arne Elofsson
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, Solna, Sweden
| |
Collapse
|
25
|
Nicoludis JM, Gaudet R. Applications of sequence coevolution in membrane protein biochemistry. BIOCHIMICA ET BIOPHYSICA ACTA. BIOMEMBRANES 2018; 1860:895-908. [PMID: 28993150 PMCID: PMC5807202 DOI: 10.1016/j.bbamem.2017.10.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/19/2017] [Revised: 09/28/2017] [Accepted: 10/02/2017] [Indexed: 12/22/2022]
Abstract
Recently, protein sequence coevolution analysis has matured into a predictive powerhouse for protein structure and function. Direct methods, which use global statistical models of sequence coevolution, have enabled the prediction of membrane and disordered protein structures, protein complex architectures, and the functional effects of mutations in proteins. The field of membrane protein biochemistry and structural biology has embraced these computational techniques, which provide functional and structural information in an otherwise experimentally-challenging field. Here we review recent applications of protein sequence coevolution analysis to membrane protein structure and function and highlight the promising directions and future obstacles in these fields. We provide insights and guidelines for membrane protein biochemists who wish to apply sequence coevolution analysis to a given experimental system.
Collapse
Affiliation(s)
- John M Nicoludis
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, United States
| | - Rachelle Gaudet
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, United States.
| |
Collapse
|
26
|
Schaarschmidt J, Monastyrskyy B, Kryshtafovych A, Bonvin AM. Assessment of contact predictions in CASP12: Co-evolution and deep learning coming of age. Proteins 2018; 86 Suppl 1:51-66. [PMID: 29071738 PMCID: PMC5820169 DOI: 10.1002/prot.25407] [Citation(s) in RCA: 126] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2017] [Revised: 10/06/2017] [Accepted: 10/24/2017] [Indexed: 12/20/2022]
Abstract
Following up on the encouraging results of residue-residue contact prediction in the CASP11 experiment, we present the analysis of predictions submitted for CASP12. The submissions include predictions of 34 groups for 38 domains classified as free modeling targets which are not accessible to homology-based modeling due to a lack of structural templates. CASP11 saw a rise of coevolution-based methods outperforming other approaches. The improvement of these methods coupled to machine learning and sequence database growth are most likely the main driver for a significant improvement in average precision from 27% in CASP11 to 47% in CASP12. In more than half of the targets, especially those with many homologous sequences accessible, precisions above 90% were achieved with the best predictors reaching a precision of 100% in some cases. We furthermore tested the impact of using these contacts as restraints in ab initio modeling of 14 single-domain free modeling targets using Rosetta. Adding contacts to the Rosetta calculations resulted in improvements of up to 26% in GDT_TS within the top five structures.
Collapse
Affiliation(s)
- Joerg Schaarschmidt
- Faculty of Science ‐ ChemistryComputational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht UniversityUtrechtThe Netherlands
| | | | | | - Alexandre M.J.J. Bonvin
- Faculty of Science ‐ ChemistryComputational Structural Biology Group, Bijvoet Center for Biomolecular Research, Utrecht UniversityUtrechtThe Netherlands
| |
Collapse
|
27
|
Li B, Fooksa M, Heinze S, Meiler J. Finding the needle in the haystack: towards solving the protein-folding problem computationally. Crit Rev Biochem Mol Biol 2018; 53:1-28. [PMID: 28976219 PMCID: PMC6790072 DOI: 10.1080/10409238.2017.1380596] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Revised: 08/22/2017] [Accepted: 09/13/2017] [Indexed: 12/22/2022]
Abstract
Prediction of protein tertiary structures from amino acid sequence and understanding the mechanisms of how proteins fold, collectively known as "the protein folding problem," has been a grand challenge in molecular biology for over half a century. Theories have been developed that provide us with an unprecedented understanding of protein folding mechanisms. However, computational simulation of protein folding is still difficult, and prediction of protein tertiary structure from amino acid sequence is an unsolved problem. Progress toward a satisfying solution has been slow due to challenges in sampling the vast conformational space and deriving sufficiently accurate energy functions. Nevertheless, several techniques and algorithms have been adopted to overcome these challenges, and the last two decades have seen exciting advances in enhanced sampling algorithms, computational power and tertiary structure prediction methodologies. This review aims at summarizing these computational techniques, specifically conformational sampling algorithms and energy approximations that have been frequently used to study protein-folding mechanisms or to de novo predict protein tertiary structures. We hope that this review can serve as an overview on how the protein-folding problem can be studied computationally and, in cases where experimental approaches are prohibitive, help the researcher choose the most relevant computational approach for the problem at hand. We conclude with a summary of current challenges faced and an outlook on potential future directions.
Collapse
Affiliation(s)
- Bian Li
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Michaela Fooksa
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
- Chemical and Physical Biology Graduate Program, Vanderbilt University, Nashville, TN, USA
| | - Sten Heinze
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Jens Meiler
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
28
|
Prediction of Structures and Interactions from Genome Information. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2018; 1105:123-152. [DOI: 10.1007/978-981-13-2200-6_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
29
|
Wozniak PP, Konopka BM, Xu J, Vriend G, Kotulska M. Forecasting residue-residue contact prediction accuracy. Bioinformatics 2017; 33:3405-3414. [PMID: 29036497 DOI: 10.1093/bioinformatics/btx416] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Accepted: 06/22/2017] [Indexed: 11/14/2022] Open
Abstract
Motivation Apart from meta-predictors, most of today's methods for residue-residue contact prediction are based entirely on Direct Coupling Analysis (DCA) of correlated mutations in multiple sequence alignments (MSAs). These methods are on average ∼40% correct for the 100 strongest predicted contacts in each protein. The end-user who works on a single protein of interest will not know if predictions are either much more or much less correct than 40%, which is especially a problem if contacts are predicted to steer experimental research on that protein. Results We designed a regression model that forecasts the accuracy of residue-residue contact prediction for individual proteins with an average error of 7 percentage points. Contacts were predicted with two DCA methods (gplmDCA and PSICOV). The models were built on parameters that describe the MSA, the predicted secondary structure, the predicted solvent accessibility and the contact prediction scores for the target protein. Results show that our models can be also applied to the meta-methods, which was tested on RaptorX. Availability and implementation All data and scripts are available from http://comprec-lin.iiar.pwr.edu.pl/dcaQ/. Contact malgorzata.kotulska@pwr.edu.pl. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- P P Wozniak
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - B M Konopka
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| | - J Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, USA
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, GA 6525, Nijmegen, The Netherlands
| | - M Kotulska
- Department of Biomedical Engineering, Faculty of Fundamental Problems of Technology, Wroclaw University of Science and Technology, Wroclaw, Poland
| |
Collapse
|
30
|
Abstract
Co-evolution techniques were originally conceived to assist in protein structure prediction by inferring pairs of residues that share spatial proximity. However, the functional relationships that can be extrapolated from co-evolution have also proven to be useful in a wide array of structural bioinformatics applications. These techniques are a powerful way to extract structural and functional information in a sequence-rich world.
Collapse
|
31
|
Teixeira PL, Mendenhall JL, Heinze S, Weiner B, Skwark MJ, Meiler J. Membrane protein contact and structure prediction using co-evolution in conjunction with machine learning. PLoS One 2017; 12:e0177866. [PMID: 28542325 PMCID: PMC5443516 DOI: 10.1371/journal.pone.0177866] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2016] [Accepted: 05/04/2017] [Indexed: 11/18/2022] Open
Abstract
De novo membrane protein structure prediction is limited to small proteins due to the conformational search space quickly expanding with length. Long-range contacts (24+ amino acid separation)-residue positions distant in sequence, but in close proximity in the structure, are arguably the most effective way to restrict this conformational space. Inverse methods for co-evolutionary analysis predict a global set of position-pair couplings that best explain the observed amino acid co-occurrences, thus distinguishing between evolutionarily explained co-variances and these arising from spurious transitive effects. Here, we show that applying machine learning approaches and custom descriptors improves evolutionary contact prediction accuracy, resulting in improvement of average precision by 6 percentage points for the top 1L non-local contacts. Further, we demonstrate that predicted contacts improve protein folding with BCL::Fold. The mean RMSD100 metric for the top 10 models folded was reduced by an average of 2 Å for a benchmark of 25 membrane proteins.
Collapse
Affiliation(s)
- Pedro L. Teixeira
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
| | - Jeff L. Mendenhall
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville Tennessee, United States of America
| | - Sten Heinze
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville Tennessee, United States of America
| | - Brian Weiner
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville Tennessee, United States of America
| | - Marcin J. Skwark
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville Tennessee, United States of America
| | - Jens Meiler
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America
- Department of Chemistry, Center for Structural Biology, Vanderbilt University, Nashville Tennessee, United States of America
- * E-mail:
| |
Collapse
|
32
|
Garcia-Garcia J, Valls-Comamala V, Guney E, Andreu D, Muñoz FJ, Fernandez-Fuentes N, Oliva B. iFrag: A Protein–Protein Interface Prediction Server Based on Sequence Fragments. J Mol Biol 2017; 429:382-389. [DOI: 10.1016/j.jmb.2016.11.034] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2016] [Revised: 11/27/2016] [Accepted: 11/30/2016] [Indexed: 01/08/2023]
|
33
|
Wozniak PP, Vriend G, Kotulska M. Correlated mutations select misfolded from properly folded proteins. Bioinformatics 2017; 33:1497-1504. [PMID: 28203707 DOI: 10.1093/bioinformatics/btx013] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Accepted: 01/11/2017] [Indexed: 11/14/2022] Open
Affiliation(s)
- P P Wozniak
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland
| | - G Vriend
- Centre for Molecular and Biomolecular Informatics, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - M Kotulska
- Faculty of Fundamental Problems of Technology, Department of Biomedical Engineering, Wrocław University of Science and Technology, Wrocław, Poland
| |
Collapse
|
34
|
Rawi R, Mall R, Kunji K, El Anbari M, Aupetit M, Ullah E, Bensmail H. COUSCOus: improved protein contact prediction using an empirical Bayes covariance estimator. BMC Bioinformatics 2016; 17:533. [PMID: 27978812 PMCID: PMC5159955 DOI: 10.1186/s12859-016-1400-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2016] [Accepted: 12/01/2016] [Indexed: 11/13/2022] Open
Abstract
Background The post-genomic era with its wealth of sequences gave rise to a broad range of protein residue-residue contact detecting methods. Although various coevolution methods such as PSICOV, DCA and plmDCA provide correct contact predictions, they do not completely overlap. Hence, new approaches and improvements of existing methods are needed to motivate further development and progress in the field. We present a new contact detecting method, COUSCOus, by combining the best shrinkage approach, the empirical Bayes covariance estimator and GLasso. Results Using the original PSICOV benchmark dataset, COUSCOus achieves mean accuracies of 0.74, 0.62 and 0.55 for the top L/10 predicted long, medium and short range contacts, respectively. In addition, COUSCOus attains mean areas under the precision-recall curves of 0.25, 0.29 and 0.30 for long, medium and short contacts and outperforms PSICOV. We also observed that COUSCOus outperforms PSICOV w.r.t. Matthew’s correlation coefficient criterion on full list of residue contacts. Furthermore, COUSCOus achieves on average 10% more gain in prediction accuracy compared to PSICOV on an independent test set composed of CASP11 protein targets. Finally, we showed that when using a simple random forest meta-classifier, by combining contact detecting techniques and sequence derived features, PSICOV predictions should be replaced by the more accurate COUSCOus predictions. Conclusion We conclude that the consideration of superior covariance shrinkage approaches will boost several research fields that apply the GLasso procedure, amongst the presented one of residue-residue contact prediction as well as fields such as gene network reconstruction. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1400-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Reda Rawi
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar.
| | - Raghvendra Mall
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Khalid Kunji
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Mohammed El Anbari
- Division of Biomedical Informatics, Sidra Medical and Research Center, Doha, Qatar
| | - Michael Aupetit
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Ehsan Ullah
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Halima Bensmail
- Computational Science and Engineering, Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| |
Collapse
|
35
|
Assessing Predicted Contacts for Building Protein Three-Dimensional Models. Methods Mol Biol 2016. [PMID: 27787823 DOI: 10.1007/978-1-4939-6406-2_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Recent successes of contact-guided protein structure prediction methods have revived interest in solving the long-standing problem of ab initio protein structure prediction. With homology modeling failing for many protein sequences that do not have templates, contact-guided structure prediction has shown promise, and consequently, contact prediction has gained a lot of interest recently. Although a few dozen contact prediction tools are already currently available as web servers and downloadables, not enough research has been done towards using existing measures like precision and recall to evaluate these contacts with the goal of building three-dimensional models. Moreover, when we do not have a native structure for a set of predicted contacts, the only analysis we can perform is a simple contact map visualization of the predicted contacts. A wider and more rigorous assessment of the predicted contacts is needed, in order to build tertiary structure models. This chapter discusses instructions and protocols for using tools and applying techniques in order to assess predicted contacts for building three-dimensional models.
Collapse
|
36
|
Monastyrskyy B, D'Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. New encouraging developments in contact prediction: Assessment of the CASP11 results. Proteins 2016; 84 Suppl 1:131-44. [PMID: 26474083 PMCID: PMC4834069 DOI: 10.1002/prot.24943] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2015] [Revised: 09/15/2015] [Accepted: 10/11/2015] [Indexed: 12/27/2022]
Abstract
This article provides a report on the state-of-the-art in the prediction of intra-molecular residue-residue contacts in proteins based on the assessment of the predictions submitted to the CASP11 experiment. The assessment emphasis is placed on the accuracy in predicting long-range contacts. Twenty-nine groups participated in contact prediction in CASP11. At least eight of them used the recently developed evolutionary coupling techniques, with the top group (CONSIP2) reaching precision of 27% on target proteins that could not be modeled by homology. This result indicates a breakthrough in the development of methods based on the correlated mutation approach. Successful prediction of contacts was shown to be practically helpful in modeling three-dimensional structures; in particular target T0806 was modeled exceedingly well with accuracy not yet seen for ab initio targets of this size (>250 residues). Proteins 2016; 84(Suppl 1):131-144. © 2015 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
| | - Daniel D'Andrea
- Department of Physics, Sapienza-University of Rome, Rome, 00185, Italy
| | | | - Anna Tramontano
- Department of Physics, Sapienza-University of Rome, Rome, 00185, Italy
- Istituto Pasteur-Fondazione Cenci Bolognetti-University of Rome, Rome, 00185, Italy
| | | |
Collapse
|
37
|
Simkovic F, Thomas JMH, Keegan RM, Winn MD, Mayans O, Rigden DJ. Residue contacts predicted by evolutionary covariance extend the application of ab initio molecular replacement to larger and more challenging protein folds. IUCRJ 2016; 3:259-70. [PMID: 27437113 PMCID: PMC4937781 DOI: 10.1107/s2052252516008113] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2016] [Accepted: 05/18/2016] [Indexed: 05/05/2023]
Abstract
For many protein families, the deluge of new sequence information together with new statistical protocols now allow the accurate prediction of contacting residues from sequence information alone. This offers the possibility of more accurate ab initio (non-homology-based) structure prediction. Such models can be used in structure solution by molecular replacement (MR) where the target fold is novel or is only distantly related to known structures. Here, AMPLE, an MR pipeline that assembles search-model ensembles from ab initio structure predictions ('decoys'), is employed to assess the value of contact-assisted ab initio models to the crystallographer. It is demonstrated that evolutionary covariance-derived residue-residue contact predictions improve the quality of ab initio models and, consequently, the success rate of MR using search models derived from them. For targets containing β-structure, decoy quality and MR performance were further improved by the use of a β-strand contact-filtering protocol. Such contact-guided decoys achieved 14 structure solutions from 21 attempted protein targets, compared with nine for simple Rosetta decoys. Previously encountered limitations were superseded in two key respects. Firstly, much larger targets of up to 221 residues in length were solved, which is far larger than the previously benchmarked threshold of 120 residues. Secondly, contact-guided decoys significantly improved success with β-sheet-rich proteins. Overall, the improved performance of contact-guided decoys suggests that MR is now applicable to a significantly wider range of protein targets than were previously tractable, and points to a direct benefit to structural biology from the recent remarkable advances in sequencing.
Collapse
Affiliation(s)
- Felix Simkovic
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Jens M. H. Thomas
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Ronan M. Keegan
- Research Complex at Harwell, STFC Rutherford Appleton Laboratory, Didcot OX11 0FA, England
| | - Martyn D. Winn
- Science and Technology Facilities Council, Daresbury Laboratory, Warrington WA4 4AD, England
| | - Olga Mayans
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| | - Daniel J. Rigden
- Institute of Integrative Biology, University of Liverpool, Liverpool L69 7ZB, England
| |
Collapse
|
38
|
Bhattacharya D, Cao R, Cheng J. UniCon3D: de novo protein structure prediction using united-residue conformational search via stepwise, probabilistic sampling. Bioinformatics 2016; 32:2791-9. [PMID: 27259540 PMCID: PMC5018369 DOI: 10.1093/bioinformatics/btw316] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2016] [Accepted: 05/15/2016] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Recent experimental studies have suggested that proteins fold via stepwise assembly of structural units named 'foldons' through the process of sequential stabilization. Alongside, latest developments on computational side based on probabilistic modeling have shown promising direction to perform de novo protein conformational sampling from continuous space. However, existing computational approaches for de novo protein structure prediction often randomly sample protein conformational space as opposed to experimentally suggested stepwise sampling. RESULTS Here, we develop a novel generative, probabilistic model that simultaneously captures local structural preferences of backbone and side chain conformational space of polypeptide chains in a united-residue representation and performs experimentally motivated conditional conformational sampling via stepwise synthesis and assembly of foldon units that minimizes a composite physics and knowledge-based energy function for de novo protein structure prediction. The proposed method, UniCon3D, has been found to (i) sample lower energy conformations with higher accuracy than traditional random sampling in a small benchmark of 6 proteins; (ii) perform comparably with the top five automated methods on 30 difficult target domains from the 11th Critical Assessment of Protein Structure Prediction (CASP) experiment and on 15 difficult target domains from the 10th CASP experiment; and (iii) outperform two state-of-the-art approaches and a baseline counterpart of UniCon3D that performs traditional random sampling for protein modeling aided by predicted residue-residue contacts on 45 targets from the 10th edition of CASP. AVAILABILITY AND IMPLEMENTATION Source code, executable versions, manuals and example data of UniCon3D for Linux and OSX are freely available to non-commercial users at http://sysbio.rnet.missouri.edu/UniCon3D/ CONTACT: chengji@missouri.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Jianlin Cheng
- Department of Computer Science Informatics Institute, University of Missouri, Columbia, MO 65211, USA
| |
Collapse
|
39
|
Zhang H, Huang Q, Bei Z, Wei Y, Floudas CA. COMSAT: Residue contact prediction of transmembrane proteins based on support vector machines and mixed integer linear programming. Proteins 2016; 84:332-48. [PMID: 26756402 DOI: 10.1002/prot.24979] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Revised: 11/19/2015] [Accepted: 12/10/2015] [Indexed: 12/28/2022]
Abstract
In this article, we present COMSAT, a hybrid framework for residue contact prediction of transmembrane (TM) proteins, integrating a support vector machine (SVM) method and a mixed integer linear programming (MILP) method. COMSAT consists of two modules: COMSAT_SVM which is trained mainly on position-specific scoring matrix features, and COMSAT_MILP which is an ab initio method based on optimization models. Contacts predicted by the SVM model are ranked by SVM confidence scores, and a threshold is trained to improve the reliability of the predicted contacts. For TM proteins with no contacts above the threshold, COMSAT_MILP is used. The proposed hybrid contact prediction scheme was tested on two independent TM protein sets based on the contact definition of 14 Å between Cα-Cα atoms. First, using a rigorous leave-one-protein-out cross validation on the training set of 90 TM proteins, an accuracy of 66.8%, a coverage of 12.3%, a specificity of 99.3% and a Matthews' correlation coefficient (MCC) of 0.184 were obtained for residue pairs that are at least six amino acids apart. Second, when tested on a test set of 87 TM proteins, the proposed method showed a prediction accuracy of 64.5%, a coverage of 5.3%, a specificity of 99.4% and a MCC of 0.106. COMSAT shows satisfactory results when compared with 12 other state-of-the-art predictors, and is more robust in terms of prediction accuracy as the length and complexity of TM protein increase. COMSAT is freely accessible at http://hpcc.siat.ac.cn/COMSAT/.
Collapse
Affiliation(s)
- Huiling Zhang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Qingsheng Huang
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Zhendong Bei
- Center for Cloud Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Yanjie Wei
- Centre for High Performance Computing, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518055, China
| | - Christodoulos A Floudas
- Department of Chemical Engineering, Texas A&M University, College Station, Texas, 77843.,Texas A&M Energy Institute, Texas A&M University, College Station, Texas, 77843
| |
Collapse
|
40
|
Braun T, Koehler Leman J, Lange OF. Combining Evolutionary Information and an Iterative Sampling Strategy for Accurate Protein Structure Prediction. PLoS Comput Biol 2015; 11:e1004661. [PMID: 26713437 PMCID: PMC4694711 DOI: 10.1371/journal.pcbi.1004661] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2015] [Accepted: 11/17/2015] [Indexed: 12/18/2022] Open
Abstract
Recent work has shown that the accuracy of ab initio structure prediction can be significantly improved by integrating evolutionary information in form of intra-protein residue-residue contacts. Following this seminal result, much effort is put into the improvement of contact predictions. However, there is also a substantial need to develop structure prediction protocols tailored to the type of restraints gained by contact predictions. Here, we present a structure prediction protocol that combines evolutionary information with the resolution-adapted structural recombination approach of Rosetta, called RASREC. Compared to the classic Rosetta ab initio protocol, RASREC achieves improved sampling, better convergence and higher robustness against incorrect distance restraints, making it the ideal sampling strategy for the stated problem. To demonstrate the accuracy of our protocol, we tested the approach on a diverse set of 28 globular proteins. Our method is able to converge for 26 out of the 28 targets and improves the average TM-score of the entire benchmark set from 0.55 to 0.72 when compared to the top ranked models obtained by the EVFold web server using identical contact predictions. Using a smaller benchmark, we furthermore show that the prediction accuracy of our method is only slightly reduced when the contact prediction accuracy is comparatively low. This observation is of special interest for protein sequences that only have a limited number of homologs. Recently, a breakthrough has been achieved in modeling the atomic 3D structures of proteins from their sequence alone without requiring any experimental work on the protein itself. To achieve this goal, a database of evolutionary related sequences is analyzed to find co-evolving residues, giving insight into which residues are in close proximity to each other. These residue-residue contacts can help to drive a computer simulation with an atomic-scale physical model of the protein structure from a random starting conformation to a native-like 3D conformation. Although much effort is being put into the improvement of residue-residue contact predictions, their accuracy will always be limited. Therefore, structure prediction protocols with a high tolerance against incorrect distance restraints are needed. Here, we present a structure prediction protocol that combines evolutionary information with the iterative sampling approach of the molecular modeling suite Rosetta, called RASREC. RASREC has been shown to converge faster to near-native models and to be more robust against incorrect distance restraints than standard prediction protocols. It is therefore perfectly suited for restraints obtained from predicted residue-residue contacts with limited accuracy. We show that our protocol outperforms other currently published structure prediction methods and is able to achieve accurate structures, even if the accuracy of predicted contacts is low.
Collapse
Affiliation(s)
- Tatjana Braun
- Biomolecular NMR and Munich Center for Integrated Protein Science, Department Chemie, Technische Universität München, Garching, Germany
- * E-mail:
| | - Julia Koehler Leman
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Oliver F. Lange
- Biomolecular NMR and Munich Center for Integrated Protein Science, Department Chemie, Technische Universität München, Garching, Germany
| |
Collapse
|
41
|
Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning. Bioinformatics 2015; 31:3506-13. [PMID: 26275894 DOI: 10.1093/bioinformatics/btv472] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Accepted: 08/08/2015] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Protein contact prediction is important for protein structure and functional study. Both evolutionary coupling (EC) analysis and supervised machine learning methods have been developed, making use of different information sources. However, contact prediction is still challenging especially for proteins without a large number of sequence homologs. RESULTS This article presents a group graphical lasso (GGL) method for contact prediction that integrates joint multi-family EC analysis and supervised learning to improve accuracy on proteins without many sequence homologs. Different from existing single-family EC analysis that uses residue coevolution information in only the target protein family, our joint EC analysis uses residue coevolution in both the target family and its related families, which may have divergent sequences but similar folds. To implement this, we model a set of related protein families using Gaussian graphical models and then coestimate their parameters by maximum-likelihood, subject to the constraint that these parameters shall be similar to some degree. Our GGL method can also integrate supervised learning methods to further improve accuracy. Experiments show that our method outperforms existing methods on proteins without thousands of sequence homologs, and that our method performs better on both conserved and family-specific contacts. AVAILABILITY AND IMPLEMENTATION See http://raptorx.uchicago.edu/ContactMap/ for a web server implementing the method. CONTACT j3xu@ttic.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianzhu Ma
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Sheng Wang
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Zhiyong Wang
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| |
Collapse
|
42
|
Pietal MJ, Bujnicki JM, Kozlowski LP. GDFuzz3D: a method for protein 3D structure reconstruction from contact maps, based on a non-Euclidean distance function. Bioinformatics 2015; 31:3499-505. [PMID: 26130575 DOI: 10.1093/bioinformatics/btv390] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2014] [Accepted: 06/23/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION To date, only a few distinct successful approaches have been introduced to reconstruct a protein 3D structure from a map of contacts between its amino acid residues (a 2D contact map). Current algorithms can infer structures from information-rich contact maps that contain a limited fraction of erroneous predictions. However, it is difficult to reconstruct 3D structures from predicted contact maps that usually contain a high fraction of false contacts. RESULTS We describe a new, multi-step protocol that predicts protein 3D structures from the predicted contact maps. The method is based on a novel distance function acting on a fuzzy residue proximity graph, which predicts a 2D distance map from a 2D predicted contact map. The application of a Multi-Dimensional Scaling algorithm transforms that predicted 2D distance map into a coarse 3D model, which is further refined by typical modeling programs into an all-atom representation. We tested our approach on contact maps predicted de novo by MULTICOM, the top contact map predictor according to CASP10. We show that our method outperforms FT-COMAR, the state-of-the-art method for 3D structure reconstruction from 2D maps. For all predicted 2D contact maps of relatively low sensitivity (60-84%), GDFuzz3D generates more accurate 3D models, with the average improvement of 4.87 Å in terms of RMSD. AVAILABILITY AND IMPLEMENTATION GDFuzz3D server and standalone version are freely available at http://iimcb.genesilico.pl/gdserver/GDFuzz3D/. CONTACT iamb@genesilico.pl SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michal J Pietal
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, Poland and
| | - Janusz M Bujnicki
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland, Bioinformatics Laboratory, Institute of Molecular Biology and Biotechnology, Adam Mickiewicz University, Poznan, Poland
| | - Lukasz P Kozlowski
- Laboratory of Bioinformatics and Protein Engineering, International Institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland
| |
Collapse
|
43
|
Mao W, Kaya C, Dutta A, Horovitz A, Bahar I. Comparative study of the effectiveness and limitations of current methods for detecting sequence coevolution. Bioinformatics 2015; 31:1929-37. [PMID: 25697822 PMCID: PMC4481699 DOI: 10.1093/bioinformatics/btv103] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2014] [Accepted: 02/02/2015] [Indexed: 01/02/2023] Open
Abstract
Motivation: With rapid accumulation of sequence data on several species, extracting rational and systematic information from multiple sequence alignments (MSAs) is becoming increasingly important. Currently, there is a plethora of computational methods for investigating coupled evolutionary changes in pairs of positions along the amino acid sequence, and making inferences on structure and function. Yet, the significance of coevolution signals remains to be established. Also, a large number of false positives (FPs) arise from insufficient MSA size, phylogenetic background and indirect couplings. Results: Here, a set of 16 pairs of non-interacting proteins is thoroughly examined to assess the effectiveness and limitations of different methods. The analysis shows that recent computationally expensive methods designed to remove biases from indirect couplings outperform others in detecting tertiary structural contacts as well as eliminating intermolecular FPs; whereas traditional methods such as mutual information benefit from refinements such as shuffling, while being highly efficient. Computations repeated with 2,330 pairs of protein families from the Negatome database corroborated these results. Finally, using a training dataset of 162 families of proteins, we propose a combined method that outperforms existing individual methods. Overall, the study provides simple guidelines towards the choice of suitable methods and strategies based on available MSA size and computing resources. Availability and implementation: Software is freely available through the Evol component of ProDy API. Contact:bahar@pitt.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wenzhi Mao
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Cihan Kaya
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Anindita Dutta
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Amnon Horovitz
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| | - Ivet Bahar
- Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15260, USA, Department of Pharmacology, School of Medicine, Tsinghua University, Beijing 100084, China and Department of Structural Biology, Weizmann Institute of Science, Rehovot 76100, Israel
| |
Collapse
|
44
|
Li G, Theys K, Verheyen J, Pineda-Peña AC, Khouri R, Piampongsant S, Eusébio M, Ramon J, Vandamme AM. A new ensemble coevolution system for detecting HIV-1 protein coevolution. Biol Direct 2015; 10:1. [PMID: 25564011 PMCID: PMC4332441 DOI: 10.1186/s13062-014-0031-8] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2014] [Accepted: 12/02/2014] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND A key challenge in the field of HIV-1 protein evolution is the identification of coevolving amino acids at the molecular level. In the past decades, many sequence-based methods have been designed to detect position-specific coevolution within and between different proteins. However, an ensemble coevolution system that integrates different methods to improve the detection of HIV-1 protein coevolution has not been developed. RESULTS We integrated 27 sequence-based prediction methods published between 2004 and 2013 into an ensemble coevolution system. This system allowed combinations of different sequence-based methods for coevolution predictions. Using HIV-1 protein structures and experimental data, we evaluated the performance of individual and combined sequence-based methods in the prediction of HIV-1 intra- and inter-protein coevolution. We showed that sequence-based methods clustered according to their methodology, and a combination of four methods outperformed any of the 27 individual methods. This four-method combination estimated that HIV-1 intra-protein coevolving positions were mainly located in functional domains and physically contacted with each other in the protein tertiary structures. In the analysis of HIV-1 inter-protein coevolving positions between Gag and protease, protease drug resistance positions near the active site mostly coevolved with Gag cleavage positions (V128, S373-T375, A431, F448-P453) and Gag C-terminal positions (S489-Q500) under selective pressure of protease inhibitors. CONCLUSIONS This study presents a new ensemble coevolution system which detects position-specific coevolution using combinations of 27 different sequence-based methods. Our findings highlight key coevolving residues within HIV-1 structural proteins and between Gag and protease, shedding light on HIV-1 intra- and inter-protein coevolution.
Collapse
Affiliation(s)
- Guangdi Li
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Kristof Theys
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Jens Verheyen
- Institute of Virology, University hospital, University Duisburg-Essen, Essen, Germany.
| | - Andrea-Clemencia Pineda-Peña
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium. .,Clinical and Molecular Infectious Disease Group, Faculty of Sciences and Mathematics, Universidad del Rosario, Bogotá, Colombia.
| | - Ricardo Khouri
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Supinya Piampongsant
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium.
| | - Mónica Eusébio
- Centro de Malária e Outras Doenças Tropicais and Unidade de Microbiologia, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Lisboa, Portugal.
| | - Jan Ramon
- Department of Computer Science, KU Leuven - University of Leuven, Leuven, Belgium.
| | - Anne-Mieke Vandamme
- KU Leuven - University of Leuven, Department of Microbiology and Immunology, Rega Institute for Medical Research, Clinical and Epidemiological Virology, Leuven, Belgium. .,Centro de Malária e Outras Doenças Tropicais and Unidade de Microbiologia, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa, Lisboa, Portugal.
| |
Collapse
|
45
|
Abstract
Recent advances in identifying residue-residue contacts from large multiple sequence alignments have enabled impressive gains to be made in the field of protein structure prediction. In this chapter, we discuss these advances and provide a step-by-step guide to applying the latest tools to the de novo modelling of alpha-helical transmembrane proteins. As a practical example, we demonstrate the process of building an accurate 3D model of a G protein-coupled receptor, correctly orientated in the membrane, using only its primary protein sequence.
Collapse
Affiliation(s)
- Timothy Nugent
- Bioinformatics Group, Department of Computer Science, University College London, Office: 8.11, Desk: 206, Gower Street, London, WC1E 6BT, UK,
| |
Collapse
|
46
|
Jones DT, Singh T, Kosciolek T, Tetchner S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. ACTA ACUST UNITED AC 2014; 31:999-1006. [PMID: 25431331 PMCID: PMC4382908 DOI: 10.1093/bioinformatics/btu791] [Citation(s) in RCA: 237] [Impact Index Per Article: 21.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Accepted: 11/22/2014] [Indexed: 12/13/2022]
Abstract
Motivation: Recent developments of statistical techniques to infer direct evolutionary couplings between residue pairs have rendered covariation-based contact prediction a viable means for accurate 3D modelling of proteins, with no information other than the sequence required. To extend the usefulness of contact prediction, we have designed a new meta-predictor (MetaPSICOV) which combines three distinct approaches for inferring covariation signals from multiple sequence alignments, considers a broad range of other sequence-derived features and, uniquely, a range of metrics which describe both the local and global quality of the input multiple sequence alignment. Finally, we use a two-stage predictor, where the second stage filters the output of the first stage. This two-stage predictor is additionally evaluated on its ability to accurately predict the long range network of hydrogen bonds, including correctly assigning the donor and acceptor residues. Results: Using the original PSICOV benchmark set of 150 protein families, MetaPSICOV achieves a mean precision of 0.54 for top-L predicted long range contacts—around 60% higher than PSICOV, and around 40% better than CCMpred. In de novo protein structure prediction using FRAGFOLD, MetaPSICOV is able to improve the TM-scores of models by a median of 0.05 compared with PSICOV. Lastly, for predicting long range hydrogen bonding, MetaPSICOV-HB achieves a precision of 0.69 for the top-L/10 hydrogen bonds compared with just 0.26 for the baseline MetaPSICOV. Availability and implementation: MetaPSICOV is available as a freely available web server at http://bioinf.cs.ucl.ac.uk/MetaPSICOV. Raw data (predicted contact lists and 3D models) and source code can be downloaded from http://bioinf.cs.ucl.ac.uk/downloads/MetaPSICOV. Contact:d.t.jones@ucl.ac.uk Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- David T Jones
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tanya Singh
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Tomasz Kosciolek
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Stuart Tetchner
- Bioinformatics Group, Department of Computer Science, University College London, London WC1E 6BT, UK
| |
Collapse
|
47
|
Improved contact predictions using the recognition of protein like contact patterns. PLoS Comput Biol 2014; 10:e1003889. [PMID: 25375897 PMCID: PMC4222596 DOI: 10.1371/journal.pcbi.1003889] [Citation(s) in RCA: 132] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2014] [Accepted: 09/03/2014] [Indexed: 11/23/2022] Open
Abstract
Given sufficient large protein families, and using a global statistical inference approach, it is possible to obtain sufficient accuracy in protein residue contact predictions to predict the structure of many proteins. However, these approaches do not consider the fact that the contacts in a protein are neither randomly, nor independently distributed, but actually follow precise rules governed by the structure of the protein and thus are interdependent. Here, we present PconsC2, a novel method that uses a deep learning approach to identify protein-like contact patterns to improve contact predictions. A substantial enhancement can be seen for all contacts independently on the number of aligned sequences, residue separation or secondary structure type, but is largest for β-sheet containing proteins. In addition to being superior to earlier methods based on statistical inferences, in comparison to state of the art methods using machine learning, PconsC2 is superior for families with more than 100 effective sequence homologs. The improved contact prediction enables improved structure prediction. Here, we introduce a novel protein contact prediction method PconsC2 that, to the best of our knowledge, outperforms earlier methods. PconsC2 is based on our earlier method, PconsC, as it utilizes the same set of contact predictions from plmDCA and PSICOV. However, in contrast to PconsC, where each residue pair is analysed independently, the initial predictions are analysed in context of neighbouring residue pairs using a deep learning approach, inspired by earlier work. We find that for each layer the deep learning procedure improves the predictions. At the end, after five layers of deep learning and inclusion of a few extra features provides the best performance. An improvement can be seen for all types of proteins, independent on length, number of homologous sequences and structural class. However, the improvement is largest for β-sheet containing proteins. Most importantly the improvement brings for the first time sufficiently accurate predictions to some protein families with less than 1000 homologous sequences. PconsC2 outperforms as well state of the art machine learning based predictors for protein families larger than 100 effective sequences. PconsC2 is licensed under the GNU General Public License v3 and freely available from http://c2.pcons.net/.
Collapse
|
48
|
Wozniak PP, Kotulska M. Characteristics of protein residue-residue contacts and their application in contact prediction. J Mol Model 2014; 20:2497. [PMID: 25374390 PMCID: PMC4221654 DOI: 10.1007/s00894-014-2497-9] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2014] [Accepted: 10/09/2014] [Indexed: 11/28/2022]
Abstract
Contact sites between amino acids characterize important structural features of a protein. We investigated characteristics of contact sites in a representative set of proteins and their relations between protein class or topology. For this purpose, we used a non-redundant set of 5872 protein domains, identically categorized by CATH and SCOP databases. The proteins represented alpha, beta, and alpha+beta classes. Contact maps of protein structures were obtained for a selected set of physical distances in the main backbone and separations in protein sequences. For each set a dependency between contact degree and distance parameters was quantified. We indicated residues forming contact sites most frequently and unique amino acid pairs which created contact sites most often within each structural class. Contact characteristics of specific topologies were compared to the characteristics of their protein classes showing protein groups with a distinguished contact characteristic. We showed that our results could be used to improve the performance of recent top contact predictor — direct coupling analysis. Our work provides values of contact site propensities that can be involved in bioinformatic databases.
Collapse
Affiliation(s)
- Pawel P Wozniak
- Institute of Biomedical Engineering and Instrumentation, Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, 50-370, Wroclaw, Poland,
| | | |
Collapse
|
49
|
Feinauer C, Skwark MJ, Pagnani A, Aurell E. Improving contact prediction along three dimensions. PLoS Comput Biol 2014; 10:e1003847. [PMID: 25299132 PMCID: PMC4191875 DOI: 10.1371/journal.pcbi.1003847] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2014] [Accepted: 08/07/2014] [Indexed: 11/18/2022] Open
Abstract
Correlation patterns in multiple sequence alignments of homologous proteins can be exploited to infer information on the three-dimensional structure of their members. The typical pipeline to address this task, which we in this paper refer to as the three dimensions of contact prediction, is to (i) filter and align the raw sequence data representing the evolutionarily related proteins; (ii) choose a predictive model to describe a sequence alignment; (iii) infer the model parameters and interpret them in terms of structural properties, such as an accurate contact map. We show here that all three dimensions are important for overall prediction success. In particular, we show that it is possible to improve significantly along the second dimension by going beyond the pair-wise Potts models from statistical physics, which have hitherto been the focus of the field. These (simple) extensions are motivated by multiple sequence alignments often containing long stretches of gaps which, as a data feature, would be rather untypical for independent samples drawn from a Potts model. Using a large test set of proteins we show that the combined improvements along the three dimensions are as large as any reported to date. Proteins are large molecules that living cells make by stringing together building blocks called amino acids or peptides, following their blue-prints in the DNA. Freshly made proteins are typically long, structure-less chains of peptides, but shortly afterwards most of them fold into characteristic structures. Proteins execute many functions in the cell, for which they need to have the right structure, which is therefore very important in determining what the proteins can do. The structure of a protein can be determined by X-ray diffraction and other experimental approaches which are all, to this day, somewhat labor-intensive and difficult. On the other hand, the order of the peptides in a protein can be read off from the DNA blue-print, and such protein sequences are today routinely produced in large numbers. In this paper we show that many similar protein sequences can be used to find information about the structure. The basic approach is to construct a probabilistic model for sequence variability, and then to use the parameters of that model to predict structure in three-dimensional space. The main technical novelty compared to previous contributions in the same general direction is that we use models more directly matched to the data.
Collapse
Affiliation(s)
- Christoph Feinauer
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
| | - Marcin J. Skwark
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
| | - Andrea Pagnani
- DISAT and Center for Computational Sciences, Politecnico Torino, Torino, Italy
- Human Genetics Foundation-Torino, Molecular Biotechnology Center, Torino, Italy
| | - Erik Aurell
- Department of Information and Computer Science, Aalto University, Aalto, Finland
- Aalto Science Institute (AScI), Aalto University, Aalto, Finland
- Department of Computational Biology, Royal Institute of Technology, AlbaNova University Centre, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
50
|
Michel M, Hayat S, Skwark MJ, Sander C, Marks DS, Elofsson A. PconsFold: improved contact predictions improve protein models. Bioinformatics 2014; 30:i482-8. [PMID: 25161237 PMCID: PMC4147911 DOI: 10.1093/bioinformatics/btu458] [Citation(s) in RCA: 85] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
MOTIVATION Recently it has been shown that the quality of protein contact prediction from evolutionary information can be improved significantly if direct and indirect information is separated. Given sufficiently large protein families, the contact predictions contain sufficient information to predict the structure of many protein families. However, since the first studies contact prediction methods have improved. Here, we ask how much the final models are improved if improved contact predictions are used. RESULTS In a small benchmark of 15 proteins, we show that the TM-scores of top-ranked models are improved by on average 33% using PconsFold compared with the original version of EVfold. In a larger benchmark, we find that the quality is improved with 15-30% when using PconsC in comparison with earlier contact prediction methods. Further, using Rosetta instead of CNS does not significantly improve global model accuracy, but the chemistry of models generated with Rosetta is improved. AVAILABILITY PconsFold is a fully automated pipeline for ab initio protein structure prediction based on evolutionary information. PconsFold is based on PconsC contact prediction and uses the Rosetta folding protocol. Due to its modularity, the contact prediction tool can be easily exchanged. The source code of PconsFold is available on GitHub at https://www.github.com/ElofssonLab/pcons-fold under the MIT license. PconsC is available from http://c.pcons.net/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mirco Michel
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Sikander Hayat
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Marcin J Skwark
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Chris Sander
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Debora S Marks
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| | - Arne Elofsson
- Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA Department of Biochemistry and Biophysics, Stockholm University, 10691 Stockholm, Sweden, Science for Life Laboratory, Stockholm University, Box 1031, 17121 Solna, Sweden, Department of Systems Biology, Harvard Medical School, Boston, MA, USA, Department of Information and Computer Science, Aalto University, PO Box 15400, FI-00076 Aalto, Finland and Computational Biology, Memorial Sloan-Kettering Cancer Center, New York, NY, USA
| |
Collapse
|