1
|
Fiorote J, Alves J, Stock L, Treptow W. Investigating Statistical Conditions of Coevolutionary Signals that Enable Algorithmic Predictions of Protein Partners. J Chem Inf Model 2025; 65:4107-4115. [PMID: 40232741 PMCID: PMC12042258 DOI: 10.1021/acs.jcim.5c00052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2025] [Revised: 04/01/2025] [Accepted: 04/02/2025] [Indexed: 04/16/2025]
Abstract
This study examines the statistical conditions of coevolutionary signals that allow algorithmic predictions of protein partners based on amino acid sequences rather than 3D structures. It introduces a Markov stochastic model that predicts the number of correct protein partners based on coevolutionary information. The model defines state probabilities using a Poisson mixture of normal distributions, with key parameters including the total number of protein sequences M, the coevolutionary information gap α, and variance σ02. The model suggests that algorithmic approaches that maximize coevolutionary information cannot effectively resolve partners in protein families with a large number of sequences M ≥ 100. The model shows that true-positive (TP) rates can be enhanced by disregarding mismatches among similar sequences. This approach allows a distinction, in terms of {α, σ02}, between optimized solutions with trivial errors and other degenerate solutions. Our findings enable the a priori classification of protein families where partners can be reliably predicted by ignoring trivial errors between similar sequences, advancing the understanding of coevolutionary models for large protein data sets.
Collapse
Affiliation(s)
- José Fiorote
- Laboratório
de Biologia Teórica e Computacional (LBTC), Universidade de Brasília, Brasilia, DF 70910-900, Brasil
| | - João Alves
- Laboratório
de Biologia Teórica e Computacional (LBTC), Universidade de Brasília, Brasilia, DF 70910-900, Brasil
| | - Letícia Stock
- Ben May Department
for Cancer Research, University of Chicago, Chicago, Illinois 60637, United States
| | - Werner Treptow
- Laboratório
de Biologia Teórica e Computacional (LBTC), Universidade de Brasília, Brasilia, DF 70910-900, Brasil
| |
Collapse
|
2
|
Lakshman AH, Wright ES. EvoWeaver: large-scale prediction of gene functional associations from coevolutionary signals. Nat Commun 2025; 16:3878. [PMID: 40274827 PMCID: PMC12022180 DOI: 10.1038/s41467-025-59175-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2025] [Accepted: 04/09/2025] [Indexed: 04/26/2025] Open
Abstract
The known universe of uncharacterized proteins is expanding far faster than our ability to annotate their functions through laboratory study. Computational annotation approaches rely on similarity to previously studied proteins, thereby ignoring unstudied proteins. Coevolutionary approaches hold promise for injecting new information into our knowledge of the protein universe by linking proteins through 'guilt-by-association'. However, existing coevolutionary algorithms have insufficient accuracy and scalability to connect the entire universe of proteins. We present EvoWeaver, a method that weaves together 12 signals of coevolution to quantify the degree of shared evolution between genes. EvoWeaver accurately identifies proteins involved in protein complexes or separate steps of a biochemical pathway. We show the merits of EvoWeaver by partly reconstructing known biochemical pathways without any prior knowledge other than that available from genomic sequences. Applying EvoWeaver to 1545 gene groups from 8564 genomes reveals missing connections in popular databases and potentially undiscovered links between proteins.
Collapse
Affiliation(s)
- Aidan H Lakshman
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Erik S Wright
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA.
- Center for Evolutionary Biology and Medicine, Pittsburgh, PA, USA.
| |
Collapse
|
3
|
Lupo U, Sgarbossa D, Milighetti M, Bitbol AF. DiffPaSS-high-performance differentiable pairing of protein sequences using soft scores. Bioinformatics 2024; 41:btae738. [PMID: 39672677 PMCID: PMC11676329 DOI: 10.1093/bioinformatics/btae738] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2024] [Revised: 12/05/2024] [Accepted: 12/11/2024] [Indexed: 12/15/2024] Open
Abstract
MOTIVATION Identifying interacting partners from two sets of protein sequences has important applications in computational biology. Interacting partners share similarities across species due to their common evolutionary history, and feature correlations in amino acid usage due to the need to maintain complementary interaction interfaces. Thus, the problem of finding interacting pairs can be formulated as searching for a pairing of sequences that maximizes a sequence similarity or a coevolution score. Several methods have been developed to address this problem, applying different approximate optimization methods to different scores. RESULTS We introduce Differentiable Pairing using Soft Scores (DiffPaSS), a differentiable framework for flexible, fast, and hyperparameter-free optimization for pairing interacting biological sequences, which can be applied to a wide variety of scores. We apply it to a benchmark prokaryotic dataset, using mutual information and neighbor graph alignment scores. DiffPaSS outperforms existing algorithms for optimizing the same scores. We demonstrate the usefulness of our paired alignments for the prediction of protein complex structure. DiffPaSS does not require sequences to be aligned, and we also apply it to nonaligned sequences from T-cell receptors. AVAILABILITY AND IMPLEMENTATION A PyTorch implementation and installable Python package are available at https://github.com/Bitbol-Lab/DiffPaSS.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Martina Milighetti
- Division of Infection and Immunity, University College London, London WC1E 6BT, United Kingdom
- Cancer Institute, University College London, London WC1E 6DD, United Kingdom
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
4
|
Lupo U, Sgarbossa D, Bitbol AF. Pairing interacting protein sequences using masked language modeling. Proc Natl Acad Sci U S A 2024; 121:e2311887121. [PMID: 38913900 PMCID: PMC11228504 DOI: 10.1073/pnas.2311887121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2023] [Accepted: 12/18/2023] [Indexed: 06/26/2024] Open
Abstract
Predicting which proteins interact together from amino acid sequences is an important task. We develop a method to pair interacting protein sequences which leverages the power of protein language models trained on multiple sequence alignments (MSAs), such as MSA Transformer and the EvoFormer module of AlphaFold. We formulate the problem of pairing interacting partners among the paralogs of two protein families in a differentiable way. We introduce a method called Differentiable Pairing using Alignment-based Language Models (DiffPALM) that solves it by exploiting the ability of MSA Transformer to fill in masked amino acids in multiple sequence alignments using the surrounding context. MSA Transformer encodes coevolution between functionally or structurally coupled amino acids within protein chains. It also captures inter-chain coevolution, despite being trained on single-chain data. Relying on MSA Transformer without fine-tuning, DiffPALM outperforms existing coevolution-based pairing methods on difficult benchmarks of shallow multiple sequence alignments extracted from ubiquitous prokaryotic protein datasets. It also outperforms an alternative method based on a state-of-the-art protein language model trained on single sequences. Paired alignments of interacting protein sequences are a crucial ingredient of supervised deep learning methods to predict the three-dimensional structure of protein complexes. Starting from sequences paired by DiffPALM substantially improves the structure prediction of some eukaryotic protein complexes by AlphaFold-Multimer. It also achieves competitive performance with using orthology-based pairing.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne, Lausanne CH-1015, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne CH-1015, Switzerland
| |
Collapse
|
5
|
Zhao H, Petrey D, Murray D, Honig B. ZEPPI: Proteome-scale sequence-based evaluation of protein-protein interaction models. Proc Natl Acad Sci U S A 2024; 121:e2400260121. [PMID: 38743624 PMCID: PMC11127014 DOI: 10.1073/pnas.2400260121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/18/2024] [Indexed: 05/16/2024] Open
Abstract
We introduce ZEPPI (Z-score Evaluation of Protein-Protein Interfaces), a framework to evaluate structural models of a complex based on sequence coevolution and conservation involving residues in protein-protein interfaces. The ZEPPI score is calculated by comparing metrics for an interface to those obtained from randomly chosen residues. Since contacting residues are defined by the structural model, this obviates the need to account for indirect interactions. Further, although ZEPPI relies on species-paired multiple sequence alignments, its focus on interfacial residues allows it to leverage quite shallow alignments. ZEPPI can be implemented on a proteome-wide scale and is applied here to millions of structural models of dimeric complexes in the Escherichia coli and human interactomes found in the PrePPI database. PrePPI's scoring function is based primarily on the evaluation of protein-protein interfaces, and ZEPPI adds a new feature to this analysis through the incorporation of evolutionary information. ZEPPI performance is evaluated through applications to experimentally determined complexes and to decoys from the CASP-CAPRI experiment. As we discuss, the standard CAPRI scores used to evaluate docking models are based on model quality and not on the ability to give yes/no answers as to whether two proteins interact. ZEPPI is able to detect weak signals from PPI models that the CAPRI scores define as incorrect and, similarly, to identify potential PPIs defined as low confidence by the current PrePPI scoring function. A number of examples that illustrate how the combination of PrePPI and ZEPPI can yield functional hypotheses are provided.
Collapse
Affiliation(s)
- Haiqing Zhao
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Donald Petrey
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Diana Murray
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
| | - Barry Honig
- Department of Systems Biology, Columbia University Irving Medical Center, New York, NY10032
- Department of Biochemistry and Molecular Biophysics, Columbia University Irving Medical Center, New York, NY10032
- Department of Medicine, Columbia University, New York, NY10032
- Zuckerman Institute, Columbia University, New York, NY10027
| |
Collapse
|
6
|
Fang T, Szklarczyk D, Hachilif R, von Mering C. Enhancing coevolutionary signals in protein-protein interaction prediction through clade-wise alignment integration. Sci Rep 2024; 14:6009. [PMID: 38472223 PMCID: PMC10933411 DOI: 10.1038/s41598-024-55655-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Accepted: 02/26/2024] [Indexed: 03/14/2024] Open
Abstract
Protein-protein interactions (PPIs) play essential roles in most biological processes. The binding interfaces between interacting proteins impose evolutionary constraints that have successfully been employed to predict PPIs from multiple sequence alignments (MSAs). To construct MSAs, critical choices have to be made: how to ensure the reliable identification of orthologs, and how to optimally balance the need for large alignments versus sufficient alignment quality. Here, we propose a divide-and-conquer strategy for MSA generation: instead of building a single, large alignment for each protein, multiple distinct alignments are constructed under distinct clades in the tree of life. Coevolutionary signals are searched separately within these clades, and are only subsequently integrated using machine learning techniques. We find that this strategy markedly improves overall prediction performance, concomitant with better alignment quality. Using the popular DCA algorithm to systematically search pairs of such alignments, a genome-wide all-against-all interaction scan in a bacterial genome is demonstrated. Given the recent successes of AlphaFold in predicting direct PPIs at atomic detail, a discover-and-refine approach is proposed: our method could provide a fast and accurate strategy for pre-screening the entire genome, submitting to AlphaFold only promising interaction candidates-thus reducing false positives as well as computation time.
Collapse
Affiliation(s)
- Tao Fang
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Damian Szklarczyk
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Radja Hachilif
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland
| | - Christian von Mering
- Department of Molecular Life Sciences, University of Zurich, 8057, Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, 1015, Lausanne, Switzerland.
| |
Collapse
|
7
|
Gandarilla-Pérez CA, Pinilla S, Bitbol AF, Weigt M. Combining phylogeny and coevolution improves the inference of interaction partners among paralogous proteins. PLoS Comput Biol 2023; 19:e1011010. [PMID: 36996234 PMCID: PMC10089317 DOI: 10.1371/journal.pcbi.1011010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 04/11/2023] [Accepted: 03/08/2023] [Indexed: 04/01/2023] Open
Abstract
Predicting protein-protein interactions from sequences is an important goal of computational biology. Various sources of information can be used to this end. Starting from the sequences of two interacting protein families, one can use phylogeny or residue coevolution to infer which paralogs are specific interaction partners within each species. We show that these two signals can be combined to improve the performance of the inference of interaction partners among paralogs. For this, we first align the sequence-similarity graphs of the two families through simulated annealing, yielding a robust partial pairing. We next use this partial pairing to seed a coevolution-based iterative pairing algorithm. This combined method improves performance over either separate method. The improvement obtained is striking in the difficult cases where the average number of paralogs per species is large or where the total number of sequences is modest.
Collapse
Affiliation(s)
- Carlos A Gandarilla-Pérez
- Facultad de Física, Universidad de la Habana, San Lázaro y L, Vedado, Habana, Cuba
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
| | - Sergio Pinilla
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (UMR 8237), Paris, France
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), Paris, France
| |
Collapse
|
8
|
Xie J, Zhang W, Zhu X, Deng M, Lai L. Coevolution-based prediction of key allosteric residues for protein function regulation. eLife 2023; 12:81850. [PMID: 36799896 PMCID: PMC9981151 DOI: 10.7554/elife.81850] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2022] [Accepted: 02/16/2023] [Indexed: 02/18/2023] Open
Abstract
Allostery is fundamental to many biological processes. Due to the distant regulation nature, how allosteric mutations, modifications, and effector binding impact protein function is difficult to forecast. In protein engineering, remote mutations cannot be rationally designed without large-scale experimental screening. Allosteric drugs have raised much attention due to their high specificity and possibility of overcoming existing drug-resistant mutations. However, optimization of allosteric compounds remains challenging. Here, we developed a novel computational method KeyAlloSite to predict allosteric site and to identify key allosteric residues (allo-residues) based on the evolutionary coupling model. We found that protein allosteric sites are strongly coupled to orthosteric site compared to non-functional sites. We further inferred key allo-residues by pairwise comparing the difference of evolutionary coupling scores of each residue in the allosteric pocket with the functional site. Our predicted key allo-residues are in accordance with previous experimental studies for typical allosteric proteins like BCR-ABL1, Tar, and PDZ3, as well as key cancer mutations. We also showed that KeyAlloSite can be used to predict key allosteric residues distant from the catalytic site that are important for enzyme catalysis. Our study demonstrates that weak coevolutionary couplings contain important information of protein allosteric regulation function. KeyAlloSite can be applied in studying the evolution of protein allosteric regulation, designing and optimizing allosteric drugs, and performing functional protein design and enzyme engineering.
Collapse
Affiliation(s)
- Juan Xie
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
| | - Weilin Zhang
- BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking UniversityBeijingChina
| | - Xiaolei Zhu
- School of Sciences, Anhui Agricultural UniversityHefeiChina
| | - Minghua Deng
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- School of Mathematical Sciences, Peking UniversityBeijingChina
- Center for Statistical Science, Peking UniversityBeijingChina
| | - Luhua Lai
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary Studies, Peking UniversityBeijingChina
- BNLMS, Peking-Tsinghua Center for Life Sciences at the College of Chemistry and Molecular Engineering, Peking UniversityBeijingChina
- Research Unit of Drug Design Method, Chinese Academy of Medical Sciences (2021RU014)BeijingChina
| |
Collapse
|
9
|
Dietler N, Lupo U, Bitbol AF. Impact of phylogeny on structural contact inference from protein sequence data. J R Soc Interface 2023; 20:20220707. [PMID: 36751926 PMCID: PMC9905998 DOI: 10.1098/rsif.2022.0707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 01/09/2023] [Indexed: 02/09/2023] Open
Abstract
Local and global inference methods have been developed to infer structural contacts from multiple sequence alignments of homologous proteins. They rely on correlations in amino acid usage at contacting sites. Because homologous proteins share a common ancestry, their sequences also feature phylogenetic correlations, which can impair contact inference. We investigate this effect by generating controlled synthetic data from a minimal model where the importance of contacts and of phylogeny can be tuned. We demonstrate that global inference methods, specifically Potts models, are more resilient to phylogenetic correlations than local methods, based on covariance or mutual information. This holds whether or not phylogenetic corrections are used, and may explain the success of global methods. We analyse the roles of selection strength and of phylogenetic relatedness. We show that sites that mutate early in the phylogeny yield false positive contacts. We consider natural data and realistic synthetic data, and our findings generalize to these cases. Our results highlight the impact of phylogeny on contact prediction from protein sequences and illustrate the interplay between the rich structure of biological data and inference.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), 1015 Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
10
|
Ward KM, Pickett BD, Ebbert MTW, Kauwe JSK, Miller JB. Web-Based Protein Interactions Calculator Identifies Likely Proteome Coevolution with Alzheimer’s Disease-Associated Proteins. Genes (Basel) 2022; 13:genes13081346. [PMID: 36011253 PMCID: PMC9407263 DOI: 10.3390/genes13081346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2022] [Revised: 07/22/2022] [Accepted: 07/23/2022] [Indexed: 11/19/2022] Open
Abstract
Protein–protein functional interactions arise from either transitory or permanent biomolecular associations and often lead to the coevolution of the interacting residues. Although mutual information has traditionally been used to identify coevolving residues within the same protein, its application between coevolving proteins remains largely uncharacterized. Therefore, we developed the Protein Interactions Calculator (PIC) to efficiently identify coevolving residues between two protein sequences using mutual information. We verified the algorithm using 2102 known human protein interactions and 233 known bacterial protein interactions, with a respective 1975 and 252 non-interacting protein controls. The average PIC score for known human protein interactions was 4.5 times higher than non-interacting proteins (p = 1.03 × 10−108) and 1.94 times higher in bacteria (p = 1.22 × 10−35). We then used the PIC scores to determine the probability that two proteins interact. Using those probabilities, we paired 37 Alzheimer’s disease-associated proteins with 8608 other proteins and determined the likelihood that each pair interacts, which we report through a web interface. The PIC had significantly higher sensitivity and residue-specific resolution not available in other algorithms. Therefore, we propose that the PIC can be used to prioritize potential protein interactions, which can lead to a better understanding of biological processes and additional therapeutic targets belonging to protein interaction groups.
Collapse
Affiliation(s)
- Katrisa M. Ward
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Brandon D. Pickett
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Mark T. W. Ebbert
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA;
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY 40506, USA
- Department of Neuroscience, University of Kentucky, Lexington, KY 40506, USA
| | - John S. K. Kauwe
- Department of Biology, Brigham Young University, Provo, UT 84602, USA; (K.M.W.); (B.D.P.); (J.S.K.K.)
| | - Justin B. Miller
- Sanders-Brown Center on Aging, University of Kentucky, Lexington, KY 40536, USA;
- Division of Biomedical Informatics, Department of Internal Medicine, University of Kentucky, Lexington, KY 40506, USA
- Department of Pathology and Laboratory Medicine, University of Kentucky, Lexington, KY 40506, USA
- Correspondence: ; Tel.: +1-859-562-0333
| |
Collapse
|
11
|
Si Y, Yan C. Protein complex structure prediction powered by multiple sequence alignments of interologs from multiple taxonomic ranks and AlphaFold2. Brief Bioinform 2022; 23:6596987. [PMID: 35649388 DOI: 10.1093/bib/bbac208] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 04/17/2022] [Accepted: 05/05/2022] [Indexed: 12/19/2022] Open
Abstract
AlphaFold2 can predict protein complex structures as long as a multiple sequence alignment (MSA) of the interologs of the target protein-protein interaction (PPI) can be provided. In this study, a simplified phylogeny-based approach was applied to generate the MSA of interologs, which was then used as the input to AlphaFold2 for protein complex structure prediction. In this extensively benchmarked protocol on nonredundant PPI dataset, including 107 bacterial PPIs and 442 eukaryotic PPIs, we show complex structures of 79.5% of the bacterial PPIs and 49.8% of the eukaryotic PPIs can be successfully predicted, which yielded significantly better performance than the application of MSA of interologs prepared by two existing approaches. Considering PPIs may not be conserved in species with long evolutionary distances, we further restricted interologs in the MSA to different taxonomic ranks of the species of the target PPI in protein complex structure prediction. We found that the success rates can be increased to 87.9% for the bacterial PPIs and 56.3% for the eukaryotic PPIs if interologs in the MSA are restricted to a specific taxonomic rank of the species of each target PPI. Finally, we show that the optimal taxonomic ranks for protein complex structure prediction can be selected with the application of the predicted template modeling (TM) scores of the output models.
Collapse
Affiliation(s)
- Yunda Si
- School of Physics, Huazhong University of Science and Technology, China
| | - Chengfei Yan
- School of Physics, Huazhong University of Science and Technology, China
| |
Collapse
|
12
|
Gerardos A, Dietler N, Bitbol AF. Correlations from structure and phylogeny combine constructively in the inference of protein partners from sequences. PLoS Comput Biol 2022; 18:e1010147. [PMID: 35576238 PMCID: PMC9135348 DOI: 10.1371/journal.pcbi.1010147] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 05/26/2022] [Accepted: 04/27/2022] [Indexed: 11/19/2022] Open
Abstract
Inferring protein-protein interactions from sequences is an important task in computational biology. Recent methods based on Direct Coupling Analysis (DCA) or Mutual Information (MI) allow to find interaction partners among paralogs of two protein families. Does successful inference mainly rely on correlations from structural contacts or from phylogeny, or both? Do these two types of signal combine constructively or hinder each other? To address these questions, we generate and analyze synthetic data produced using a minimal model that allows us to control the amounts of structural constraints and phylogeny. We show that correlations from these two sources combine constructively to increase the performance of partner inference by DCA or MI. Furthermore, signal from phylogeny can rescue partner inference when signal from contacts becomes less informative, including in the realistic case where inter-protein contacts are restricted to a small subset of sites. We also demonstrate that DCA-inferred couplings between non-contact pairs of sites improve partner inference in the presence of strong phylogeny, while deteriorating it otherwise. Moreover, restricting to non-contact pairs of sites preserves inference performance in the presence of strong phylogeny. In a natural data set, as well as in realistic synthetic data based on it, we find that non-contact pairs of sites contribute positively to partner inference performance, and that restricting to them preserves performance, evidencing an important role of phylogeny.
Collapse
Affiliation(s)
- Andonis Gerardos
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
13
|
Extracting phylogenetic dimensions of coevolution reveals hidden functional signals. Sci Rep 2022; 12:820. [PMID: 35039514 PMCID: PMC8764114 DOI: 10.1038/s41598-021-04260-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2021] [Accepted: 12/17/2021] [Indexed: 11/08/2022] Open
Abstract
Despite the structural and functional information contained in the statistical coupling between pairs of residues in a protein, coevolution associated with function is often obscured by artifactual signals such as genetic drift, which shapes a protein's phylogenetic history and gives rise to concurrent variation between protein sequences that is not driven by selection for function. Here, we introduce a background model for phylogenetic contributions of statistical coupling that separates the coevolution signal due to inter-clade and intra-clade sequence comparisons and demonstrate that coevolution can be measured on multiple phylogenetic timescales within a single protein. Our method, nested coevolution (NC), can be applied as an extension to any coevolution metric. We use NC to demonstrate that poorly conserved residues can nonetheless have important roles in protein function. Moreover, NC improved the structural-contact predictions of several coevolution-based methods, particularly in subsampled alignments with fewer sequences. NC also lowered the noise in detecting functional sectors of collectively coevolving residues. Sectors of coevolving residues identified after application of NC were more spatially compact and phylogenetically distinct from the rest of the protein, and strongly enriched for mutations that disrupt protein activity. Thus, our conceptualization of the phylogenetic separation of coevolution provides the potential to further elucidate relationships among protein evolution, function, and genetic diseases.
Collapse
|
14
|
Behdenna A, Godfroid M, Petot P, Pothier J, Lambert A, Achaz G. A minimal yet flexible likelihood framework to assess correlated evolution. Syst Biol 2021; 71:823-838. [PMID: 34792608 DOI: 10.1093/sysbio/syab092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 11/04/2021] [Accepted: 11/09/2021] [Indexed: 11/14/2022] Open
Abstract
An evolutionary process is reflected in the sequence of changes of any trait (e.g. morphological or molecular) through time. Yet, a better understanding of evolution would be procured by characterizing correlated evolution, or when two or more evolutionary processes interact. Previously developed parametric methods often require significant computing time as they rely on the estimation of many parameters. Here we propose a minimal likelihood framework modelling the joint evolution of two traits on a known phylogenetic tree. The type and strength of correlated evolution is characterized by a few parameters tuning mutation rates of each trait and interdependencies between these rates. The framework can be applied to study any discrete trait or character ranging from nucleotide substitution to gain or loss of a biological function. More specifically, it can be used to 1) test for independence between two evolutionary processes, 2) identify the type of interaction between them and 3) estimate parameter values of the most likely model of interaction. In the current implementation, the method takes as input a phylogenetic tree with discrete evolutionary events mapped on its branches. The method then maximizes the likelihood for one or several chosen scenarios. The strengths and limits of the method, as well as its relative power compared to a few other methods, are assessed using both simulations and data from 16S rRNA sequences in a sample of 54 γ-enterobacteria. We show that, even with datasets of fewer than 100 species, the method performs well in parameter estimation and in evolutionary model selection.
Collapse
Affiliation(s)
- Abdelkader Behdenna
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS UMR 7205, Sorbonne Université, École Pratique des Hautes Études, Université des Antilles, 45 rue Buffon, 75005 Paris, France
- SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, Université PSL, 11, place Marcellin Berthelot, 75005 Paris, France
- Epigene Labs, 7 Square Gabriel Fauré, 75017 Paris, France
| | - Maxime Godfroid
- SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, Université PSL, 11, place Marcellin Berthelot, 75005 Paris, France
| | - Patrice Petot
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS UMR 7205, Sorbonne Université, École Pratique des Hautes Études, Université des Antilles, 45 rue Buffon, 75005 Paris, France
- SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, Université PSL, 11, place Marcellin Berthelot, 75005 Paris, France
| | - Joël Pothier
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS UMR 7205, Sorbonne Université, École Pratique des Hautes Études, Université des Antilles, 45 rue Buffon, 75005 Paris, France
| | - Amaury Lambert
- SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, Université PSL, 11, place Marcellin Berthelot, 75005 Paris, France
- Laboratoire de Probabilités, Statistique et Modélisation (LPSM), Sorbonne Université, CNRS UMR 8001, Université de Paris, 4, place Jussieu, 75005 Paris, France
| | - Guillaume Achaz
- Institut de Systématique, Évolution, Biodiversité (ISYEB), Muséum National d'Histoire Naturelle, CNRS UMR 7205, Sorbonne Université, École Pratique des Hautes Études, Université des Antilles, 45 rue Buffon, 75005 Paris, France
- SMILE Group, Center for Interdisciplinary Research in Biology (CIRB), Collège de France, CNRS, INSERM, Université PSL, 11, place Marcellin Berthelot, 75005 Paris, France
- Éco-anthropologie, Muséum National d'Histoire Naturelle, CNRS UMR 7206, Université de Paris, place du Trocadéro, 75016 Paris, France
| |
Collapse
|
15
|
Pozzati G, Zhu W, Bassot C, Lamb J, Kundrotas P, Elofsson A. Limits and potential of combined folding and docking. Bioinformatics 2021; 38:954-961. [PMID: 34788800 PMCID: PMC8796369 DOI: 10.1093/bioinformatics/btab760] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/23/2021] [Accepted: 11/02/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION In the last decade, de novo protein structure prediction accuracy for individual proteins has improved significantly by utilising deep learning (DL) methods for harvesting the co-evolution information from large multiple sequence alignments (MSAs). The same approach can, in principle, also be used to extract information about evolutionary-based contacts across protein-protein interfaces. However, most earlier studies have not used the latest DL methods for inter-chain contact distance prediction. This article introduces a fold-and-dock method based on predicted residue-residue distances with trRosetta. RESULTS The method can simultaneously predict the tertiary and quaternary structure of a protein pair, even when the structures of the monomers are not known. The straightforward application of this method to a standard dataset for protein-protein docking yielded limited success. However, using alternative methods for generating MSAs allowed us to dock accurately significantly more proteins. We also introduced a novel scoring function, PconsDock, that accurately separates 98% of correctly and incorrectly folded and docked proteins. The average performance of the method is comparable to the use of traditional, template-based or ab initio shape-complementarity-only docking methods. Moreover, the results of conventional and fold-and-dock approaches are complementary, and thus a combined docking pipeline could increase overall docking success significantly. This methodology contributed to the best model for one of the CASP14 oligomeric targets, H1065. AVAILABILITY AND IMPLEMENTATION All scripts for predictions and analysis are available from https://github.com/ElofssonLab/bioinfo-toolbox/ and https://gitlab.com/ElofssonLab/benchmark5/. All models joined alignments, and evaluation results are available from the following figshare repository https://doi.org/10.6084/m9.figshare.14654886.v2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | - John Lamb
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, 171 21 Solna, Sweden
| | - Petras Kundrotas
- Science for Life Laboratory and Department of Biochemistry and Biophysics, Stockholm University, 171 21 Solna, Sweden,Center for Computational Biology, The University of Kansas, Lawrence, KS 66047, USA
| | | |
Collapse
|
16
|
Jiang XL, Dimas RP, Chan CTY, Morcos F. Coevolutionary methods enable robust design of modular repressors by reestablishing intra-protein interactions. Nat Commun 2021; 12:5592. [PMID: 34552074 PMCID: PMC8458406 DOI: 10.1038/s41467-021-25851-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Accepted: 09/03/2021] [Indexed: 11/23/2022] Open
Abstract
Genetic sensors with unique combinations of DNA recognition and allosteric response can be created by hybridizing DNA-binding modules (DBMs) and ligand-binding modules (LBMs) from distinct transcriptional repressors. This module swapping approach is limited by incompatibility between DBMs and LBMs from different proteins, due to the loss of critical module-module interactions after hybridization. We determine a design strategy for restoring key interactions between DBMs and LBMs by using a computational model informed by coevolutionary traits in the LacI family. This model predicts the influence of proposed mutations on protein structure and function, quantifying the feasibility of each mutation for rescuing hybrid repressors. We accurately predict which hybrid repressors can be rescued by mutating residues to reinstall relevant module-module interactions. Experimental results confirm that dynamic ranges of gene expression induction were improved significantly in these mutants. This approach enhances the molecular and mechanistic understanding of LacI family proteins, and advances the ability to design modular genetic parts.
Collapse
Affiliation(s)
- Xian-Li Jiang
- Department of Biological Sciences, The University of Texas at Dallas, Dallas, TX, USA
- Department of Bioinformatics and Computational Biology, The University of Texas M.D. Anderson Cancer Center, Houston, TX, USA
| | - Rey P Dimas
- Department of Biology, The University of Texas at Tyler, Tyler, TX, USA
| | - Clement T Y Chan
- Department of Biomedical Engineering, University of North Texas, Denton, TX, USA.
- BioDiscovery Institute, University of North Texas, Denton, TX, USA.
| | - Faruck Morcos
- Department of Biological Sciences, The University of Texas at Dallas, Dallas, TX, USA.
- Department of Bioengineering, The University of Texas at Dallas, Dallas, TX, USA.
- Center for Systems Biology, The University of Texas at Dallas, Dallas, TX, USA.
| |
Collapse
|
17
|
Suh D, Lee JW, Choi S, Lee Y. Recent Applications of Deep Learning Methods on Evolution- and Contact-Based Protein Structure Prediction. Int J Mol Sci 2021; 22:6032. [PMID: 34199677 PMCID: PMC8199773 DOI: 10.3390/ijms22116032] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 05/29/2021] [Accepted: 05/29/2021] [Indexed: 01/23/2023] Open
Abstract
The new advances in deep learning methods have influenced many aspects of scientific research, including the study of the protein system. The prediction of proteins' 3D structural components is now heavily dependent on machine learning techniques that interpret how protein sequences and their homology govern the inter-residue contacts and structural organization. Especially, methods employing deep neural networks have had a significant impact on recent CASP13 and CASP14 competition. Here, we explore the recent applications of deep learning methods in the protein structure prediction area. We also look at the potential opportunities for deep learning methods to identify unknown protein structures and functions to be discovered and help guide drug-target interactions. Although significant problems still need to be addressed, we expect these techniques in the near future to play crucial roles in protein structural bioinformatics as well as in drug discovery.
Collapse
Affiliation(s)
- Donghyuk Suh
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Jai Woo Lee
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Sun Choi
- Global AI Drug Discovery Center, School of Pharmaceutical Sciences, College of Pharmacy and Graduate, Ewha Womans University, Seoul 03760, Korea; (D.S.); (J.W.L.); (S.C.)
| | - Yoonji Lee
- College of Pharmacy, Chung-Ang University, Seoul 06974, Korea
| |
Collapse
|
18
|
Information Theory in Molecular Evolution: From Models to Structures and Dynamics. ENTROPY 2021; 23:e23040482. [PMID: 33921557 PMCID: PMC8073717 DOI: 10.3390/e23040482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 04/15/2021] [Accepted: 04/15/2021] [Indexed: 11/27/2022]
|
19
|
Trivial and nontrivial error sources account for misidentification of protein partners in mutual information approaches. Sci Rep 2021; 11:6902. [PMID: 33767294 PMCID: PMC7994710 DOI: 10.1038/s41598-021-86455-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 03/15/2021] [Indexed: 12/01/2022] Open
Abstract
The problem of finding the correct set of partners for a given pair of interacting protein families based on multi-sequence alignments (MSAs) has received great attention over the years. Recently, the native contacts of two interacting proteins were shown to store the strongest mutual information (MI) signal to discriminate MSA concatenations with the largest fraction of correct pairings. Although that signal might be of practical relevance in the search for an effective heuristic to solve the problem, the number of MSA concatenations with near-native MI is large, imposing severe limitations. Here, a Genetic Algorithm that explores possible MSA concatenations according to a MI maximization criteria is shown to find degenerate solutions with two error sources, arising from mismatches among (i) similar and (ii) non-similar sequences. If mistakes made among similar sequences are disregarded, type-(i) solutions are found to resolve correct pairings at best true positive (TP) rates of 70%—far above the very same estimates in type-(ii) solutions. A machine learning classification algorithm helps to show further that differences between optimized solutions based on TP rates are not artificial and may have biological meaning associated with the three-dimensional distribution of the MI signal. Type-(i) solutions may therefore correspond to reliable results for predictive purposes, found here to be more likely obtained via MI maximization across protein systems having a minimum critical number of amino acid contacts on their interaction surfaces (N > 200).
Collapse
|
20
|
Green AG, Elhabashy H, Brock KP, Maddamsetti R, Kohlbacher O, Marks DS. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat Commun 2021; 12:1396. [PMID: 33654096 PMCID: PMC7925567 DOI: 10.1038/s41467-021-21636-z] [Citation(s) in RCA: 68] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 01/27/2021] [Indexed: 12/28/2022] Open
Abstract
Increasing numbers of protein interactions have been identified in high-throughput experiments, but only a small proportion have solved structures. Recently, sequence coevolution-based approaches have led to a breakthrough in predicting monomer protein structures and protein interaction interfaces. Here, we address the challenges of large-scale interaction prediction at residue resolution with a fast alignment concatenation method and a probabilistic score for the interaction of residues. Importantly, this method (EVcomplex2) is able to assess the likelihood of a protein interaction, as we show here applied to large-scale experimental datasets where the pairwise interactions are unknown. We predict 504 interactions de novo in the E. coli membrane proteome, including 243 that are newly discovered. While EVcomplex2 does not require available structures, coevolving residue pairs can be used to produce structural models of protein interactions, as done here for membrane complexes including the Flagellar Hook-Filament Junction and the Tol/Pal complex. Our understanding of the residue-level details of protein interactions remains incomplete. Here, the authors show sequence coevolution can be used to infer interacting proteins with residue-level details, including predicting 467 interactions de novo in the Escherichia coli cell envelope proteome.
Collapse
Affiliation(s)
- Anna G Green
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Hadeer Elhabashy
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany.,Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany.,Department of Computer Science, University of Tübingen, WSI/ZBIT, Sand 14, 72076, Tübingen, Germany
| | - Kelly P Brock
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Rohan Maddamsetti
- Department of Systems Biology, Harvard Medical School, Boston, MA, 02115, USA
| | - Oliver Kohlbacher
- Biomolecular Interactions, Max Planck Institute for Developmental Biology, 72076, Tübingen, Germany. .,Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany. .,Department of Computer Science, University of Tübingen, WSI/ZBIT, Sand 14, 72076, Tübingen, Germany. .,Quantitative Biology Center, University of Tübingen, Auf der Morgenstelle 8, 72076, Tübingen, Germany. .,Institute for Translational Bioinformatics, University Hospital Tübingen, Sand 14, 72076, Tübingen, Germany.
| | - Debora S Marks
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Sand 14, 72076, Tübingen, Germany. .,Broad Institute of Harvard and MIT, Cambridge, MA, 02142, USA.
| |
Collapse
|
21
|
Salmanian S, Pezeshk H, Sadeghi M. Inter-protein residue covariation information unravels physically interacting protein dimers. BMC Bioinformatics 2020; 21:584. [PMID: 33334319 PMCID: PMC7745481 DOI: 10.1186/s12859-020-03930-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2020] [Accepted: 12/09/2020] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Predicting physical interaction between proteins is one of the greatest challenges in computational biology. There are considerable various protein interactions and a huge number of protein sequences and synthetic peptides with unknown interacting counterparts. Most of co-evolutionary methods discover a combination of physical interplays and functional associations. However, there are only a handful of approaches which specifically infer physical interactions. Hybrid co-evolutionary methods exploit inter-protein residue coevolution to unravel specific physical interacting proteins. In this study, we introduce a hybrid co-evolutionary-based approach to predict physical interplays between pairs of protein families, starting from protein sequences only. RESULTS In the present analysis, pairs of multiple sequence alignments are constructed for each dimer and the covariation between residues in those pairs are calculated by CCMpred (Contacts from Correlated Mutations predicted) and three mutual information based approaches for ten accessible surface area threshold groups. Then, whole residue couplings between proteins of each dimer are unified into a single Frobenius norm value. Norms of residue contact matrices of all dimers in different accessible surface area thresholds are fed into support vector machine as single or multiple feature models. The results of training the classifiers by single features show no apparent different accuracies in distinct methods for different accessible surface area thresholds. Nevertheless, mutual information product and context likelihood of relatedness procedures may roughly have an overall higher and lower performances than other two methods for different accessible surface area cut-offs, respectively. The results also demonstrate that training support vector machine with multiple norm features for several accessible surface area thresholds leads to a considerable improvement of prediction performance. In this context, CCMpred roughly achieves an overall better performance than mutual information based approaches. The best accuracy, sensitivity, specificity, precision and negative predictive value for that method are 0.98, 1, 0.962, 0.96, and 0.962, respectively. CONCLUSIONS In this paper, by feeding norm values of protein dimers into support vector machines in different accessible surface area thresholds, we demonstrate that even small number of proteins in pairs of multiple alignments could allow one to accurately discriminate between positive and negative dimers.
Collapse
Affiliation(s)
- Sara Salmanian
- Department of Bioinformatics, Institute of Biochemistry and Biophysics, University of Tehran, Tehran, Iran
| | - Hamid Pezeshk
- School of Mathematics, Statistics and Computer Science, College of Science, University of Tehran, Tehran, Iran
- Present Address: Department of Mathematics and Statistics, Concordia University, Montreal, Canada
- School of Biological Sciences, Institute for Research in Fundamental Sciences, Tehran, Iran
| | - Mehdi Sadeghi
- National Institute of Genetic Engineering and Biotechnology, Tehran, Iran
| |
Collapse
|
22
|
Nath A, Leier A. Improved cytokine-receptor interaction prediction by exploiting the negative sample space. BMC Bioinformatics 2020; 21:493. [PMID: 33129275 PMCID: PMC7603689 DOI: 10.1186/s12859-020-03835-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2020] [Accepted: 10/23/2020] [Indexed: 01/19/2023] Open
Abstract
Background Cytokines act by binding to specific receptors in the plasma membrane of target cells. Knowledge of cytokine–receptor interaction (CRI) is very important for understanding the pathogenesis of various human diseases—notably autoimmune, inflammatory and infectious diseases—and identifying potential therapeutic targets. Recently, machine learning algorithms have been used to predict CRIs. “Gold Standard” negative datasets are still lacking and strong biases in negative datasets can significantly affect the training of learning algorithms and their evaluation. To mitigate the unrepresentativeness and bias inherent in the negative sample selection (non-interacting proteins), we propose a clustering-based approach for representative negative sample selection. Results We used deep autoencoders to investigate the effect of different sampling approaches for non-interacting pairs on the training and the performance of machine learning classifiers. By using the anomaly detection capabilities of deep autoencoders we deduced the effects of different categories of negative samples on the training of learning algorithms. Random sampling for selecting non-interacting pairs results in either over- or under-representation of hard or easy to classify instances. When K-means based sampling of negative datasets is applied to mitigate the inadequacies of random sampling, random forest (RF) together with the combined feature set of atomic composition, physicochemical-2grams and two different representations of evolutionary information performs best. Average model performances based on leave-one-out cross validation (loocv) over ten different negative sample sets that each model was trained with, show that RF models significantly outperform the previous best CRI predictor in terms of accuracy (+ 5.1%), specificity (+ 13%), mcc (+ 0.1) and g-means value (+ 5.1). Evaluations using tenfold cv and training/testing splits confirm the competitive performance. Conclusions A comparative analysis was performed to assess the effect of three different sampling methods (random, K-means and uniform sampling) on the training of learning algorithms using different evaluation methods. Models trained on K-means sampled datasets generally show a significantly improved performance compared to those trained on random selections—with RF seemingly benefiting most in our particular setting. Our findings on the sampling are highly relevant and apply to many applications of supervised learning approaches in bioinformatics.
Collapse
Affiliation(s)
- Abhigyan Nath
- Department of Biochemistry, Pt. Jawahar Lal Nehru Memorial Medical College, Raipur, 492001, India.
| | - André Leier
- Department of Genetics, Department of Cell Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.
| |
Collapse
|
23
|
Russ WP, Figliuzzi M, Stocker C, Barrat-Charlaix P, Socolich M, Kast P, Hilvert D, Monasson R, Cocco S, Weigt M, Ranganathan R. An evolution-based model for designing chorismate mutase enzymes. Science 2020; 369:440-445. [PMID: 32703877 DOI: 10.1126/science.aba3304] [Citation(s) in RCA: 151] [Impact Index Per Article: 30.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2019] [Accepted: 05/13/2020] [Indexed: 02/02/2023]
Abstract
The rational design of enzymes is an important goal for both fundamental and practical reasons. Here, we describe a process to learn the constraints for specifying proteins purely from evolutionary sequence data, design and build libraries of synthetic genes, and test them for activity in vivo using a quantitative complementation assay. For chorismate mutase, a key enzyme in the biosynthesis of aromatic amino acids, we demonstrate the design of natural-like catalytic function with substantial sequence diversity. Further optimization focuses the generative model toward function in a specific genomic context. The data show that sequence-based statistical models suffice to specify proteins and provide access to an enormous space of functional sequences. This result provides a foundation for a general process for evolution-based design of artificial proteins.
Collapse
Affiliation(s)
- William P Russ
- University of Texas Southwestern Medical Center, Dallas, TX, USA
| | - Matteo Figliuzzi
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France
| | | | - Pierre Barrat-Charlaix
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.,Biozentrum, University of Basel, Basel, Switzerland
| | - Michael Socolich
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA
| | - Peter Kast
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Donald Hilvert
- Laboratory of Organic Chemistry, ETH Zurich, Switzerland
| | - Remi Monasson
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Simona Cocco
- Laboratoire de Physique de l'Ecole Normale Supérieure, PSL and CNRS, Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris Seine, Laboratoire de Biologie Computationnelle and Quantitative, Paris, France.
| | - Rama Ranganathan
- Center for Physics of Evolving Systems, Biochemistry and Molecular Biology and the Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
24
|
Gandarilla-Pérez CA, Mergny P, Weigt M, Bitbol AF. Statistical physics of interacting proteins: Impact of dataset size and quality assessed in synthetic sequences. Phys Rev E 2020; 101:032413. [PMID: 32290011 DOI: 10.1103/physreve.101.032413] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2019] [Accepted: 03/04/2020] [Indexed: 11/07/2022]
Abstract
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g., direct coupling analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins and interblock couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets of known interaction partners are available and that an iterative pairing algorithm allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if their quality are imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
Collapse
Affiliation(s)
- Carlos A Gandarilla-Pérez
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.,Facultad de Física, Universidad de la Habana, San Lázaro y L, Vedado, Habana 4, CP-10400, Cuba
| | - Pierre Mergny
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France.,Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France
| | - Martin Weigt
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire de Biologie Computationnelle et Quantitative (LCQB, UMR 7238), F-75005 Paris, France
| | - Anne-Florence Bitbol
- Sorbonne Université, CNRS, Institut de Biologie Paris-Seine, Laboratoire Jean Perrin (LJP, UMR 8237), F-75005 Paris, France.,Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015 Lausanne, Switzerland
| |
Collapse
|
25
|
Pensar J, Puranen S, Arnold B, MacAlasdair N, Kuronen J, Tonkin-Hill G, Pesonen M, Xu Y, Sipola A, Sánchez-Busó L, Lees JA, Chewapreecha C, Bentley SD, Harris SR, Parkhill J, Croucher NJ, Corander J. Genome-wide epistasis and co-selection study using mutual information. Nucleic Acids Res 2019; 47:e112. [PMID: 31361894 PMCID: PMC6765119 DOI: 10.1093/nar/gkz656] [Citation(s) in RCA: 28] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2019] [Revised: 07/09/2019] [Accepted: 07/19/2019] [Indexed: 01/19/2023] Open
Abstract
Covariance-based discovery of polymorphisms under co-selective pressure or epistasis has received considerable recent attention in population genomics. Both statistical modeling of the population level covariation of alleles across the chromosome and model-free testing of dependencies between pairs of polymorphisms have been shown to successfully uncover patterns of selection in bacterial populations. Here we introduce a model-free method, SpydrPick, whose computational efficiency enables analysis at the scale of pan-genomes of many bacteria. SpydrPick incorporates an efficient correction for population structure, which adjusts for the phylogenetic signal in the data without requiring an explicit phylogenetic tree. We also introduce a new type of visualization of the results similar to the Manhattan plots used in genome-wide association studies, which enables rapid exploration of the identified signals of co-evolution. Simulations demonstrate the usefulness of our method and give some insight to when this type of analysis is most likely to be successful. Application of the method to large population genomic datasets of two major human pathogens, Streptococcus pneumoniae and Neisseria meningitidis, revealed both previously identified and novel putative targets of co-selection related to virulence and antibiotic resistance, highlighting the potential of this approach to drive molecular discoveries, even in the absence of phenotypic data.
Collapse
Affiliation(s)
- Johan Pensar
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland
| | - Santeri Puranen
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland.,Department of Computer Science, Aalto University, Espoo, FI-00014, Finland
| | - Brian Arnold
- Division of Informatics, Faculty of Arts and Sciences, Harvard University, Cambridge, MA 02138, USA
| | - Neil MacAlasdair
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Juri Kuronen
- Department of Biostatistics, University of Oslo, Oslo, 0317, Norway
| | - Gerry Tonkin-Hill
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Maiju Pesonen
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland.,Department of Computer Science, Aalto University, Espoo, FI-00014, Finland
| | - Yingying Xu
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland.,Department of Computer Science, Aalto University, Espoo, FI-00014, Finland
| | - Aleksi Sipola
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland
| | | | - John A Lees
- Department of Microbiology, New York University School of Medicine, New York, NY 10016, USA
| | - Claire Chewapreecha
- Department of Medicine, University of Cambridge, Cambridge CB2 0QQ, UK.,Bioinformatics & Systems Biology program, King Mongkut's University of Technology Thonburi, Bangkok 10150, Thailand
| | - Stephen D Bentley
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Simon R Harris
- Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK
| | - Julian Parkhill
- Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge, CB3 0ES, UK
| | - Nicholas J Croucher
- MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, St. Mary's Campus, Imperial College London, London, W2 1PG, UK
| | - Jukka Corander
- Department of Mathematics and Statistics, Helsinki Institute for Information Technology (HIIT), Faculty of Science, University of Helsinki, FI-00014 Helsinki, Finland.,Parasites and Microbes, Wellcome Sanger Institute, Cambridge, CB10 1SA, UK.,Department of Biostatistics, University of Oslo, Oslo, 0317, Norway
| |
Collapse
|
26
|
Coevolutive, evolutive and stochastic information in protein-protein interactions. Comput Struct Biotechnol J 2019; 17:1429-1435. [PMID: 31871588 PMCID: PMC6906720 DOI: 10.1016/j.csbj.2019.10.005] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2019] [Revised: 10/19/2019] [Accepted: 10/22/2019] [Indexed: 11/24/2022] Open
Abstract
Here, we investigate the contributions of coevolutive, evolutive and stochastic information in determining protein-protein interactions (PPIs) based on primary sequences of two interacting protein families A and B. Specifically, under the assumption that coevolutive information is imprinted on the interacting amino acids of two proteins in contrast to other (evolutive and stochastic) sources spread over their sequences, we dissect those contributions in terms of compensatory mutations at physically-coupled and uncoupled amino acids of A and B. We find that physically-coupled amino-acids at short range distances store the largest per-contact mutual information content, with a significant fraction of that content resulting from coevolutive sources alone. The information stored in coupled amino acids is shown further to discriminate multi-sequence alignments (MSAs) with the largest expectation fraction of PPI matches – a conclusion that holds against various definitions of intermolecular contacts and binding modes. When compared to the informational content resulting from evolution at long-range interactions, the mutual information in physically-coupled amino-acids is the strongest signal to distinguish PPIs derived from cospeciation and likely, the unique indication in case of molecular coevolution in independent genomes as the evolutive information must vanish for uncorrelated proteins.
Collapse
|
27
|
Phylogenetic correlations can suffice to infer protein partners from sequences. PLoS Comput Biol 2019; 15:e1007179. [PMID: 31609984 PMCID: PMC6812855 DOI: 10.1371/journal.pcbi.1007179] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2019] [Revised: 10/24/2019] [Accepted: 09/25/2019] [Indexed: 12/30/2022] Open
Abstract
Determining which proteins interact together is crucial to a systems-level understanding of the cell. Recently, algorithms based on Direct Coupling Analysis (DCA) pairwise maximum-entropy models have allowed to identify interaction partners among paralogous proteins from sequence data. This success of DCA at predicting protein-protein interactions could be mainly based on its known ability to identify pairs of residues that are in contact in the three-dimensional structure of protein complexes and that coevolve to remain physicochemically complementary. However, interacting proteins possess similar evolutionary histories. What is the role of purely phylogenetic correlations in the performance of DCA-based methods to infer interaction partners? To address this question, we employ controlled synthetic data that only involve phylogeny and no interactions or contacts. We find that DCA accurately identifies the pairs of synthetic sequences that share evolutionary history. While phylogenetic correlations confound the identification of contacting residues by DCA, they are thus useful to predict interacting partners among paralogs. We find that DCA performs as well as phylogenetic methods to this end, and slightly better than them with large and accurate training sets. Employing DCA or phylogenetic methods within an Iterative Pairing Algorithm (IPA) allows to predict pairs of evolutionary partners without a training set. We further demonstrate the ability of these various methods to correctly predict pairings among real paralogous proteins with genome proximity but no known direct physical interaction, illustrating the importance of phylogenetic correlations in natural data. However, for physically interacting and strongly coevolving proteins, DCA and mutual information outperform phylogenetic methods. We finally discuss how to distinguish physically interacting proteins from proteins that only share a common evolutionary history. Many biologically important protein-protein interactions are conserved over evolutionary time scales. This leads to two different signals that can be used to computationally predict interactions between protein families and to identify specific interaction partners. First, the shared evolutionary history leads to highly similar phylogenetic relationships between interacting proteins of the two families. Second, the need to keep the interaction surfaces of partner proteins biophysically compatible causes a correlated amino-acid usage of interface residues. Employing simulated data, we show that the shared history alone can be used to detect partner proteins. Similar accuracies are achieved by algorithms comparing phylogenetic relationships and by methods based on Direct Coupling Analysis (DCA), which are primarily known for their ability to detect the second type of signal. Using natural sequence data, we show that in cases with shared evolutionary history but without known physical interactions, both methods work with similar accuracy, while for some physically interacting systems, DCA and mutual information outperform phylogenetic methods. We propose methods allowing both to predict interactions between protein families and to find interacting partners among paralogs.
Collapse
|
28
|
Fongang B, Cunningham KA, Rowicka M, Kudlicki A. Coevolution of Residues Provides Evidence of a Functional Heterodimer of 5-HT 2AR and 5-HT 2CR Involving Both Intracellular and Extracellular Domains. Neuroscience 2019; 412:48-59. [PMID: 31158438 PMCID: PMC7299066 DOI: 10.1016/j.neuroscience.2019.05.013] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2019] [Revised: 05/02/2019] [Accepted: 05/07/2019] [Indexed: 10/26/2022]
Abstract
Serotonin is a neurotransmitter that plays a role in regulating activities such as sleep, appetite, mood and substance abuse disorders; serotonin receptors 5-HT2AR and 5-HT2CR are active within pathways associated with substance abuse. It has been suggested that 5-HT2AR and 5-HT2CR may form a dimer that affects behavioral processes. Here we study the coevolution of residues in 5-HT2AR and 5-HT2CR to identify potential interactions between residues in both proteins. Coevolution studies can detect protein interactions, and since the thus uncovered interactions are subject to evolutionary pressure, they are likely functional. We assessed the significance of the 5-HT2AR/5-HT2CR interactions using randomized phylogenetic trees and found the coevolution significant (p-value = 0.01). We also discuss how co-expression of the receptors suggests the predicted interaction is functional. Finally, we analyze how several single nucleotide polymorphisms for the 5-HT2AR and 5-HT2CR genes affect their interaction. Our findings are the first to characterize the binding interface of 5-HT2AR/5-HT2CR and indicate a correlation between this interface and location of SNPs in both proteins.
Collapse
MESH Headings
- Animals
- Databases, Genetic
- Evolution, Molecular
- Papio anubis
- Phosphorylation
- Receptor, Serotonin, 5-HT2A/genetics
- Receptor, Serotonin, 5-HT2A/metabolism
- Receptor, Serotonin, 5-HT2C/genetics
- Receptor, Serotonin, 5-HT2C/metabolism
- Transcriptome
Collapse
Affiliation(s)
- Bernard Fongang
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX 77555, USA; Glenn Biggs Institute for Alzheimer's & Neurodegenerative Diseases, UTHSCSA, San Antonio, TX 78229, USA; Department of Biochemistry and Structural Biology, UTHSCSA, San Antonio, TX 78229, USA; Department of Epidemiology and Biostatistics, UTHSCSA, San Antonio, TX 78229, USA.
| | - Kathryn A Cunningham
- Center for Addiction Research and Department of Pharmacology and Toxicology, University of Texas Medical Branch, Galveston, TX 77555, USA
| | - Maga Rowicka
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX 77555, USA; Institute for Translational Sciences, University of Texas Medical Branch, Galveston, TX 77555, USA; Sealy Center for Molecular Medicine, University of Texas Medical Branch, Galveston, TX 77555, USA
| | - Andrzej Kudlicki
- Department of Biochemistry and Molecular Biology, University of Texas Medical Branch, Galveston, TX 77555, USA; Institute for Translational Sciences, University of Texas Medical Branch, Galveston, TX 77555, USA; Sealy Center for Molecular Medicine, University of Texas Medical Branch, Galveston, TX 77555, USA.
| |
Collapse
|
29
|
Cong Q, Anishchenko I, Ovchinnikov S, Baker D. Protein interaction networks revealed by proteome coevolution. Science 2019; 365:185-189. [PMID: 31296772 PMCID: PMC6948103 DOI: 10.1126/science.aaw6718] [Citation(s) in RCA: 128] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 06/07/2019] [Indexed: 01/19/2023]
Abstract
Residue-residue coevolution has been observed across a number of protein-protein interfaces, but the extent of residue coevolution between protein families on the whole-proteome scale has not been systematically studied. We investigate coevolution between 5.4 million pairs of proteins in Escherichia coli and between 3.9 millions pairs in Mycobacterium tuberculosis We find strong coevolution for binary complexes involved in metabolism and weaker coevolution for larger complexes playing roles in genetic information processing. We take advantage of this coevolution, in combination with structure modeling, to predict protein-protein interactions (PPIs) with an accuracy that benchmark studies suggest is considerably higher than that of proteome-wide two-hybrid and mass spectrometry screens. We identify hundreds of previously uncharacterized PPIs in E. coli and M. tuberculosis that both add components to known protein complexes and networks and establish the existence of new ones.
Collapse
Affiliation(s)
- Qian Cong
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
| | - Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA 02138, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA 98105, USA.
- Institute for Protein Design, University of Washington, Seattle, WA 98105, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA 98105, USA
| |
Collapse
|
30
|
The role of coevolutionary signatures in protein interaction dynamics, complex inference, molecular recognition, and mutational landscapes. Curr Opin Struct Biol 2019; 56:179-186. [PMID: 31029927 DOI: 10.1016/j.sbi.2019.03.024] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2019] [Revised: 03/18/2019] [Accepted: 03/19/2019] [Indexed: 11/22/2022]
Abstract
Evolution imposes constraints at the interface of interacting biomolecules in order to preserve function or maintain fitness. This pressure may have a direct effect on the sequence composition of interacting biomolecules. As a result, statistical patterns of amino acid or nucleotide covariance that encode for physical and functional interactions are observed in sequences of extant organisms. In recent years, global pairwise models of amino acid and nucleotide coevolution from multiple sequence alignments have been developed and utilized to study molecular interactions in structural biology. In proteins, for which the energy landscape is funneled and minimally frustrated, a direct connection between the physical and sequence space landscapes can be established. Estimating coevolutionary information from sequences of interacting molecules has a broad impact in molecular biology. Applications include the accurate determination of 3D structures of molecular complexes, inference of protein interaction partners, models of protein-protein interaction specificity, the elucidation, and design of protein-nucleic acid recognition as well as the discovery of genome-wide epistatic effects. The current state of the art of coevolutionary analysis includes biomedical applications ranging from mutational landscapes and drug-design to vaccine development.
Collapse
|