1
|
Epistasis facilitates functional evolution in an ancient transcription factor. eLife 2024; 12:RP88737. [PMID: 38767330 PMCID: PMC11105156 DOI: 10.7554/elife.88737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
A protein's genetic architecture - the set of causal rules by which its sequence produces its functions - also determines its possible evolutionary trajectories. Prior research has proposed that the genetic architecture of proteins is very complex, with pervasive epistatic interactions that constrain evolution and make function difficult to predict from sequence. Most of this work has analyzed only the direct paths between two proteins of interest - excluding the vast majority of possible genotypes and evolutionary trajectories - and has considered only a single protein function, leaving unaddressed the genetic architecture of functional specificity and its impact on the evolution of new functions. Here, we develop a new method based on ordinal logistic regression to directly characterize the global genetic determinants of multiple protein functions from 20-state combinatorial deep mutational scanning (DMS) experiments. We use it to dissect the genetic architecture and evolution of a transcription factor's specificity for DNA, using data from a combinatorial DMS of an ancient steroid hormone receptor's capacity to activate transcription from two biologically relevant DNA elements. We show that the genetic architecture of DNA recognition consists of a dense set of main and pairwise effects that involve virtually every possible amino acid state in the protein-DNA interface, but higher-order epistasis plays only a tiny role. Pairwise interactions enlarge the set of functional sequences and are the primary determinants of specificity for different DNA elements. They also massively expand the number of opportunities for single-residue mutations to switch specificity from one DNA target to another. By bringing variants with different functions close together in sequence space, pairwise epistasis therefore facilitates rather than constrains the evolution of new functions.
Collapse
|
2
|
Chemical codes promote selective compartmentalization of proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.15.589616. [PMID: 38659952 PMCID: PMC11042338 DOI: 10.1101/2024.04.15.589616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/26/2024]
Abstract
Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must efficiently assemble. Such assembly is presumed to unfold as a result of specific interactions between biomolecules; however, recent evidence suggests that distinctive chemical environments within subcellular compartments may also play an important role. Here, we test the hypothesis that protein groups with shared functions also share codes that guide them to compartment destinations. To test our hypothesis, we developed a transformer large language model, called ProtGPS, that predicts with high performance the compartment localization of human proteins excluded from the training set. We then demonstrate ProtGPS can be used for guided generation of novel protein sequences that selectively assemble into specific compartments in cells. Furthermore, ProtGPS predictions were sensitive to disease-associated mutations that produce changes in protein compartmentalization, suggesting that this type of pathogenic dysfunction can be discovered in silico. Our results indicate that protein sequences contain not only a folding code, but also a previously unrecognized chemical code governing their distribution in specific cellular compartments.
Collapse
|
3
|
Cryptic genetic variation shapes the fate of gene duplicates in a protein interaction network. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.23.581840. [PMID: 38464075 PMCID: PMC10925128 DOI: 10.1101/2024.02.23.581840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Paralogous genes are often redundant for long periods of time before they diverge in function. While their functions are preserved, paralogous proteins can accumulate mutations that, through epistasis, could impact their fate in the future. By quantifying the impact of all single-amino acid substitutions on the binding of two myosin proteins to their interaction partners, we find that the future evolution of these proteins is highly contingent on their regulatory divergence and the mutations that have silently accumulated in their protein binding domains. Differences in the promoter strength of the two paralogs amplify the impact of mutations on binding in the lowly expressed one. While some mutations would be sufficient to non-functionalize one paralog, they would have minimal impact on the other. Our results reveal how functionally equivalent protein domains could be destined to specific fates by regulatory and cryptic coding sequence changes that currently have little to no functional impact.
Collapse
|
4
|
Kinesin-7 CENP-E in tumorigenesis: Chromosome instability, spindle assembly checkpoint, and applications. Front Mol Biosci 2024; 11:1366113. [PMID: 38560520 PMCID: PMC10978661 DOI: 10.3389/fmolb.2024.1366113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 03/04/2024] [Indexed: 04/04/2024] Open
Abstract
Kinesin motors are a large family of molecular motors that walk along microtubules to fulfill many roles in intracellular transport, microtubule organization, and chromosome alignment. Kinesin-7 CENP-E (Centromere protein E) is a chromosome scaffold-associated protein that is located in the corona layer of centromeres, which participates in kinetochore-microtubule attachment, chromosome alignment, and spindle assembly checkpoint. Over the past 3 decades, CENP-E has attracted great interest as a promising new mitotic target for cancer therapy and drug development. In this review, we describe expression patterns of CENP-E in multiple tumors and highlight the functions of CENP-E in cancer cell proliferation. We summarize recent advances in structural domains, roles, and functions of CENP-E in cell division. Notably, we describe the dual functions of CENP-E in inhibiting and promoting tumorigenesis. We summarize the mechanisms by which CENP-E affects tumorigenesis through chromosome instability and spindle assembly checkpoints. Finally, we overview and summarize the CENP-E-specific inhibitors, mechanisms of drug resistances and their applications.
Collapse
|
5
|
Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
|
6
|
The simplicity of protein sequence-function relationships. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.02.556057. [PMID: 37732229 PMCID: PMC10508729 DOI: 10.1101/2023.09.02.556057] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2023]
Abstract
How complicated is the genetic architecture of proteins - the set of causal effects by which sequence determines function? High-order epistatic interactions among residues are thought to be pervasive, making a protein's function difficult to predict or understand from its sequence. Most studies, however, used methods that overestimate epistasis, because they analyze genetic architecture relative to a designated reference sequence - causing measurement noise and small local idiosyncrasies to propagate into pervasive high-order interactions - or have not effectively accounted for global nonlinearity in the sequence-function relationship. Here we present a new reference-free method that jointly estimates global nonlinearity and specific epistatic interactions across a protein's entire genotype-phenotype map. This method yields a maximally efficient explanation of a protein's genetic architecture and is more robust than existing methods to measurement noise, partial sampling, and model misspecification. We reanalyze 20 combinatorial mutagenesis experiments from a diverse set of proteins and find that additive and pairwise effects, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of total variance in measured phenotypes (and >92% in every case). Only a tiny fraction of genotypes are strongly affected by third- or higher-order epistasis. Genetic architecture is also sparse: the number of terms required to explain the vast majority of variance is smaller than the number of genotypes by many orders of magnitude. The sequence-function relationship in most proteins is therefore far simpler than previously thought, opening the way for new and tractable approaches to characterize it.
Collapse
|
7
|
Evolution shapes interaction patterns for epistasis and specific protein binding in a two-component signaling system. Commun Chem 2024; 7:13. [PMID: 38233668 PMCID: PMC10794238 DOI: 10.1038/s42004-024-01098-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 01/05/2024] [Indexed: 01/19/2024] Open
Abstract
The elegant design of protein sequence/structure/function relationships arises from the interaction patterns between amino acid positions. A central question is how evolutionary forces shape the interaction patterns that encode long-range epistasis and binding specificity. Here, we combined family-wide evolutionary analysis of natural homologous sequences and structure-oriented evolution simulation for two-component signaling (TCS) system. The magnitude-frequency relationship of coupling conservation between positions manifests a power-law-like distribution and the positions with highly coupling conservation are sparse but distributed intensely on the binding surfaces and hydrophobic core. The structure-specific interaction pattern involves further optimization of local frustrations at or near the binding surface to adapt the binding partner. The construction of family-wide conserved interaction patterns and structure-specific ones demonstrates that binding specificity is modulated by both direct intermolecular interactions and long-range epistasis across the binding complex. Evolution sculpts the interaction patterns via sequence variations at both family-wide and structure-specific levels for TCS system.
Collapse
|
8
|
Evolutionary paths that link orthogonal pairs of binding proteins. RESEARCH SQUARE 2023:rs.3.rs-2836905. [PMID: 37131620 PMCID: PMC10153392 DOI: 10.21203/rs.3.rs-2836905/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Some protein binding pairs exhibit extreme specificities that functionally insulate them from homologs. Such pairs evolve mostly by accumulating single-point mutations, and mutants are selected if their affinity exceeds the threshold required for function1-4. Thus, homologous and high-specificity binding pairs bring to light an evolutionary conundrum: how does a new specificity evolve while maintaining the required affinity in each intermediate5,6? Until now, a fully functional single-mutation path that connects two orthogonal pairs has only been described where the pairs were mutationally close thus enabling experimental enumeration of all intermediates2. We present an atomistic and graph-theoretical framework for discovering low molecular strain single-mutation paths that connect two extant pairs, enabling enumeration beyond experimental capability. We apply it to two orthogonal bacterial colicin endonuclease-immunity pairs separated by 17 interface mutations7. We were not able to find a strain-free and functional path in the sequence space defined by the two extant pairs. But including mutations that bridge amino acids that cannot be exchanged through single-nucleotide mutations led us to a strain-free 19-mutation trajectory that is completely viable in vivo. Our experiments show that the specificity switch is remarkably abrupt, resulting from only one radical mutation on each partner. Furthermore, each of the critical specificity-switch mutations increases fitness, demonstrating that functional divergence could be driven by positive Darwinian selection. These results reveal how even radical functional changes in an epistatic fitness landscape may evolve.
Collapse
|
9
|
STAR: A Web Server for Assisting Directed Protein Evolution with Machine Learning. ACS OMEGA 2023; 8:44751-44756. [PMID: 38046324 PMCID: PMC10688154 DOI: 10.1021/acsomega.3c04832] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Revised: 10/10/2023] [Accepted: 10/12/2023] [Indexed: 12/05/2023]
Abstract
Protein engineering has made significant contributions to industries such as agriculture, food, and pharmaceuticals. In recent years, directed evolution combined with artificial intelligence has emerged as a cutting-edge R&D approach. However, the application of machine learning techniques can be challenging for those without relevant experience and coding skills. To address this issue, we have developed a web-based protein sequence recommendation system: STAR (Sequence recommendaTion via ARtificial intelligence). Our system utilizes Bayesian optimization as its backbone and includes a filtering step using a regression model to enhance the success rate of recommended sequences. Additionally, we have incorporated an in silico-directed evolution approach to expand the exploration of the protein space. The Web site can be accessed at https://www.FindProteinStar.com/.
Collapse
|
10
|
Deep-learning-assisted Sort-Seq enables high-throughput profiling of gene expression characteristics with high precision. SCIENCE ADVANCES 2023; 9:eadg5296. [PMID: 37939173 PMCID: PMC10631719 DOI: 10.1126/sciadv.adg5296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 10/06/2023] [Indexed: 11/10/2023]
Abstract
Owing to the nondeterministic and nonlinear nature of gene expression, the steady-state intracellular protein abundance of a clonal population forms a distribution. The characteristics of this distribution, including expression strength and noise, are closely related to cellular behavior. However, quantitative description of these characteristics has so far relied on arrayed methods, which are time-consuming and labor-intensive. To address this issue, we propose a deep-learning-assisted Sort-Seq approach (dSort-Seq) in this work, enabling high-throughput profiling of expression properties with high precision. We demonstrated the validity of dSort-Seq for large-scale assaying of the dose-response relationships of biosensors. In addition, we comprehensively investigated the contribution of transcription and translation to noise production in Escherichia coli, from which we found that the expression noise is strongly coupled with the mean expression level. We also found that the transcriptional interference caused by overlapping RpoD-binding sites contributes to noise production, which suggested the existence of a simple and feasible noise control strategy in E. coli.
Collapse
|
11
|
Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
|
12
|
Learning protein fitness landscapes with deep mutational scanning data from multiple sources. Cell Syst 2023; 14:706-721.e5. [PMID: 37591206 DOI: 10.1016/j.cels.2023.07.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/30/2023] [Accepted: 07/18/2023] [Indexed: 08/19/2023]
Abstract
One of the key points of machine learning-assisted directed evolution (MLDE) is the accurate learning of the fitness landscape, a conceptual mapping from sequence variants to the desired function. Here, we describe a multi-protein training scheme that leverages the existing deep mutational scanning data from diverse proteins to aid in understanding the fitness landscape of a new protein. Proof-of-concept trials are designed to validate this training scheme in three aspects: random and positional extrapolation for single-variant effects, zero-shot fitness predictions for new proteins, and extrapolation for higher-order variant effects from single-variant effects. Moreover, our study identified previously overlooked strong baselines, and their unexpectedly good performance brings our attention to the pitfalls of MLDE. Overall, these results may improve our understanding of the association between different protein fitness profiles and shed light on developing better machine learning-assisted approaches to the directed evolution of proteins. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
|
13
|
Jointly modeling deep mutational scans identifies shifted mutational effects among SARS-CoV-2 spike homologs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.31.551037. [PMID: 37577604 PMCID: PMC10418112 DOI: 10.1101/2023.07.31.551037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Deep mutational scanning (DMS) is a high-throughput experimental technique that measures the effects of thousands of mutations to a protein. These experiments can be performed on multiple homologs of a protein or on the same protein selected under multiple conditions. It is often of biological interest to identify mutations with shifted effects across homologs or conditions. However, it is challenging to determine if observed shifts arise from biological signal or experimental noise. Here, we describe a method for jointly inferring mutational effects across multiple DMS experiments while also identifying mutations that have shifted in their effects among experiments. A key aspect of our method is to regularize the inferred shifts, so that they are nonzero only when strongly supported by the data. We apply this method to DMS experiments that measure how mutations to spike proteins from SARS-CoV-2 variants (Delta, Omicron BA.1, and Omicron BA.2) affect cell entry. Most mutational effects are conserved between these spike homologs, but a fraction have markedly shifted. We experimentally validate a subset of the mutations inferred to have shifted effects, and confirm differences of > 1,000-fold in the impact of the same mutation on spike-mediated viral infection across spikes from different SARS-CoV-2 variants. Overall, our work establishes a general approach for comparing sets of DMS experiments to identify biologically important shifts in mutational effects.
Collapse
|
14
|
Environment-dependent epistasis increases phenotypic diversity in gene regulatory networks. SCIENCE ADVANCES 2023; 9:eadf1773. [PMID: 37224262 DOI: 10.1126/sciadv.adf1773] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Accepted: 04/17/2023] [Indexed: 05/26/2023]
Abstract
Mutations to gene regulatory networks can be maladaptive or a source of evolutionary novelty. Epistasis confounds our understanding of how mutations affect the expression patterns of gene regulatory networks, a challenge exacerbated by the dependence of epistasis on the environment. We used the toolkit of synthetic biology to systematically assay the effects of pairwise and triplet combinations of mutant genotypes on the expression pattern of a gene regulatory network expressed in Escherichia coli that interprets an inducer gradient across a spatial domain. We uncovered a preponderance of epistasis that can switch in magnitude and sign across the inducer gradient to produce a greater diversity of expression pattern phenotypes than would be possible in the absence of such environment-dependent epistasis. We discuss our findings in the context of the evolution of hybrid incompatibilities and evolutionary novelties.
Collapse
|
15
|
Robustness and innovation in synthetic genotype networks. Nat Commun 2023; 14:2454. [PMID: 37117168 PMCID: PMC10147661 DOI: 10.1038/s41467-023-38033-3] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2022] [Accepted: 04/13/2023] [Indexed: 04/30/2023] Open
Abstract
Genotype networks are sets of genotypes connected by small mutational changes that share the same phenotype. They facilitate evolutionary innovation by enabling the exploration of different neighborhoods in genotype space. Genotype networks, first suggested by theoretical models, have been empirically confirmed for proteins and RNAs. Comparative studies also support their existence for gene regulatory networks (GRNs), but direct experimental evidence is lacking. Here, we report the construction of three interconnected genotype networks of synthetic GRNs producing three distinct phenotypes in Escherichia coli. Our synthetic GRNs contain three nodes regulating each other by CRISPR interference and governing the expression of fluorescent reporters. The genotype networks, composed of over twenty different synthetic GRNs, provide robustness in face of mutations while enabling transitions to innovative phenotypes. Through realistic mathematical modeling, we quantify robustness and evolvability for the complete genotype-phenotype map and link these features mechanistically to GRN motifs. Our work thereby exemplifies how GRN evolution along genotype networks might be driving evolutionary innovation.
Collapse
|
16
|
Structural features of sensory two component systems: a synthetic biology perspective. Biochem J 2023; 480:127-140. [PMID: 36688908 DOI: 10.1042/bcj20210798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2022] [Revised: 01/05/2023] [Accepted: 01/06/2023] [Indexed: 01/24/2023]
Abstract
All living organisms include a set of signaling devices that confer the ability to dynamically perceive and adapt to the fluctuating environment. Two-component systems are part of this sensory machinery that regulates the execution of different genetic and/or biochemical programs in response to specific physical or chemical signals. In the last two decades, there has been tremendous progress in our molecular understanding on how signals are detected, the allosteric mechanisms that control intramolecular information transmission and the specificity determinants that guarantee correct wiring. All this information is starting to be exploited in the development of new synthetic networks. Connecting multiple molecular players, analogous to programming lines of code, can provide the resources to build new sophisticated biocomputing systems. The Synthetic Biology field is starting to revolutionize several scientific fields, such as biomedicine and agriculture, propelling the development of new solutions. Expanding the spectrum of available nanodevices in the toolbox is key to unleash its full potential. This review aims to discuss, from a structural perspective, how to take advantage of the vast array of sensor and effector protein modules involved in two-component systems for the construction of new synthetic circuits.
Collapse
|
17
|
Protein engineering via Bayesian optimization-guided evolutionary algorithm and robotic experiments. Brief Bioinform 2023; 24:6958505. [PMID: 36562723 DOI: 10.1093/bib/bbac570] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Revised: 11/14/2022] [Accepted: 11/22/2022] [Indexed: 12/24/2022] Open
Abstract
Directed protein evolution applies repeated rounds of genetic mutagenesis and phenotypic screening and is often limited by experimental throughput. Through in silico prioritization of mutant sequences, machine learning has been applied to reduce wet lab burden to a level practical for human researchers. On the other hand, robotics permits large batches and rapid iterations for protein engineering cycles, but such capacities have not been well exploited in existing machine learning-assisted directed evolution approaches. Here, we report a scalable and batched method, Bayesian Optimization-guided EVOlutionary (BO-EVO) algorithm, to guide multiple rounds of robotic experiments to explore protein fitness landscapes of combinatorial mutagenesis libraries. We first examined various design specifications based on an empirical landscape of protein G domain B1. Then, BO-EVO was successfully generalized to another empirical landscape of an Escherichia coli kinase PhoQ, as well as simulated NK landscapes with up to moderate epistasis. This approach was then applied to guide robotic library creation and screening to engineer enzyme specificity of RhlA, a key biosynthetic enzyme for rhamnolipid biosurfactants. A 4.8-fold improvement in producing a target rhamnolipid congener was achieved after examining less than 1% of all possible mutants after four iterations. Overall, BO-EVO proves to be an efficient and general approach to guide combinatorial protein engineering without prior knowledge.
Collapse
|
18
|
Promiscuity of response regulators for thioredoxin steers bacterial virulence. Nat Commun 2022; 13:6210. [PMID: 36266276 PMCID: PMC9584953 DOI: 10.1038/s41467-022-33983-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 10/11/2022] [Indexed: 12/24/2022] Open
Abstract
The exquisite specificity between a sensor kinase and its cognate response regulator ensures faithful partner selectivity within two-component pairs concurrently firing in a single bacterium, minimizing crosstalk with other members of this conserved family of paralogous proteins. We show that conserved hydrophobic and charged residues on the surface of thioredoxin serve as a docking station for structurally diverse response regulators. Using the OmpR protein, we identify residues in the flexible linker and the C-terminal β-hairpin that enable associations of this archetypical response regulator with thioredoxin, but are dispensable for interactions of this transcription factor to its cognate sensor kinase EnvZ, DNA or RNA polymerase. Here we show that the promiscuous interactions of response regulators with thioredoxin foster the flow of information through otherwise highly dedicated two-component signaling systems, thereby enabling both the transcription of Salmonella pathogenicity island-2 genes as well as growth of this intracellular bacterium in macrophages and mice.
Collapse
|
19
|
Evolution avoids a pathological stabilizing interaction in the immune protein S100A9. Proc Natl Acad Sci U S A 2022; 119:e2208029119. [PMID: 36194634 PMCID: PMC9565474 DOI: 10.1073/pnas.2208029119] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 09/07/2022] [Indexed: 01/03/2023] Open
Abstract
Stability constrains evolution. While much is known about constraints on destabilizing mutations, less is known about the constraints on stabilizing mutations. We recently identified a mutation in the innate immune protein S100A9 that provides insight into such constraints. When introduced into human S100A9, M63F simultaneously increases the stability of the protein and disrupts its natural ability to activate Toll-like receptor 4. Using chemical denaturation, we found that M63F stabilizes a calcium-bound conformation of hS100A9. We then used NMR to solve the structure of the mutant protein, revealing that the mutation distorts the hydrophobic binding surface of hS100A9, explaining its deleterious effect on function. Hydrogen-deuterium exchange (HDX) experiments revealed stabilization of the region around M63F in the structure, notably Phe37. In the structure of the M63F mutant, the Phe37 and Phe63 sidechains are in contact, plausibly forming an edge-face π-stack. Mutating Phe37 to Leu abolished the stabilizing effect of M63F as probed by both chemical denaturation and HDX. It also restored the biological activity of S100A9 disrupted by M63F. These findings reveal that Phe63 creates a molecular staple with Phe37 that stabilizes a nonfunctional conformation of the protein, thus disrupting function. Using a bioinformatic analysis, we found that S100A9 proteins from different organisms rarely have Phe at both positions 37 and 63, suggesting that avoiding a pathological stabilizing interaction indeed constrains S100A9 evolution. This work highlights an important evolutionary constraint on stabilizing mutations, namely, that they must avoid inappropriately stabilizing nonfunctional protein conformations.
Collapse
|
20
|
Intragenic compensation through the lens of deep mutational scanning. Biophys Rev 2022; 14:1161-1182. [PMID: 36345285 PMCID: PMC9636336 DOI: 10.1007/s12551-022-01005-w] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2022] [Accepted: 09/26/2022] [Indexed: 12/20/2022] Open
Abstract
A significant fraction of mutations in proteins are deleterious and result in adverse consequences for protein function, stability, or interaction with other molecules. Intragenic compensation is a specific case of positive epistasis when a neutral missense mutation cancels effect of a deleterious mutation in the same protein. Permissive compensatory mutations facilitate protein evolution, since without them all sequences would be extremely conserved. Understanding compensatory mechanisms is an important scientific challenge at the intersection of protein biophysics and evolution. In human genetics, intragenic compensatory interactions are important since they may result in variable penetrance of pathogenic mutations or fixation of pathogenic human alleles in orthologous proteins from related species. The latter phenomenon complicates computational and clinical inference of an allele's pathogenicity. Deep mutational scanning is a relatively new technique that enables experimental studies of functional effects of thousands of mutations in proteins. We review the important aspects of the field and discuss existing limitations of current datasets. We reviewed ten published DMS datasets with quantified functional effects of single and double mutations and described rates and patterns of intragenic compensation in eight of them. Supplementary Information The online version contains supplementary material available at 10.1007/s12551-022-01005-w.
Collapse
|
21
|
Abstract
One core goal of genetics is to systematically understand the mapping between the DNA sequence of an organism (genotype) and its measurable characteristics (phenotype). Understanding this mapping is often challenging because of interactions between mutations, where the result of combining several different mutations can be very different than the sum of their individual effects. Here we provide a statistical framework for modeling complex genetic interactions of this type. The key idea is to ask how fast the effects of mutations change when introducing the same mutation in increasingly distant genetic backgrounds. We then propose a model for phenotypic prediction that takes into account this tendency for the effects of mutations to be more similar in nearby genetic backgrounds. Contemporary high-throughput mutagenesis experiments are providing an increasingly detailed view of the complex patterns of genetic interaction that occur between multiple mutations within a single protein or regulatory element. By simultaneously measuring the effects of thousands of combinations of mutations, these experiments have revealed that the genotype–phenotype relationship typically reflects not only genetic interactions between pairs of sites but also higher-order interactions among larger numbers of sites. However, modeling and understanding these higher-order interactions remains challenging. Here we present a method for reconstructing sequence-to-function mappings from partially observed data that can accommodate all orders of genetic interaction. The main idea is to make predictions for unobserved genotypes that match the type and extent of epistasis found in the observed data. This information on the type and extent of epistasis can be extracted by considering how phenotypic correlations change as a function of mutational distance, which is equivalent to estimating the fraction of phenotypic variance due to each order of genetic interaction (additive, pairwise, three-way, etc.). Using these estimated variance components, we then define an empirical Bayes prior that in expectation matches the observed pattern of epistasis and reconstruct the genotype–phenotype mapping by conducting Gaussian process regression under this prior. To demonstrate the power of this approach, we present an application to the antibody-binding domain GB1 and also provide a detailed exploration of a dataset consisting of high-throughput measurements for the splicing efficiency of human pre-mRNA 5′ splice sites, for which we also validate our model predictions via additional low-throughput experiments.
Collapse
|
22
|
CLADE 2.0: Evolution-Driven Cluster Learning-Assisted Directed Evolution. J Chem Inf Model 2022; 62:4629-4641. [PMID: 36154171 DOI: 10.1021/acs.jcim.2c01046] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Directed evolution, a revolutionary biotechnology in protein engineering, optimizes protein fitness by searching an astronomical mutational space via expensive experiments. The cluster learning-assisted directed evolution (CLADE) efficiently explores the mutational space via a combination of unsupervised hierarchical clustering and supervised learning. However, the initial-stage sampling in CLADE treats all clusters equally despite many clusters containing a large portion of non-functional mutations. Recent statistical and deep learning tools enable evolutionary density modeling to access protein fitness in an unsupervised manner. In this work, we construct an ensemble of multiple evolutionary scores to guide the initial sampling in CLADE. The resulting evolutionary score-enhanced CLADE, called CLADE 2.0, efficiently selects a training set within a small informative space using the evolution-driven clustering sampling. CLADE 2.0 is validated by using two benchmark libraries both having 160,000 sequences from four-site mutational combinations. Extensive computational experiments and comparisons with existing cutting-edge methods indicate that CLADE 2.0 is a new state-of-art tool for machine learning-assisted directed evolution.
Collapse
|
23
|
Abstract
Epistatic interactions can make the outcomes of evolution unpredictable, but no comprehensive data are available on the extent and temporal dynamics of changes in the effects of mutations as protein sequences evolve. Here, we use phylogenetic deep mutational scanning to measure the functional effect of every possible amino acid mutation in a series of ancestral and extant steroid receptor DNA binding domains. Across 700 million years of evolution, epistatic interactions caused the effects of most mutations to become decorrelated from their initial effects and their windows of evolutionary accessibility to open and close transiently. Most effects changed gradually and without bias at rates that were largely constant across time, indicating a neutral process caused by many weak epistatic interactions. Our findings show that protein sequences drift inexorably into contingency and unpredictability, but that the process is statistically predictable, given sufficient phylogenetic and experimental data.
Collapse
|
24
|
Abstract
Widespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing are thus unjustified. It also results from the fact that, even though many lower-cost sequencing strategies have been developed, they often require at least some access to and experience with sequencing or computational resources, both of which can be barriers to access. Here, we present every variant sequencing (evSeq), a method and collection of tools/standardized components for sequencing a variable region within every variant gene produced during a protein engineering campaign at a cost of cents per variant. evSeq was designed to democratize low-cost sequencing for protein engineers and, indeed, anyone interested in engineering biological systems. Execution of its wet-lab component is simple, requires no sequencing experience to perform, relies only on resources and services typically available to biology labs, and slots neatly into existing protein engineering workflows. Analysis of evSeq data is likewise made simple by its accompanying software (found at github.com/fhalab/evSeq, documentation at fhalab.github.io/evSeq), which can be run on a personal laptop and was designed to be accessible to users with no computational experience. Low-cost and easy-to-use, evSeq makes the collection of extensive protein variant sequence-fitness data practical.
Collapse
|
25
|
Droplet-based screening of phosphate transfer catalysis reveals how epistasis shapes MAP kinase interactions with substrates. Nat Commun 2022; 13:844. [PMID: 35149678 PMCID: PMC8837617 DOI: 10.1038/s41467-022-28396-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2021] [Accepted: 01/10/2022] [Indexed: 11/20/2022] Open
Abstract
The combination of ultrahigh-throughput screening and sequencing informs on function and intragenic epistasis within combinatorial protein mutant libraries. Establishing a droplet-based, in vitro compartmentalised approach for robust expression and screening of protein kinase cascades (>107 variants/day) allowed us to dissect the intrinsic molecular features of the MKK-ERK signalling pathway, without interference from endogenous cellular components. In a six-residue combinatorial library of the MKK1 docking domain, we identified 29,563 sequence permutations that allow MKK1 to efficiently phosphorylate and activate its downstream target kinase ERK2. A flexibly placed hydrophobic sequence motif emerges which is defined by higher order epistatic interactions between six residues, suggesting synergy that enables high connectivity in the sequence landscape. Through positive epistasis, MKK1 maintains function during mutagenesis, establishing the importance of co-dependent residues in mammalian protein kinase-substrate interactions, and creating a scenario for the evolution of diverse human signalling networks. Here, the authors use a droplet-based screen for phosphate transfer catalysis, testing variants of the human protein kinase MKK1 for its ability to activate its downstream target ERK2. Data reveal a flexible motif in the MKK1 docking domain that promotes efficient activation of ERK2, and suggest epistasis between the residues within that sequence.
Collapse
|
26
|
Deep Mutational Scanning of Protein-Protein Interactions Between Partners Expressed from Their Endogenous Loci In Vivo. Methods Mol Biol 2022; 2477:237-259. [PMID: 35524121 DOI: 10.1007/978-1-0716-2257-5_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Deep mutational scanning (DMS) generates mutants of a protein of interest in a comprehensive manner. CRISPR-Cas9 technology enables large-scale genome editing with high efficiency. Using both DMS and CRISPR-Cas9 therefore allows us to investigate the effects of thousands of mutations inserted directly in the genome. Combined with protein-fragment complementation assay (PCA), which enables the quantitative measurement of protein-protein interactions (PPIs) in vivo, these methods allow for the systematic assessment of the effects of mutations on PPIs in living cells. Here, we describe a method leveraging DMS, CRISPR-Cas9, and PCA to study the effect of point mutations on PPIs mediated by protein domains in yeast.
Collapse
|
27
|
Abstract
Evolution is the hallmark of life. Descriptions of the evolution of microorganisms have provided a wealth of information, but knowledge regarding "what happened" has precluded a deeper understanding of "how" evolution has proceeded, as in the case of antimicrobial resistance. The difficulty in answering the "how" question lies in the multihierarchical dimensions of evolutionary processes, nested in complex networks, encompassing all units of selection, from genes to communities and ecosystems. At the simplest ontological level (as resistance genes), evolution proceeds by random (mutation and drift) and directional (natural selection) processes; however, sequential pathways of adaptive variation can occasionally be observed, and under fixed circumstances (particular fitness landscapes), evolution is predictable. At the highest level (such as that of plasmids, clones, species, microbiotas), the systems' degrees of freedom increase dramatically, related to the variable dispersal, fragmentation, relatedness, or coalescence of bacterial populations, depending on heterogeneous and changing niches and selective gradients in complex environments. Evolutionary trajectories of antibiotic resistance find their way in these changing landscapes subjected to random variations, becoming highly entropic and therefore unpredictable. However, experimental, phylogenetic, and ecogenetic analyses reveal preferential frequented paths (highways) where antibiotic resistance flows and propagates, allowing some understanding of evolutionary dynamics, modeling and designing interventions. Studies on antibiotic resistance have an applied aspect in improving individual health, One Health, and Global Health, as well as an academic value for understanding evolution. Most importantly, they have a heuristic significance as a model to reduce the negative influence of anthropogenic effects on the environment.
Collapse
|
28
|
Physics of biomolecular recognition and conformational dynamics. REPORTS ON PROGRESS IN PHYSICS. PHYSICAL SOCIETY (GREAT BRITAIN) 2021; 84:126601. [PMID: 34753115 DOI: 10.1088/1361-6633/ac3800] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Accepted: 11/09/2021] [Indexed: 06/13/2023]
Abstract
Biomolecular recognition usually leads to the formation of binding complexes, often accompanied by large-scale conformational changes. This process is fundamental to biological functions at the molecular and cellular levels. Uncovering the physical mechanisms of biomolecular recognition and quantifying the key biomolecular interactions are vital to understand these functions. The recently developed energy landscape theory has been successful in quantifying recognition processes and revealing the underlying mechanisms. Recent studies have shown that in addition to affinity, specificity is also crucial for biomolecular recognition. The proposed physical concept of intrinsic specificity based on the underlying energy landscape theory provides a practical way to quantify the specificity. Optimization of affinity and specificity can be adopted as a principle to guide the evolution and design of molecular recognition. This approach can also be used in practice for drug discovery using multidimensional screening to identify lead compounds. The energy landscape topography of molecular recognition is important for revealing the underlying flexible binding or binding-folding mechanisms. In this review, we first introduce the energy landscape theory for molecular recognition and then address four critical issues related to biomolecular recognition and conformational dynamics: (1) specificity quantification of molecular recognition; (2) evolution and design in molecular recognition; (3) flexible molecular recognition; (4) chromosome structural dynamics. The results described here and the discussions of the insights gained from the energy landscape topography can provide valuable guidance for further computational and experimental investigations of biomolecular recognition and conformational dynamics.
Collapse
|
29
|
Abstract
Directed evolution, a strategy for protein engineering, optimizes protein properties (i.e., fitness) by expensive and time-consuming screening or selection of large mutational sequence space. Machine learning-assisted directed evolution (MLDE), which screens sequence properties in silico, can accelerate the optimization and reduce the experimental burden. This work introduces a MLDE framework, cluster learning-assisted directed evolution (CLADE), that combines hierarchical unsupervised clustering sampling and supervised learning to guide protein engineering. The clustering sampling selectively picks and screens variants in targeted subspaces, which guides the subsequent generation of diverse training sets. In the last stage, accurate predictions via supervised learning models improve final outcomes. By sequentially screening 480 sequences out of 160,000 in a four-site combinatorial library with five equal experimental batches, CLADE achieves the global maximal fitness hit rate up to 91.0% and 34.0% for GB1 and PhoQ datasets, respectively, improved from 18.6% and 7.2% obtained by random-sampling-based MLDE.
Collapse
|
30
|
Climbing Up and Down Binding Landscapes through Deep Mutational Scanning of Three Homologous Protein-Protein Complexes. J Am Chem Soc 2021; 143:17261-17275. [PMID: 34609866 PMCID: PMC8532158 DOI: 10.1021/jacs.1c08707] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Protein-protein interactions (PPIs) have evolved to display binding affinities that can support their function. As such, cognate and noncognate PPIs could be highly similar structurally but exhibit huge differences in binding affinities. To understand this phenomenon, we study three homologous protease-inhibitor PPIs that span 9 orders of magnitude in binding affinity. Using state-of-the-art methodology that combines protein randomization, affinity sorting, deep sequencing, and data normalization, we report quantitative binding landscapes consisting of ΔΔGbind values for the three PPIs, gleaned from tens of thousands of single and double mutations. We show that binding landscapes of the three complexes are strikingly different and depend on the PPI evolutionary optimality. We observe different patterns of couplings between mutations for the three PPIs with negative and positive epistasis appearing most frequently at hot-spot and cold-spot positions, respectively. The evolutionary trends observed here are likely to be universal to other biological complexes in the cell.
Collapse
|
31
|
Epistasis shapes the fitness landscape of an allosteric specificity switch. Nat Commun 2021; 12:5562. [PMID: 34548494 PMCID: PMC8455584 DOI: 10.1038/s41467-021-25826-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2020] [Accepted: 09/03/2021] [Indexed: 11/08/2022] Open
Abstract
Epistasis is a major determinant in the emergence of novel protein function. In allosteric proteins, direct interactions between inducer-binding mutations propagate through the allosteric network, manifesting as epistasis at the level of biological function. Elucidating this relationship between local interactions and their global effects is essential to understanding evolution of allosteric proteins. We integrate computational design, structural and biophysical analysis to characterize the emergence of novel inducer specificity in an allosteric transcription factor. Adaptive landscapes of different inducers of the designed mutant show that a few strong epistatic interactions constrain the number of viable sequence pathways, revealing ridges in the fitness landscape leading to new specificity. The structure of the designed mutant shows that a striking change in inducer orientation still retains allosteric function. Comparing biophysical and functional properties suggests a nonlinear relationship between inducer binding affinity and allostery. Our results highlight the functional and evolutionary complexity of allosteric proteins. Epistasis plays an important role in the evolution of novel protein functions because it determines the mutational path a protein takes. Here, the authors combine functional, structural and biophysical analyses to characterize epistasis in a computationally redesigned ligand-inducible allosteric transcription factor and found that epistasis creates distinct biophysical and biological functional landscapes.
Collapse
|
32
|
The search for universality in evolutionary landscapes: Comment on "From genotypes to organisms: State-of-the-art and perspectives of a cornerstone in evolutionary dynamics" by Susanna Manrubia, José A. Cuesta, et al. Phys Life Rev 2021; 39:76-78. [PMID: 34507904 DOI: 10.1016/j.plrev.2021.08.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2021] [Accepted: 08/19/2021] [Indexed: 11/21/2022]
|
33
|
Binding affinity landscapes constrain the evolution of broadly neutralizing anti-influenza antibodies. eLife 2021; 10:71393. [PMID: 34491198 PMCID: PMC8476123 DOI: 10.7554/elife.71393] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2021] [Accepted: 09/05/2021] [Indexed: 12/12/2022] Open
Abstract
Over the past two decades, several broadly neutralizing antibodies (bnAbs) that confer protection against diverse influenza strains have been isolated. Structural and biochemical characterization of these bnAbs has provided molecular insight into how they bind distinct antigens. However, our understanding of the evolutionary pathways leading to bnAbs, and thus how best to elicit them, remains limited. Here, we measure equilibrium dissociation constants of combinatorially complete mutational libraries for two naturally isolated influenza bnAbs (CR9114, 16 heavy-chain mutations; CR6261, 11 heavy-chain mutations), reconstructing all possible evolutionary intermediates back to the unmutated germline sequences. We find that these two libraries exhibit strikingly different patterns of breadth: while many variants of CR6261 display moderate affinity to diverse antigens, those of CR9114 display appreciable affinity only in specific, nested combinations. By examining the extensive pairwise and higher order epistasis between mutations, we find key sites with strong synergistic interactions that are highly similar across antigens for CR6261 and different for CR9114. Together, these features of the binding affinity landscapes strongly favor sequential acquisition of affinity to diverse antigens for CR9114, while the acquisition of breadth to more similar antigens for CR6261 is less constrained. These results, if generalizable to other bnAbs, may explain the molecular basis for the widespread observation that sequential exposure favors greater breadth, and such mechanistic insight will be essential for predicting and eliciting broadly protective immune responses.
Collapse
|
34
|
How the PhoP/PhoQ System Controls Virulence and Mg 2+ Homeostasis: Lessons in Signal Transduction, Pathogenesis, Physiology, and Evolution. Microbiol Mol Biol Rev 2021; 85:e0017620. [PMID: 34191587 PMCID: PMC8483708 DOI: 10.1128/mmbr.00176-20] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The PhoP/PhoQ two-component system governs virulence, Mg2+ homeostasis, and resistance to a variety of antimicrobial agents, including acidic pH and cationic antimicrobial peptides, in several Gram-negative bacterial species. Best understood in Salmonella enterica serovar Typhimurium, the PhoP/PhoQ system consists o-regulated gene products alter PhoP-P amounts, even under constant inducing conditions. PhoP-P controls the abundance of hundreds of proteins both directly, by having transcriptional effects on the corresponding genes, and indirectly, by modifying the abundance, activity, or stability of other transcription factors, regulatory RNAs, protease regulators, and metabolites. The investigation of PhoP/PhoQ has uncovered novel forms of signal transduction and the physiological consequences of regulon evolution.
Collapse
|
35
|
Abstract
Duplication and divergence is a major mechanism by which new proteins and functions emerge in biology. Consequently, most organisms, in all domains of life, have genomes that encode large paralogous families of proteins. For recently duplicated pathways to acquire different, independent functions, the two paralogs must acquire mutations that effectively insulate them from one another. For instance, paralogous signaling proteins must acquire mutations that endow them with different interaction specificities such that they can participate in different signaling pathways without disruptive cross-talk. Although duplicated genes undoubtedly shape each other's evolution as they diverge and attain new functions, it is less clear how other paralogs impact or constrain gene duplication. Does the establishment of a new pathway by duplication and divergence require the system-wide optimization of all paralogs? The answer has profound implications for molecular evolution and our ability to engineer biological systems. Here, we discuss models, experiments, and approaches for tackling this question, and for understanding how new proteins and pathways are born.
Collapse
|
36
|
New binding specificities evolve via point mutation in an invertebrate allorecognition gene. iScience 2021; 24:102811. [PMID: 34296075 PMCID: PMC8282982 DOI: 10.1016/j.isci.2021.102811] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 06/16/2021] [Accepted: 06/28/2021] [Indexed: 01/04/2023] Open
Abstract
Many organisms use genetic self-recognition systems to distinguish themselves from conspecifics. In the cnidarian, Hydractinia symbiolongicarpus, self-recognition is partially controlled by allorecognition 2 (Alr2). Alr2 encodes a highly polymorphic transmembrane protein that discriminates self from nonself by binding in trans to other Alr2 proteins with identical or similar sequences. Here, we focused on the N-terminal domain of Alr2, which can determine its binding specificity. We pair ancestral sequence reconstruction and experimental assays to show that amino acid substitutions can create sequences with novel binding specificities either directly (via one mutation) or via sequential mutations and intermediates with relaxed specificities. We also show that one side of the domain has experienced positive selection and likely forms the binding interface. Our results provide direct evidence that point mutations can generate Alr2 proteins with novel binding specificities. This provides a plausible mechanism for the generation and maintenance of functional variation in nature.
Collapse
|
37
|
Diversification of DNA-Binding Specificity by Permissive and Specificity-Switching Mutations in the ParB/Noc Protein Family. Cell Rep 2021; 32:107928. [PMID: 32698006 PMCID: PMC7383237 DOI: 10.1016/j.celrep.2020.107928] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 03/25/2020] [Accepted: 06/26/2020] [Indexed: 12/17/2022] Open
Abstract
Specific interactions between proteins and DNA are essential to many biological processes. Yet, it remains unclear how the diversification in DNA-binding specificity was brought about, and the mutational paths that led to changes in specificity are unknown. Using a pair of evolutionarily related DNA-binding proteins, each with a different DNA preference (ParB [Partitioning Protein B] and Noc [Nucleoid Occlusion Factor], which both play roles in bacterial chromosome maintenance), we show that specificity is encoded by a set of four residues at the protein-DNA interface. Combining X-ray crystallography and deep mutational scanning of the interface, we suggest that permissive mutations must be introduced before specificity-switching mutations to reprogram specificity and that mutational paths to new specificity do not necessarily involve dual-specificity intermediates. Overall, our results provide insight into the possible evolutionary history of ParB and Noc and, in a broader context, might be useful for understanding the evolution of other classes of DNA-binding proteins. DNA-binding specificity for parS and NBS is conserved within ParB and Noc family Specificity is encoded by a set of four residues at the protein-DNA interface Mutations must be introduced in a defined order to reprogram specificity
Collapse
|
38
|
Characterizing the portability of phage-encoded homologous recombination proteins. Nat Chem Biol 2021; 17:394-402. [PMID: 33462496 PMCID: PMC7990699 DOI: 10.1038/s41589-020-00710-5] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Revised: 11/02/2020] [Accepted: 11/13/2020] [Indexed: 01/29/2023]
Abstract
Efficient genome editing methods are essential for biotechnology and fundamental research. Homologous recombination (HR) is the most versatile method of genome editing, but techniques that rely on host RecA-mediated pathways are inefficient and laborious. Phage-encoded single-stranded DNA annealing proteins (SSAPs) improve HR 1,000-fold above endogenous levels. However, they are not broadly functional. Using Escherichia coli, Lactococcus lactis, Mycobacterium smegmatis, Lactobacillus rhamnosus and Caulobacter crescentus, we investigated the limited portability of SSAPs. We find that these proteins specifically recognize the C-terminal tail of the host's single-stranded DNA-binding protein (SSB) and are portable between species only if compatibility with this host domain is maintained. Furthermore, we find that co-expressing SSAPs with SSBs can significantly improve genome editing efficiency, in some species enabling SSAP functionality even without host compatibility. Finally, we find that high-efficiency HR far surpasses the mutational capacity of commonly used random mutagenesis methods, generating exceptional phenotypes that are inaccessible through sequential nucleotide conversions.
Collapse
|
39
|
Neutral quasispecies evolution and the maximal entropy random walk. SCIENCE ADVANCES 2021; 7:7/16/eabb2376. [PMID: 33853768 PMCID: PMC8046360 DOI: 10.1126/sciadv.abb2376] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Accepted: 02/24/2021] [Indexed: 06/12/2023]
Abstract
Even if they have no impact on phenotype, neutral mutations are not equivalent in the eyes of evolution: A robust neutral variant-one which remains functional after further mutations-is more likely to spread in a large, diverse population than a fragile one. Quasispecies theory shows that the equilibrium frequency of a genotype is proportional to its eigenvector centrality in the neutral network. This paper explores the link between the selection for mutational robustness and the navigability of neutral networks. I show that sequences of neutral mutations follow a "maximal entropy random walk," a canonical Markov chain on graphs with nonlocal, nondiffusive dynamics. I revisit M. Smith's word-game model of evolution in this light, finding that the likelihood of certain sequences of substitutions can decrease with the population size. These counterintuitive results underscore the fertility of the interface between evolutionary dynamics, information theory, and physics.
Collapse
|
40
|
A large-scale survey of pairwise epistasis reveals a mechanism for evolutionary expansion and specialization of PDZ domains. Proteins 2021; 89:899-914. [PMID: 33620761 DOI: 10.1002/prot.26067] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Revised: 02/02/2021] [Accepted: 02/18/2021] [Indexed: 12/21/2022]
Abstract
Deep mutational scanning (DMS) facilitates data-driven models of protein structure and function. Here, we adapted Saturated Programmable Insertion Engineering (SPINE) as a programmable DMS technique. We validate SPINE with a reference single mutant dataset in the PSD95 PDZ3 domain and then characterize most pairwise double mutants to study epistasis. We observe wide-spread proximal negative epistasis, which we attribute to mutations affecting thermodynamic stability, and strong long-range positive epistasis, which is enriched in an evolutionarily conserved and function-defining network of "sector" and clade-specifying residues. Conditional neutrality of mutations in clade-specifying residues compensates for deleterious mutations in sector positions. This suggests that epistatic interactions between these position pairs facilitated the evolutionary expansion and specialization of PDZ domains. We propose that SPINE provides easy experimental access to reveal epistasis signatures in proteins that will improve our understanding of the structural basis for protein function and adaptation.
Collapse
|
41
|
Uncovering the basis of protein-protein interaction specificity with a combinatorially complete library. eLife 2020; 9:e60924. [PMID: 33107822 PMCID: PMC7669267 DOI: 10.7554/elife.60924] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2020] [Accepted: 10/26/2020] [Indexed: 12/27/2022] Open
Abstract
Protein-protein interaction specificity is often encoded at the primary sequence level. However, the contributions of individual residues to specificity are usually poorly understood and often obscured by mutational robustness, sequence degeneracy, and epistasis. Using bacterial toxin-antitoxin systems as a model, we screened a combinatorially complete library of antitoxin variants at three key positions against two toxins. This library enabled us to measure the effect of individual substitutions on specificity in hundreds of genetic backgrounds. These distributions allow inferences about the general nature of interface residues in promoting specificity. We find that positive and negative contributions to specificity are neither inherently coupled nor mutually exclusive. Further, a wild-type antitoxin appears optimized for specificity as no substitutions improve discrimination between cognate and non-cognate partners. By comparing crystal structures of paralogous complexes, we provide a rationale for our observations. Collectively, this work provides a generalizable approach to understanding the logic of molecular recognition.
Collapse
|
42
|
Functional effects of protein variants. Biochimie 2020; 180:104-120. [PMID: 33164889 DOI: 10.1016/j.biochi.2020.10.009] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Revised: 10/15/2020] [Accepted: 10/19/2020] [Indexed: 12/11/2022]
Abstract
Genetic and other variations frequently affect protein functions. Scientific articles can contain confusing descriptions about which function or property is affected, and in many cases the statements are pure speculation without any experimental evidence. To clarify functional effects of protein variations of genetic or non-genetic origin, a systematic conceptualisation and framework are introduced. This framework describes protein functional effects on abundance, activity, specificity and affinity, along with countermeasures, which allow cells, tissues and organisms to tolerate, avoid, repair, attenuate or resist (TARAR) the effects. Effects on abundance discussed include gene dosage, restricted expression, mis-localisation and degradation. Enzymopathies, effects on kinetics, allostery and regulation of protein activity are subtopics for the effects of variants on activity. Variation outcomes on specificity and affinity comprise promiscuity, specificity, affinity and moonlighting. TARAR mechanisms redress variations with active and passive processes including chaperones, redundancy, robustness, canalisation and metabolic and signalling rewiring. A framework for pragmatic protein function analysis and presentation is introduced. All of the mechanisms and effects are described along with representative examples, most often in relation to diseases. In addition, protein function is discussed from evolutionary point of view. Application of the presented framework facilitates unambiguous, detailed and specific description of functional effects and their systematic study.
Collapse
|
43
|
Amalgamated cross-species transcriptomes reveal organ-specific propensity in gene expression evolution. Nat Commun 2020; 11:4459. [PMID: 32900997 PMCID: PMC7479108 DOI: 10.1038/s41467-020-18090-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2019] [Accepted: 07/29/2020] [Indexed: 12/24/2022] Open
Abstract
The origins of multicellular physiology are tied to evolution of gene expression. Genes can shift expression as organisms evolve, but how ancestral expression influences altered descendant expression is not well understood. To examine this, we amalgamate 1,903 RNA-seq datasets from 182 research projects, including 6 organs in 21 vertebrate species. Quality control eliminates project-specific biases, and expression shifts are reconstructed using gene-family-wise phylogenetic Ornstein-Uhlenbeck models. Expression shifts following gene duplication result in more drastic changes in expression properties than shifts without gene duplication. The expression properties are tightly coupled with protein evolutionary rate, depending on whether and how gene duplication occurred. Fluxes in expression patterns among organs are nonrandom, forming modular connections that are reshaped by gene duplication. Thus, if expression shifts, ancestral expression in some organs induces a strong propensity for expression in particular organs in descendants. Regardless of whether the shifts are adaptive or not, this supports a major role for what might be termed preadaptive pathways of gene expression evolution.
Collapse
|
44
|
Distinct Mechanisms of Resistance to a CENP-E Inhibitor Emerge in Near-Haploid and Diploid Cancer Cells. Cell Chem Biol 2020; 27:850-857.e6. [PMID: 32442423 DOI: 10.1016/j.chembiol.2020.05.003] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2019] [Revised: 04/03/2020] [Accepted: 05/04/2020] [Indexed: 12/20/2022]
Abstract
Aberrant chromosome numbers in cancer cells may impose distinct constraints on the emergence of drug resistance-a major factor limiting the long-term efficacy of molecularly targeted therapeutics. However, for most anticancer drugs we lack analyses of drug-resistance mechanisms in cells with different karyotypes. Here, we focus on GSK923295, a mitotic kinesin CENP-E inhibitor that was evaluated in clinical trials as a cancer therapeutic. We performed unbiased selections to isolate inhibitor-resistant clones in diploid and near-haploid cancer cell lines. In diploid cells we identified single-point mutations that can suppress inhibitor binding. In contrast,transcriptome analyses revealed that the C-terminus of CENP-E was disrupted in GSK923295-resistant near-haploid cells. While chemical inhibition of CENP-E is toxic to near-haploid cells, knockout of the CENPE gene does not suppress haploid cell proliferation, suggesting that deletion of the CENP-E C-terminus can confer resistance to GSK923295. Together, these findings indicate that different chromosome copy numbers in cells can alter epistatic dependencies and lead to distinct modes of chemotype-specific resistance.
Collapse
|
45
|
Global fitness landscapes of the Shine-Dalgarno sequence. Genome Res 2020; 30:711-723. [PMID: 32424071 PMCID: PMC7263185 DOI: 10.1101/gr.260182.119] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2019] [Accepted: 04/21/2020] [Indexed: 01/06/2023]
Abstract
Shine-Dalgarno sequences (SD) in prokaryotic mRNA facilitate protein translation by pairing with rRNA in ribosomes. Although conventionally defined as AG-rich motifs, recent genomic surveys reveal great sequence diversity, questioning how SD functions. Here, we determined the molecular fitness (i.e., translation efficiency) of 49 synthetic 9-nt SD genotypes in three distinct mRNA contexts in Escherichia coli. We uncovered generic principles governing the SD fitness landscapes: (1) Guanine contents, rather than canonical SD motifs, best predict the fitness of both synthetic and endogenous SD; (2) the genotype-fitness correlation of SD promotes its evolvability by steadily supplying beneficial mutations across fitness landscapes; and (3) the frequency and magnitude of deleterious mutations increase with background fitness, and adjacent nucleotides in SD show stronger epistasis. Epistasis results from disruption of the continuous base pairing between SD and rRNA. This “chain-breaking” epistasis creates sinkholes in SD fitness landscapes and may profoundly impact the evolution and function of prokaryotic translation initiation and other RNA-mediated processes. Collectively, our work yields functional insights into the SD sequence variation in prokaryotic genomes, identifies a simple design principle to guide bioengineering and bioinformatic analysis of SD, and illuminates the fundamentals of fitness landscapes and molecular evolution.
Collapse
|
46
|
Abstract
The limits of evolution have long fascinated biologists. However, the causes of evolutionary constraint have remained elusive due to a poor mechanistic understanding of studied phenotypes. Recently, a range of innovative approaches have leveraged mechanistic information on regulatory networks and cellular biology. These methods combine systems biology models with population and single-cell quantification and with new genetic tools, and they have been applied to a range of complex cellular functions and engineered networks. In this article, we review these developments, which are revealing the mechanistic causes of epistasis at different levels of biological organization-in molecular recognition, within a single regulatory network, and between different networks-providing first indications of predictable features of evolutionary constraint.
Collapse
|
47
|
Structural prediction of protein interactions and docking using conservation and coevolution. WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL MOLECULAR SCIENCE 2020. [DOI: 10.1002/wcms.1470] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
48
|
Minimum epistasis interpolation for sequence-function relationships. Nat Commun 2020; 11:1782. [PMID: 32286265 PMCID: PMC7156698 DOI: 10.1038/s41467-020-15512-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 03/12/2020] [Indexed: 12/17/2022] Open
Abstract
Massively parallel phenotyping assays have provided unprecedented insight into how multiple mutations combine to determine biological function. While such assays can measure phenotypes for thousands to millions of genotypes in a single experiment, in practice these measurements are not exhaustive, so that there is a need for techniques to impute values for genotypes whose phenotypes have not been directly assayed. Here, we present an imputation method based on inferring the least epistatic possible sequence-function relationship compatible with the data. In particular, we infer the reconstruction where mutational effects change as little as possible across adjacent genetic backgrounds. The resulting models can capture complex higher-order genetic interactions near the data, but approach additivity where data is sparse or absent. We apply the method to high-throughput transcription factor binding assays and use it to explore a fitness landscape for protein G.
Collapse
|
49
|
Phylogenetic Analyses of Sites in Different Protein Structural Environments Result in Distinct Placements of the Metazoan Root. BIOLOGY 2020; 9:E64. [PMID: 32231097 PMCID: PMC7235752 DOI: 10.3390/biology9040064] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/24/2019] [Revised: 03/09/2020] [Accepted: 03/20/2020] [Indexed: 12/23/2022]
Abstract
Phylogenomics, the use of large datasets to examine phylogeny, has revolutionized the study of evolutionary relationships. However, genome-scale data have not been able to resolve all relationships in the tree of life; this could reflect, at least in part, the poor-fit of the models used to analyze heterogeneous datasets. Some of the heterogeneity may reflect the different patterns of selection on proteins based on their structures. To test that hypothesis, we developed a pipeline to divide phylogenomic protein datasets into subsets based on secondary structure and relative solvent accessibility. We then tested whether amino acids in different structural environments had distinct signals for the topology of the deepest branches in the metazoan tree. We focused on a dataset that appeared to have a mixture of signals and we found that the most striking difference in phylogenetic signal reflected relative solvent accessibility. Analyses of exposed sites (residues located on the surface of proteins) yielded a tree that placed ctenophores sister to all other animals whereas sites buried inside proteins yielded a tree with a sponge+ctenophore clade. These differences in phylogenetic signal were not ameliorated when we conducted analyses using a set of maximum-likelihood profile mixture models. These models are very similar to the Bayesian CAT model, which has been used in many analyses of deep metazoan phylogeny. In contrast, analyses conducted after recoding amino acids to limit the impact of deviations from compositional stationarity increased the congruence in the estimates of phylogeny for exposed and buried sites; after recoding amino acid trees estimated using the exposed and buried site both supported placement of ctenophores sister to all other animals. Although the central conclusion of our analyses is that sites in different structural environments yield distinct trees when analyzed using models of protein evolution, our amino acid recoding analyses also have implications for metazoan evolution. Specifically, our results add to the evidence that ctenophores are the sister group of all other animals and they further suggest that the placozoa+cnidaria clade found in some other studies deserves more attention. Taken as a whole, these results provide striking evidence that it is necessary to achieve a better understanding of the constraints due to protein structure to improve phylogenetic estimation.
Collapse
|
50
|
Selecting for Altered Substrate Specificity Reveals the Evolutionary Flexibility of ATP-Binding Cassette Transporters. Curr Biol 2020; 30:1689-1702.e6. [PMID: 32220325 PMCID: PMC7243462 DOI: 10.1016/j.cub.2020.02.077] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 01/20/2020] [Accepted: 02/24/2020] [Indexed: 12/12/2022]
Abstract
ATP-binding cassette (ABC) transporters are the largest family of ATP-hydrolyzing transporters, which import or export substrates across membranes, and have members in every sequenced genome. Structural studies and biochemistry highlight the contrast between the global structural similarity of homologous transporters and the enormous diversity of their substrates. How do ABC transporters evolve to carry such diverse molecules and what variations in their amino acid sequence alter their substrate selectivity? We mutagenized the transmembrane domains of a conserved fungal ABC transporter that exports a mating pheromone and selected for mutants that export a non-cognate pheromone. Mutations that alter export selectivity cover a region that is larger than expected for a localized substrate-binding site. Individual selected clones have multiple mutations, which have broadly additive contributions to specific transport activity. Our results suggest that multiple positions influence substrate selectivity, leading to alternative evolutionary paths toward selectivity for particular substrates and explaining the number and diversity of ABC transporters. Srikant et al. find that mutations at many different positions in an ABC transporter of fungal mating pheromone have roughly additive effects on substrate recognition. This helps explain the evolvability of ABC transporters to transport a remarkable variety of substrates and their presence as the largest protein family across all domains of life.
Collapse
|