1
|
Posfai A, Zhou J, McCandlish DM, Kinney JB. Gauge fixing for sequence-function relationships. PLoS Comput Biol 2025; 21:e1012818. [PMID: 40111986 PMCID: PMC11957564 DOI: 10.1371/journal.pcbi.1012818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 01/22/2025] [Indexed: 03/22/2025] Open
Abstract
Quantitative models of sequence-function relationships are ubiquitous in computational biology, e.g., for modeling the DNA binding of transcription factors or the fitness landscapes of proteins. Interpreting these models, however, is complicated by the fact that the values of model parameters can often be changed without affecting model predictions. Before the values of model parameters can be meaningfully interpreted, one must remove these degrees of freedom (called "gauge freedoms" in physics) by imposing additional constraints (a process called "fixing the gauge"). However, strategies for fixing the gauge of sequence-function relationships have received little attention. Here we derive an analytically tractable family of gauges for a large class of sequence-function relationships. These gauges are derived in the context of models with all-order interactions, but an important subset of these gauges can be applied to diverse types of models, including additive models, pairwise-interaction models, and models with higher-order interactions. Many commonly used gauges are special cases of gauges within this family. We demonstrate the utility of this family of gauges by showing how different choices of gauge can be used both to explore complex activity landscapes and to reveal simplified models that are approximately correct within localized regions of sequence space. The results provide practical gauge-fixing strategies and demonstrate the utility of gauge-fixing for model exploration and interpretation.
Collapse
Affiliation(s)
- Anna Posfai
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Juannan Zhou
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
- Department of Biology, University of Florida, Gainesville, Florida, United States of America
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Justin B. Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| |
Collapse
|
2
|
Martí-Gómez C, Zhou J, Chen WC, Kinney JB, McCandlish DM. Inference and visualization of complex genotype-phenotype maps with gpmap-tools. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642267. [PMID: 40161830 PMCID: PMC11952336 DOI: 10.1101/2025.03.09.642267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Multiplex assays of variant effect (MAVEs) allow the functional characterization of an unprecedented number of sequence variants in both gene regulatory regions and protein coding sequences. This has enabled the study of nearly complete combinatorial libraries of mutational variants and revealed the widespread influence of higher-order genetic interactions that arise when multiple mutations are combined. However, the lack of appropriate tools for exploratory analysis of this high-dimensional data limits our overall understanding of the main qualitative properties of complex genotype-phenotype maps. To fill this gap, we have developed gpmap-tools (https://github.com/cmarti/gpmap-tools), a python library that integrates Gaussian process models for inference, phenotypic imputation, and error estimation from incomplete and noisy MAVE data and collections of natural sequences, together with methods for summarizing patterns of higher-order epistasis and non-linear dimensionality reduction techniques that allow visualization of genotype-phenotype maps containing up to millions of genotypes. Here, we used gpmap-tools to study the genotype-phenotype map of the Shine-Dalgarno sequence, a motif that modulates binding of the 16S rRNA to the 5' untranslated region (UTR) of mRNAs through base pair complementarity during translation initiation in prokaryotes. We inferred full combinatorial landscapes containing 262,144 different sequences from the sequences of 5,311 5'UTRs in the E. coli genome and from experimental MAVE data. Visualizations of the inferred landscapes were largely consistent with each other, and unveiled a simple molecular mechanism underlying the highly epistatic genotype-phenotype map of the Shine-Dalgarno sequence.
Collapse
Affiliation(s)
- Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, Republic of China
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
3
|
Ripley DM, Garner T, Stevens A. Developing the 'omic toolkit of comparative physiologists. COMPARATIVE BIOCHEMISTRY AND PHYSIOLOGY. PART D, GENOMICS & PROTEOMICS 2024; 52:101287. [PMID: 38972179 DOI: 10.1016/j.cbd.2024.101287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 06/22/2024] [Accepted: 07/01/2024] [Indexed: 07/09/2024]
Abstract
Typical 'omic analyses reduce complex biological systems to simple lists of supposedly independent variables, failing to account for changes in the wider transcriptional landscape. In this commentary, we discuss the utility of network approaches for incorporating this wider context into the study of physiological phenomena. We highlight opportunities to build on traditional network tools by utilising cutting-edge techniques to account for higher order interactions (i.e. beyond pairwise associations) within datasets, allowing for more accurate models of complex 'omic systems. Finally, we show examples of previous works utilising network approaches to gain additional insight into their organisms of interest. As 'omics grow in both their popularity and breadth of application, so does the requirement for flexible analytical tools capable of interpreting and synthesising complex datasets.
Collapse
Affiliation(s)
- Daniel M Ripley
- Marine Biology Laboratory, Division of Science, New York University Abu Dhabi, United Arab Emirates. https://twitter.com/@ElasmoDan
| | - Terence Garner
- Division of Developmental Biology and Medicine, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK
| | - Adam Stevens
- Division of Developmental Biology and Medicine, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester, UK.
| |
Collapse
|
4
|
Schmidlin K, Ogbunugafor CB, Alexander S, Geiler-Samerotte K. Environment by environment interactions (ExE) differ across genetic backgrounds (ExExG). BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.08.593194. [PMID: 38766025 PMCID: PMC11100745 DOI: 10.1101/2024.05.08.593194] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2024]
Abstract
While the terms "gene-by-gene interaction" (GxG) and "gene-by-environment interaction" (GxE) are widely recognized in the fields of quantitative and evolutionary genetics, "environment-byenvironment interaction" (ExE) is a term used less often. In this study, we find that environmentby-environment interactions are a meaningful driver of phenotypes, and moreover, that they differ across different genotypes (suggestive of ExExG). To support this conclusion, we analyzed a large dataset of roughly 1,000 mutant yeast strains with varying degrees of resistance to different antifungal drugs. Our findings reveal that the effectiveness of a drug combination, relative to single drugs, often differs across drug resistant mutants. Remarkably, even mutants that differ by only a single nucleotide change can have dramatically different drug × drug (ExE) interactions. We also introduce a new framework that more accurately predicts the direction and magnitude of ExE interactions for some mutants. Understanding how ExE interactions change across genotypes (ExExG) is crucial not only for modeling the evolution of pathogenic microbes, but also for enhancing our knowledge of the underlying cell biology and the sources of phenotypic variance within populations. While the significance of ExExG interactions has been overlooked in evolutionary and population genetics, these fields and others stand to benefit from understanding how these interactions shape the complex behavior of living systems.
Collapse
Affiliation(s)
- Kara Schmidlin
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, 85287
- School of Life Sciences, Arizona State University, Tempe AZ, 85287
| | - C. Brandon Ogbunugafor
- Department of Ecology & Evolutionary Biology, Yale University, New Haven, CT,06511
- Santa Fe Institute, Santa Fe, NM, 87501
| | - Sastokas Alexander
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, 85287
- School of Life Sciences, Arizona State University, Tempe AZ, 85287
| | - Kerry Geiler-Samerotte
- Biodesign Center for Mechanisms of Evolution, Arizona State University, Tempe, AZ, 85287
- School of Life Sciences, Arizona State University, Tempe AZ, 85287
| |
Collapse
|
5
|
Park Y, Metzger BPH, Thornton JW. The simplicity of protein sequence-function relationships. Nat Commun 2024; 15:7953. [PMID: 39261454 PMCID: PMC11390738 DOI: 10.1038/s41467-024-51895-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2024] [Accepted: 08/20/2024] [Indexed: 09/13/2024] Open
Abstract
How complex are the rules by which a protein's sequence determines its function? High-order epistatic interactions among residues are thought to be pervasive, suggesting an idiosyncratic and unpredictable sequence-function relationship. But many prior studies may have overestimated epistasis, because they analyzed sequence-function relationships relative to a single reference sequence-which causes measurement noise and local idiosyncrasies to snowball into high-order epistasis-or they did not fully account for global nonlinearities. Here we present a reference-free method that jointly infers specific epistatic interactions and global nonlinearity using a bird's-eye view of sequence space. This technique yields the simplest explanation of sequence-function relationships and is more robust than existing methods to measurement noise, missing data, and model misspecification. We reanalyze 20 experimental datasets and find that context-independent amino acid effects and pairwise interactions, along with a simple nonlinearity to account for limited dynamic range, explain a median of 96% of phenotypic variance and over 92% in every case. Only a tiny fraction of genotypes are strongly affected by higher-order epistasis. Sequence-function relationships are also sparse: a miniscule fraction of amino acids and interactions account for 90% of phenotypic variance. Sequence-function causality across these datasets is therefore simple, opening the way for tractable approaches to characterize proteins' genetic architecture.
Collapse
Affiliation(s)
- Yeonwoo Park
- Committee on Genetics, Genomics, and Systems Biology, University of Chicago, Chicago, IL, USA
- Center for RNA Research, Institute for Basic Science, Seoul, Republic of Korea
| | - Brian P H Metzger
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Joseph W Thornton
- Department of Ecology and Evolution, University of Chicago, Chicago, IL, USA.
- Department of Human Genetics, University of Chicago, Chicago, IL, USA.
| |
Collapse
|
6
|
Dietler N, Abbara A, Choudhury S, Bitbol AF. Impact of phylogeny on the inference of functional sectors from protein sequence data. PLoS Comput Biol 2024; 20:e1012091. [PMID: 39312591 PMCID: PMC11449291 DOI: 10.1371/journal.pcbi.1012091] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 10/03/2024] [Accepted: 09/10/2024] [Indexed: 09/25/2024] Open
Abstract
Statistical analysis of multiple sequence alignments of homologous proteins has revealed groups of coevolving amino acids called sectors. These groups of amino-acid sites feature collective correlations in their amino-acid usage, and they are associated to functional properties. Modeling showed that nonlinear selection on an additive functional trait of a protein is generically expected to give rise to a functional sector. These modeling results motivated a principled method, called ICOD, which is designed to identify functional sectors, as well as mutational effects, from sequence data. However, a challenge for all methods aiming to identify sectors from multiple sequence alignments is that correlations in amino-acid usage can also arise from the mere fact that homologous sequences share common ancestry, i.e. from phylogeny. Here, we generate controlled synthetic data from a minimal model comprising both phylogeny and functional sectors. We use this data to dissect the impact of phylogeny on sector identification and on mutational effect inference by different methods. We find that ICOD is most robust to phylogeny, but that conservation is also quite robust. Next, we consider natural multiple sequence alignments of protein families for which deep mutational scan experimental data is available. We show that in this natural data, conservation and ICOD best identify sites with strong functional roles, in agreement with our results on synthetic data. Importantly, these two methods have different premises, since they respectively focus on conservation and on correlations. Thus, their joint use can reveal complementary information.
Collapse
Affiliation(s)
- Nicola Dietler
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Alia Abbara
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Subham Choudhury
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
7
|
Van Gelder K, Lindner SN, Hanson AD, Zhou J. Strangers in a foreign land: 'Yeastizing' plant enzymes. Microb Biotechnol 2024; 17:e14525. [PMID: 39222378 PMCID: PMC11368087 DOI: 10.1111/1751-7915.14525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Accepted: 07/02/2024] [Indexed: 09/04/2024] Open
Abstract
Expressing plant metabolic pathways in microbial platforms is an efficient, cost-effective solution for producing many desired plant compounds. As eukaryotic organisms, yeasts are often the preferred platform. However, expression of plant enzymes in a yeast frequently leads to failure because the enzymes are poorly adapted to the foreign yeast cellular environment. Here, we first summarize the current engineering approaches for optimizing performance of plant enzymes in yeast. A critical limitation of these approaches is that they are labour-intensive and must be customized for each individual enzyme, which significantly hinders the establishment of plant pathways in cellular factories. In response to this challenge, we propose the development of a cost-effective computational pipeline to redesign plant enzymes for better adaptation to the yeast cellular milieu. This proposition is underpinned by compelling evidence that plant and yeast enzymes exhibit distinct sequence features that are generalizable across enzyme families. Consequently, we introduce a data-driven machine learning framework designed to extract 'yeastizing' rules from natural protein sequence variations, which can be broadly applied to all enzymes. Additionally, we discuss the potential to integrate the machine learning model into a full design-build-test cycle.
Collapse
Affiliation(s)
- Kristen Van Gelder
- Horticultural Sciences DepartmentUniversity of FloridaGainesvilleFloridaUSA
| | - Steffen N. Lindner
- Department of Systems and Synthetic MetabolismMax Planck Institute of Molecular Plant PhysiologyPotsdamGermany
- Department of BiochemistryCharité Universitätsmedizin Berlin, Freie Universität Berlin and Humboldt‐UniversitätBerlinGermany
| | - Andrew D. Hanson
- Horticultural Sciences DepartmentUniversity of FloridaGainesvilleFloridaUSA
| | - Juannan Zhou
- Department of BiologyUniversity of FloridaGainesvilleFloridaUSA
| |
Collapse
|
8
|
Seitz EE, McCandlish DM, Kinney JB, Koo PK. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. NAT MACH INTELL 2024; 6:701-713. [PMID: 39950082 PMCID: PMC11823438 DOI: 10.1038/s42256-024-00851-5] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 05/09/2024] [Indexed: 02/16/2025]
Abstract
Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. However, elucidating underlying biological mechanisms from genomic DNNs remains challenging. Existing interpretability methods, such as attribution maps, have their origins in non-biological machine learning applications and therefore have the potential to be improved by incorporating domain-specific interpretation strategies. Here we introduce SQUID, a genomic DNN interpretability framework based on domain-specific surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models-simpler quantitative models that have inherently interpretable mathematical forms. SQUID leverages domain knowledge to model cis-regulatory mechanisms in genomic DNNs, in particular by removing the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements, as well as global explanations of cis-regulatory mechanisms across sequence contexts. SQUID thus advances the ability to mechanistically interpret genomic DNNs.
Collapse
Affiliation(s)
- Evan E Seitz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
9
|
Rozhoňová H, Martí-Gómez C, McCandlish DM, Payne JL. Robust genetic codes enhance protein evolvability. PLoS Biol 2024; 22:e3002594. [PMID: 38754362 PMCID: PMC11098591 DOI: 10.1371/journal.pbio.3002594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Accepted: 03/19/2024] [Indexed: 05/18/2024] Open
Abstract
The standard genetic code defines the rules of translation for nearly every life form on Earth. It also determines the amino acid changes accessible via single-nucleotide mutations, thus influencing protein evolvability-the ability of mutation to bring forth adaptive variation in protein function. One of the most striking features of the standard genetic code is its robustness to mutation, yet it remains an open question whether such robustness facilitates or frustrates protein evolvability. To answer this question, we use data from massively parallel sequence-to-function assays to construct and analyze 6 empirical adaptive landscapes under hundreds of thousands of rewired genetic codes, including those of codon compression schemes relevant to protein engineering and synthetic biology. We find that robust genetic codes tend to enhance protein evolvability by rendering smooth adaptive landscapes with few peaks, which are readily accessible from throughout sequence space. However, the standard genetic code is rarely exceptional in this regard, because many alternative codes render smoother landscapes than the standard code. By constructing low-dimensional visualizations of these landscapes, which each comprise more than 16 million mRNA sequences, we show that such alternative codes radically alter the topological features of the network of high-fitness genotypes. Whereas the genetic codes that optimize evolvability depend to some extent on the detailed relationship between amino acid sequence and protein function, we also uncover general design principles for engineering nonstandard genetic codes for enhanced and diminished evolvability, which may facilitate directed protein evolution experiments and the bio-containment of synthetic organisms, respectively.
Collapse
Affiliation(s)
- Hana Rozhoňová
- Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Joshua L. Payne
- Institute of Integrative Biology, ETH Zürich, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| |
Collapse
|
10
|
Faure AJ, Lehner B, Miró Pina V, Serrano Colome C, Weghorn D. An extension of the Walsh-Hadamard transform to calculate and model epistasis in genetic landscapes of arbitrary shape and complexity. PLoS Comput Biol 2024; 20:e1012132. [PMID: 38805561 PMCID: PMC11161127 DOI: 10.1371/journal.pcbi.1012132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 06/07/2024] [Accepted: 05/04/2024] [Indexed: 05/30/2024] Open
Abstract
Accurate models describing the relationship between genotype and phenotype are necessary in order to understand and predict how mutations to biological sequences affect the fitness and evolution of living organisms. The apparent abundance of epistasis (genetic interactions), both between and within genes, complicates this task and how to build mechanistic models that incorporate epistatic coefficients (genetic interaction terms) is an open question. The Walsh-Hadamard transform represents a rigorous computational framework for calculating and modeling epistatic interactions at the level of individual genotypic values (known as genetical, biological or physiological epistasis), and can therefore be used to address fundamental questions related to sequence-to-function encodings. However, one of its main limitations is that it can only accommodate two alleles (amino acid or nucleotide states) per sequence position. In this paper we provide an extension of the Walsh-Hadamard transform that allows the calculation and modeling of background-averaged epistasis (also known as ensemble epistasis) in genetic landscapes with an arbitrary number of states per position (20 for amino acids, 4 for nucleotides, etc.). We also provide a recursive formula for the inverse matrix and then derive formulae to directly extract any element of either matrix without having to rely on the computationally intensive task of constructing or inverting large matrices. Finally, we demonstrate the utility of our theory by using it to model epistasis within both simulated and empirical multiallelic fitness landscapes, revealing that both pairwise and higher-order genetic interactions are enriched between physically interacting positions.
Collapse
Affiliation(s)
- Andre J. Faure
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, United Kingdom
| | - Verónica Miró Pina
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
| | - Claudia Serrano Colome
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
| | - Donate Weghorn
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
| |
Collapse
|
11
|
Meger AT, Spence MA, Sandhu M, Matthews D, Chen J, Jackson CJ, Raman S. Rugged fitness landscapes minimize promiscuity in the evolution of transcriptional repressors. Cell Syst 2024; 15:374-387.e6. [PMID: 38537640 PMCID: PMC11299162 DOI: 10.1016/j.cels.2024.03.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 09/08/2023] [Accepted: 03/05/2024] [Indexed: 04/20/2024]
Abstract
How a protein's function influences the shape of its fitness landscape, smooth or rugged, is a fundamental question in evolutionary biochemistry. Smooth landscapes arise when incremental mutational steps lead to a progressive change in function, as commonly seen in enzymes and binding proteins. On the other hand, rugged landscapes are poorly understood because of the inherent unpredictability of how sequence changes affect function. Here, we experimentally characterize the entire sequence phylogeny, comprising 1,158 extant and ancestral sequences, of the DNA-binding domain (DBD) of the LacI/GalR transcriptional repressor family. Our analysis revealed an extremely rugged landscape with rapid switching of specificity, even between adjacent nodes. Further, the ruggedness arises due to the necessity of the repressor to simultaneously evolve specificity for asymmetric operators and disfavors potentially adverse regulatory crosstalk. Our study provides fundamental insight into evolutionary, molecular, and biophysical rules of genetic regulation through the lens of fitness landscapes.
Collapse
Affiliation(s)
- Anthony T Meger
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
| | - Mahakaran Sandhu
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia
| | - Dana Matthews
- Research School of Biology, Australian National University, Canberra, ACT 2601, Australia
| | - Jackie Chen
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia; ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia; ARC Centre of Excellence for Innovations in Synthetic Biology, Research School of Chemistry, Australian National University, Canberra, ACT 2601, Australia.
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Bacteriology, University of Wisconsin-Madison, Madison, WI 53706, USA; Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI 53706, USA.
| |
Collapse
|
12
|
Wang X, Li A, Li X, Cui H. Empowering Protein Engineering through Recombination of Beneficial Substitutions. Chemistry 2024; 30:e202303889. [PMID: 38288640 DOI: 10.1002/chem.202303889] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2024] [Indexed: 02/24/2024]
Abstract
Directed evolution stands as a seminal technology for generating novel protein functionalities, a cornerstone in biocatalysis, metabolic engineering, and synthetic biology. Today, with the development of various mutagenesis methods and advanced analytical machines, the challenge of diversity generation and high-throughput screening platforms is largely solved, and one of the remaining challenges is: how to empower the potential of single beneficial substitutions with recombination to achieve the epistatic effect. This review overviews experimental and computer-assisted recombination methods in protein engineering campaigns. In addition, integrated and machine learning-guided strategies were highlighted to discuss how these recombination approaches contribute to generating the screening library with better diversity, coverage, and size. A decision tree was finally summarized to guide the further selection of proper recombination strategies in practice, which was beneficial for accelerating protein engineering.
Collapse
Affiliation(s)
- Xinyue Wang
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Anni Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Xiujuan Li
- School of Food Science and Pharmaceutical Engineering, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| | - Haiyang Cui
- School of Life Sciences, Nanjing Normal University, No. 2 Xuelin Road, Nanjing, 210097, China
| |
Collapse
|
13
|
Seitz EE, McCandlish DM, Kinney JB, Koo PK. Interpreting cis-regulatory mechanisms from genomic deep neural networks using surrogate models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.14.567120. [PMID: 38013993 PMCID: PMC10680760 DOI: 10.1101/2023.11.14.567120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Deep neural networks (DNNs) have greatly advanced the ability to predict genome function from sequence. Interpreting genomic DNNs in terms of biological mechanisms, however, remains difficult. Here we introduce SQUID, a genomic DNN interpretability framework based on surrogate modeling. SQUID approximates genomic DNNs in user-specified regions of sequence space using surrogate models, i.e., simpler models that are mechanistically interpretable. Importantly, SQUID removes the confounding effects that nonlinearities and heteroscedastic noise in functional genomics data can have on model interpretation. Benchmarking analysis on multiple genomic DNNs shows that SQUID, when compared to established interpretability methods, identifies motifs that are more consistent across genomic loci and yields improved single-nucleotide variant-effect predictions. SQUID also supports surrogate models that quantify epistatic interactions within and between cis-regulatory elements. SQUID thus advances the ability to mechanistically interpret genomic DNNs.
Collapse
Affiliation(s)
- Evan E Seitz
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
14
|
Charest N, Shen Y, Lai YC, Chen IA, Shea JE. Discovering pathways through ribozyme fitness landscapes using information theoretic quantification of epistasis. RNA (NEW YORK, N.Y.) 2023; 29:1644-1657. [PMID: 37580126 PMCID: PMC10578471 DOI: 10.1261/rna.079541.122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/02/2022] [Accepted: 07/29/2023] [Indexed: 08/16/2023]
Abstract
The identification of catalytic RNAs is typically achieved through primarily experimental means. However, only a small fraction of sequence space can be analyzed even with high-throughput techniques. Methods to extrapolate from a limited data set to predict additional ribozyme sequences, particularly in a human-interpretable fashion, could be useful both for designing new functional RNAs and for generating greater understanding about a ribozyme fitness landscape. Using information theory, we express the effects of epistasis (i.e., deviations from additivity) on a ribozyme. This representation was incorporated into a simple model of the epistatic fitness landscape, which identified potentially exploitable combinations of mutations. We used this model to theoretically predict mutants of high activity for a self-aminoacylating ribozyme, identifying potentially active triple and quadruple mutants beyond the experimental data set of single and double mutants. The predictions were validated experimentally, with nine out of nine sequences being accurately predicted to have high activity. This set of sequences included mutants that form a previously unknown evolutionary "bridge" between two ribozyme families that share a common motif. Individual steps in the method could be examined, understood, and guided by a human, combining interpretability and performance in a simple model to predict ribozyme sequences by extrapolation.
Collapse
Affiliation(s)
- Nathaniel Charest
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, California 93106, USA
| | - Yuning Shen
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, California 93106, USA
| | - Yei-Chen Lai
- Department of Chemistry, National Chung Hsing University, Taichung City 40227, Taiwan
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California 90095, USA
| | - Irene A Chen
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, California 93106, USA
- Department of Chemical and Biomolecular Engineering, University of California, Los Angeles, California 90095, USA
| | - Joan-Emma Shea
- Department of Chemistry and Biochemistry, University of California, Santa Barbara, California 93106, USA
| |
Collapse
|
15
|
Stamp J, DenAdel A, Weinreich D, Crawford L. Leveraging the genetic correlation between traits improves the detection of epistasis in genome-wide association studies. G3 (BETHESDA, MD.) 2023; 13:jkad118. [PMID: 37243672 PMCID: PMC10484060 DOI: 10.1093/g3journal/jkad118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Revised: 01/11/2023] [Accepted: 05/23/2023] [Indexed: 05/29/2023]
Abstract
Epistasis, commonly defined as the interaction between genetic loci, is known to play an important role in the phenotypic variation of complex traits. As a result, many statistical methods have been developed to identify genetic variants that are involved in epistasis, and nearly all of these approaches carry out this task by focusing on analyzing one trait at a time. Previous studies have shown that jointly modeling multiple phenotypes can often dramatically increase statistical power for association mapping. In this study, we present the "multivariate MArginal ePIstasis Test" (mvMAPIT)-a multioutcome generalization of a recently proposed epistatic detection method which seeks to detect marginal epistasis or the combined pairwise interaction effects between a given variant and all other variants. By searching for marginal epistatic effects, one can identify genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact-thus, potentially alleviating much of the statistical and computational burden associated with conventional explicit search-based methods. Our proposed mvMAPIT builds upon this strategy by taking advantage of correlation structure between traits to improve the identification of variants involved in epistasis. We formulate mvMAPIT as a multivariate linear mixed model and develop a multitrait variance component estimation algorithm for efficient parameter inference and P-value computation. Together with reasonable model approximations, our proposed approach is scalable to moderately sized genome-wide association studies. With simulations, we illustrate the benefits of mvMAPIT over univariate (or single-trait) epistatic mapping strategies. We also apply mvMAPIT framework to protein sequence data from two broadly neutralizing anti-influenza antibodies and approximately 2,000 heterogeneous stock of mice from the Wellcome Trust Centre for Human Genetics. The mvMAPIT R package can be downloaded at https://github.com/lcrawlab/mvMAPIT.
Collapse
Affiliation(s)
- Julian Stamp
- Center for Computational Molecular Biology, Brown University, Providence, RI 02906, USA
| | - Alan DenAdel
- Center for Computational Molecular Biology, Brown University, Providence, RI 02906, USA
| | - Daniel Weinreich
- Center for Computational Molecular Biology, Brown University, Providence, RI 02906, USA
- Department of Ecology, Evolution, and Organismal Biology, Brown University, Providence, RI 02906, USA
| | - Lorin Crawford
- Center for Computational Molecular Biology, Brown University, Providence, RI 02906, USA
- Department of Biostatistics, Brown University, Providence, RI 02903, USA
- Microsoft Research New England, Cambridge, MA 02142, USA
| |
Collapse
|
16
|
Frioux C, Ansorge R, Özkurt E, Ghassemi Nedjad C, Fritscher J, Quince C, Waszak SM, Hildebrand F. Enterosignatures define common bacterial guilds in the human gut microbiome. Cell Host Microbe 2023; 31:1111-1125.e6. [PMID: 37339626 DOI: 10.1016/j.chom.2023.05.024] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2023] [Revised: 04/03/2023] [Accepted: 05/23/2023] [Indexed: 06/22/2023]
Abstract
The human gut microbiome composition is generally in a stable dynamic equilibrium, but it can deteriorate into dysbiotic states detrimental to host health. To disentangle the inherent complexity and capture the ecological spectrum of microbiome variability, we used 5,230 gut metagenomes to characterize signatures of bacteria commonly co-occurring, termed enterosignatures (ESs). We find five generalizable ESs dominated by either Bacteroides, Firmicutes, Prevotella, Bifidobacterium, or Escherichia. This model confirms key ecological characteristics known from previous enterotype concepts, while enabling the detection of gradual shifts in community structures. Temporal analysis implies that the Bacteroides-associated ES is "core" in the resilience of westernized gut microbiomes, while combinations with other ESs often complement the functional spectrum. The model reliably detects atypical gut microbiomes correlated with adverse host health conditions and/or the presence of pathobionts. ESs provide an interpretable and generic model that enables an intuitive characterization of gut microbiome composition in health and disease.
Collapse
Affiliation(s)
- Clémence Frioux
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK; Inria, University of Bordeaux, INRAE, 33400 Talence, France.
| | - Rebecca Ansorge
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK
| | - Ezgi Özkurt
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK
| | | | - Joachim Fritscher
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK
| | - Christopher Quince
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK
| | - Sebastian M Waszak
- Centre for Molecular Medicine Norway (NCMM), Nordic EMBL Partnership, University of Oslo and Oslo University Hospital, Oslo 0318, Norway; Department of Neurology, University of California, San Francisco, San Francisco, CA 94148, USA; Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg 69117, Germany
| | - Falk Hildebrand
- Food, Microbiome, and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, NR4 7UQ Norwich, Norfolk, UK; Digital Biology, Earlham Institute NR4 7UZ Norwich, Norfolk, UK.
| |
Collapse
|
17
|
Johnson MS, Reddy G, Desai MM. Epistasis and evolution: recent advances and an outlook for prediction. BMC Biol 2023; 21:120. [PMID: 37226182 PMCID: PMC10206586 DOI: 10.1186/s12915-023-01585-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 03/30/2023] [Indexed: 05/26/2023] Open
Abstract
As organisms evolve, the effects of mutations change as a result of epistatic interactions with other mutations accumulated along the line of descent. This can lead to shifts in adaptability or robustness that ultimately shape subsequent evolution. Here, we review recent advances in measuring, modeling, and predicting epistasis along evolutionary trajectories, both in microbial cells and single proteins. We focus on simple patterns of global epistasis that emerge in this data, in which the effects of mutations can be predicted by a small number of variables. The emergence of these patterns offers promise for efforts to model epistasis and predict evolution.
Collapse
Affiliation(s)
- Milo S Johnson
- Department of Integrative Biology, University of California, Berkeley, CA, USA
- Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Gautam Reddy
- Physics & Informatics Laboratories, NTT Research, Inc., Sunnyvale, CA, USA
- Center for Brain Science, Harvard University, Cambridge, MA, USA
| | - Michael M Desai
- Department of Organismic and Evolutionary Biology and Department of Physics, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
18
|
Baier F, Gauye F, Perez-Carrasco R, Payne JL, Schaerli Y. Environment-dependent epistasis increases phenotypic diversity in gene regulatory networks. SCIENCE ADVANCES 2023; 9:eadf1773. [PMID: 37224262 DOI: 10.1126/sciadv.adf1773] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/07/2022] [Accepted: 04/17/2023] [Indexed: 05/26/2023]
Abstract
Mutations to gene regulatory networks can be maladaptive or a source of evolutionary novelty. Epistasis confounds our understanding of how mutations affect the expression patterns of gene regulatory networks, a challenge exacerbated by the dependence of epistasis on the environment. We used the toolkit of synthetic biology to systematically assay the effects of pairwise and triplet combinations of mutant genotypes on the expression pattern of a gene regulatory network expressed in Escherichia coli that interprets an inducer gradient across a spatial domain. We uncovered a preponderance of epistasis that can switch in magnitude and sign across the inducer gradient to produce a greater diversity of expression pattern phenotypes than would be possible in the absence of such environment-dependent epistasis. We discuss our findings in the context of the evolution of hybrid incompatibilities and evolutionary novelties.
Collapse
Affiliation(s)
- Florian Baier
- Department of Fundamental Microbiology, University of Lausanne, Biophore Building, 1015 Lausanne, Switzerland
| | - Florence Gauye
- Department of Fundamental Microbiology, University of Lausanne, Biophore Building, 1015 Lausanne, Switzerland
| | | | - Joshua L Payne
- Institute of Integrative Biology, ETH Zurich, 8092 Zurich, Switzerland
| | - Yolanda Schaerli
- Department of Fundamental Microbiology, University of Lausanne, Biophore Building, 1015 Lausanne, Switzerland
| |
Collapse
|