1
|
Long Y, Mora A, Li FZ, Gürsoy E, Johnston KE, Arnold FH. LevSeq: Rapid Generation of Sequence-Function Data for Directed Evolution and Machine Learning. ACS Synth Biol 2025; 14:230-238. [PMID: 39719062 DOI: 10.1021/acssynbio.4c00625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2024]
Abstract
Sequence-function data provides valuable information about the protein functional landscape but is rarely obtained during directed evolution campaigns. Here, we present Long-read every variant Sequencing (LevSeq), a pipeline that combines a dual barcoding strategy with nanopore sequencing to rapidly generate sequence-function data for entire protein-coding genes. LevSeq integrates into existing protein engineering workflows and comes with open-source software for data analysis and visualization. The pipeline facilitates data-driven protein engineering by consolidating sequence-function data to inform directed evolution and provide the requisite data for machine learning-guided protein engineering (MLPE). LevSeq enables quality control of mutagenesis libraries prior to screening, which reduces time and resource costs. Simulation studies demonstrate LevSeq's ability to accurately detect variants under various experimental conditions. Finally, we show LevSeq's utility in engineering protoglobins for new-to-nature chemistry. Widespread adoption of LevSeq and sharing of the data will enhance our understanding of protein sequence-function landscapes and empower data-driven directed evolution.
Collapse
Affiliation(s)
- Yueming Long
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Ariane Mora
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| | - Emre Gürsoy
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
| | - Kadina E Johnston
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California91125, United States
- Division of Biology and Bioengineering, California Institute of Technology, Pasadena, California91125, United States
| |
Collapse
|
2
|
Maciá Valero A, Prins RC, de Vroet T, Billerbeck S. Combining Oligo Pools and Golden Gate Cloning to Create Protein Variant Libraries or Guide RNA Libraries for CRISPR Applications. Methods Mol Biol 2025; 2850:265-295. [PMID: 39363077 DOI: 10.1007/978-1-0716-4220-7_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Oligo pools are array-synthesized, user-defined mixtures of single-stranded oligonucleotides that can be used as a source of synthetic DNA for library cloning. While currently offering the most affordable source of synthetic DNA, oligo pools also come with limitations such as a maximum synthesis length (approximately 350 bases), a higher error rate compared to alternative synthesis methods, and the presence of truncated molecules in the pool due to incomplete synthesis. Here, we provide users with a comprehensive protocol that details how oligo pools can be used in combination with Golden Gate cloning to create user-defined protein mutant libraries, as well as single-guide RNA libraries for CRISPR applications. Our methods are optimized to work within the Yeast Toolkit Golden Gate scheme, but are in principle compatible with any other Golden Gate-based modular cloning toolkit and extendable to other restriction enzyme-based cloning methods beyond Golden Gate. Our methods yield high-quality, affordable, in-house variant libraries.
Collapse
Affiliation(s)
- Alicia Maciá Valero
- Molecular Microbiology, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands
| | - Rianne C Prins
- Molecular Microbiology, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands
| | - Thijs de Vroet
- Molecular Microbiology, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands
| | - Sonja Billerbeck
- Molecular Microbiology, Groningen Biomolecular Sciences and Biotechnology Institute, University of Groningen, Groningen, The Netherlands.
| |
Collapse
|
3
|
DelRosso N, Suzuki PH, Griffith D, Lotthammer JM, Novak B, Kocalar S, Sheth MU, Holehouse AS, Bintu L, Fordyce P. High-throughput affinity measurements of direct interactions between activation domains and co-activators. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.19.608698. [PMID: 39229005 PMCID: PMC11370418 DOI: 10.1101/2024.08.19.608698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Sequence-specific activation by transcription factors is essential for gene regulation1,2. Key to this are activation domains, which often fall within disordered regions of transcription factors3,4 and recruit co-activators to initiate transcription5. These interactions are difficult to characterize via most experimental techniques because they are typically weak and transient6,7. Consequently, we know very little about whether these interactions are promiscuous or specific, the mechanisms of binding, and how these interactions tune the strength of gene activation. To address these questions, we developed a microfluidic platform for expression and purification of hundreds of activation domains in parallel followed by direct measurement of co-activator binding affinities (STAMMPPING, for Simultaneous Trapping of Affinity Measurements via a Microfluidic Protein-Protein INteraction Generator). By applying STAMMPPING to quantify direct interactions between eight co-activators and 204 human activation domains (>1,500 K ds), we provide the first quantitative map of these interactions and reveal 334 novel binding pairs. We find that the metazoan-specific co-activator P300 directly binds >100 activation domains, potentially explaining its widespread recruitment across the genome to influence transcriptional activation. Despite sharing similar molecular properties (e.g. enrichment of negative and hydrophobic residues), activation domains utilize distinct biophysical properties to recruit certain co-activator domains. Co-activator domain affinity and occupancy are well-predicted by analytical models that account for multivalency, and in vitro affinities quantitatively predict activation in cells with an ultrasensitive response. Not only do our results demonstrate the ability to measure affinities between even weak protein-protein interactions in high throughput, but they also provide a necessary resource of over 1,500 activation domain/co-activator affinities which lays the foundation for understanding the molecular basis of transcriptional activation.
Collapse
Affiliation(s)
| | - Peter H Suzuki
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Daniel Griffith
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Jeffrey M Lotthammer
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Borna Novak
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Selin Kocalar
- Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Maya U Sheth
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Alex S Holehouse
- Department of Biochemistry and Molecular Biophysics, Washington University School of Medicine, St. Louis, MO, USA
- Center for Biomolecular Condensates, Washington University in St. Louis, St. Louis, MO, USA
| | - Lacramioara Bintu
- Biophysics Program, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Polly Fordyce
- Biophysics Program, Stanford University, Stanford, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Sarafan ChEM-H Institute, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub San Francisco, CA, USA
| |
Collapse
|
4
|
Li W, Miller D, Liu X, Tosi L, Chkaiban L, Mei H, Hung PH, Parekkadan B, Sherlock G, Levy S. Arrayed in vivo barcoding for multiplexed sequence verification of plasmid DNA and demultiplexing of pooled libraries. Nucleic Acids Res 2024; 52:e47. [PMID: 38709890 PMCID: PMC11162764 DOI: 10.1093/nar/gkae332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/23/2024] [Accepted: 04/16/2024] [Indexed: 05/08/2024] Open
Abstract
Sequence verification of plasmid DNA is critical for many cloning and molecular biology workflows. To leverage high-throughput sequencing, several methods have been developed that add a unique DNA barcode to individual samples prior to pooling and sequencing. However, these methods require an individual plasmid extraction and/or in vitro barcoding reaction for each sample processed, limiting throughput and adding cost. Here, we develop an arrayed in vivo plasmid barcoding platform that enables pooled plasmid extraction and library preparation for Oxford Nanopore sequencing. This method has a high accuracy and recovery rate, and greatly increases throughput and reduces cost relative to other plasmid barcoding methods or Sanger sequencing. We use in vivo barcoding to sequence verify >45 000 plasmids and show that the method can be used to transform error-containing dispersed plasmid pools into sequence-perfect arrays or well-balanced pools. In vivo barcoding does not require any specialized equipment beyond a low-overhead Oxford Nanopore sequencer, enabling most labs to flexibly process hundreds to thousands of plasmids in parallel.
Collapse
Affiliation(s)
- Weiyi Li
- SLAC National Accelerator Laboratory, Stanford University, Stanford, CA, USA
| | - Darach Miller
- SLAC National Accelerator Laboratory, Stanford University, Stanford, CA, USA
| | - Xianan Liu
- SLAC National Accelerator Laboratory, Stanford University, Stanford, CA, USA
| | - Lorenzo Tosi
- Department of Biomedical Engineering, Rutgers University, Piscataway, NJ, USA
| | - Lamia Chkaiban
- Department of Biomedical Engineering, Rutgers University, Piscataway, NJ, USA
| | - Han Mei
- SLAC National Accelerator Laboratory, Stanford University, Stanford, CA, USA
| | - Po-Hsiang Hung
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Biju Parekkadan
- Department of Biomedical Engineering, Rutgers University, Piscataway, NJ, USA
| | - Gavin Sherlock
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Sasha F Levy
- SLAC National Accelerator Laboratory, Stanford University, Stanford, CA, USA
| |
Collapse
|
5
|
Johnston KE, Fannjiang C, Wittmann BJ, Hie BL, Yang KK, Wu Z. Machine Learning for Protein Engineering. ARXIV 2023:arXiv:2305.16634v1. [PMID: 37292483 PMCID: PMC10246115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.
Collapse
Affiliation(s)
| | | | - Bruce J Wittmann
- work done while at California Institute of Technology, now at Microsoft
| | | | | | | |
Collapse
|
6
|
Wittmann BJ, Johnston KE, Almhjell PJ, Arnold FH. evSeq: Cost-Effective Amplicon Sequencing of Every Variant in a Protein Library. ACS Synth Biol 2022; 11:1313-1324. [PMID: 35172576 DOI: 10.1021/acssynbio.1c00592] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Widespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing are thus unjustified. It also results from the fact that, even though many lower-cost sequencing strategies have been developed, they often require at least some access to and experience with sequencing or computational resources, both of which can be barriers to access. Here, we present every variant sequencing (evSeq), a method and collection of tools/standardized components for sequencing a variable region within every variant gene produced during a protein engineering campaign at a cost of cents per variant. evSeq was designed to democratize low-cost sequencing for protein engineers and, indeed, anyone interested in engineering biological systems. Execution of its wet-lab component is simple, requires no sequencing experience to perform, relies only on resources and services typically available to biology labs, and slots neatly into existing protein engineering workflows. Analysis of evSeq data is likewise made simple by its accompanying software (found at github.com/fhalab/evSeq, documentation at fhalab.github.io/evSeq), which can be run on a personal laptop and was designed to be accessible to users with no computational experience. Low-cost and easy-to-use, evSeq makes the collection of extensive protein variant sequence-fitness data practical.
Collapse
Affiliation(s)
- Bruce J. Wittmann
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Kadina E. Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Patrick J. Almhjell
- Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
- Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, California 91125, United States
| |
Collapse
|