1
|
Tang Z, Koo PK. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.29.582810. [PMID: 38464101 PMCID: PMC10925287 DOI: 10.1101/2024.02.29.582810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The emergence of genomic language models (gLMs) offers an unsupervised approach to learn a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown pre-trained gLMs can be leveraged to improve prediction performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that current gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major limitation with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
Collapse
Affiliation(s)
- Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
2
|
Liu J, Ashuach T, Inoue F, Ahituv N, Yosef N, Kreimer A. Optimizing sequence design strategies for perturbation MPRAs: a computational evaluation framework. Nucleic Acids Res 2024; 52:1613-1627. [PMID: 38296821 PMCID: PMC10939410 DOI: 10.1093/nar/gkae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2023] [Revised: 12/26/2023] [Accepted: 01/12/2024] [Indexed: 02/02/2024] Open
Abstract
The advent of perturbation-based massively parallel reporter assays (MPRAs) technique has facilitated the delineation of the roles of non-coding regulatory elements in orchestrating gene expression. However, computational efforts remain scant to evaluate and establish guidelines for sequence design strategies for perturbation MPRAs. In this study, we propose a framework for evaluating and comparing various perturbation strategies for MPRA experiments. Within this framework, we benchmark three different perturbation approaches from the perspectives of alteration in motif-based profiles, consistency of MPRA outputs, and robustness of models that predict the activities of putative regulatory motifs. While our analyses show very similar results across multiple benchmarking metrics, the predictive modeling for the approach involving random nucleotide shuffling shows significant robustness compared with the other two approaches. Thus, we recommend designing sequences by randomly shuffling the nucleotides of the perturbed site in perturbation-MPRA, followed by a coherence check to prevent the introduction of other variations of the target motifs. In summary, our evaluation framework and the benchmarking findings create a resource of computational pipelines and highlight the potential of perturbation-MPRA in predicting non-coding regulatory activities.
Collapse
Affiliation(s)
- Jiayi Liu
- Graduate Program in Cell & Developmental Biology, Rutgers, The State University of New Jersey, 604 Allison Rd, Piscataway, NJ 08854, USA
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, NJ 08854, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ 08854, USA
| | - Tal Ashuach
- Department of Electrical Engineering and Computer Sciences and Center for Computational Biology, University of California, Berkeley, 387 Soda Hall, Berkeley, CA 94720, USA
| | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Faculty of Medicine Building B, Yoshidatachibanacho, Sakyo Ward, Kyoto 606-8303, Japan
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, 1700 4th Street, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California, 513 Parnassus Ave, San Francisco, CA 94143, USA
| | - Nir Yosef
- Department of Systems Immunology, Weizmann Institute of Science, 234 Herzl Street, Rehovot 7610001, Israel
- Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA 94158, USA
- Department of Systems Immunology, Ragon Institute of MGH, MIT, and Harvard Institute of Science, 400 Technology Square, Cambridge, MA 02139, USA
| | - Anat Kreimer
- Department of Biochemistry and Molecular Biology, Rutgers, The State University of New Jersey, 604 Allison Road, Piscataway, NJ 08854, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ 08854, USA
| |
Collapse
|
3
|
DaSilva LF, Senan S, Patel ZM, Janardhan Reddy A, Gabbita S, Nussbaum Z, Valdez Córdova CM, Wenteler A, Weber N, Tunjic TM, Ahmad Khan T, Li Z, Smith C, Bejan M, Karmel Louis L, Cornejo P, Connell W, Wong ES, Meuleman W, Pinello L. DNA-Diffusion: Leveraging Generative Models for Controlling Chromatin Accessibility and Gene Expression via Synthetic Regulatory Elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.01.578352. [PMID: 38352499 PMCID: PMC10862870 DOI: 10.1101/2024.02.01.578352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/25/2024]
Abstract
The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate synthetic 200 bp regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences: transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-of-the-art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.
Collapse
Affiliation(s)
- Lucas Ferreira DaSilva
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
| | - Simon Senan
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Zain Munir Patel
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Aniketh Janardhan Reddy
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Sameer Gabbita
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Johns Hopkins University, Baltimore, MD, USA
| | | | | | | | | | | | | | - Zelun Li
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Cameron Smith
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | | | - Lithin Karmel Louis
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Paola Cornejo
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | | | - Emily S. Wong
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Wouter Meuleman
- Altius Institute for Biomedical Sciences, Seattle, WA, USA
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Luca Pinello
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| |
Collapse
|
4
|
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M. CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to improve genome-wide variant predictions. Nucleic Acids Res 2024; 52:D1143-D1154. [PMID: 38183205 PMCID: PMC10767851 DOI: 10.1093/nar/gkad989] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Revised: 10/14/2023] [Accepted: 10/17/2023] [Indexed: 01/07/2024] Open
Abstract
Machine Learning-based scoring and classification of genetic variants aids the assessment of clinical findings and is employed to prioritize variants in diverse genetic studies and analyses. Combined Annotation-Dependent Depletion (CADD) is one of the first methods for the genome-wide prioritization of variants across different molecular functions and has been continuously developed and improved since its original publication. Here, we present our most recent release, CADD v1.7. We explored and integrated new annotation features, among them state-of-the-art protein language model scores (Meta ESM-1v), regulatory variant effect predictions (from sequence-based convolutional neural networks) and sequence conservation scores (Zoonomia). We evaluated the new version on data sets derived from ClinVar, ExAC/gnomAD and 1000 Genomes variants. For coding effects, we tested CADD on 31 Deep Mutational Scanning (DMS) data sets from ProteinGym and, for regulatory effect prediction, we used saturation mutagenesis reporter assay data of promoter and enhancer sequences. The inclusion of new features further improved the overall performance of CADD. As with previous releases, all data sets, genome-wide CADD v1.7 scores, scripts for on-site scoring and an easy-to-use webserver are readily provided via https://cadd.bihealth.org/ or https://cadd.gs.washington.edu/ to the community.
Collapse
Affiliation(s)
- Max Schubach
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Thorben Maass
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Lusiné Nazaretyan
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Sebastian Röner
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Kircher
- Exploratory Diagnostic Sciences, Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Berlin, Germany
- Institute of Human Genetics, University Hospital Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| |
Collapse
|
5
|
Bajwa A, Rastogi R, Kathail P, Shuai RW, Ioannidis NM. Characterizing uncertainty in predictions of genomic sequence-to-activity models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.21.572730. [PMID: 38187742 PMCID: PMC10769392 DOI: 10.1101/2023.12.21.572730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.
Collapse
Affiliation(s)
- Ayesha Bajwa
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Pooja Kathail
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
6
|
Zhao J, Baltoumas FA, Konnaris MA, Mouratidis I, Liu Z, Sims J, Agarwal V, Pavlopoulos GA, Georgakopoulos-Soares I, Ahituv N. MPRAbase: A Massively Parallel Reporter Assay Database. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.19.567742. [PMID: 38045264 PMCID: PMC10690217 DOI: 10.1101/2023.11.19.567742] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/05/2023]
Abstract
Massively parallel reporter assays (MPRAs) represent a set of high-throughput technologies that measure the functional effects of thousands of sequences/variants on gene regulatory activity. There are several different variations of MPRA technology and they are used for numerous applications, including regulatory element discovery, variant effect measurement, saturation mutagenesis, synthetic regulatory element generation or characterization of evolutionary gene regulatory differences. Despite their many designs and uses, there is no comprehensive database that incorporates the results of these experiments. To address this, we developed MPRAbase, a manually curated database that currently harbors 129 experiments, encompassing 17,718,677 elements tested across 35 cell types and 4 organisms. The MPRAbase web interface ( http://www.mprabase.com ) serves as a centralized user-friendly repository to download existing MPRA data for independent analysis and is designed with the ability to allow researchers to share their published data for rapid dissemination to the community.
Collapse
|
7
|
Moyers BA, Partridge EC, Mackiewicz M, Betti MJ, Darji R, Meadows SK, Newberry KM, Brandsmeier LA, Wold BJ, Mendenhall EM, Myers RM. Characterization of human transcription factor function and patterns of gene regulation in HepG2 cells. Genome Res 2023; 33:gr.278205.123. [PMID: 37852782 PMCID: PMC10760452 DOI: 10.1101/gr.278205.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/13/2023] [Indexed: 10/20/2023]
Abstract
Transcription factors (TFs) are trans-acting proteins that bind cis-regulatory elements (CREs) in DNA to control gene expression. Here, we analyzed the genomic localization profiles of 529 sequence-specific TFs and 151 cofactors and chromatin regulators in the human cancer cell line HepG2, for a total of 680 broadly termed DNA-associated proteins (DAPs). We used this deep collection to model each TF's impact on gene expression, and identified a cohort of 26 candidate transcriptional repressors. We examine high occupancy target (HOT) sites in the context of three-dimensional genome organization and show biased motif placement in distal-promoter connections involving HOT sites. We also found a substantial number of closed chromatin regions with multiple DAPs bound, and explored their properties, finding that a MAFF/MAFK TF pair correlates with transcriptional repression. Altogether, these analyses provide novel insights into the regulatory logic of the human cell line HepG2 genome and show the usefulness of large genomic analyses for elucidation of individual TF functions.
Collapse
Affiliation(s)
- Belle A Moyers
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Michael J Betti
- Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA
| | - Roshan Darji
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Sarah K Meadows
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | | | - Barbara J Wold
- Merkin Institute for Translational Research, California Institute of Technology, Pasadena, California 91125, USA
| | - Eric M Mendenhall
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| |
Collapse
|
8
|
Liu J, Ashuach T, Inoue F, Ahituv N, Yosef N, Kreimer A. Best practices for perturbation MPRA-a computational evaluation framework of sequence design strategies. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.27.559768. [PMID: 37808807 PMCID: PMC10557651 DOI: 10.1101/2023.09.27.559768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/10/2023]
Abstract
The advent of the perturbation-based massively parallel reporter assays (MPRAs) technique has enabled delineating of the roles of non-coding regulatory elements in orchestrating gene expression. However, computational efforts remain scant to evaluate and establish guidelines for sequence design strategies for perturbation MPRAs. Here, we propose a framework for evaluating and comparing various perturbation strategies for MPRA experiments. Under this framework, we benchmark three different perturbation approaches from the perspectives of alteration in motif-based profiles, consistency of MPRA outputs, and robustness of models that predict the activities of putative regulatory motifs. Although our analyses show similar while significant results in multiple metrics, the method of randomly shuffling nucleotides outperform the other two methods. Thus, we still recommend designing sequences by randomly shuffling the nucleotides of the perturbed site in perturbation-MPRA. The evaluation framework, together with the benchmarking findings in our work, creates a resource of computational pipelines and illustrates the promise of perturbation-MPRA for predicting non-coding regulatory activities.
Collapse
Affiliation(s)
- Jiayi Liu
- Graduate Programs in Molecular Biosciences, Rutgers, The State
University of New Jersey, 604 Allison Rd, Piscataway, NJ, 08854, USA
- Department of Biochemistry and Molecular Biology, Rutgers, The
State University of New Jersey, 604 Allison Road, Piscataway, NJ, 08854, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The
State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ, 08854,
USA
| | - Tal Ashuach
- Department of Electrical Engineering and Computer Sciences and
Center for Computational Biology, University of California, Berkeley, 387 Soda Hall,
Berkeley, CA, 94720, USA
| | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi),
Kyoto University, Faculty of Medicine Building B, Yoshidatachibanacho, Sakyo Ward, Kyoto,
606-8303, Japan
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University
of California, San Francisco, 513 Parnassus Ave, CA, 94143, USA
- Institute for Human Genetics, University of California, San
Francisco, 513 Parnassus Ave, CA, 94143, USA
| | - Nir Yosef
- Department of Systems Immunology, Weizmann Institute of Science,
234 Herzl Street, Rehovot 7610001 Israel
- Chan-Zuckerberg Biohub, 499 Illinois St, San Francisco, CA,
94158, USA
- Department of Systems Immunology, Ragon Institute of MGH, MIT,
and Harvard Institute of Science, 400 Technology Square, Cambridge, MA, 02139, USA
| | - Anat Kreimer
- Department of Biochemistry and Molecular Biology, Rutgers, The
State University of New Jersey, 604 Allison Road, Piscataway, NJ, 08854, USA
- Center for Advanced Biotechnology and Medicine, Rutgers, The
State University of New Jersey, 679 Hoes Lane West, Piscataway, Piscataway, NJ, 08854,
USA
| |
Collapse
|
9
|
Friedman RZ, Ramu A, Lichtarge S, Myers CA, Granas DM, Gause M, Corbo JC, Cohen BA, White MA. Active learning of enhancer and silencer regulatory grammar in photoreceptors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.21.554146. [PMID: 37662358 PMCID: PMC10473580 DOI: 10.1101/2023.08.21.554146] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2023]
Abstract
Cis-regulatory elements (CREs) direct gene expression in health and disease, and models that can accurately predict their activities from DNA sequences are crucial for biomedicine. Deep learning represents one emerging strategy to model the regulatory grammar that relates CRE sequence to function. However, these models require training data on a scale that exceeds the number of CREs in the genome. We address this problem using active machine learning to iteratively train models on multiple rounds of synthetic DNA sequences assayed in live mammalian retinas. During each round of training the model actively selects sequence perturbations to assay, thereby efficiently generating informative training data. We iteratively trained a model that predicts the activities of sequences containing binding motifs for the photoreceptor transcription factor Cone-rod homeobox (CRX) using an order of magnitude less training data than current approaches. The model's internal confidence estimates of its predictions are reliable guides for designing sequences with high activity. The model correctly identified critical sequence differences between active and inactive sequences with nearly identical transcription factor binding sites, and revealed order and spacing preferences for combinations of motifs. Our results establish active learning as an effective method to train accurate deep learning models of cis-regulatory function after exhausting naturally occurring training examples in the genome.
Collapse
Affiliation(s)
- Ryan Z. Friedman
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Avinash Ramu
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Sara Lichtarge
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Connie A. Myers
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - David M. Granas
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Maria Gause
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Joseph C. Corbo
- Department of Pathology and Immunology, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Barak A. Cohen
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| | - Michael A. White
- The Edison Family Center for Genome Sciences & Systems Biology, Washington University School of Medicine, Saint Louis, MO, 63110
- Department of Genetics, Washington University School of Medicine, Saint Louis, MO, 63110
| |
Collapse
|
10
|
Gosai SJ, Castro RI, Fuentes N, Butts JC, Kales S, Noche RR, Mouri K, Sabeti PC, Reilly SK, Tewhey R. Machine-guided design of synthetic cell type-specific cis-regulatory elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.08.552077. [PMID: 37609287 PMCID: PMC10441439 DOI: 10.1101/2023.08.08.552077] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Cis-regulatory elements (CREs) control gene expression, orchestrating tissue identity, developmental timing, and stimulus responses, which collectively define the thousands of unique cell types in the body. While there is great potential for strategically incorporating CREs in therapeutic or biotechnology applications that require tissue specificity, there is no guarantee that an optimal CRE for an intended purpose has arisen naturally through evolution. Here, we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell type specificity. We leverage innovations in deep neural network modeling of CRE activity across three cell types, efficient in silico optimization, and massively parallel reporter assays (MPRAs) to design and empirically test thousands of CREs. Through in vitro and in vivo validation, we show that synthetic sequences outperform natural sequences from the human genome in driving cell type-specific expression. Synthetic sequences leverage unique sequence syntax to promote activity in the on-target cell type and simultaneously reduce activity in off-target cells. Together, we provide a generalizable framework to prospectively engineer CREs and demonstrate the required literacy to write regulatory code that is fit-for-purpose in vivo across vertebrates.
Collapse
Affiliation(s)
- SJ Gosai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Graduate Program in Biological and Biomedical Science, Boston MA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - RI Castro
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - N Fuentes
- The Jackson Laboratory, Bar Harbor, ME, USA
- Harvard College, Harvard University, Cambridge, MA, USA
| | - JC Butts
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
| | - S Kales
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - RR Noche
- Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, USA
- Yale Zebrafish Research Core, Yale School of Medicine, New Haven, CT, USA
| | - K Mouri
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - PC Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - SK Reilly
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
- Wu Tsai Institute, Yale University, New Haven, CT, USA
| | - R Tewhey
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
- Graduate School of Biomedical Sciences, Tufts University School of Medicine, Boston, MA, USA
| |
Collapse
|
11
|
Li G, Su G, Wang Y, Wang W, Shi J, Li D, Sui G. Integrative genomic analyses of promoter G-quadruplexes reveal their selective constraint and association with gene activation. Commun Biol 2023; 6:625. [PMID: 37301913 PMCID: PMC10257653 DOI: 10.1038/s42003-023-05015-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
G-quadruplexes (G4s) regulate DNA replication and gene transcription, and are enriched in promoters without fully appreciated functional relevance. Here we show high selection pressure on putative G4 (pG4) forming sequences in promoters through investigating genetic and genomic data. Analyses of 76,156 whole-genome sequences reveal that G-tracts and connecting loops in promoter pG4s display lower or higher allele frequencies, respectively, than pG4-flanking regions, and central guanines (Gs) in G-tracts show higher selection pressure than other Gs. Additionally, pG4-promoters produce over 72.4% of transcripts, and promoter G4-containing genes are expressed at relatively high levels. Most genes repressed by TMPyP4, a G4-ligand, regulate epigenetic processes, and promoter G4s are enriched with gene activation histone marks, chromatin remodeler and transcription factor binding sites. Consistently, cis-expression quantitative trait loci (cis-eQTLs) are enriched in promoter pG4s and their G-tracts. Overall, our study demonstrates selective constraint of promoter G4s and reinforces their stimulative role in gene expression.
Collapse
Affiliation(s)
- Guangyue Li
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Gongbo Su
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Yunxuan Wang
- Department of Medical Oncology, Harbin Medical University Cancer Hospital, Harbin, 150081, China
| | - Wenmeng Wang
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Jinming Shi
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Dangdang Li
- College of Life Science, Northeast Forestry University, Harbin, 150040, China
| | - Guangchao Sui
- College of Life Science, Northeast Forestry University, Harbin, 150040, China.
| |
Collapse
|
12
|
Zahm AM, Owens WS, Himes SR, Rondem KE, Fallon BS, Gormick AN, Bloom JS, Kosuri S, Chan H, English JG. Discovery and Validation of Context-Dependent Synthetic Mammalian Promoters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.11.539703. [PMID: 37214829 PMCID: PMC10197685 DOI: 10.1101/2023.05.11.539703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]
Abstract
Cellular transcription enables cells to adapt to various stimuli and maintain homeostasis. Transcription factors bind to transcription response elements (TREs) in gene promoters, initiating transcription. Synthetic promoters, derived from natural TREs, can be engineered to control exogenous gene expression using endogenous transcription machinery. This technology has found extensive use in biological research for applications including reporter gene assays, biomarker development, and programming synthetic circuits in living cells. However, a reliable and precise method for selecting minimally-sized synthetic promoters with desired background, amplitude, and stimulation response profiles has been elusive. In this study, we introduce a massively parallel reporter assay library containing 6184 synthetic promoters, each less than 250 bp in length. This comprehensive library allows for rapid identification of promoters with optimal transcriptional output parameters across multiple cell lines and stimuli. We showcase this library's utility to identify promoters activated in unique cell types, and in response to metabolites, mitogens, cellular toxins, and agonism of both aminergic and non-aminergic GPCRs. We further show these promoters can be used in luciferase reporter assays, eliciting 50-100 fold dynamic ranges in response to stimuli. Our platform is effective, easily implemented, and provides a solution for selecting short-length promoters with precise performance for a multitude of applications.
Collapse
Affiliation(s)
- Adam M. Zahm
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | | | - Samuel R. Himes
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Kathleen E. Rondem
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Braden S. Fallon
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | - Alexa N. Gormick
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| | | | | | | | - Justin G. English
- Department of Biochemistry, University of Utah School of Medicine, Salt Lake City, UT, USA
| |
Collapse
|
13
|
Georgakopoulos-Soares I, Deng C, Agarwal V, Chan CSY, Zhao J, Inoue F, Ahituv N. Transcription factor binding site orientation and order are major drivers of gene regulatory activity. Nat Commun 2023; 14:2333. [PMID: 37087538 PMCID: PMC10122648 DOI: 10.1038/s41467-023-37960-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 04/06/2023] [Indexed: 04/24/2023] Open
Abstract
The gene regulatory code and grammar remain largely unknown, precluding our ability to link phenotype to genotype in regulatory sequences. Here, using a massively parallel reporter assay (MPRA) of 209,440 sequences, we examine all possible pair and triplet combinations, permutations and orientations of eighteen liver-associated transcription factor binding sites (TFBS). We find that TFBS orientation and order have a major effect on gene regulatory activity. Corroborating these results with genomic analyses, we find clear human promoter TFBS orientation biases and similar TFBS orientation and order transcriptional effects in an MPRA that tested 164,307 liver candidate regulatory elements. Additionally, by adding TFBS orientation to a model that predicts expression from sequence we improve performance by 7.7%. Collectively, our results show that TFBS orientation and order have a significant effect on gene regulatory activity and need to be considered when analyzing the functional effect of variants on the activity of these sequences.
Collapse
Affiliation(s)
- Ilias Georgakopoulos-Soares
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA.
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA.
| | - Chengyu Deng
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Jingjing Zhao
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA.
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|