1
|
Barbadilla-Martínez L, Klaassen N, van Steensel B, de Ridder J. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 2025:10.1038/s41576-025-00841-2. [PMID: 40360798 DOI: 10.1038/s41576-025-00841-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/01/2025] [Indexed: 05/15/2025]
Abstract
Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.
Collapse
Affiliation(s)
- Lucía Barbadilla-Martínez
- Oncode Institute, Utrecht, The Netherlands
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands
| | - Noud Klaassen
- Oncode Institute, Utrecht, The Netherlands
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Bas van Steensel
- Oncode Institute, Utrecht, The Netherlands.
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands.
| | - Jeroen de Ridder
- Oncode Institute, Utrecht, The Netherlands.
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands.
| |
Collapse
|
2
|
Martyn GE, Montgomery MT, Jones H, Guo K, Doughty BR, Linder J, Bisht D, Xia F, Cai XS, Chen Z, Cochran K, Lawrence KA, Munson G, Pampari A, Fulco CP, Sahni N, Kelley DR, Lander ES, Kundaje A, Engreitz JM. Rewriting regulatory DNA to dissect and reprogram gene expression. Cell 2025:S0092-8674(25)00352-6. [PMID: 40245860 DOI: 10.1016/j.cell.2025.03.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 12/16/2024] [Accepted: 03/19/2025] [Indexed: 04/19/2025]
Abstract
Regulatory DNA provides a platform for transcription factor binding to encode cell-type-specific patterns of gene expression. However, the effects and programmability of regulatory DNA sequences remain difficult to map or predict. Here, we develop variant effects from flow-sorting experiments with CRISPR targeting screens (Variant-EFFECTS) to introduce hundreds of designed edits to endogenous regulatory DNA and quantify their effects on gene expression. We systematically dissect and reprogram 3 regulatory elements for 2 genes in 2 cell types. These data reveal endogenous binding sites with effects specific to genomic context, transcription factor motifs with cell-type-specific activities, and limitations of computational models for predicting the effect sizes of variants. We identify small edits that can tune gene expression over a large dynamic range, suggesting new possibilities for prime-editing-based therapeutics targeting regulatory DNA. Variant-EFFECTS provides a generalizable tool to dissect regulatory DNA and to identify genome editing reagents that tune gene expression in an endogenous context.
Collapse
Affiliation(s)
- Gabriella E Martyn
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA
| | - Michael T Montgomery
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA
| | - Hank Jones
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA
| | - Katherine Guo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA
| | - Benjamin R Doughty
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Johannes Linder
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Deepa Bisht
- Department of Genitourinary Medical Oncology, Division of Cancer Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA; Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77230, USA
| | - Fan Xia
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA
| | - Xiangmeng S Cai
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA; Department of Bioengineering, Stanford University, Stanford, CA 94305, USA
| | - Ziwei Chen
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Kelly Cochran
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Kathryn A Lawrence
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Glen Munson
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Anusri Pampari
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Charles P Fulco
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Nidhi Sahni
- Department of Epigenetics and Molecular Carcinogenesis, The University of Texas MD Anderson Cancer Center, Houston, TX 77230, USA; Department of Bioinformatics and Computational Biology, The University of Texas MD Anderson Cancer Center, Houston, TX 77230, USA; Quantitative and Computational Biosciences Program, Baylor College of Medicine, Houston, TX 77030, USA
| | - David R Kelley
- Calico Life Sciences LLC, South San Francisco, CA 94080, USA
| | - Eric S Lander
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Department of Biology, MIT, Cambridge, MA 02139, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Jesse M Engreitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, USA; Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA 94305, USA; Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Stanford Cardiovascular Institute, Stanford University, Stanford, CA 94305, USA.
| |
Collapse
|
3
|
Linder J, Srivastava D, Yuan H, Agarwal V, Kelley DR. Predicting RNA-seq coverage from DNA sequence as a unifying model of gene regulation. Nat Genet 2025; 57:949-961. [PMID: 39779956 PMCID: PMC11985352 DOI: 10.1038/s41588-024-02053-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 12/04/2024] [Indexed: 01/11/2025]
Abstract
Sequence-based machine-learning models trained on genomics data improve genetic variant interpretation by providing functional predictions describing their impact on the cis-regulatory code. However, current tools do not predict RNA-seq expression profiles because of modeling challenges. Here, we introduce Borzoi, a model that learns to predict cell-type-specific and tissue-specific RNA-seq coverage from DNA sequence. Using statistics derived from Borzoi's predicted coverage, we isolate and accurately score DNA variant effects across multiple layers of regulation, including transcription, splicing and polyadenylation. Evaluated on quantitative trait loci, Borzoi is competitive with and often outperforms state-of-the-art models trained on individual regulatory functions. By applying attribution methods to the derived statistics, we extract cis-regulatory motifs driving RNA expression and post-transcriptional regulation in normal tissues. The wide availability of RNA-seq data across species, conditions and assays profiling specific aspects of regulation emphasizes the potential of this approach to decipher the mapping from DNA sequence to regulatory function.
Collapse
Affiliation(s)
| | | | - Han Yuan
- Calico Life Sciences LLC, South San Francisco, CA, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi Pasteur Inc., Cambridge, MA, USA
| | | |
Collapse
|
4
|
Hingerl JC, Martens LD, Karollus A, Manz T, Buenrostro JD, Theis FJ, Gagneur J. scooby: Modeling multi-modal genomic profiles from DNA sequence at single-cell resolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.09.19.613754. [PMID: 39345504 PMCID: PMC11429888 DOI: 10.1101/2024.09.19.613754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/01/2024]
Abstract
Understanding how regulatory DNA elements shape gene expression across individual cells is a fundamental challenge in genomics. Joint RNA-seq and epigenomic profiling provides opportunities to build unifying models of gene regulation capturing sequence determinants across steps of gene expression. However, current models, developed primarily for bulk omics data, fail to capture the cellular heterogeneity and dynamic processes revealed by single-cell multi-modal technologies. Here, we introduce scooby, the first framework to model scRNA-seq coverage and scATAC-seq insertion profiles along the genome from sequence at single-cell resolution. For this, we leverage the pre-trained multi-omics profile predictor Borzoi as a foundation model, equip it with a cell-specific decoder, and fine-tune its sequence embeddings. Specifically, we condition the decoder on the cell position in a precomputed single-cell embedding resulting in strong generalization capability. Applied to a hematopoiesis dataset, scooby recapitulates cell-specific expression levels of held-out genes, and identifies regulators and their putative target genes through in silico motif deletion. Moreover, accurate variant effect prediction with scooby allows for breaking down bulk eQTL effects into single-cell effects and delineating their impact on chromatin accessibility and gene expression. We anticipate scooby to aid unraveling the complexities of gene regulation at the resolution of individual cells.
Collapse
Affiliation(s)
- Johannes C Hingerl
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Laura D Martens
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| | - Alexander Karollus
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Trevor Manz
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Jason D Buenrostro
- Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA 02142 USA
- Department of Stem Cell and Regenerative Biology, Harvard University, Cambridge, MA, USA
| | - Fabian J Theis
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany
| |
Collapse
|
5
|
Qiu Y, Liu L, Yan J, Xiang X, Wang S, Luo Y, Deng K, Xu J, Jin M, Wu X, Liwei Cheng, Zhou Y, Xie W, Liu HJ, Fernie AR, Hu X, Yan J. Precise engineering of gene expression by editing plasticity. Genome Biol 2025; 26:51. [PMID: 40065399 PMCID: PMC11892124 DOI: 10.1186/s13059-025-03516-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 02/25/2025] [Indexed: 03/14/2025] Open
Abstract
BACKGROUND Identifying transcriptional cis-regulatory elements (CREs) and understanding their role in gene expression are essential for the precise manipulation of gene expression and associated phenotypes. This knowledge is fundamental for advancing genetic engineering and improving crop traits. RESULTS We here demonstrate that CREs can be accurately predicted and utilized to precisely regulate gene expression beyond the range of natural variation. We firstly build two sequence-to-expression deep learning models to respectively identify distal and proximal CREs by combining them with interpretability methods in multiple crops. A large number of distal CREs are verified for enhancer activity in vitro using UMI-STARR-seq on 12,000 synthesized sequences. These comprehensively characterized CREs and their precisely predicted effects further contribute to the design of in silico editing schemes for precise engineering of gene expression. We introduce a novel concept of "editingplasticity" to evaluate the potential of promoter editing to alter expression of each gene. As a proof of concept, both exhaustive prediction and random knockout mutants are analyzed within the promoter region of ZmVTE4, a key gene affecting α-tocopherol content in maize. A high degree of agreement between predicted and observed expression is observed, extending the range of natural variation and thereby allowing the creation of an optimal phenotype. CONCLUSIONS Our study provides a robust computational framework that advances knowledge-guided gene editing for precise regulation of gene expression and crop improvement. By reliably predicting and validating CREs, we offer a tool for targeted genetic modifications, enhancing desirable traits in crops.
Collapse
Affiliation(s)
- Yang Qiu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Lifen Liu
- Hubei Hongshan Laboratory, Wuhan, 430070, China
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jiali Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Xianglei Xiang
- Hubei Hongshan Laboratory, Wuhan, 430070, China
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan, 430070, China
| | - Shouzhe Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Yun Luo
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Kaixuan Deng
- Hubei Hongshan Laboratory, Wuhan, 430070, China
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jieting Xu
- WIMI Biotechnology Co., Ltd., Sanya, Hainan, 572000, China
| | - Minliang Jin
- WIMI Biotechnology Co., Ltd., Sanya, Hainan, 572000, China
| | - Xiaoyu Wu
- WIMI Biotechnology Co., Ltd., Sanya, Hainan, 572000, China
| | - Liwei Cheng
- WIMI Biotechnology Co., Ltd., Sanya, Hainan, 572000, China
| | - Ying Zhou
- Institute of Agricultural Sciences of Xishuangbanna Prefecture of Yunnan Province, Jinghong, Yunnan, 666100, China
- The Expert Workstation of Jianbing Yan in Yunnan Province, Jinghong, Yunnan, 666100, China
| | - Weibo Xie
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Hai-Jun Liu
- Yazhouwan National Laboratory, Sanya, 572024, China
| | - Alisdair R Fernie
- Department of Molecular Physiology, Max Planck Institute of Molecular Plant Physiology, Potsdam-Golm, 14476, Germany
| | - Xuehai Hu
- Hubei Hongshan Laboratory, Wuhan, 430070, China.
- College of Informatics, Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, Wuhan, 430070, China.
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China.
- Hubei Hongshan Laboratory, Wuhan, 430070, China.
- Yazhouwan National Laboratory, Sanya, 572024, China.
| |
Collapse
|
6
|
Agarwal V, Inoue F, Schubach M, Penzar D, Martin BK, Dash PM, Keukeleire P, Zhang Z, Sohota A, Zhao J, Georgakopoulos-Soares I, Noble WS, Yardımcı GG, Kulakovskiy IV, Kircher M, Shendure J, Ahituv N. Massively parallel characterization of transcriptional regulatory elements. Nature 2025; 639:411-420. [PMID: 39814889 PMCID: PMC11903340 DOI: 10.1038/s41586-024-08430-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Accepted: 11/20/2024] [Indexed: 01/18/2025]
Abstract
The human genome contains millions of candidate cis-regulatory elements (cCREs) with cell-type-specific activities that shape both health and many disease states1. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these cCREs. Here we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of more than 680,000 sequences, representing an extensive set of annotated cCREs among three cell types (HepG2, K562 and WTC11), and found that 41.7% of these sequences were active. By testing sequences in both orientations, we find promoters to have strand-orientation biases and their 200-nucleotide cores to function as non-cell-type-specific 'on switches' that provide similar expression levels to their associated gene. By contrast, enhancers have weaker orientation biases, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict cCRE function and variant effects with high accuracy, delineate regulatory motifs and model their combinatorial effects. Testing a lentiMPRA library encompassing 60,000 cCREs in all three cell types further identified factors that determine cell-type specificity. Collectively, our work provides an extensive catalogue of functional CREs in three widely used cell lines and showcases how large-scale functional measurements can be used to dissect regulatory grammar.
Collapse
Affiliation(s)
- Vikram Agarwal
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- mRNA Center of Excellence, Sanofi, Waltham, MA, USA.
| | - Fumitaka Inoue
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Max Schubach
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow, Russia
| | - Beth K Martin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Pyaree Mohan Dash
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Pia Keukeleire
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Zicong Zhang
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Ajuni Sohota
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Jingjing Zhao
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Galip Gürkan Yardımcı
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Knight Cancer Institute, Oregon Health and Science University, Portland, OR, USA
- Cancer Early Detection Advanced Research Center, Oregon Health and Science University, Portland, OR, USA
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
- Life Improvement by Future Technologies (LIFT) Center, Moscow, Russia
| | - Martin Kircher
- Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA.
- Howard Hughes Medical Institute, Seattle, WA, USA.
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA, USA.
- Seattle Hub for Synthetic Biology, Seattle, Washington, USA.
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA.
- Institute for Human Genetics, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
7
|
Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, Grzywaczewski AH, Oteri F, Dallago C, Trop E, de Almeida BP, Sirelkhatim H, Richard G, Skwark M, Beguir K, Lopez M, Pierrot T. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods 2025; 22:287-297. [PMID: 39609566 PMCID: PMC11810778 DOI: 10.1038/s41592-024-02523-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 10/19/2024] [Indexed: 11/30/2024]
Abstract
The prediction of molecular phenotypes from DNA sequences remains a longstanding challenge in genomics, often driven by limited annotated data and the inability to transfer learnings between tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named Nucleotide Transformer, ranging from 50 million up to 2.5 billion parameters and integrating information from 3,202 human genomes and 850 genomes from diverse species. These transformer models yield context-specific representations of nucleotide sequences, which allow for accurate predictions even in low-data settings. We show that the developed models can be fine-tuned at low cost to solve a variety of genomics applications. Despite no supervision, the models learned to focus attention on key genomic elements and can be used to improve the prioritization of genetic variants. The training and application of foundational models in genomics provides a widely applicable approach for accurate molecular phenotype prediction from DNA sequence.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Christian Dallago
- Nvidia, Santa Clara, CA, USA
- Technical University of Munich, Munich, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
8
|
Li S, Noroozizadeh S, Moayedpour S, Kogler-Anele L, Xue Z, Zheng D, Montoya FU, Agarwal V, Bar-Joseph Z, Jager S. mRNA-LM: full-length integrated SLM for mRNA analysis. Nucleic Acids Res 2025; 53:gkaf044. [PMID: 39898548 PMCID: PMC11962594 DOI: 10.1093/nar/gkaf044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 12/07/2024] [Accepted: 01/20/2025] [Indexed: 02/04/2025] Open
Abstract
The success of SARS-CoV-2 (severe acute respiratory syndrome coronavirus 2) messenger RNA (mRNA) vaccine has led to increased interest in the design and use of mRNA for vaccines and therapeutics. Still, selecting the most appropriate mRNA sequence for a protein remains a challenge. Several recent studies have shown that the specific mRNA sequence can have a significant impact on the translation efficiency, half-life, degradation rates, and other issues that play a major role in determining vaccine efficiency. To enable the selection of the most appropriate sequence, we developed mRNA-LM, an integrated small language model for modeling the entire mRNA sequence. mRNA-LM uses the contrastive language-image pretraining integration technology to combine three separate language models for the different mRNA segments. We trained mRNA-LM on millions of diverse mRNA sequences from several different species. The unsupervised model was able to learn meaningful biology related to evolution and host-pathogen interactions. Fine-tuning of mRNA-LM allowed us to use it in several mRNA property prediction tasks. As we show, using the full-length integrated model led to accurate predictions, improving on prior methods proposed for this task.
Collapse
Affiliation(s)
- Sizhen Li
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
| | - Shahriar Noroozizadeh
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
- Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Heinz College, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | | | | | - Zexin Xue
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
- Department of Computer Science, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - Dinghai Zheng
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, United States
| | | | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi, Waltham, MA 02451, United States
| | | | - Sven Jager
- Digital R&D, Sanofi, Cambridge, MA 02141, United States
| |
Collapse
|
9
|
Shen Y, Kudla G, Oyarzún DA. Improving the generalization of protein expression models with mechanistic sequence information. Nucleic Acids Res 2025; 53:gkaf020. [PMID: 39873269 PMCID: PMC11773361 DOI: 10.1093/nar/gkaf020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Revised: 12/12/2024] [Accepted: 01/08/2025] [Indexed: 01/30/2025] Open
Abstract
The growing demand for biological products drives many efforts to maximize expression of heterologous proteins. Advances in high-throughput sequencing can produce data suitable for building sequence-to-expression models with machine learning. The most accurate models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Moreover, studies have consistently shown that training on mechanistic sequence features leads to much poorer predictions, even with features that are known to correlate with expression, such as DNA sequence motifs, codon usage, or properties of mRNA secondary structures. However, despite their excellent local accuracy, current sequence-to-expression models can fail to generalize predictions far away from the training data. Through a comparative study across datasets in Escherichia coli and Saccharomyces cerevisiae, here we show that mechanistic sequence features can provide gains on model generalization, and thus improve their utility for predictive sequence design. We explore several strategies to integrate one-hot encodings and mechanistic features into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work casts new light on mechanistic sequence features, underscoring the importance of domain-knowledge and feature engineering for accurate prediction of protein expression levels.
Collapse
Affiliation(s)
- Yuxin Shen
- School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JH, United Kingdom
| | - Grzegorz Kudla
- Institute for Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, United Kingdom
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JH, United Kingdom
- School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, United Kingdom
| |
Collapse
|
10
|
Sidi T, Bahiri-Elitzur S, Tuller T, Kolodny R. Predicting gene sequences with AI to study codon usage patterns. Proc Natl Acad Sci U S A 2025; 122:e2410003121. [PMID: 39739812 PMCID: PMC11725940 DOI: 10.1073/pnas.2410003121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 11/27/2024] [Indexed: 01/02/2025] Open
Abstract
Selective pressure acts on the codon use, optimizing multiple, overlapping signals that are only partially understood. We trained AI models to predict codons given their amino acid sequence in the eukaryotes Saccharomyces cerevisiae and Schizosaccharomyces pombe and the bacteria Escherichia coli and Bacillus subtilis to study the extent to which we can learn patterns in naturally occurring codons to improve predictions. We trained our models on a subset of the proteins and evaluated their predictions on large, separate sets of proteins of varying lengths and expression levels. Our models significantly outperformed naïve frequency-based approaches, demonstrating that there are learnable dependencies in evolutionary-selected codon usage. The prediction accuracy advantage of our models is greater for highly expressed genes and is greater in bacteria than eukaryotes, supporting the hypothesis that there is a monotonic relationship between selective pressure for complex codon patterns and effective population size. In S. cerevisiae and bacteria, our models were more accurate for longer proteins, suggesting that the learned patterns may be related to cotranslational folding. Gene functionality and conservation were also important determinants that affect the performance of our models. Finally, we showed that using information encoded in homologous proteins has only a minor effect on prediction accuracy, perhaps due to complex codon-usage codes in genes undergoing rapid evolution. Our study employing contemporary AI methods offers a unique perspective and a deep-learning-based prediction tool for evolutionary-selected codons. We hope that these can be useful to optimize codon usage in endogenous and heterologous proteins.
Collapse
Affiliation(s)
- Tomer Sidi
- Department of Computer Science, University of Haifa, Haifa3303221, Israel
| | - Shir Bahiri-Elitzur
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv6139001, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel-Aviv University, Tel Aviv6139001, Israel
- The Sagol School of Neuroscience, Tel-Aviv University, Tel Aviv6139001, Israel
| | - Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa3303221, Israel
| |
Collapse
|
11
|
Xu L, Li C, Liao R, Xiao Q, Wang X, Zhao Z, Zhang W, Ding X, Cao Y, Cai L, Rosenecker J, Guan S, Tang J. From Sequence to System: Enhancing IVT mRNA Vaccine Effectiveness through Cutting-Edge Technologies. Mol Pharm 2025; 22:81-102. [PMID: 39601789 DOI: 10.1021/acs.molpharmaceut.4c00863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The COVID-19 pandemic has spotlighted the potential of in vitro transcribed (IVT) mRNA vaccines with their demonstrated efficacy, safety, cost-effectiveness, and rapid manufacturing. Numerous IVT mRNA vaccines are now under clinical trials for a range of targets, including infectious diseases, cancers, and genetic disorders. Despite their promise, IVT mRNA vaccines face hurdles such as limited expression levels, nonspecific targeting beyond the liver, rapid degradation, and unintended immune activation. Overcoming these challenges is crucial to harnessing the full therapeutic potential of IVT mRNA vaccines for global health advancement. This review provides a comprehensive overview of the latest research progress and optimization strategies for IVT mRNA molecules and delivery systems, including the application of artificial intelligence (AI) models and deep learning techniques for IVT mRNA structure optimization and mRNA delivery formulation design. We also discuss recent development of the delivery platforms, such as lipid nanoparticles (LNPs), polymers, and exosomes, which aim to address challenges related to IVT mRNA protection, cellular uptake, and targeted delivery. Lastly, we offer insights into future directions for improving IVT mRNA vaccines, with the hope to spur further progress in IVT mRNA vaccine research and development.
Collapse
Affiliation(s)
- Lifeng Xu
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Chao Li
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Rui Liao
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Qin Xiao
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Xiaoran Wang
- Department of Pharmacy, The First Affiliated Hospital of Xinjiang Medical University, Urumqi 830000, China
| | - Zhuo Zhao
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Weijun Zhang
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Xiaoyan Ding
- Department of Pediatrics, Ludwig-Maximilians University of Munich, Munich 80337, Germany
| | - Yuxue Cao
- Drug Delivery, Disposition and Dynamics, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, VIC 3052, Australia
| | - Larry Cai
- Drug Delivery, Disposition and Dynamics, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, VIC 3052, Australia
| | - Joseph Rosenecker
- Department of Pediatrics, Ludwig-Maximilians University of Munich, Munich 80337, Germany
| | - Shan Guan
- National Engineering Research Center of Immunological Products, Third Military Medical University, Chongqing 400038, China
| | - Jie Tang
- Drug Delivery, Disposition and Dynamics, Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, VIC 3052, Australia
| |
Collapse
|
12
|
Wang Z, Yuan H, Yan J, Liu J. Identification, characterization, and design of plant genome sequences using deep learning. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2025; 121:e17190. [PMID: 39666835 DOI: 10.1111/tpj.17190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 11/11/2024] [Accepted: 11/23/2024] [Indexed: 12/14/2024]
Abstract
Due to its excellent performance in processing large amounts of data and capturing complex non-linear relationships, deep learning has been widely applied in many fields of plant biology. Here we first review the application of deep learning in analyzing genome sequences to predict gene expression, chromatin interactions, and epigenetic features (open chromatin, transcription factor binding sites, and methylation sites) in plants. Then, current motif mining and functional component design and synthesis based on generative adversarial networks, large models, and attention mechanisms are elaborated in detail. The progress of protein structure and function prediction, genomic prediction, and large model applications based on deep learning is also discussed. Finally, this work provides prospects for the future development of deep learning in plants with regard to multiple omics data, algorithm optimization, large language models, sequence design, and intelligent breeding.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| |
Collapse
|
13
|
Buric F, Viknander S, Fu X, Lemke O, Carmona OG, Zrimec J, Szyrwiel L, Mülleder M, Ralser M, Zelezniak A. Amino acid sequence encodes protein abundance shaped by protein stability at reduced synthesis cost. Protein Sci 2025; 34:e5239. [PMID: 39665261 PMCID: PMC11635393 DOI: 10.1002/pro.5239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 10/11/2024] [Accepted: 11/14/2024] [Indexed: 12/13/2024]
Abstract
Understanding what drives protein abundance is essential to biology, medicine, and biotechnology. Driven by evolutionary selection, an amino acid sequence is tailored to meet the required abundance of a proteome, underscoring the intricate relationship between sequence and functional demand. Yet, the specific role of amino acid sequences in determining proteome abundance remains elusive. Here we show that the amino acid sequence alone encodes over half of protein abundance variation across all domains of life, ranging from bacteria to mouse and human. With an attempt to go beyond predictions, we trained a manageable-size Transformer model to interpret latent factors predictive of protein abundances. Intuitively, the model's attention focused on the protein's structural features linked to stability and metabolic costs related to protein synthesis. To probe these relationships, we introduce MGEM (Mutation Guided by an Embedded Manifold), a methodology for guiding protein abundance through sequence modifications. We find that mutations which increase predicted abundance have significantly altered protein polarity and hydrophobicity, underscoring a connection between protein structural features and abundance. Through molecular dynamics simulations we revealed that abundance-enhancing mutations possibly contribute to protein thermostability by increasing rigidity, which occurs at a lower synthesis cost.
Collapse
Affiliation(s)
- Filip Buric
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Sandra Viknander
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Xiaozhi Fu
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
| | - Oliver Lemke
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Oriol Gracia Carmona
- Randall Centre for Cell & Molecular BiophysicsKing's College LondonLondonUK
- Institute of Structural and Molecular BiologyUniversity College LondonLondonUK
| | - Jan Zrimec
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Department of Biotechnology and Systems BiologyNational Institute of BiologyLjubljanaSlovenia
| | - Lukasz Szyrwiel
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Michael Mülleder
- Core Facility High Throughput Mass SpectrometryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Markus Ralser
- Department of BiochemistryCharité – Universitätsmedizin BerlinBerlinGermany
| | - Aleksej Zelezniak
- Department of Biology and Biological EngineeringChalmers University of TechnologyGothenburgSweden
- Randall Centre for Cell & Molecular BiophysicsKing's College LondonLondonUK
- Institute of Biotechnology, Life Sciences CentreVilnius UniversityVilniusLithuania
| |
Collapse
|
14
|
Wang Y, Liang N, Gao G. Quantifying the regulatory potential of genetic variants via a hybrid sequence-oriented model with SVEN. Nat Commun 2024; 15:10917. [PMID: 39738063 PMCID: PMC11685779 DOI: 10.1038/s41467-024-55392-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2024] [Accepted: 12/06/2024] [Indexed: 01/01/2025] Open
Abstract
Deciphering how noncoding DNA determines gene expression is critical for decoding the functional genome. Understanding the transcription effects of noncoding genetic variants are still major unsolved problems, which is critical for downstream applications in human genetics and precision medicine. Here, we integrate regulatory-specific neural networks and tissue-specific gradient-boosting trees to build SVEN: a hybrid sequence-oriented architecture that can accurately predict tissue-specific gene expression level and quantify the tissue-specific transcriptomic impacts of structural variants across more than 350 tissues and cell lines. We further systematically screen a large-scale structural variants dataset derived from 3622 individuals and clinical structural variants from ClinVar, and provide an overview of transcriptomic impacts of structural variants in population. As a sequence-oriented model, SVEN is also able to predict regulatory effects for small noncoding variants. We expect that SVEN will enable more effective in silico analysis and interpretation of human genome-wide disease-related genetic variants.
Collapse
Affiliation(s)
- Yu Wang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China
- Changping Laboratory, 102206, Beijing, China
| | - Nan Liang
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China
| | - Ge Gao
- State Key Laboratory of Protein and Plant Gene Research, School of Life Sciences, Biomedical Pioneering Innovative Center (BIOPIC) and Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), Peking University, 100871, Beijing, China.
- Changping Laboratory, 102206, Beijing, China.
| |
Collapse
|
15
|
Bréhélin L. Advancing Regulatory Genomics With Machine Learning. Bioinform Biol Insights 2024; 18:11779322241249562. [PMID: 39735654 PMCID: PMC11672376 DOI: 10.1177/11779322241249562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 04/09/2024] [Indexed: 12/31/2024] Open
Abstract
In recent years, several machine learning (ML) approaches have been proposed to predict gene expression signal and chromatin features from the DNA sequence alone. These models are often used to deduce and, to some extent, assess putative new biological insights about gene regulation, and they have led to very interesting advances in regulatory genomics. This article reviews a selection of these methods, ranging from linear models to random forests, kernel methods, and more advanced deep learning models. Specifically, we detail the different techniques and strategies that can be used to extract new gene-regulation hypotheses from these models. Furthermore, because these putative insights need to be validated with wet-lab experiments, we emphasize that it is important to have a measure of confidence associated with the extracted hypotheses. We review the procedures that have been proposed to measure this confidence for the different types of ML models, and we discuss the fact that they do not provide the same kind of information.
Collapse
|
16
|
Cohen S, Bergman S, Lynn N, Tuller T. A tool for CRISPR-Cas9 sgRNA evaluation based on computational models of gene expression. Genome Med 2024; 16:152. [PMID: 39716183 DOI: 10.1186/s13073-024-01420-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 12/02/2024] [Indexed: 12/25/2024] Open
Abstract
BACKGROUND CRISPR is widely used to silence genes by inducing mutations expected to nullify their expression. While numerous computational tools have been developed to design single-guide RNAs (sgRNAs) with high cutting efficiency and minimal off-target effects, only a few tools focus specifically on predicting gene knockouts following CRISPR. These tools consider factors like conservation, amino acid composition, and frameshift likelihood. However, they neglect the impact of CRISPR on gene expression, which can dramatically affect the success of CRISPR-induced gene silencing attempts. Furthermore, information regarding gene expression can be useful even when the objective is not to silence a gene. Therefore, a tool that considers gene expression when predicting CRISPR outcomes is lacking. RESULTS We developed EXPosition, the first computational tool that combines models predicting gene knockouts after CRISPR with models that forecast gene expression, offering more accurate predictions of gene knockout outcomes. EXPosition leverages deep-learning models to predict key steps in gene expression: transcription, splicing, and translation initiation. We showed our tool performs better at predicting gene knockout than existing tools across 6 datasets, 4 cell types and ~207k sgRNAs. We also validated our gene expression models using the ClinVar dataset by showing enrichment of pathogenic mutations in high-scoring mutations according to our models. CONCLUSIONS We believe EXPosition will enhance both the efficiency and accuracy of genome editing projects, by directly predicting CRISPR's effect on various aspects of gene expression. EXPosition is available at http://www.cs.tau.ac.il/~tamirtul/EXPosition . The source code is available at https://github.com/shaicoh3n/EXPosition .
Collapse
Affiliation(s)
- Shai Cohen
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Shaked Bergman
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Nicolas Lynn
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel
| | - Tamir Tuller
- Department of Biomedical Engineering, Tel Aviv University, Tel-Aviv, 6997801, Israel.
- Sagol School of Neuroscience, Tel Aviv University, Tel-Aviv, 6997801, Israel.
| |
Collapse
|
17
|
Dong G, Wu Y, Huang L, Li F, Zhou F. TExCNN: Leveraging Pre-Trained Models to Predict Gene Expression from Genomic Sequences. Genes (Basel) 2024; 15:1593. [PMID: 39766860 PMCID: PMC11675716 DOI: 10.3390/genes15121593] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2024] [Revised: 12/02/2024] [Accepted: 12/10/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND/OBJECTIVES Understanding the relationship between DNA sequences and gene expression levels is of significant biological importance. Recent advancements have demonstrated the ability of deep learning to predict gene expression levels directly from genomic data. However, traditional methods are limited by basic word encoding techniques, which fail to capture the inherent features and patterns of DNA sequences. METHODS We introduce TExCNN, a novel framework that integrates the pre-trained models DNABERT and DNABERT-2 to generate word embeddings for DNA sequences. We partitioned the DNA sequences into manageable segments and computed their respective embeddings using the pre-trained models. These embeddings were then utilized as inputs to our deep learning framework, which was based on convolutional neural network. RESULTS TExCNN outperformed current state-of-the-art models, achieving an average R2 score of 0.622, compared to the 0.596 score achieved by the DeepLncLoc model, which is based on the Word2Vec model and a text convolutional neural network. Furthermore, when the sequence length was extended from 10,500 bp to 50,000 bp, TExCNN achieved an even higher average R2 score of 0.639. The prediction accuracy improved further when additional biological features were incorporated. CONCLUSIONS Our experimental results demonstrate that the use of pre-trained models for word embedding generation significantly improves the accuracy of predicting gene expression. The proposed TExCNN pipeline performes optimally with longer DNA sequences and is adaptable for both cell-type-independent and cell-type-dependent predictions.
Collapse
Affiliation(s)
- Guohao Dong
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; (G.D.); (Y.W.); (L.H.); (F.L.)
- College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yuqian Wu
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; (G.D.); (Y.W.); (L.H.); (F.L.)
- College of Software, Jilin University, Changchun 130012, China
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; (G.D.); (Y.W.); (L.H.); (F.L.)
- College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Fei Li
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; (G.D.); (Y.W.); (L.H.); (F.L.)
- College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun 130012, China; (G.D.); (Y.W.); (L.H.); (F.L.)
- College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
18
|
Tahir M, Norouzi M, Khan SS, Davie JR, Yamanaka S, Ashraf A. Artificial intelligence and deep learning algorithms for epigenetic sequence analysis: A review for epigeneticists and AI experts. Comput Biol Med 2024; 183:109302. [PMID: 39500240 DOI: 10.1016/j.compbiomed.2024.109302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2024] [Revised: 09/22/2024] [Accepted: 10/17/2024] [Indexed: 11/20/2024]
Abstract
Epigenetics encompasses mechanisms that can alter the expression of genes without changing the underlying genetic sequence. The epigenetic regulation of gene expression is initiated and sustained by several mechanisms such as DNA methylation, histone modifications, chromatin conformation, and non-coding RNA. The changes in gene regulation and expression can manifest in the form of various diseases and disorders such as cancer and congenital deformities. Over the last few decades, high-throughput experimental approaches have been used to identify and understand epigenetic changes, but these laboratory experimental approaches and biochemical processes are time-consuming and expensive. To overcome these challenges, machine learning and artificial intelligence (AI) approaches have been extensively used for mapping epigenetic modifications to their phenotypic manifestations. In this paper we provide a narrative review of published research on AI models trained on epigenomic data to address a variety of problems such as prediction of disease markers, gene expression, enhancer-promoter interaction, and chromatin states. The purpose of this review is twofold as it is addressed to both AI experts and epigeneticists. For AI researchers, we provided a taxonomy of epigenetics research problems that can benefit from an AI-based approach. For epigeneticists, given each of the above problems we provide a list of candidate AI solutions in the literature. We have also identified several gaps in the literature, research challenges, and recommendations to address these challenges.
Collapse
Affiliation(s)
- Muhammad Tahir
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Mahboobeh Norouzi
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada
| | - Shehroz S Khan
- College of Engineering and Technology, American University of the Middle East, Kuwait
| | - James R Davie
- Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Soichiro Yamanaka
- Graduate School of Science, Department of Biophysics and Biochemistry, University of Tokyo, Japan
| | - Ahmed Ashraf
- Department of Electrical and Computer Engineering, University of Manitoba, Winnipeg, R3T 5V6, MB, Canada.
| |
Collapse
|
19
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer CG. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol 2024:10.1038/s41587-024-02414-w. [PMID: 39394483 DOI: 10.1038/s41587-024-02414-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 08/29/2024] [Indexed: 10/13/2024]
Abstract
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
Collapse
Affiliation(s)
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- AIRI, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | - Eeshit Dhaval Vaishnav
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Sequome, Inc., South San Francisco, CA, USA
| | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Genentech, San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, MN, USA
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Pablo Meyer
- Health Care and Life Sciences, IBM Research, New York, NY, USA
| | - Carl G de Boer
- University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
20
|
Ramprasad P, Pai N, Pan W. Enhancing personalized gene expression prediction from DNA sequences using genomic foundation models. HGG ADVANCES 2024; 5:100347. [PMID: 39205391 PMCID: PMC11416237 DOI: 10.1016/j.xhgg.2024.100347] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 08/23/2024] [Accepted: 08/23/2024] [Indexed: 09/04/2024] Open
Abstract
Artificial intelligence (AI)/deep learning (DL) models that predict molecular phenotypes like gene expression directly from DNA sequences have recently emerged. While these models have proven effective at capturing the variation across genes, their ability to explain inter-individual differences has been limited. We hypothesize that the performance gap can be narrowed through the use of pre-trained embeddings from the Nucleotide Transformer, a large foundation model trained on 3,000+ genomes. We train a transformer model using the pre-trained embeddings and compare its predictive performance to Enformer, the current state-of-the-art model, using genotype and expression data from 290 individuals. Our model significantly outperforms Enformer in terms of correlation across individuals, and narrows the performance gap with an elastic net regression approach that uses just the genetic variants as predictors. Although simple regression models have their advantages in personalized prediction tasks, DL approaches based on foundation models pre-trained on diverse genomes have unique strengths in flexibility and interpretability. With further methodological and computational improvements with more training data, these models may eventually predict molecular phenotypes from DNA sequences with an accuracy surpassing that of regression-based approaches. Our work demonstrates the potential for large pre-trained AI/DL models to advance functional genomics.
Collapse
Affiliation(s)
- Pratik Ramprasad
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Nidhi Pai
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA
| | - Wei Pan
- Division of Biostatistics and Health Data Science, University of Minnesota, Minneapolis, Minneapolis, MN, USA.
| |
Collapse
|
21
|
Sasse A, Chikina M, Mostafavi S. Quick and effective approximation of in silico saturation mutagenesis experiments with first-order taylor expansion. iScience 2024; 27:110807. [PMID: 39286491 PMCID: PMC11404212 DOI: 10.1016/j.isci.2024.110807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/08/2024] [Accepted: 08/20/2024] [Indexed: 09/19/2024] Open
Abstract
To understand the decision process of genomic sequence-to-function models, explainable AI algorithms determine the importance of each nucleotide in a given input sequence to the model's predictions and enable discovery of cis-regulatory motifs for gene regulation. The most commonly applied method is in silico saturation mutagenesis (ISM) because its per-nucleotide importance scores can be intuitively understood as the computational counterpart to in vivo saturation mutagenesis experiments. While ISM is highly interpretable, it is computationally challenging to perform for many sequences, and becomes prohibitive as the length of the input sequences and size of the model grows. Here, we use the first-order Taylor approximation to approximate ISM values from the model's gradient, which reduces its computation cost to a single forward pass for an input sequence. We show that the Taylor ISM (TISM) approximation is robust across different model ablations, random initializations, training parameters, and dataset sizes.
Collapse
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 16354, USA
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
- Canadian Institute for Advanced Research, Toronto, ON MG5 1ZB, Canada
| |
Collapse
|
22
|
Read DF, Booth GT, Daza RM, Jackson DL, Gladden RG, Srivatsan SR, Ewing B, Franks JM, Spurrell CH, Gomes AR, O'Day D, Gogate AA, Martin BK, Larson H, Pfleger C, Starita L, Lin Y, Shendure J, Lin S, Trapnell C. Single-cell analysis of chromatin and expression reveals age- and sex-associated alterations in the human heart. Commun Biol 2024; 7:1052. [PMID: 39187646 PMCID: PMC11347658 DOI: 10.1038/s42003-024-06582-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Accepted: 07/11/2024] [Indexed: 08/28/2024] Open
Abstract
Sex differences and age-related changes in the human heart at the tissue, cell, and molecular level have been well-documented and many may be relevant for cardiovascular disease. However, how molecular programs within individual cell types vary across individuals by age and sex remains poorly characterized. To better understand this variation, we performed single-nucleus combinatorial indexing (sci) ATAC- and RNA-Seq in human heart samples from nine donors. We identify hundreds of differentially expressed genes by age and sex and find epigenetic signatures of variation in ATAC-Seq data in this discovery cohort. We then scale up our single-cell RNA-Seq analysis by combining our data with five recently published single nucleus RNA-Seq datasets of healthy adult hearts. We find variation such as metabolic alterations by sex and immune changes by age in differential expression tests, as well as alterations in abundance of cardiomyocytes by sex and neurons with age. In addition, we compare our adult-derived ATAC-Seq profiles to analogous fetal cell types to identify putative developmental-stage-specific regulatory factors. Finally, we train predictive models of cell-type-specific RNA expression levels utilizing ATAC-Seq profiles to link distal regulatory sequences to promoters, quantifying the predictive value of a simple TF-to-expression regulatory grammar and identifying cell-type-specific TFs. Our analysis represents the largest single-cell analysis of cardiac variation by age and sex to date and provides a resource for further study of healthy cardiac variation and transcriptional regulation at single-cell resolution.
Collapse
Affiliation(s)
- David F Read
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Gregory T Booth
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Riza M Daza
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Dana L Jackson
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Rula Green Gladden
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Sanjay R Srivatsan
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Brent Ewing
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Jennifer M Franks
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | | | | | - Diana O'Day
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Aishwarya A Gogate
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Seattle Children's Research Institute, Seattle, WA, USA
| | - Beth K Martin
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Haleigh Larson
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Christian Pfleger
- University of Washington School of Medicine, Division of Cardiology, Seattle, WA, USA
| | - Lea Starita
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
| | - Yiing Lin
- Department of Surgery, Washington University, St Louis, MO, USA
| | - Jay Shendure
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
- Seattle Children's Research Institute, Seattle, WA, USA.
- Howard Hughes Medical Institute, Seattle, WA, USA.
- Allen Discovery Center for Cell Lineage Tracing, Seattle, WA, USA.
| | - Shin Lin
- University of Washington School of Medicine, Division of Cardiology, Seattle, WA, USA.
| | - Cole Trapnell
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA.
| |
Collapse
|
23
|
Li S, Moayedpour S, Li R, Bailey M, Riahi S, Kogler-Anele L, Miladi M, Miner J, Pertuy F, Zheng D, Wang J, Balsubramani A, Tran K, Zacharia M, Wu M, Gu X, Clinton R, Asquith C, Skaleski J, Boeglin L, Chivukula S, Dias A, Strugnell T, Montoya FU, Agarwal V, Bar-Joseph Z, Jager S. CodonBERT large language model for mRNA vaccines. Genome Res 2024; 34:1027-1035. [PMID: 38951026 PMCID: PMC11368176 DOI: 10.1101/gr.278870.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Accepted: 06/25/2024] [Indexed: 07/03/2024]
Abstract
mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.
Collapse
Affiliation(s)
- Sizhen Li
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA
| | | | - Ruijiang Li
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA
| | - Michael Bailey
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA
| | - Saleh Riahi
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA
| | | | - Milad Miladi
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Jacob Miner
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Fabien Pertuy
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Dinghai Zheng
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Jun Wang
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | | | - Khang Tran
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Minnie Zacharia
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Monica Wu
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Xiaobo Gu
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Ryan Clinton
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Carla Asquith
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Joseph Skaleski
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Lianne Boeglin
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Sudha Chivukula
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Anusha Dias
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Tod Strugnell
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | | | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA
| | - Ziv Bar-Joseph
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA;
| | - Sven Jager
- Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA
| |
Collapse
|
24
|
Chauhan V, Baptista ISC, Arsh AM, Jagadeesan R, Dash S, Ribeiro AS. Transcription Attenuation in Synthetic Promoters in Nonoverlapping Tandem Formation. Biochemistry 2024; 63:2009-2022. [PMID: 38997112 PMCID: PMC11339919 DOI: 10.1021/acs.biochem.4c00012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 07/01/2024] [Accepted: 07/01/2024] [Indexed: 07/14/2024]
Abstract
Closely spaced promoters are ubiquitous in prokaryotic and eukaryotic genomes. How their structure and dynamics relate remains unclear, particularly for tandem formations. To study their transcriptional interference, we engineered two pairs and one trio of synthetic promoters in nonoverlapping, tandem formation, in single-copy plasmids transformed into Escherichia coli cells. From in vivo measurements, we found that these promoters in tandem formation can have attenuated transcription rates. The attenuation strength can be widely fine-tuned by the promoters' positioning, natural regulatory mechanisms, and other factors, including the antibiotic rifampicin, which is known to hamper RNAP promoter escape. From this, and supported by in silico models, we concluded that the attenuation in these constructs emerges from premature terminations generated by collisions between RNAPs elongating from upstream promoters and RNAPs occupying downstream promoters. Moreover, we found that these collisions can cause one or both RNAPs to falloff. Finally, the broad spectrum of possible, externally regulated, attenuation strengths observed in our synthetic tandem promoters suggests that they could become useful as externally controllable regulators of future synthetic circuits.
Collapse
Affiliation(s)
- Vatsala Chauhan
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
- Department
of Cell and Molecular Biology (ICM), Uppsala
University, 751 24 Uppsala, Sweden
| | - Ines S. C. Baptista
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Amir M. Arsh
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Rahul Jagadeesan
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Suchintak Dash
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| | - Andre S. Ribeiro
- Faculty
of Medicine and Health Technology, Tampere
University, 33520 Tampere, Finland
| |
Collapse
|
25
|
Xu L, Liu Y. Identification, Design, and Application of Noncoding Cis-Regulatory Elements. Biomolecules 2024; 14:945. [PMID: 39199333 PMCID: PMC11352686 DOI: 10.3390/biom14080945] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 07/25/2024] [Accepted: 07/30/2024] [Indexed: 09/01/2024] Open
Abstract
Cis-regulatory elements (CREs) play a pivotal role in orchestrating interactions with trans-regulatory factors such as transcription factors, RNA-binding proteins, and noncoding RNAs. These interactions are fundamental to the molecular architecture underpinning complex and diverse biological functions in living organisms, facilitating a myriad of sophisticated and dynamic processes. The rapid advancement in the identification and characterization of these regulatory elements has been marked by initiatives such as the Encyclopedia of DNA Elements (ENCODE) project, which represents a significant milestone in the field. Concurrently, the development of CRE detection technologies, exemplified by massively parallel reporter assays, has progressed at an impressive pace, providing powerful tools for CRE discovery. The exponential growth of multimodal functional genomic data has necessitated the application of advanced analytical methods. Deep learning algorithms, particularly large language models, have emerged as invaluable tools for deconstructing the intricate nucleotide sequences governing CRE function. These advancements facilitate precise predictions of CRE activity and enable the de novo design of CREs. A deeper understanding of CRE operational dynamics is crucial for harnessing their versatile regulatory properties. Such insights are instrumental in refining gene therapy techniques, enhancing the efficacy of selective breeding programs, pushing the boundaries of genetic innovation, and opening new possibilities in microbial synthetic biology.
Collapse
Affiliation(s)
- Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China;
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Foshan 528226, China
| |
Collapse
|
26
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
27
|
Kathail P, Shuai RW, Chung R, Ye CJ, Loeb GB, Ioannidis NM. Current genomic deep learning models display decreased performance in cell type-specific accessible regions. Genome Biol 2024; 25:202. [PMID: 39090688 PMCID: PMC11293111 DOI: 10.1186/s13059-024-03335-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 07/10/2024] [Indexed: 08/04/2024] Open
Abstract
BACKGROUND A number of deep learning models have been developed to predict epigenetic features such as chromatin accessibility from DNA sequence. Model evaluations commonly report performance genome-wide; however, cis regulatory elements (CREs), which play critical roles in gene regulation, make up only a small fraction of the genome. Furthermore, cell type-specific CREs contain a large proportion of complex disease heritability. RESULTS We evaluate genomic deep learning models in chromatin accessibility regions with varying degrees of cell type specificity. We assess two modeling directions in the field: general purpose models trained across thousands of outputs (cell types and epigenetic marks) and models tailored to specific tissues and tasks. We find that the accuracy of genomic deep learning models, including two state-of-the-art general purpose models-Enformer and Sei-varies across the genome and is reduced in cell type-specific accessible regions. Using accessibility models trained on cell types from specific tissues, we find that increasing model capacity to learn cell type-specific regulatory syntax-through single-task learning or high capacity multi-task models-can improve performance in cell type-specific accessible regions. We also observe that improving reference sequence predictions does not consistently improve variant effect predictions, indicating that novel strategies are needed to improve performance on variants. CONCLUSIONS Our results provide a new perspective on the performance of genomic deep learning models, showing that performance varies across the genome and is particularly reduced in cell type-specific accessible regions. We also identify strategies to maximize performance in cell type-specific accessible regions.
Collapse
Affiliation(s)
- Pooja Kathail
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, University of California, San Francisco, CA, USA
- Institute for Human Genetics, University of California, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Gabriel B Loeb
- Division of Nephrology, Department of Medicine, University of California, San Francisco, CA, USA.
- Cardiovascular Research Institute, University of California, San Francisco, CA, USA.
| | - Nilah M Ioannidis
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA.
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
28
|
Romero R, Menichelli C, Vroland C, Marin JM, Lèbre S, Lecellier CH, Bréhélin L. TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors. Genome Biol 2024; 25:187. [PMID: 38987807 PMCID: PMC11514967 DOI: 10.1186/s13059-024-03321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 06/24/2024] [Indexed: 07/12/2024] Open
Abstract
Characterizing the binding preferences of transcription factors (TFs) in different cell types and conditions is key to understand how they orchestrate gene expression. Here, we develop TFscope, a machine learning approach that identifies sequence features explaining the binding differences observed between two ChIP-seq experiments targeting either the same TF in two conditions or two TFs with similar motifs (paralogous TFs). TFscope systematically investigates differences in the core motif, nucleotide environment and co-factor motifs, and provides the contribution of each key feature in the two experiments. TFscope was applied to > 305 ChIP-seq pairs, and several examples are discussed.
Collapse
Affiliation(s)
- Raphaël Romero
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- IMAG, Univ Montpellier, CNRS, Montpellier, France
| | | | - Christophe Vroland
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | | - Sophie Lèbre
- IMAG, Univ Montpellier, CNRS, Montpellier, France.
- AMIS, Université Paul-Valéry-Montpellier 3, Montpellier, France.
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France.
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | | |
Collapse
|
29
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
30
|
Reddy AJ, Geng X, Herschl MH, Kolli S, Kumar A, Hsu PD, Levine S, Ioannidis NM. Designing Cell-Type-Specific Promoter Sequences Using Conservative Model-Based Optimization. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.23.600232. [PMID: 38948874 PMCID: PMC11213138 DOI: 10.1101/2024.06.23.600232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Gene therapies have the potential to treat disease by delivering therapeutic genetic cargo to disease-associated cells. One limitation to their widespread use is the lack of short regulatory sequences, or promoters, that differentially induce the expression of delivered genetic cargo in target cells, minimizing side effects in other cell types. Such cell-type-specific promoters are difficult to discover using existing methods, requiring either manual curation or access to large datasets of promoter-driven expression from both targeted and untargeted cells. Model-based optimization (MBO) has emerged as an effective method to design biological sequences in an automated manner, and has recently been used in promoter design methods. However, these methods have only been tested using large training datasets that are expensive to collect, and focus on designing promoters for markedly different cell types, overlooking the complexities associated with designing promoters for closely related cell types that share similar regulatory features. Therefore, we introduce a comprehensive framework for utilizing MBO to design promoters in a data-efficient manner, with an emphasis on discovering promoters for similar cell types. We use conservative objective models (COMs) for MBO and highlight practical considerations such as best practices for improving sequence diversity, getting estimates of model uncertainty, and choosing the optimal set of sequences for experimental validation. Using three relatively similar blood cancer cell lines (Jurkat, K562, and THP1), we show that our approach discovers many novel cell-type-specific promoters after experimentally validating the designed sequences. For K562 cells, in particular, we discover a promoter that has 75.85% higher cell-type-specificity than the best promoter from the initial dataset used to train our models.
Collapse
|
31
|
Gonzalez-Avalos E, Onodera A, Samaniego-Castruita D, Rao A, Ay F. Predicting gene expression state and prioritizing putative enhancers using 5hmC signal. Genome Biol 2024; 25:142. [PMID: 38825692 PMCID: PMC11145787 DOI: 10.1186/s13059-024-03273-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 05/11/2024] [Indexed: 06/04/2024] Open
Abstract
BACKGROUND Like its parent base 5-methylcytosine (5mC), 5-hydroxymethylcytosine (5hmC) is a direct epigenetic modification of cytosines in the context of CpG dinucleotides. 5hmC is the most abundant oxidized form of 5mC, generated through the action of TET dioxygenases at gene bodies of actively-transcribed genes and at active or lineage-specific enhancers. Although such enrichments are reported for 5hmC, to date, predictive models of gene expression state or putative regulatory regions for genes using 5hmC have not been developed. RESULTS Here, by using only 5hmC enrichment in genic regions and their vicinity, we develop neural network models that predict gene expression state across 49 cell types. We show that our deep neural network models distinguish high vs low expression state utilizing only 5hmC levels and these predictive models generalize to unseen cell types. Further, in order to leverage 5hmC signal in distal enhancers for expression prediction, we employ an Activity-by-Contact model and also develop a graph convolutional neural network model with both utilizing Hi-C data and 5hmC enrichment to prioritize enhancer-promoter links. These approaches identify known and novel putative enhancers for key genes in multiple immune cell subsets. CONCLUSIONS Our work highlights the importance of 5hmC in gene regulation through proximal and distal mechanisms and provides a framework to link it to genome function. With the recent advances in 6-letter DNA sequencing by short and long-read techniques, profiling of 5mC and 5hmC may be done routinely in the near future, hence, providing a broad range of applications for the methods developed here.
Collapse
Affiliation(s)
- Edahi Gonzalez-Avalos
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Atsushi Onodera
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Department of Immunology, Graduate School of Medicine, Chiba University, Chiba, 260-8670, Japan
| | - Daniela Samaniego-Castruita
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA
- Biological Sciences Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA
| | - Anjana Rao
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pharmacology, University of California San Diego, La Jolla, CA, 92093, USA.
- Sanford Consortium for Regenerative Medicine, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
| | - Ferhat Ay
- La Jolla Institute for Immunology, 9420 Athena Circle, La Jolla, CA, 92037, USA.
- Bioinformatics and Systems Biology Graduate Program, University of California San Diego, La Jolla, CA, 92093, USA.
- Moores Cancer Center, University of California San Diego, La Jolla, CA, 92093, USA.
- Department of Pediatrics, University of California San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
32
|
Gjoni K, Pollard KS. SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models. Bioinformatics 2024; 40:btae340. [PMID: 38796686 PMCID: PMC11153836 DOI: 10.1093/bioinformatics/btae340] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 05/04/2024] [Accepted: 05/24/2024] [Indexed: 05/28/2024] Open
Abstract
SUMMARY The increasing development of sequence-based machine learning models has raised the demand for manipulating sequences for this application. However, existing approaches to edit and evaluate genome sequences using models have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing and supporting in silico mutagenesis experiments. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences. AVAILABILITY AND IMPLEMENTATION SuPreMo was written in Python, and can be run using only one line of code to generate both sequences and 3D genome disruption scores. The codebase, instructions for installation and use, and tutorials are on the GitHub page: https://github.com/ketringjoni/SuPreMo.
Collapse
Affiliation(s)
- Ketrin Gjoni
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
| | - Katherine S Pollard
- Institute of Data Science and Biotechnology, Gladstone Institutes, 1650 Owens Street, San Francisco, CA 94158, United States
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, United States
- Chan Zuckerberg Biohub, San Francisco, CA 94158, United States
| |
Collapse
|
33
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
34
|
Wang S, Wang W. Interpretable prediction of mRNA abundance from promoter sequence using contextual regression models. NAR Genom Bioinform 2024; 6:lqae055. [PMID: 38807713 PMCID: PMC11131020 DOI: 10.1093/nargab/lqae055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 04/08/2024] [Accepted: 05/12/2024] [Indexed: 05/30/2024] Open
Abstract
While machine learning models have been successfully applied to predicting gene expression from promoter sequences, it remains a great challenge to derive intuitive interpretation of the model and reveal DNA motif grammar such as motif cooperation and distance constraint between motif sites. Previous interpretation approaches are often time-consuming or have difficulty to learn the combinatory rules. In this work, we designed interpretable neural network models to predict the mRNA expression levels from DNA sequences. By applying the Contextual Regression framework we developed, we extracted weighted features to cluster samples into different groups, which have different gene expression levels. We performed motif analysis in each cluster and found motifs with active or repressive regulation on gene expression. By comparing the co-occurrence locations of discovered motifs, we also uncovered multiple grammars of motif combination including communities of cooperative motifs and distance constraints between motif pairs. These results revealed new insights of the regulatory architecture of promoter sequences.
Collapse
Affiliation(s)
- Song Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
| | - Wei Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359, USA
| |
Collapse
|
35
|
Reddy AJ, Herschl MH, Geng X, Kolli S, Lu AX, Kumar A, Hsu PD, Levine S, Ioannidis NM. Strategies for effectively modelling promoter-driven gene expression using transfer learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.02.24.529941. [PMID: 36909524 PMCID: PMC10002662 DOI: 10.1101/2023.02.24.529941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/03/2023]
Abstract
The ability to deliver genetic cargo to human cells is enabling rapid progress in molecular medicine, but designing this cargo for precise expression in specific cell types is a major challenge. Expression is driven by regulatory DNA sequences within short synthetic promoters, but relatively few of these promoters are cell-type-specific. The ability to design cell-type-specific promoters using model-based optimization would be impactful for research and therapeutic applications. However, models of expression from short synthetic promoters (promoter-driven expression) are lacking for most cell types due to insufficient training data in those cell types. Although there are many large datasets of both endogenous expression and promoter-driven expression in other cell types, which provide information that could be used for transfer learning, transfer strategies remain largely unexplored for predicting promoter-driven expression. Here, we propose a variety of pretraining tasks, transfer strategies, and model architectures for modelling promoter-driven expression. To thoroughly evaluate various methods, we propose two benchmarks that reflect data-constrained and large dataset settings. In the data-constrained setting, we find that pretraining followed by transfer learning is highly effective, improving performance by 24-27%. In the large dataset setting, transfer learning leads to more modest gains, improving performance by up to 2%. We also propose the best architecture to model promoter-driven expression when training from scratch. The methods we identify are broadly applicable for modelling promoter-driven expression in understudied cell types, and our findings will guide the choice of models that are best suited to designing promoters for gene delivery applications using model-based optimization. Our code and data are available at https://github.com/anikethjr/promoter_models.
Collapse
|
36
|
Wang J, Agarwal V. How DNA encodes the start of transcription. Science 2024; 384:382-383. [PMID: 38662850 DOI: 10.1126/science.adp0869] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2024]
Abstract
A deep-learning model reveals the rules that define transcription initiation.
Collapse
Affiliation(s)
- Jun Wang
- mRNA Center of Excellence, Sanofi Pasteur, Inc., Waltham, MA, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi Pasteur, Inc., Waltham, MA, USA
| |
Collapse
|
37
|
Dudnyk K, Cai D, Shi C, Xu J, Zhou J. Sequence basis of transcription initiation in the human genome. Science 2024; 384:eadj0116. [PMID: 38662817 PMCID: PMC11223672 DOI: 10.1126/science.adj0116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Accepted: 02/28/2024] [Indexed: 05/03/2024]
Abstract
Transcription initiation is a process that is essential to ensuring the proper function of any gene, yet we still lack a unified understanding of sequence patterns and rules that explain most transcription start sites in the human genome. By predicting transcription initiation at base-pair resolution from sequences with a deep learning-inspired explainable model called Puffin, we show that a small set of simple rules can explain transcription initiation at most human promoters. We identify key sequence patterns that contribute to human promoter activity, each activating transcription with distinct position-specific effects. Furthermore, we explain the sequence basis of bidirectional transcription at promoters, identify the links between promoter sequence and gene expression variation across cell types, and explore the conservation of sequence determinants of transcription initiation across mammalian species.
Collapse
Affiliation(s)
- Kseniia Dudnyk
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Donghong Cai
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Chenlai Shi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| | - Jian Xu
- Center of Excellence for Leukemia Studies (CELS), Department of Pathology, St. Jude Children’s Research Hospital, Memphis, Tennessee, United States of America
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center; Dallas, Texas, United States of America
| |
Collapse
|
38
|
Zhou Y, Qi T, Pan M, Tu J, Zhao X, Ge Q, Lu Z. Deep-Cloud: A Deep Neural Network-Based Approach for Analyzing Differentially Expressed Genes of RNA-seq Data. J Chem Inf Model 2024; 64:2302-2310. [PMID: 37682833 DOI: 10.1021/acs.jcim.3c00766] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2023]
Abstract
Presently, the field of analyzing differentially expressed genes (DEGs) of RNA-seq data is still in its infancy, with new approaches constantly being proposed. Taking advantage of deep neural networks to explore gene expression information on RNA-seq data can provide a novel possibility in the biomedical field. In this study, a novel approach based on a deep learning algorithm and cloud model was developed, named Deep-Cloud. Its main advantage is not only using a convolutional neural network and long short-term memory to extract original data features and estimate gene expression of RNA-seq data but also combining the statistical method of the cloud model to quantify the uncertainty and carry out in-depth analysis of the DEGs between the disease groups and the control groups. Compared with traditional analysis software of DEGs, the Deep-cloud model further improves the sensitivity and accuracy of obtaining DEGs from RNA-seq data. Overall, the proposed new approach Deep-cloud paves a new pathway for mining RNA-seq data in the biomedical field.
Collapse
Affiliation(s)
- Ying Zhou
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| | - Ting Qi
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| | - Min Pan
- School of Medicine, Southeast University, Nanjing 210097, China
| | - Jing Tu
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| | - Xiangwei Zhao
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| | - Qinyu Ge
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| | - Zuhong Lu
- State Key Laboratory of Bioelectronics, School of Biological Science & Medical Engineering, Southeast University, Nanjing 210096, China
| |
Collapse
|
39
|
Robson ES, Ioannidis NM. GUANinE v1.0: Benchmark Datasets for Genomic AI Sequence-to-Function Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.12.562113. [PMID: 37904945 PMCID: PMC10614795 DOI: 10.1101/2023.10.12.562113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/01/2023]
Abstract
Computational genomics increasingly relies on machine learning methods for genome interpretation, and the recent adoption of neural sequence-to-function models highlights the need for rigorous model specification and controlled evaluation, problems familiar to other fields of AI. Research strategies that have greatly benefited other fields - including benchmarking, auditing, and algorithmic fairness - are also needed to advance the field of genomic AI and to facilitate model development. Here we propose a genomic AI benchmark, GUANinE, for evaluating model generalization across a number of distinct genomic tasks. Compared to existing task formulations in computational genomics, GUANinE is large-scale, de-noised, and suitable for evaluating pretrained models. GUANinE v1.0 primarily focuses on functional genomics tasks such as functional element annotation and gene expression prediction, and it also draws upon connections to evolutionary biology through sequence conservation tasks. The current GUANinE tasks provide insight into the performance of existing genomic AI models and non-neural baselines, with opportunities to be refined, revisited, and broadened as the field matures. Finally, the GUANinE benchmark allows us to evaluate new self-supervised T5 models and explore the tradeoffs between tokenization and model performance, while showcasing the potential for self-supervision to complement existing pretraining procedures.
Collapse
Affiliation(s)
- Eyes S Robson
- Center for Computational Biology, UC Berkeley, Berkeley, CA 94720
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, UC Berkeley, Berkeley, CA 94720
| |
Collapse
|
40
|
Michielsen L, Reinders MJT, Mahfouz A. Predicting cell population-specific gene expression from genomic sequence. FRONTIERS IN BIOINFORMATICS 2024; 4:1347276. [PMID: 38501113 PMCID: PMC10944912 DOI: 10.3389/fbinf.2024.1347276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Accepted: 01/23/2024] [Indexed: 03/20/2024] Open
Abstract
Most regulatory elements, especially enhancer sequences, are cell population-specific. One could even argue that a distinct set of regulatory elements is what defines a cell population. However, discovering which non-coding regions of the DNA are essential in which context, and as a result, which genes are expressed, is a difficult task. Some computational models tackle this problem by predicting gene expression directly from the genomic sequence. These models are currently limited to predicting bulk measurements and mainly make tissue-specific predictions. Here, we present a model that leverages single-cell RNA-sequencing data to predict gene expression. We show that cell population-specific models outperform tissue-specific models, especially when the expression profile of a cell population and the corresponding tissue are dissimilar. Further, we show that our model can prioritize GWAS variants and learn motifs of transcription factor binding sites. We envision that our model can be useful for delineating cell population-specific regulatory elements.
Collapse
Affiliation(s)
- Lieke Michielsen
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| | - Marcel J. T. Reinders
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| | - Ahmed Mahfouz
- Department of Human Genetics, Leiden University Medical Center, Leiden, Netherlands
- Leiden Computational Biology Center, Leiden University Medical Center, Leiden, Netherlands
- Delft Bioinformatics Lab, Delft University of Technology, Delft, Netherlands
| |
Collapse
|
41
|
Luthra I, Jensen C, Chen XE, Salaudeen AL, Rafi AM, de Boer CG. Regulatory activity is the default DNA state in eukaryotes. Nat Struct Mol Biol 2024; 31:559-567. [PMID: 38448573 DOI: 10.1038/s41594-024-01235-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 01/29/2024] [Indexed: 03/08/2024]
Abstract
Genomes encode for genes and non-coding DNA, both capable of transcriptional activity. However, unlike canonical genes, many transcripts from non-coding DNA have limited evidence of conservation or function. Here, to determine how much biological noise is expected from non-genic sequences, we quantify the regulatory activity of evolutionarily naive DNA using RNA-seq in yeast and computational predictions in humans. In yeast, more than 99% of naive DNA bases were transcribed. Unlike the evolved transcriptome, naive transcripts frequently overlapped with opposite sense transcripts, suggesting selection favored coherent gene structures in the yeast genome. In humans, regulation-associated chromatin activity is predicted to be common in naive dinucleotide-content-matched randomized DNA. Here, naive and evolved DNA have similar co-occurrence and cell-type specificity of chromatin marks, challenging these as indicators of selection. However, in both yeast and humans, extreme high activities were rare in naive DNA, suggesting they result from selection. Overall, basal regulatory activity seems to be the default, which selection can hone to evolve a function or, if detrimental, repress.
Collapse
Affiliation(s)
- Ishika Luthra
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Cassandra Jensen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Xinyi E Chen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Asfar Lathif Salaudeen
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Abdul Muntakim Rafi
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada
| | - Carl G de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
42
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer C. Evaluation and optimization of sequence-based gene regulatory deep learning models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.04.26.538471. [PMID: 38405704 PMCID: PMC10888977 DOI: 10.1101/2023.04.26.538471] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Neural networks have emerged as immensely powerful tools in predicting functional genomic regions, notably evidenced by recent successes in deciphering gene regulatory logic. However, a systematic evaluation of how model architectures and training strategies impact genomics model performance is lacking. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast, to best capture the relationship between regulatory DNA and gene expression. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. While some benchmarks produced similar results across the top-performing models, others differed substantially. All top-performing models used neural networks, but diverged in architectures and novel training strategies, tailored to genomics sequence data. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide any given model into logically equivalent building blocks. We tested all possible combinations for the top three models and observed performance improvements for each. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets. Overall, we demonstrate that high-quality gold-standard genomics datasets can drive significant progress in model development.
Collapse
Affiliation(s)
| | - Daria Nogina
- Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | | | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Massachusetts, United States
- Genentech, South San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, United States
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | | | - Carl de Boer
- University of British Columbia, Vancouver, BC, Canada
| |
Collapse
|
43
|
Fukuda K. The role of transposable elements in human evolution and methods for their functional analysis: current status and future perspectives. Genes Genet Syst 2024; 98:289-304. [PMID: 37866889 DOI: 10.1266/ggs.23-00140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2023] Open
Abstract
Transposable elements (TEs) are mobile DNA sequences that can insert themselves into various locations within the genome, causing mutations that may provide advantages or disadvantages to individuals and species. The insertion of TEs can result in genetic variation that may affect a wide range of human traits including genetic disorders. Understanding the role of TEs in human biology is crucial for both evolutionary and medical research. This review discusses the involvement of TEs in human traits and disease susceptibility, as well as methods for functional analysis of TEs.
Collapse
Affiliation(s)
- Kei Fukuda
- Integrative Genomics Unit, The University of Melbourne
| |
Collapse
|
44
|
Romo L, Findlay SD, Burge CB. Regulatory features aid interpretation of 3'UTR variants. Am J Hum Genet 2024; 111:350-363. [PMID: 38237594 PMCID: PMC10870128 DOI: 10.1016/j.ajhg.2023.12.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2023] [Revised: 12/13/2023] [Accepted: 12/14/2023] [Indexed: 01/30/2024] Open
Abstract
Our ability to determine the clinical impact of variants in 3' untranslated regions (UTRs) of genes remains poor. We provide a thorough analysis of 3' UTR variants from several datasets. Variants in putative regulatory elements, including RNA-binding protein motifs, eCLIP peaks, and microRNA sites, are up to 16 times more likely than variants not in these elements to have gene expression and phenotype associations. Variants in regulatory motifs result in allele-specific protein binding in cell lines and allele-specific gene expression differences in population studies. In addition, variants in shared regions of alternatively polyadenylated isoforms and those proximal to polyA sites are more likely to affect gene expression and phenotype. Finally, pathogenic 3' UTR variants in ClinVar are up to 20 times more likely than benign variants to fall in a regulatory site. We incorporated these findings into RegVar, a software tool that interprets regulatory elements and annotations for any 3' UTR variant and predicts whether the variant is likely to affect gene expression or phenotype. This tool will help prioritize variants for experimental studies and identify pathogenic variants in individuals.
Collapse
Affiliation(s)
- Lindsay Romo
- Harvard Medical Genetics Training Program, Boston Children's Hospital, Boston, MA 02115, USA.
| | - Scott D Findlay
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Christopher B Burge
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02142, USA.
| |
Collapse
|
45
|
Palazzo AF, Qiu Y, Kang YM. mRNA nuclear export: how mRNA identity features distinguish functional RNAs from junk transcripts. RNA Biol 2024; 21:1-12. [PMID: 38091265 PMCID: PMC10732640 DOI: 10.1080/15476286.2023.2293339] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Revised: 11/27/2023] [Accepted: 12/05/2023] [Indexed: 12/18/2023] Open
Abstract
The division of the cellular space into nucleoplasm and cytoplasm promotes quality control mechanisms that prevent misprocessed mRNAs and junk RNAs from gaining access to the translational machinery. Here, we explore how properly processed mRNAs are distinguished from both misprocessed mRNAs and junk RNAs by the presence or absence of various 'identity features'.
Collapse
Affiliation(s)
| | - Yi Qiu
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada
| | - Yoon Mo Kang
- Department of Biochemistry, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
46
|
de Boer CG, Taipale J. Hold out the genome: a roadmap to solving the cis-regulatory code. Nature 2024; 625:41-50. [PMID: 38093018 DOI: 10.1038/s41586-023-06661-w] [Citation(s) in RCA: 27] [Impact Index Per Article: 27.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2023] [Accepted: 09/20/2023] [Indexed: 01/05/2024]
Abstract
Gene expression is regulated by transcription factors that work together to read cis-regulatory DNA sequences. The 'cis-regulatory code' - how cells interpret DNA sequences to determine when, where and how much genes should be expressed - has proven to be exceedingly complex. Recently, advances in the scale and resolution of functional genomics assays and machine learning have enabled substantial progress towards deciphering this code. However, the cis-regulatory code will probably never be solved if models are trained only on genomic sequences; regions of homology can easily lead to overestimation of predictive performance, and our genome is too short and has insufficient sequence diversity to learn all relevant parameters. Fortunately, randomly synthesized DNA sequences enable testing a far larger sequence space than exists in our genomes, and designed DNA sequences enable targeted queries to maximally improve the models. As the same biochemical principles are used to interpret DNA regardless of its source, models trained on these synthetic data can predict genomic activity, often better than genome-trained models. Here we provide an outlook on the field, and propose a roadmap towards solving the cis-regulatory code by a combination of machine learning and massively parallel assays using synthetic DNA.
Collapse
Affiliation(s)
- Carl G de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, British Columbia, Canada.
| | - Jussi Taipale
- Applied Tumor Genomics Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland.
- Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Stockholm, Sweden.
- Department of Biochemistry, University of Cambridge, Cambridge, UK.
| |
Collapse
|
47
|
Rashid A, Al-Obeidat F, Hafez W, Benakatti G, Malik RA, Koutentis C, Sharief J, Brierley J, Quraishi N, Malik ZA, Anwary A, Alkhzaimi H, Zaki SA, Khilnani P, Kadwa R, Phatak R, Schumacher M, Shaikh MG, Al-Dubai A, Hussain A. ADVANCING THE UNDERSTANDING OF CLINICAL SEPSIS USING GENE EXPRESSION-DRIVEN MACHINE LEARNING TO IMPROVE PATIENT OUTCOMES. Shock 2024; 61:4-18. [PMID: 37752080 PMCID: PMC11841734 DOI: 10.1097/shk.0000000000002227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Accepted: 09/05/2023] [Indexed: 09/28/2023]
Abstract
ABSTRACT Sepsis remains a major challenge that necessitates improved approaches to enhance patient outcomes. This study explored the potential of machine learning (ML) techniques to bridge the gap between clinical data and gene expression information to better predict and understand sepsis. We discuss the application of ML algorithms, including neural networks, deep learning, and ensemble methods, to address key evidence gaps and overcome the challenges in sepsis research. The lack of a clear definition of sepsis is highlighted as a major hurdle, but ML models offer a workaround by focusing on endpoint prediction. We emphasize the significance of gene transcript information and its use in ML models to provide insights into sepsis pathophysiology and biomarker identification. Temporal analysis and integration of gene expression data further enhance the accuracy and predictive capabilities of ML models for sepsis. Although challenges such as interpretability and bias exist, ML research offers exciting prospects for addressing critical clinical problems, improving sepsis management, and advancing precision medicine approaches. Collaborative efforts between clinicians and data scientists are essential for the successful implementation and translation of ML models into clinical practice. Machine learning has the potential to revolutionize our understanding of sepsis and significantly improve patient outcomes. Further research and collaboration between clinicians and data scientists are needed to fully understand the potential of ML in sepsis management.
Collapse
Affiliation(s)
- Asrar Rashid
- School of Computing, Edinburgh Napier University, Edinburgh, UK
- NMC Royal Hospital, Khalifa, Abu Dhabi, UAE
| | - Feras Al-Obeidat
- College of Technological Innovation Zayed University, Abu Dhabi, UAE
| | - Wael Hafez
- NMC Royal Hospital, Khalifa, Abu Dhabi, UAE
- Internal Medicine Department, The Medical Research Division, The National Research Centre, Cairo, Egypt
| | | | - Rayaz A. Malik
- Institute of Cardiovascular Science, University of Manchester, Manchester, UK
- Weill Cornell Medicine-Qatar, Doha, Qatar
| | - Christos Koutentis
- Department of Anesthesiology, SUNY Downstate Medical Center, Brooklyn, New York
| | | | - Joe Brierley
- University College London, NIHR Great Ormond Street Hospital Biomedical Research Centre, London, UK
| | - Nasir Quraishi
- Centre for Spinal Studies & Surgery, Queen’s Medical Centre; The University of Nottingham, Nottingham, UK
| | - Zainab A. Malik
- College of Medicine, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, U.A.E
| | - Arif Anwary
- School of Computing, Edinburgh Napier University, Edinburgh, UK
| | | | - Syed Ahmed Zaki
- All India Institute of Medical Sciences, Bibinagar, Hyderabad, India
| | | | - Raziya Kadwa
- Department of Anesthesiology, SUNY Downstate Medical Center, Brooklyn, New York
| | - Rajesh Phatak
- Pediatric Intensive Care, Burjeel Hospital, Najda, Abu Dhabi
| | | | - M. Guftar Shaikh
- Department of Paediatric Endocrinology, Royal Hospital for Children, Glasgow, UK
| | - Ahmed Al-Dubai
- School of Computing, Edinburgh Napier University, Edinburgh, UK
| | - Amir Hussain
- School of Computing, Edinburgh Napier University, Edinburgh, UK
| |
Collapse
|
48
|
Bajwa A, Rastogi R, Kathail P, Shuai RW, Ioannidis NM. Characterizing uncertainty in predictions of genomic sequence-to-activity models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.21.572730. [PMID: 38187742 PMCID: PMC10769392 DOI: 10.1101/2023.12.21.572730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Genomic sequence-to-activity models are increasingly utilized to understand gene regulatory syntax and probe the functional consequences of regulatory variation. Current models make accurate predictions of relative activity levels across the human reference genome, but their performance is more limited for predicting the effects of genetic variants, such as explaining gene expression variation across individuals. To better understand the causes of these shortcomings, we examine the uncertainty in predictions of genomic sequence-to-activity models using an ensemble of Basenji2 model replicates. We characterize prediction consistency on four types of sequences: reference genome sequences, reference genome sequences perturbed with TF motifs, eQTLs, and personal genome sequences. We observe that models tend to make high-confidence predictions on reference sequences, even when incorrect, and low-confidence predictions on sequences with variants. For eQTLs and personal genome sequences, we find that model replicates make inconsistent predictions in >50% of cases. Our findings suggest strategies to improve performance of these models.
Collapse
Affiliation(s)
- Ayesha Bajwa
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Pooja Kathail
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| |
Collapse
|
49
|
Martyn GE, Montgomery MT, Jones H, Guo K, Doughty BR, Linder J, Chen Z, Cochran K, Lawrence KA, Munson G, Pampari A, Fulco CP, Kelley DR, Lander ES, Kundaje A, Engreitz JM. Rewriting regulatory DNA to dissect and reprogram gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.20.572268. [PMID: 38187584 PMCID: PMC10769263 DOI: 10.1101/2023.12.20.572268] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Regulatory DNA sequences within enhancers and promoters bind transcription factors to encode cell type-specific patterns of gene expression. However, the regulatory effects and programmability of such DNA sequences remain difficult to map or predict because we have lacked scalable methods to precisely edit regulatory DNA and quantify the effects in an endogenous genomic context. Here we present an approach to measure the quantitative effects of hundreds of designed DNA sequence variants on gene expression, by combining pooled CRISPR prime editing with RNA fluorescence in situ hybridization and cell sorting (Variant-FlowFISH). We apply this method to mutagenize and rewrite regulatory DNA sequences in an enhancer and the promoter of PPIF in two immune cell lines. Of 672 variant-cell type pairs, we identify 497 that affect PPIF expression. These variants appear to act through a variety of mechanisms including disruption or optimization of existing transcription factor binding sites, as well as creation of de novo sites. Disrupting a single endogenous transcription factor binding site often led to large changes in expression (up to -40% in the enhancer, and -50% in the promoter). The same variant often had different effects across cell types and states, demonstrating a highly tunable regulatory landscape. We use these data to benchmark performance of sequence-based predictive models of gene regulation, and find that certain types of variants are not accurately predicted by existing models. Finally, we computationally design 185 small sequence variants (≤10 bp) and optimize them for specific effects on expression in silico. 84% of these rationally designed edits showed the intended direction of effect, and some had dramatic effects on expression (-100% to +202%). Variant-FlowFISH thus provides a powerful tool to map the effects of variants and transcription factor binding sites on gene expression, test and improve computational models of gene regulation, and reprogram regulatory DNA.
Collapse
Affiliation(s)
- Gabriella E Martyn
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA, USA
| | - Michael T Montgomery
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA, USA
| | - Hank Jones
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA, USA
| | - Katherine Guo
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA, USA
| | - Benjamin R Doughty
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | | | - Ziwei Chen
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Kelly Cochran
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Kathryn A Lawrence
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
| | - Glen Munson
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Anusri Pampari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Charles P Fulco
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Present Address: Sanofi, Cambridge, MA, USA
| | | | - Eric S Lander
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Biology, MIT, Cambridge, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jesse M Engreitz
- Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
- Basic Science and Engineering Initiative, Stanford Children's Health, Betty Irene Moore Children's Heart Center, Stanford, CA, USA
- The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Gene Regulation Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Stanford Cardiovascular Institute, Stanford University, Stanford, CA, USA
| |
Collapse
|
50
|
Tang Z, Toneyan S, Koo PK. Current approaches to genomic deep learning struggle to fully capture human genetic variation. Nat Genet 2023; 55:2021-2022. [PMID: 38036789 DOI: 10.1038/s41588-023-01517-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2023]
Affiliation(s)
- Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Shushan Toneyan
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| |
Collapse
|