1
|
Lal A, Gunsalus L, Gupta A, Biancalani T, Eraslan G. Polygraph: a software framework for the systematic assessment of synthetic regulatory DNA elements. Genome Biol 2025; 26:114. [PMID: 40329378 PMCID: PMC12054168 DOI: 10.1186/s13059-025-03584-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 04/23/2025] [Indexed: 05/08/2025] Open
Abstract
The design of regulatory elements is pivotal in gene and cell therapy, where DNA sequences are engineered to drive elevated and cell-type specific expression. However, the systematic assessment of synthetic DNA sequences without robust metrics and easy-to-use software remains challenging. Here, we introduce Polygraph, a Python framework that evaluates synthetic DNA elements, based on features like diversity, motif and k-mer composition, similarity to endogenous sequences, and screening with predictive and foundational models. Polygraph is the first instrument for assessing synthetic regulatory sequences, enabling faster progress in therapeutic interventions and improving our understanding of gene regulatory mechanisms.
Collapse
Affiliation(s)
- Avantika Lal
- Biology Research | AI Development, gRED Computational Sciences, Genentech, South San Francisco, USA
| | - Laura Gunsalus
- Biology Research | AI Development, gRED Computational Sciences, Genentech, South San Francisco, USA
| | - Anay Gupta
- College of Computing, Georgia Institute of Technology, Atlanta, GA, USA
| | - Tommaso Biancalani
- Biology Research | AI Development, gRED Computational Sciences, Genentech, South San Francisco, USA
| | - Gokcen Eraslan
- Biology Research | AI Development, gRED Computational Sciences, Genentech, South San Francisco, USA.
| |
Collapse
|
2
|
James JS, Dai J, Chew WL, Cai Y. The design and engineering of synthetic genomes. Nat Rev Genet 2025; 26:298-319. [PMID: 39506144 DOI: 10.1038/s41576-024-00786-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/23/2024] [Indexed: 11/08/2024]
Abstract
Synthetic genomics seeks to design and construct entire genomes to mechanistically dissect fundamental questions of genome function and to engineer organisms for diverse applications, including bioproduction of high-value chemicals and biologics, advanced cell therapies, and stress-tolerant crops. Recent progress has been fuelled by advancements in DNA synthesis, assembly, delivery and editing. Computational innovations, such as the use of artificial intelligence to provide prediction of function, also provide increasing capabilities to guide synthetic genome design and construction. However, translating synthetic genome-scale projects from idea to implementation remains highly complex. Here, we aim to streamline this implementation process by comprehensively reviewing the strategies for design, construction, delivery, debugging and tailoring of synthetic genomes as well as their potential applications.
Collapse
Affiliation(s)
- Joshua S James
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Junbiao Dai
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Shenzhen Key Laboratory of Agricultural Synthetic Biology, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
- Shenzhen Key Laboratory of Synthetic Genomics, Guangdong Provincial Key Laboratory of Synthetic Genomics, Shenzhen Institute of Synthetic Biology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Wei Leong Chew
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), Singapore, Republic of Singapore
| | - Yizhi Cai
- Manchester Institute of Biotechnology, University of Manchester, Manchester, UK.
| |
Collapse
|
3
|
Garg S, Kim M, Romero-Suarez D. Current advancements in fungal engineering technologies for Sustainable Development Goals. Trends Microbiol 2025; 33:285-301. [PMID: 39645481 DOI: 10.1016/j.tim.2024.11.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 10/18/2024] [Accepted: 11/06/2024] [Indexed: 12/09/2024]
Abstract
Fungi are emerging as key organisms in tackling global challenges related to agricultural and food productivity, environmental sustainability, and climate change. This review delves into the transformative potential of fungal genomics and metabolic engineering, two forefront fields in modern biotechnology. Fungal genomics entails the thorough analysis and manipulation of fungal genetic material to enhance desirable traits, such as pest resistance, nutrient absorption, and stress tolerance. Metabolic engineering focuses on altering the biochemical pathways within fungi to optimize the production of valuable compounds, including biofuels, pharmaceuticals, and industrial enzymes. By artificial intelligence (AI)-driven integration of genetic and metabolic engineering techniques, we can harness the unique capabilities of both filamentous and mycorrhizal fungi to develop sustainable agricultural practices, enhance soil health, and promote ecosystem restoration. This review explores the current state of research, technological advancements, and practical applications, offering insights into scalability challenges on how integrative fungal genomics and metabolic engineering can deliver innovative solutions for a sustainable future.
Collapse
Affiliation(s)
- Shilpa Garg
- Technical University of Denmark, 2800 Kongens Lyngby, Denmark; University of Manchester, Manchester M13 9PT, United Kingdom.
| | - Minji Kim
- Technical University of Denmark, 2800 Kongens Lyngby, Denmark
| | - David Romero-Suarez
- ARC Center of Excellence in Synthetic Biology, Australian Genome Foundry, and School of Natural Sciences, Macquarie University, Sydney, Australia
| |
Collapse
|
4
|
Li Z, Wang X, Hu G, Li X, Song W, Wei W, Liu L, Gao C. Engineering metabolic flux for the microbial synthesis of aromatic compounds. Metab Eng 2025; 88:94-112. [PMID: 39724940 DOI: 10.1016/j.ymben.2024.12.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 10/10/2024] [Accepted: 12/21/2024] [Indexed: 12/28/2024]
Abstract
Microbial cell factories have emerged as a sustainable alternative to traditional chemical synthesis and plant extraction methods for producing aromatic compounds. However, achieving economically viable production of these compounds in microbial systems remains a significant challenge. This review summarizes the latest advancements in metabolic flux regulation during the microbial production of aromatic compounds, providing an overview of its applications and practical outcomes. Various strategies aimed at improving the utilization of extracellular substrates, enhancing the efficiency of synthetic pathways for target products, and rewiring intracellular metabolic networks to boost the titer, yield, and productivity of aromatic compounds are discussed. Additionally, the persistent challenges in this field and potential solutions are highlighted.
Collapse
Affiliation(s)
- Zhendong Li
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China
| | - Xianghe Wang
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China
| | - Guipeng Hu
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Xiaomin Li
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China
| | - Wei Song
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, 214122, China
| | - Wanqing Wei
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China
| | - Liming Liu
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China
| | - Cong Gao
- School of Biotechnology and Key Laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi, 214122, China.
| |
Collapse
|
5
|
Shen Y, Kudla G, Oyarzún DA. Improving the generalization of protein expression models with mechanistic sequence information. Nucleic Acids Res 2025; 53:gkaf020. [PMID: 39873269 PMCID: PMC11773361 DOI: 10.1093/nar/gkaf020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2024] [Revised: 12/12/2024] [Accepted: 01/08/2025] [Indexed: 01/30/2025] Open
Abstract
The growing demand for biological products drives many efforts to maximize expression of heterologous proteins. Advances in high-throughput sequencing can produce data suitable for building sequence-to-expression models with machine learning. The most accurate models have been trained on one-hot encodings, a mechanism-agnostic representation of nucleotide sequences. Moreover, studies have consistently shown that training on mechanistic sequence features leads to much poorer predictions, even with features that are known to correlate with expression, such as DNA sequence motifs, codon usage, or properties of mRNA secondary structures. However, despite their excellent local accuracy, current sequence-to-expression models can fail to generalize predictions far away from the training data. Through a comparative study across datasets in Escherichia coli and Saccharomyces cerevisiae, here we show that mechanistic sequence features can provide gains on model generalization, and thus improve their utility for predictive sequence design. We explore several strategies to integrate one-hot encodings and mechanistic features into a single predictive model, including feature stacking, ensemble model stacking, and geometric stacking, a novel architecture based on graph convolutional neural networks. Our work casts new light on mechanistic sequence features, underscoring the importance of domain-knowledge and feature engineering for accurate prediction of protein expression levels.
Collapse
Affiliation(s)
- Yuxin Shen
- School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JH, United Kingdom
| | - Grzegorz Kudla
- Institute for Genetics and Cancer, University of Edinburgh, Edinburgh, EH4 2XU, United Kingdom
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh, EH9 3JH, United Kingdom
- School of Informatics, University of Edinburgh, Edinburgh, EH8 9AB, United Kingdom
| |
Collapse
|
6
|
Cherednichenko O, Poptsova M. Data augmentation with generative models improves detection of Non-B DNA structures. Comput Biol Med 2025; 184:109440. [PMID: 39550912 DOI: 10.1016/j.compbiomed.2024.109440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2024] [Revised: 11/10/2024] [Accepted: 11/12/2024] [Indexed: 11/19/2024]
Abstract
Non-B DNA structures, or flipons, are important functional elements that regulate a large spectrum of cellular programs. Experimental technologies for flipon detection are limited to the subsets that are active at the time of an experiment and cannot capture whole-genome functional set. Thus, the task of generating reliable whole-genome annotations of non-B DNA structures is put on deep learning models, however their quality depends on the available experimental data for training. The data augmentation approach as the combination of synthetic and real data is widely used in various fields. Deep generative models demonstrated promising results in data augmentation improving classifiers' performance. Here we aimed at testing performance of diffusion models in comparison to other generative models in generating synthetic non-B DNA structures for data augmentation approach. We tested denoising diffusion probabilistic and implicit models (DDPM and DDIM), Wasserstein generative adversarial network (WGAN), vector quantised variational autoencoder (VQ-VAE) and showed that data augmentation improves the quality of classifiers. Diffusion models overall show the best results, but when considering three criteria of generative trilemma - quality of generated samples, diversity and sampling speed, we conclude that trade-off is possible between generative diffusion model and other architectures such as WGAN and VQ-VAE.
Collapse
Affiliation(s)
| | - Maria Poptsova
- International Laboratory of Bioinformatics, HSE University, Moscow, Russia.
| |
Collapse
|
7
|
Li Z, Zhang Y, Peng B, Qin S, Zhang Q, Chen Y, Chen C, Bao Y, Zhu Y, Hong Y, Liu B, Liu Q, Xu L, Chen X, Ma X, Wang H, Xie L, Yao Y, Deng B, Li J, De B, Chen Y, Wang J, Li T, Liu R, Tang Z, Cao J, Zuo E, Mei C, Zhu F, Shao C, Wang G, Sun T, Wang N, Liu G, Ni JQ, Liu Y. A novel interpretable deep learning-based computational framework designed synthetic enhancers with broad cross-species activity. Nucleic Acids Res 2024; 52:13447-13468. [PMID: 39420601 PMCID: PMC11602155 DOI: 10.1093/nar/gkae912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Revised: 09/25/2024] [Accepted: 10/03/2024] [Indexed: 10/19/2024] Open
Abstract
Enhancers play a critical role in dynamically regulating spatial-temporal gene expression and establishing cell identity, underscoring the significance of designing them with specific properties for applications in biosynthetic engineering and gene therapy. Despite numerous high-throughput methods facilitating genome-wide enhancer identification, deciphering the sequence determinants of their activity remains challenging. Here, we present the DREAM (DNA cis-Regulatory Elements with controllable Activity design platforM) framework, a novel deep learning-based approach for synthetic enhancer design. Proficient in uncovering subtle and intricate patterns within extensive enhancer screening data, DREAM achieves cutting-edge sequence-based enhancer activity prediction and highlights critical sequence features implicating strong enhancer activity. Leveraging DREAM, we have engineered enhancers that surpass the potency of the strongest enhancer within the Drosophila genome by approximately 3.6-fold. Remarkably, these synthetic enhancers exhibited conserved functionality across species that have diverged more than billion years, indicating that DREAM was able to learn highly conserved enhancer regulatory grammar. Additionally, we designed silencers and cell line-specific enhancers using DREAM, demonstrating its versatility. Overall, our study not only introduces an interpretable approach for enhancer design but also lays out a general framework applicable to the design of other types of cis-regulatory elements.
Collapse
Affiliation(s)
- Zhaohong Li
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuanyuan Zhang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Bo Peng
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
| | - Shenghua Qin
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Qian Zhang
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Yun Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Choulin Chen
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yuqi Zhu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Yi Hong
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Binghua Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Qian Liu
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Lingna Xu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xi Chen
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Xinhao Ma
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Hongyan Wang
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Long Xie
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Yilong Yao
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Biao Deng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Jiaying Li
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Baojun De
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Yuting Chen
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Jing Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tian Li
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Ranran Liu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Yuanmingyuan West Road NO. 2, Haidian District, Beijing 100193, China
| | - Zhonglin Tang
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| | - Junwei Cao
- College of Life Sciences, Inner Mongolia Autonomous Region Key Laboratory of Biomanufacturing, Inner Mongolia Agricultural University, NO. 306 Zhaowuda Road, Saihan District, Hohhot 010018, China
| | - Erwei Zuo
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Chugang Mei
- College of Grassland Agriculture, National Beef Cattle Improvement Center, College of Animal Science and Technology, Northwest A&F University, NO. 3 Taicheng Road, Yangling District, Yangling, Shaanxi 712100, China
| | - Fangjie Zhu
- College of JUNCAO Science and Ecology, Haixia Institute of Science and Technology, National Engineering Research Center of JUNCAO, Fujian Agriculture and Forestry University (FAFU), NO.15 Shangxiadian Road, Cangshan District, Fuzhou 0350002, China
| | - Changwei Shao
- State Key Laboratory of Maricultural Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, NO.106 Nanjing Road, Shinan District, Qingdao, Shandong 266071, China
| | - Guirong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Synthetic Biology, Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
| | - Tongjun Sun
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, NO. 7 Pengfei Road, Dapeng District, Shenzhen 518124, China
| | - Ningli Wang
- Department of Ophthalmology, Beijing Institute of Ophthalmology, Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University, Dongjiaomin lane No1, Dongcheng District, Beijing 100101, China
| | - Gang Liu
- State Key Laboratory of Mycology, Institute of Microbiology, Chinese Academy of Sciences, NO.1 Beichen West Road, Chaoyang District, Beijing 100101, China
| | - Jian-Quan Ni
- Gene Regulatory Lab, School of Basic Medical Sciences, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- State Key Laboratory of Molecular Oncology, Tsinghua University, NO. 30 Shuangqing road, Haidian district, Beijing 100084, China
- SXMU-Tsinghua Collaborative Innovation Center for Frontier Medicine, Shanxi Medical University, NO. 56 Xinjian South Road, Yingze District, Taiyuan 030001, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Buxin Road NO. 97, Dapeng District, Shenzhen 518124, China
- Green Healthy Aquaculture Research Center, Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Building 26 Lihe Technology Park, Auxiliary Road of Xinxi Avenue South, Nanhai District, Foshan 528226, China
| |
Collapse
|
8
|
Du Q, Poon MN, Zeng X, Zhang P, Wei Z, Wang H, Wang Y, Wei L, Wang X. Synthetic promoter design in Escherichia coli based on multinomial diffusion model. iScience 2024; 27:111207. [PMID: 39524356 PMCID: PMC11550136 DOI: 10.1016/j.isci.2024.111207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 10/03/2024] [Accepted: 10/15/2024] [Indexed: 11/16/2024] Open
Abstract
Generative design of promoters has enhanced the efficiency of de novo creation of functional sequences. Though several deep generative models have been employed in biological sequence generation, including variational autoencoder (VAE) or Wasserstein generative adversarial network (WGAN), these models might struggle with mode collapse and low sample diversity. In this study, we introduce the multinomial diffusion model (MDM) for promoter sequence design and propose a structured set of criteria for effectively comparing the performance of generative models. In silico experiments demonstrate that MDM outperforms existing generative AI approaches. MDM demonstrates superior performance in various computational evaluations, remains robust during the training process, and exhibits a strong ability in capturing weak signals. In addition, we experimentally validated that the majority of our model designed promoters have expression activities in vivo, indicating the practicality and potential of MDM for bioengineering.
Collapse
Affiliation(s)
- Qixiu Du
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - May Nee Poon
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaocheng Zeng
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Zheng Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Haochen Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Ye Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division, Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
9
|
Zhang P, Wei L, Li J, Wang X. Artificial intelligence-guided strategies for next-generation biological sequence design. Natl Sci Rev 2024; 11:nwae343. [PMID: 39606146 PMCID: PMC11601974 DOI: 10.1093/nsr/nwae343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/20/2024] [Accepted: 09/25/2024] [Indexed: 11/29/2024] Open
Affiliation(s)
- Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Jiaqi Li
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| |
Collapse
|
10
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
11
|
Mantena S, Pillai PP, Petros BA, Welch NL, Myhrvold C, Sabeti PC, Metsky HC. Model-directed generation of artificial CRISPR-Cas13a guide RNA sequences improves nucleic acid detection. Nat Biotechnol 2024:10.1038/s41587-024-02422-w. [PMID: 39394482 DOI: 10.1038/s41587-024-02422-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 09/04/2024] [Indexed: 10/13/2024]
Abstract
CRISPR guide RNA sequences deriving exactly from natural sequences may not perform optimally in every application. Here we implement and evaluate algorithms for designing maximally fit, artificial CRISPR-Cas13a guides with multiple mismatches to natural sequences that are tailored for diagnostic applications. These guides offer more sensitive detection of diverse pathogens and discrimination of pathogen variants compared with guides derived directly from natural sequences and illuminate design principles that broaden Cas13a targeting.
Collapse
Affiliation(s)
- Sreekar Mantena
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | | | - Brittany A Petros
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Health Sciences and Technology, Harvard Medical School and Massachusetts Institute of Technology, Cambridge, MA, USA
- MD-PhD Program, Harvard/Massachusetts Institute of Technology, Boston, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Cameron Myhrvold
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ, USA
- Omenn-Darling Bioengineering Institute, Princeton University, Princeton, NJ, USA
- Department of Chemistry, Princeton University, Princeton, NJ, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | | |
Collapse
|
12
|
Gosai SJ, Castro RI, Fuentes N, Butts JC, Mouri K, Alasoadura M, Kales S, Nguyen TTL, Noche RR, Rao AS, Joy MT, Sabeti PC, Reilly SK, Tewhey R. Machine-guided design of cell-type-targeting cis-regulatory elements. Nature 2024; 634:1211-1220. [PMID: 39443793 PMCID: PMC11525185 DOI: 10.1038/s41586-024-08070-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 09/18/2024] [Indexed: 10/25/2024]
Abstract
Cis-regulatory elements (CREs) control gene expression, orchestrating tissue identity, developmental timing and stimulus responses, which collectively define the thousands of unique cell types in the body1-3. While there is great potential for strategically incorporating CREs in therapeutic or biotechnology applications that require tissue specificity, there is no guarantee that an optimal CRE for these intended purposes has arisen naturally. Here we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell-type specificity. We take advantage of innovations in deep neural network modelling of CRE activity across three cell types, efficient in silico optimization and massively parallel reporter assays to design and empirically test thousands of CREs4-8. Through large-scale in vitro validation, we show that synthetic sequences are more effective at driving cell-type-specific expression in three cell lines compared with natural sequences from the human genome and achieve specificity in analogous tissues when tested in vivo. Synthetic sequences exhibit distinct motif vocabulary associated with activity in the on-target cell type and a simultaneous reduction in the activity of off-target cells. Together, we provide a generalizable framework to prospectively engineer CREs from massively parallel reporter assay models and demonstrate the required literacy to write fit-for-purpose regulatory code.
Collapse
Affiliation(s)
- Sager J Gosai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Graduate Program in Biological and Biomedical Science, Boston, MA, USA.
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
| | | | - Natalia Fuentes
- The Jackson Laboratory, Bar Harbor, ME, USA
- Harvard College, Harvard University, Cambridge, MA, USA
| | - John C Butts
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
| | | | | | | | | | - Ramil R Noche
- Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, USA
- Yale Zebrafish Research Core, Yale School of Medicine, New Haven, CT, USA
| | - Arya S Rao
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Medical School, Boston, MA, USA
| | - Mary T Joy
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
- Department of Immunology and Infectious Diseases, Harvard T H Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Steven K Reilly
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA.
- Wu Tsai Institute, Yale University, New Haven, CT, USA.
| | - Ryan Tewhey
- The Jackson Laboratory, Bar Harbor, ME, USA.
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA.
- Graduate School of Biomedical Sciences, Tufts University School of Medicine, Boston, MA, USA.
| |
Collapse
|
13
|
Genna V, Reyes-Fraile L, Iglesias-Fernandez J, Orozco M. Nucleic acids in modern molecular therapies: A realm of opportunities for strategic drug design. Curr Opin Struct Biol 2024; 87:102838. [PMID: 38759298 DOI: 10.1016/j.sbi.2024.102838] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/10/2024] [Accepted: 04/23/2024] [Indexed: 05/19/2024]
Abstract
RNA vaccines have made evident to society what was already known by the scientific community: nucleic acids will be the "drugs of the future." By modifying the genome, interfering in transcription or translation, and by introducing new catalysts into the cell or by mimicking antibody effects, nucleic acids can generate therapeutic activities that are not accessible by any other therapeutic agents. There are, however, challenges that need to be solved in the next few years to make nucleic acids usable in a wide range of therapeutic scenarios. This review illustrates how simulation methods can help achieve this goal.
Collapse
Affiliation(s)
- Vito Genna
- NBD|Nostrum Biodiscovery, Josep Tarradellas 8-10, Barcelona 08019, Spain. https://twitter.com/_VitoGenna_
| | - Laura Reyes-Fraile
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10-12, Barcelona 08028, Spain; Sixfold Bioscience Ltd, Translational & Innovation Hub, 84 Wood Ln, London W12 0BZ, United Kingdom
| | | | - Modesto Orozco
- Institute for Research in Biomedicine (IRB Barcelona), The Barcelona Institute of Science and Technology, Baldiri Reixac 10-12, Barcelona 08028, Spain; Department of Biochemistry and Biomedicine, University of Barcelona, Barcelona 08028, Spain.
| |
Collapse
|
14
|
Ding N, Yuan Z, Ma Z, Wu Y, Yin L. AI-Assisted Rational Design and Activity Prediction of Biological Elements for Optimizing Transcription-Factor-Based Biosensors. Molecules 2024; 29:3512. [PMID: 39124917 PMCID: PMC11313831 DOI: 10.3390/molecules29153512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2024] [Revised: 07/22/2024] [Accepted: 07/24/2024] [Indexed: 08/12/2024] Open
Abstract
The rational design, activity prediction, and adaptive application of biological elements (bio-elements) are crucial research fields in synthetic biology. Currently, a major challenge in the field is efficiently designing desired bio-elements and accurately predicting their activity using vast datasets. The advancement of artificial intelligence (AI) technology has enabled machine learning and deep learning algorithms to excel in uncovering patterns in bio-element data and predicting their performance. This review explores the application of AI algorithms in the rational design of bio-elements, activity prediction, and the regulation of transcription-factor-based biosensor response performance using AI-designed elements. We discuss the advantages, adaptability, and biological challenges addressed by the AI algorithms in various applications, highlighting their powerful potential in analyzing biological data. Furthermore, we propose innovative solutions to the challenges faced by AI algorithms in the field and suggest future research directions. By consolidating current research and demonstrating the practical applications and future potential of AI in synthetic biology, this review provides valuable insights for advancing both academic research and practical applications in biotechnology.
Collapse
Affiliation(s)
- Nana Ding
- State Key Laboratory of Subtropical Silviculture, Zhejiang A&F University, Hangzhou 311300, China;
- Zhejiang Provincial Key Laboratory of Resources Protection and Innovation of Traditional Chinese Medicine, Zhejiang A&F University, Hangzhou 311300, China
| | - Zenan Yuan
- State Key Laboratory of Subtropical Silviculture, Zhejiang A&F University, Hangzhou 311300, China;
- Zhejiang Provincial Key Laboratory of Resources Protection and Innovation of Traditional Chinese Medicine, Zhejiang A&F University, Hangzhou 311300, China
| | - Zheng Ma
- Zhejiang Provincial Key Laboratory of Biometrology and Inspection & Quarantine, College of Life Sciences, China Jiliang University, Hangzhou 310018, China;
| | - Yefei Wu
- Zhejiang Qianjiang Biochemical Co., Ltd., Haining 314400, China;
| | - Lianghong Yin
- State Key Laboratory of Subtropical Silviculture, Zhejiang A&F University, Hangzhou 311300, China;
- Zhejiang Provincial Key Laboratory of Resources Protection and Innovation of Traditional Chinese Medicine, Zhejiang A&F University, Hangzhou 311300, China
| |
Collapse
|
15
|
Moeckel C, Mouratidis I, Chantzi N, Uzun Y, Georgakopoulos-Soares I. Advances in computational and experimental approaches for deciphering transcriptional regulatory networks: Understanding the roles of cis-regulatory elements is essential, and recent research utilizing MPRAs, STARR-seq, CRISPR-Cas9, and machine learning has yielded valuable insights. Bioessays 2024; 46:e2300210. [PMID: 38715516 PMCID: PMC11444527 DOI: 10.1002/bies.202300210] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 04/22/2024] [Accepted: 04/23/2024] [Indexed: 05/16/2024]
Abstract
Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Yasin Uzun
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Pediatrics, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
16
|
Li T, Xu H, Teng S, Suo M, Bahitwa R, Xu M, Qian Y, Ramstein GP, Song B, Buckler ES, Wang H. Modeling 0.6 million genes for the rational design of functional cis-regulatory variants and de novo design of cis-regulatory sequences. Proc Natl Acad Sci U S A 2024; 121:e2319811121. [PMID: 38889146 PMCID: PMC11214048 DOI: 10.1073/pnas.2319811121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 05/14/2024] [Indexed: 06/20/2024] Open
Abstract
Rational design of plant cis-regulatory DNA sequences without expert intervention or prior domain knowledge is still a daunting task. Here, we developed PhytoExpr, a deep learning framework capable of predicting both mRNA abundance and plant species using the proximal regulatory sequence as the sole input. PhytoExpr was trained over 17 species representative of major clades of the plant kingdom to enhance its generalizability. Via input perturbation, quantitative functional annotation of the input sequence was achieved at single-nucleotide resolution, revealing an abundance of predicted high-impact nucleotides in conserved noncoding sequences and transcription factor binding sites. Evaluation of maize HapMap3 single-nucleotide polymorphisms (SNPs) by PhytoExpr demonstrates an enrichment of predicted high-impact SNPs in cis-eQTL. Additionally, we provided two algorithms that harnessed the power of PhytoExpr in designing functional cis-regulatory variants, and de novo creation of species-specific cis-regulatory sequences through in silico evolution of random DNA sequences. Our model represents a general and robust approach for functional variant discovery in population genetics and rational design of regulatory sequences for genome editing and synthetic biology.
Collapse
Affiliation(s)
- Tianyi Li
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Hui Xu
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Shouzhen Teng
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Mingrui Suo
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Revocatus Bahitwa
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Legumes Research Program, Research and Innovation Division, Tanzania Agricultural Research Institute, Ilonga, Kilosa, Morogoro67410, Tanzania
| | - Mingchi Xu
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Yiheng Qian
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
| | - Guillaume P. Ramstein
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus8000, Denmark
| | - Baoxing Song
- National Key Laboratory of Wheat Improvement, Peking University Institute of Advanced Agricultural Sciences, Shandong Laboratory of Advanced Agriculture Sciences in Weifang, Weifang, Shandong261325, People’s Republic of China
- Key Laboratory of Maize Biology and Genetic Breeding in Arid Area of Northwest Region of the Ministry of Agriculture, College of Agronomy, Northwest A&F University, Yangling, Shaanxi712100, People’s Republic of China
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY14853
- Agricultural Research Service, United States Department of Agriculture, Ithaca, NY14853
| | - Hai Wang
- State Key Laboratory of Maize Bio-breeding, National Maize Improvement Center, Frontiers Science Center for Molecular Design Breeding, Department of Plant Genetics and Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Center for Crop Functional Genomics and Molecular Breeding, China Agricultural University, Beijing100193, People’s Republic of China
- Sanya Institute of China Agricultural University, Sanya572025, People’s Republic of China
| |
Collapse
|
17
|
Xia Y, Du X, Liu B, Guo S, Huo YX. Species-specific design of artificial promoters by transfer-learning based generative deep-learning model. Nucleic Acids Res 2024; 52:6145-6157. [PMID: 38783063 PMCID: PMC11194083 DOI: 10.1093/nar/gkae429] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 04/04/2024] [Accepted: 05/08/2024] [Indexed: 05/25/2024] Open
Abstract
Native prokaryotic promoters share common sequence patterns, but are species dependent. For understudied species with limited data, it is challenging to predict the strength of existing promoters and generate novel promoters. Here, we developed PromoGen, a collection of nucleotide language models to generate species-specific functional promoters, across dozens of species in a data and parameter efficient way. Twenty-seven species-specific models in this collection were finetuned from the pretrained model which was trained on multi-species promoters. When systematically compared with native promoters, the Escherichia coli- and Bacillus subtilis-specific artificial PromoGen-generated promoters (PGPs) were demonstrated to hold all distribution patterns of native promoters. A regression model was developed to score generated either by PromoGen or by another competitive neural network, and the overall score of PGPs is higher. Encouraged by in silico analysis, we further experimentally characterized twenty-two B. subtilis PGPs, results showed that four of tested PGPs reached the strong promoter level while all were active. Furthermore, we developed a user-friendly website to generate species-specific promoters for 27 different species by PromoGen. This work presented an efficient deep-learning strategy for de novo species-specific promoter generation even with limited datasets, providing valuable promoter toolboxes especially for the metabolic engineering of understudied microorganisms.
Collapse
Affiliation(s)
- Yan Xia
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | - Xiaowen Du
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Shuyuan Guo
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
| | - Yi-Xin Huo
- Key Laboratory of Molecular Medicine and Biotherapy, School of Life Science, Beijing Institute of Technology, Beijing 100081, China
- Tangshan Research Institute, Beijing Institute of Technology, Hebei 063611, China
| |
Collapse
|
18
|
Wenteler A, Cabrera CP, Wei W, Neduva V, Barnes MR. AI approaches for the discovery and validation of drug targets. CAMBRIDGE PRISMS. PRECISION MEDICINE 2024; 2:e7. [PMID: 39258224 PMCID: PMC11383977 DOI: 10.1017/pcm.2024.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 05/04/2024] [Accepted: 05/08/2024] [Indexed: 09/12/2024]
Abstract
Artificial intelligence (AI) holds immense promise for accelerating and improving all aspects of drug discovery, not least target discovery and validation. By integrating a diverse range of biological data modalities, AI enables the accurate prediction of drug target properties, ultimately illuminating biological mechanisms of disease and guiding drug discovery strategies. Despite the indisputable potential of AI in drug target discovery, there are many challenges and obstacles yet to be overcome, including dealing with data biases, model interpretability and generalisability, and the validation of predicted drug targets, to name a few. By exploring recent advancements in AI, this review showcases current applications of AI for drug target discovery and offers perspectives on the future of AI for the discovery and validation of drug targets, paving the way for the generation of novel and safer pharmaceuticals.
Collapse
Affiliation(s)
- Aaron Wenteler
- Digital Environment Research Institute, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
- MSD Discovery Centre, London, United Kingdom
| | - Claudia P Cabrera
- Digital Environment Research Institute, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
- NIHR Barts Cardiovascular Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
| | - Wei Wei
- MSD Discovery Centre, London, United Kingdom
| | | | - Michael R Barnes
- Digital Environment Research Institute, Queen Mary University of London, London, United Kingdom
- Centre for Translational Bioinformatics, William Harvey Research Institute, Queen Mary University of London, London, United Kingdom
- NIHR Barts Cardiovascular Biomedical Research Centre, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, United Kingdom
- The Alan Turing Institute, London, United Kingdom
| |
Collapse
|
19
|
Si D, Teng X, Xiong B, Chen L, Shi J. Electrocatalytic functional group conversion-based carbon resource upgrading. Chem Sci 2024; 15:6269-6284. [PMID: 38699249 PMCID: PMC11062096 DOI: 10.1039/d4sc00175c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2024] [Accepted: 03/23/2024] [Indexed: 05/05/2024] Open
Abstract
The conversions of carbon resources, such as alcohols, aldehydes/ketones, and ethers, have been being one of the hottest topics most recently for the goal of carbon neutralization. The emerging electrocatalytic upgrading has been regarded as a promising strategy aiming to convert carbon resources into value-added chemicals. Although exciting progress has been made and reviewed recently in this area by mostly focusing on the explorations of valuable anodic oxidation or cathodic reduction reactions individually, however, the reaction rules of these reactions are still missing, and how to purposely find or rationally design novel but efficient reactions in batches is still challenging. The properties and transformations of key functional groups in substrate molecules play critically important roles in carbon resources conversion reactions, which have been paid more attention to and may offer hidden keys to achieve the above goal. In this review, the properties of functional groups are addressed and discussed in detail, and the reported electrocatalytic upgrading reactions are summarized in four categories based on the types of functional groups of carbon resources. Possible reaction pathways closely related to functional groups will be summarized from the aspects of activation, cleavage and formation of chemical bonds. The current challenges and future opportunities of electrocatalytic upgrading of carbon resources are discussed at the end of this review.
Collapse
Affiliation(s)
- Di Si
- Shanghai Key Laboratory of Green Chemistry and Chemical Processes, State Key Laboratory of Petroleum Molecular and Process Engineering, School of Chemistry and Molecular Engineering, East China Normal University Shanghai 200062 China
| | - Xue Teng
- Shanghai Key Laboratory of Green Chemistry and Chemical Processes, State Key Laboratory of Petroleum Molecular and Process Engineering, School of Chemistry and Molecular Engineering, East China Normal University Shanghai 200062 China
| | - Bingyan Xiong
- Shanghai Tenth People's Hospital, Shanghai Frontiers Science Center of Nanocatalytic Medicine, School of Medicine, Tongji University Shanghai 200072 P. R. China
| | - Lisong Chen
- Shanghai Key Laboratory of Green Chemistry and Chemical Processes, State Key Laboratory of Petroleum Molecular and Process Engineering, School of Chemistry and Molecular Engineering, East China Normal University Shanghai 200062 China
- Institute of Eco-Chongming Shanghai 202162 China
| | - Jianlin Shi
- Shanghai Institute of Ceramics, Chinese Academy of Sciences Shanghai 200050 P. R. China
| |
Collapse
|
20
|
Kim YA, Mousavi K, Yazdi A, Zwierzyna M, Cardinali M, Fox D, Peel T, Coller J, Aggarwal K, Maruggi G. Computational design of mRNA vaccines. Vaccine 2024; 42:1831-1840. [PMID: 37479613 DOI: 10.1016/j.vaccine.2023.07.024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 06/23/2023] [Accepted: 07/10/2023] [Indexed: 07/23/2023]
Abstract
mRNA technology has emerged as a successful vaccine platform that offered a swift response to the COVID-19 pandemic. Accumulating evidence shows that vaccine efficacy, thermostability, and other important properties, are largely impacted by intrinsic properties of the mRNA molecule, such as RNA sequence and structure, both of which can be optimized. Designing mRNA sequence for vaccines presents a combinatorial problem due to an extremely large selection space. For instance, due to the degeneracy of the genetic code, there are over 10632 possible mRNA sequences that could encode the spike protein, the COVID-19 vaccines' target. Moreover, designing different elements of the mRNA sequence simultaneously against multiple objectives such as translational efficiency, reduced reactogenicity, and improved stability requires an efficient and sophisticated optimization strategy. Recently, there has been a growing interest in utilizing computational tools to redesign mRNA sequences to improve vaccine characteristics and expedite discovery timelines. In this review, we explore important biophysical features of mRNA to be considered for vaccine design and discuss how computational approaches can be applied to rapidly design mRNA sequences with desirable characteristics.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Jeff Coller
- Johns Hopkins University, Baltimore, MD, USA
| | | | | |
Collapse
|
21
|
Wang H, Du Q, Wang Y, Xu H, Wei Z, Wang X. GPro: generative AI-empowered toolkit for promoter design. Bioinformatics 2024; 40:btae123. [PMID: 38429953 PMCID: PMC10937896 DOI: 10.1093/bioinformatics/btae123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2023] [Revised: 02/19/2024] [Accepted: 02/28/2024] [Indexed: 03/03/2024] Open
Abstract
MOTIVATION Promoters with desirable properties are crucial in biotechnological applications. Generative AI (GenAI) has demonstrated potential in creating novel synthetic promoters with significantly enhanced functionality. However, these methods' reliance on various programming frameworks and specific task-oriented contexts limits their flexibilities. Overcoming these limitations is essential for researchers to fully leverage the power of GenAI to design promoters for their tasks. RESULTS Here, we introduce GPro (Generative AI-empowered toolkit for promoter design), a user-friendly toolkit that integrates a collection of cutting-edge GenAI-empowered approaches for promoter design. This toolkit provides a standardized pipeline covering essential promoter design processes, including training, optimization, and evaluation. Several detailed demos are provided to reproduce state-of-the-art promoter design pipelines. GPro's user-friendly interface makes it accessible to a wide range of users including non-AI experts. It also offers a variety of optional algorithms for each design process, and gives users the flexibility to compare methods and create customized pipelines. AVAILABILITY AND IMPLEMENTATION GPro is released as an open-source software under the MIT license. The source code for GPro is available on GitHub for Linux, macOS, and Windows: https://github.com/WangLabTHU/GPro, and is available for download via Zenodo repository at https://zenodo.org/doi/10.5281/zenodo.10681733.
Collapse
Affiliation(s)
- Haochen Wang
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Qixiu Du
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Ye Wang
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Hanwen Xu
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Zheng Wei
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Tsinghua University, Beijing 100084, China
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing 100084, China
- Beijing National Research Center for Information Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
22
|
DaSilva LF, Senan S, Patel ZM, Janardhan Reddy A, Gabbita S, Nussbaum Z, Valdez Córdova CM, Wenteler A, Weber N, Tunjic TM, Ahmad Khan T, Li Z, Smith C, Bejan M, Karmel Louis L, Cornejo P, Connell W, Wong ES, Meuleman W, Pinello L. DNA-Diffusion: Leveraging Generative Models for Controlling Chromatin Accessibility and Gene Expression via Synthetic Regulatory Elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.01.578352. [PMID: 38352499 PMCID: PMC10862870 DOI: 10.1101/2024.02.01.578352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/25/2024]
Abstract
The challenge of systematically modifying and optimizing regulatory elements for precise gene expression control is central to modern genomics and synthetic biology. Advancements in generative AI have paved the way for designing synthetic sequences with the aim of safely and accurately modulating gene expression. We leverage diffusion models to design context-specific DNA regulatory sequences, which hold significant potential toward enabling novel therapeutic applications requiring precise modulation of gene expression. Our framework uses a cell type-specific diffusion model to generate synthetic 200 bp regulatory elements based on chromatin accessibility across different cell types. We evaluate the generated sequences based on key metrics to ensure they retain properties of endogenous sequences: transcription factor binding site composition, potential for cell type-specific chromatin accessibility, and capacity for sequences generated by DNA diffusion to activate gene expression in different cell contexts using state-of-the-art prediction models. Our results demonstrate the ability to robustly generate DNA sequences with cell type-specific regulatory potential. DNA-Diffusion paves the way for revolutionizing a regulatory modulation approach to mammalian synthetic biology and precision gene therapy.
Collapse
Affiliation(s)
- Lucas Ferreira DaSilva
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
| | - Simon Senan
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Zain Munir Patel
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Aniketh Janardhan Reddy
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA, USA
| | - Sameer Gabbita
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Johns Hopkins University, Baltimore, MD, USA
| | | | | | | | | | | | | | - Zelun Li
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Cameron Smith
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | | | - Lithin Karmel Louis
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Paola Cornejo
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | | | - Emily S. Wong
- Victor Chang Cardiac Institute, Darlinghurst, New South Wales, Australia
- School of Biotechnology and Biomolecular Sciences, Faculty of Science, UNSW Sydney, Sydney, Australia
| | - Wouter Meuleman
- Altius Institute for Biomedical Sciences, Seattle, WA, USA
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| | - Luca Pinello
- Department of Pathology, Harvard Medical School, Boston, MA, USA
- Molecular Pathology Unit, Center for Cancer Research, Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| |
Collapse
|
23
|
Taskiran II, Spanier KI, Dickmänken H, Kempynck N, Pančíková A, Ekşi EC, Hulselmans G, Ismail JN, Theunis K, Vandepoel R, Christiaens V, Mauduit D, Aerts S. Cell-type-directed design of synthetic enhancers. Nature 2024; 626:212-220. [PMID: 38086419 PMCID: PMC10830415 DOI: 10.1038/s41586-023-06936-2] [Citation(s) in RCA: 38] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 12/05/2023] [Indexed: 01/19/2024]
Abstract
Transcriptional enhancers act as docking stations for combinations of transcription factors and thereby regulate spatiotemporal activation of their target genes1. It has been a long-standing goal in the field to decode the regulatory logic of an enhancer and to understand the details of how spatiotemporal gene expression is encoded in an enhancer sequence. Here we show that deep learning models2-6, can be used to efficiently design synthetic, cell-type-specific enhancers, starting from random sequences, and that this optimization process allows detailed tracing of enhancer features at single-nucleotide resolution. We evaluate the function of fully synthetic enhancers to specifically target Kenyon cells or glial cells in the fruit fly brain using transgenic animals. We further exploit enhancer design to create 'dual-code' enhancers that target two cell types and minimal enhancers smaller than 50 base pairs that are fully functional. By examining the state space searches towards local optima, we characterize enhancer codes through the strength, combination and arrangement of transcription factor activator and transcription factor repressor motifs. Finally, we apply the same strategies to successfully design human enhancers, which adhere to enhancer rules similar to those of Drosophila enhancers. Enhancer design guided by deep learning leads to better understanding of how enhancers work and shows that their code can be exploited to manipulate cell states.
Collapse
Affiliation(s)
- Ibrahim I Taskiran
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Katina I Spanier
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Hannah Dickmänken
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Niklas Kempynck
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Alexandra Pančíková
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- VIB-KULeuven Center for Cancer Biology, Leuven, Belgium
| | - Eren Can Ekşi
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Gert Hulselmans
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Joy N Ismail
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
- UK Dementia Research Institute at Imperial College London, London, UK
| | - Koen Theunis
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Roel Vandepoel
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Valerie Christiaens
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - David Mauduit
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium
- Department of Human Genetics, KU Leuven, Leuven, Belgium
| | - Stein Aerts
- Laboratory of Computational Biology, VIB Center for AI & Computational Biology (VIB.AI), Leuven, Belgium.
- VIB-KULeuven Center for Brain & Disease Research, Leuven, Belgium.
- Department of Human Genetics, KU Leuven, Leuven, Belgium.
| |
Collapse
|
24
|
Salimi A, Jang JH, Lee JY. Leveraging attention-enhanced variational autoencoders: Novel approach for investigating latent space of aptamer sequences. Int J Biol Macromol 2024; 255:127884. [PMID: 37926303 DOI: 10.1016/j.ijbiomac.2023.127884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2023] [Revised: 10/27/2023] [Accepted: 11/02/2023] [Indexed: 11/07/2023]
Abstract
Aptamers are increasingly recognized as potent alternatives to antibodies for diagnostic and therapeutic applications. The application of deep learning, particularly attention-based models, for aptamer (DNA/RNA) sequences is an innovative field. The ongoing advancements in aptamer sequencing technologies coupled with machine learning algorithms have resulted in novel developments. Further research is required to investigate the full potential of deep learning models and address the challenges associated with the generation of sequences, like the large search space of possible sequences. In this study, we propose a workflow that integrates an attention mechanism within a framework of a generative variational autoencoder, to generate novel sequences by expanding latent memory. They show 100 % novelty compared with the dataset, and approximately 88 % of them show negative values for the minimum free energy, which may indicate the likelihood of an RNA sequence folding into a functional structure. Because the field of aptamer discovery is affected by data scarcity, advanced strategies that facilitate the generation of diverse and superior sequences are necessitated. The utilization of our workflow can result in novel aptamers. Thus, investigations such as the present study can address the abovementioned challenge. Our research is anticipated to facilitate further discoveries and advancements in aptamer fields.
Collapse
Affiliation(s)
- Abbas Salimi
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea
| | - Jee Hwan Jang
- School of Materials Science and Engineering, Sungkyunkwan University, Suwon 16419, Republic of Korea; Ucaretron Inc., No. 3508, 40, Simin-daero 365 beon-gil, Dongan-gu, Anyang-si, Gyeonggi-do, Republic of Korea.
| | - Jin Yong Lee
- Department of Chemistry, Sungkyunkwan University, Suwon 16419, Republic of Korea.
| |
Collapse
|
25
|
Fu ZH, He SZ, Wu Y, Zhao GR. Design and deep learning of synthetic B-cell-specific promoters. Nucleic Acids Res 2023; 51:11967-11979. [PMID: 37889080 PMCID: PMC10681721 DOI: 10.1093/nar/gkad930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Revised: 09/20/2023] [Accepted: 10/10/2023] [Indexed: 10/28/2023] Open
Abstract
Synthetic biology and deep learning synergistically revolutionize our ability for decoding and recoding DNA regulatory grammar. The B-cell-specific transcriptional regulation is intricate, and unlock the potential of B-cell-specific promoters as synthetic elements is important for B-cell engineering. Here, we designed and pooled synthesized 23 640 B-cell-specific promoters that exhibit larger sequence space, B-cell-specific expression, and enable diverse transcriptional patterns in B-cells. By MPRA (Massively parallel reporter assays), we deciphered the sequence features that regulate promoter transcriptional, including motifs and motif syntax (their combination and distance). Finally, we built and trained a deep learning model capable of predicting the transcriptional strength of the immunoglobulin V gene promoter directly from sequence. Prediction of thousands of promoter variants identified in the global human population shows that polymorphisms in promoters influence the transcription of immunoglobulin V genes, which may contribute to individual differences in adaptive humoral immune responses. Our work helps to decipher the transcription mechanism in immunoglobulin genes and offers thousands of non-similar promoters for B-cell engineering.
Collapse
Affiliation(s)
- Zong-Heng Fu
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Si-Zhe He
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Yi Wu
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| | - Guang-Rong Zhao
- Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (Ministry of Education), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300072, China
- Frontiers Research Institute for Synthetic Biology, Tianjin University, Tianjin 300072, China
| |
Collapse
|
26
|
Merzbacher C, Oyarzún DA. Applications of artificial intelligence and machine learning in dynamic pathway engineering. Biochem Soc Trans 2023; 51:1871-1879. [PMID: 37656433 PMCID: PMC10657174 DOI: 10.1042/bst20221542] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 08/07/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023]
Abstract
Dynamic pathway engineering aims to build metabolic production systems embedded with intracellular control mechanisms for improved performance. These control systems enable host cells to self-regulate the temporal activity of a production pathway in response to perturbations, using a combination of biosensors and feedback circuits for controlling expression of heterologous enzymes. Pathway design, however, requires assembling together multiple biological parts into suitable circuit architectures, as well as careful calibration of the function of each component. This results in a large design space that is costly to navigate through experimentation alone. Methods from artificial intelligence (AI) and machine learning are gaining increasing attention as tools to accelerate the design cycle, owing to their ability to identify hidden patterns in data and rapidly screen through large collections of designs. In this review, we discuss recent developments in the application of machine learning methods to the design of dynamic pathways and their components. We cover recent successes and offer perspectives for future developments in the field. The integration of AI into metabolic engineering pipelines offers great opportunities to streamline design and discover control systems for improved production of high-value chemicals.
Collapse
Affiliation(s)
| | - Diego A. Oyarzún
- School of Informatics, University of Edinburgh, Edinburgh, U.K
- The Alan Turing Institute, London, U.K
- School of Biological Sciences, University of Edinburgh, Edinburgh, U.K
| |
Collapse
|
27
|
Hara K, Iwano N, Fukunaga T, Hamada M. DeepRaccess: high-speed RNA accessibility prediction using deep learning. FRONTIERS IN BIOINFORMATICS 2023; 3:1275787. [PMID: 37881622 PMCID: PMC10597636 DOI: 10.3389/fbinf.2023.1275787] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 09/29/2023] [Indexed: 10/27/2023] Open
Abstract
RNA accessibility is a useful RNA secondary structural feature for predicting RNA-RNA interactions and translation efficiency in prokaryotes. However, conventional accessibility calculation tools, such as Raccess, are computationally expensive and require considerable computational time to perform transcriptome-scale analysis. In this study, we developed DeepRaccess, which predicts RNA accessibility based on deep learning methods. DeepRaccess was trained to take artificial RNA sequences as input and to predict the accessibility of these sequences as calculated by Raccess. Simulation and empirical dataset analyses showed that the accessibility predicted by DeepRaccess was highly correlated with the accessibility calculated by Raccess. In addition, we confirmed that DeepRaccess could predict protein abundance in E.coli with moderate accuracy from the sequences around the start codon. We also demonstrated that DeepRaccess achieved tens to hundreds of times software speed-up in a GPU environment. The source codes and the trained models of DeepRaccess are freely available at https://github.com/hmdlab/DeepRaccess.
Collapse
Affiliation(s)
- Kaisei Hara
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, Japan
| | - Natsuki Iwano
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
| | - Tsukasa Fukunaga
- Waseda Institute for Advanced Study, Waseda University, Tokyo, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST-Waseda University, Tokyo, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo, Japan
| |
Collapse
|
28
|
Zhang P, Wang H, Xu H, Wei L, Liu L, Hu Z, Wang X. Deep flanking sequence engineering for efficient promoter design using DeepSEED. Nat Commun 2023; 14:6309. [PMID: 37813854 PMCID: PMC10562447 DOI: 10.1038/s41467-023-41899-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Accepted: 09/20/2023] [Indexed: 10/11/2023] Open
Abstract
Designing promoters with desirable properties is essential in synthetic biology. Human experts are skilled at identifying strong explicit patterns in small samples, while deep learning models excel at detecting implicit weak patterns in large datasets. Biologists have described the sequence patterns of promoters via transcription factor binding sites (TFBSs). However, the flanking sequences of cis-regulatory elements, have long been overlooked and often arbitrarily decided in promoter design. To address this limitation, we introduce DeepSEED, an AI-aided framework that efficiently designs synthetic promoters by combining expert knowledge with deep learning techniques. DeepSEED has demonstrated success in improving the properties of Escherichia coli constitutive, IPTG-inducible, and mammalian cell doxycycline (Dox)-inducible promoters. Furthermore, our results show that DeepSEED captures the implicit features in flanking sequences, such as k-mer frequencies and DNA shape features, which are crucial for determining promoter properties.
Collapse
Affiliation(s)
- Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China
| | - Haochen Wang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China
| | - Hanwen Xu
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China
| | - Liyang Liu
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China
| | - Zhirui Hu
- Center for Statistical Science, Tsinghua University, Beijing, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics; Center for Synthetic and Systems Biology; Bioinformatics Division, Beijing National Research Center for Information Science and Technology; Department of Automation, Tsinghua University, Beijing, China.
| |
Collapse
|
29
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
30
|
Shi M, Li X, Li M, Si Y. Attention-based generative adversarial networks improve prognostic outcome prediction of cancer from multimodal data. Brief Bioinform 2023; 24:bbad329. [PMID: 37756592 DOI: 10.1093/bib/bbad329] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 08/20/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023] Open
Abstract
The prediction of prognostic outcome is critical for the development of efficient cancer therapeutics and potential personalized medicine. However, due to the heterogeneity and diversity of multimodal data of cancer, data integration and feature selection remain a challenge for prognostic outcome prediction. We proposed a deep learning method with generative adversarial network based on sequential channel-spatial attention modules (CSAM-GAN), a multimodal data integration and feature selection approach, for accomplishing prognostic stratification tasks in cancer. Sequential channel-spatial attention modules equipped with an encoder-decoder are applied for the input features of multimodal data to accurately refine selected features. A discriminator network was proposed to make the generator and discriminator learning in an adversarial way to accurately describe the complex heterogeneous information of multiple modal data. We conducted extensive experiments with various feature selection and classification methods and confirmed that the CSAM-GAN via the multilayer deep neural network (DNN) classifier outperformed these baseline methods on two different multimodal data sets with miRNA expression, mRNA expression and histopathological image data: lower-grade glioma and kidney renal clear cell carcinoma. The CSAM-GAN via the multilayer DNN classifier bridges the gap between heterogenous multimodal data and prognostic outcome prediction.
Collapse
Affiliation(s)
- Mingguang Shi
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Xuefeng Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Mingna Li
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| | - Yichong Si
- School of Electrical Engineering and Automation, Hefei University of Technology, Hefei, Anhui 230009, China
| |
Collapse
|
31
|
Abstract
Following the widespread use of deep learning for genomics, deep generative modeling is also becoming a viable methodology for the broad field. Deep generative models (DGMs) can learn the complex structure of genomic data and allow researchers to generate novel genomic instances that retain the real characteristics of the original dataset. Aside from data generation, DGMs can also be used for dimensionality reduction by mapping the data space to a latent space, as well as for prediction tasks via exploitation of this learned mapping or supervised/semi-supervised DGM designs. In this review, we briefly introduce generative modeling and two currently prevailing architectures, we present conceptual applications along with notable examples in functional and evolutionary genomics, and we provide our perspective on potential challenges and future directions.
Collapse
Affiliation(s)
- Burak Yelmen
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
- Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Flora Jay
- Laboratoire Interdisciplinaire des Sciences du Numérique, CNRS UMR 9015, INRIA, Université Paris-Saclay, Orsay, France;
| |
Collapse
|
32
|
Gosai SJ, Castro RI, Fuentes N, Butts JC, Kales S, Noche RR, Mouri K, Sabeti PC, Reilly SK, Tewhey R. Machine-guided design of synthetic cell type-specific cis-regulatory elements. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.08.552077. [PMID: 37609287 PMCID: PMC10441439 DOI: 10.1101/2023.08.08.552077] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/24/2023]
Abstract
Cis-regulatory elements (CREs) control gene expression, orchestrating tissue identity, developmental timing, and stimulus responses, which collectively define the thousands of unique cell types in the body. While there is great potential for strategically incorporating CREs in therapeutic or biotechnology applications that require tissue specificity, there is no guarantee that an optimal CRE for an intended purpose has arisen naturally through evolution. Here, we present a platform to engineer and validate synthetic CREs capable of driving gene expression with programmed cell type specificity. We leverage innovations in deep neural network modeling of CRE activity across three cell types, efficient in silico optimization, and massively parallel reporter assays (MPRAs) to design and empirically test thousands of CREs. Through in vitro and in vivo validation, we show that synthetic sequences outperform natural sequences from the human genome in driving cell type-specific expression. Synthetic sequences leverage unique sequence syntax to promote activity in the on-target cell type and simultaneously reduce activity in off-target cells. Together, we provide a generalizable framework to prospectively engineer CREs and demonstrate the required literacy to write regulatory code that is fit-for-purpose in vivo across vertebrates.
Collapse
Affiliation(s)
- SJ Gosai
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Graduate Program in Biological and Biomedical Science, Boston MA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - RI Castro
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - N Fuentes
- The Jackson Laboratory, Bar Harbor, ME, USA
- Harvard College, Harvard University, Cambridge, MA, USA
| | - JC Butts
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
| | - S Kales
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - RR Noche
- Department of Comparative Medicine, Yale School of Medicine, New Haven, CT, USA
- Yale Zebrafish Research Core, Yale School of Medicine, New Haven, CT, USA
| | - K Mouri
- The Jackson Laboratory, Bar Harbor, ME, USA
| | - PC Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department Of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Chevy Chase, MD, USA
| | - SK Reilly
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
- Wu Tsai Institute, Yale University, New Haven, CT, USA
| | - R Tewhey
- The Jackson Laboratory, Bar Harbor, ME, USA
- Graduate School of Biomedical Sciences and Engineering, University of Maine, Orono, ME, USA
- Graduate School of Biomedical Sciences, Tufts University School of Medicine, Boston, MA, USA
| |
Collapse
|
33
|
Penzar D, Nogina D, Noskova E, Zinkevich A, Meshcheryakov G, Lando A, Rafi AM, de Boer C, Kulakovskiy IV. LegNet: a best-in-class deep learning model for short DNA regulatory regions. Bioinformatics 2023; 39:btad457. [PMID: 37490428 PMCID: PMC10400376 DOI: 10.1093/bioinformatics/btad457] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 05/28/2023] [Accepted: 07/24/2023] [Indexed: 07/27/2023] Open
Abstract
MOTIVATION The increasing volume of data from high-throughput experiments including parallel reporter assays facilitates the development of complex deep-learning approaches for modeling DNA regulatory grammar. RESULTS Here, we introduce LegNet, an EfficientNetV2-inspired convolutional network for modeling short gene regulatory regions. By approaching the sequence-to-expression regression problem as a soft classification task, LegNet secured first place for the autosome.org team in the DREAM 2022 challenge of predicting gene expression from gigantic parallel reporter assays. Using published data, here, we demonstrate that LegNet outperforms existing models and accurately predicts gene expression per se as well as the effects of single-nucleotide variants. Furthermore, we show how LegNet can be used in a diffusion network manner for the rational design of promoter sequences yielding the desired expression level. AVAILABILITY AND IMPLEMENTATION https://github.com/autosome-ru/LegNet. The GitHub repository includes Jupyter Notebook tutorials and Python scripts under the MIT license to reproduce the results presented in the study.
Collapse
Affiliation(s)
- Dmitry Penzar
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Protein Research, Pushchino 142290, Russia
- Institute of Translational Medicine, Pirogov Russian National Research Medical University, Moscow 117997, Russia
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Elizaveta Noskova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | - Arsenii Zinkevich
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow 119991, Russia
| | | | | | - Abdul Muntakim Rafi
- School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Carl de Boer
- School of Biomedical Engineering, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Moscow 119991, Russia
- Institute of Protein Research, Pushchino 142290, Russia
- Laboratory of Regulatory Genomics, Institute of Fundamental Medicine and Biology, Kazan Federal University, Kazan 420008, Russia
| |
Collapse
|
34
|
Avdeyev P, Shi C, Tan Y, Dudnyk K, Zhou J. Dirichlet Diffusion Score Model for Biological Sequence Generation. ARXIV 2023:arXiv:2305.10699v2. [PMID: 37292476 PMCID: PMC10246113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Designing biological sequences is an important challenge that requires satisfying complex constraints and thus is a natural problem to address with deep generative modeling. Diffusion generative models have achieved considerable success in many applications. Score-based generative stochastic differential equations (SDE) model is a continuous-time diffusion model framework that enjoys many benefits, but the originally proposed SDEs are not naturally designed for modeling discrete data. To develop generative SDE models for discrete data such as biological sequences, here we introduce a diffusion process defined in the probability simplex space with stationary distribution being the Dirichlet distribution. This makes diffusion in continuous space natural for modeling discrete data. We refer to this approach as Dirchlet diffusion score model. We demonstrate that this technique can generate samples that satisfy hard constraints using a Sudoku generation task. This generative model can also solve Sudoku, including hard puzzles, without additional training. Finally, we applied this approach to develop the first human promoter DNA sequence design model and showed that designed sequences share similar properties with natural promoter sequences.
Collapse
Affiliation(s)
- Pavel Avdeyev
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA
| | - Chenlai Shi
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA
| | - Yuhao Tan
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA
| | - Kseniia Dudnyk
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, USA
| |
Collapse
|
35
|
Vert JP. How will generative AI disrupt data science in drug discovery? Nat Biotechnol 2023:10.1038/s41587-023-01789-6. [PMID: 37156917 DOI: 10.1038/s41587-023-01789-6] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
|
36
|
Nikolados EM, Oyarzún DA. Deep learning for optimization of protein expression. Curr Opin Biotechnol 2023; 81:102941. [PMID: 37087839 DOI: 10.1016/j.copbio.2023.102941] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 02/02/2023] [Accepted: 03/17/2023] [Indexed: 04/25/2023]
Abstract
Recent progress in high-throughput DNA synthesis and sequencing has enabled the development of massively parallel reporter assays for strain characterization. These datasets map a large number of DNA sequences to protein expression levels, sparking increased interest in data-driven methods for sequence-to-expression modeling. Here, we highlight advances in deep learning models of protein expression and their potential for optimizing strains engineered to produce recombinant proteins. We review recent works that built highly accurate models and discuss challenges that hinder adoption by end users. There is a need to better align this technology with the constraints encountered in strain engineering, particularly the cost of acquiring large amounts of data and the requirement for interpretable models that generalize beyond the training data. Overcoming these barriers will help to incentivize academic and industrial laboratories to tap into a new era of data-centric strain engineering.
Collapse
Affiliation(s)
| | - Diego A Oyarzún
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JH, UK; School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK; The Alan Turing Institute, London NW1 2DB, UK.
| |
Collapse
|
37
|
Yasmeen E, Wang J, Riaz M, Zhang L, Zuo K. Designing artificial synthetic promoters for accurate, smart, and versatile gene expression in plants. PLANT COMMUNICATIONS 2023:100558. [PMID: 36760129 PMCID: PMC10363483 DOI: 10.1016/j.xplc.2023.100558] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 01/30/2023] [Accepted: 02/06/2023] [Indexed: 06/18/2023]
Abstract
With the development of high-throughput biology techniques and artificial intelligence, it has become increasingly feasible to design and construct artificial biological parts, modules, circuits, and even whole systems. To overcome the limitations of native promoters in controlling gene expression, artificial promoter design aims to synthesize short, inducible, and conditionally controlled promoters to coordinate the expression of multiple genes in diverse plant metabolic and signaling pathways. Synthetic promoters are versatile and can drive gene expression accurately with smart responses; they show potential for enhancing desirable traits in crops, thereby improving crop yield, nutritional quality, and food security. This review first illustrates the importance of synthetic promoters, then introduces promoter architecture and thoroughly summarizes advances in synthetic promoter construction. Restrictions to the development of synthetic promoters and future applications of such promoters in synthetic plant biology and crop improvement are also discussed.
Collapse
Affiliation(s)
- Erum Yasmeen
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Jin Wang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Muhammad Riaz
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Lida Zhang
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Kaijing Zuo
- Single Cell Research Center, School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China.
| |
Collapse
|
38
|
Deep learning in regulatory genomics: from identification to design. Curr Opin Biotechnol 2023; 79:102887. [PMID: 36640453 DOI: 10.1016/j.copbio.2022.102887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/12/2022] [Accepted: 12/14/2022] [Indexed: 01/14/2023]
Abstract
Genomics and deep learning are a natural match since both are data-driven fields. Regulatory genomics refers to functional noncoding DNA regulating gene expression. In recent years, deep learning applications on regulatory genomics have achieved remarkable advances so-much-so that it has revolutionized the rules of the game of the computational methods in this field. Here, we review two emerging trends: (i) the modeling of very long input sequence (up to 200 kb), which requires self-matched modularization of model architecture; (ii) on the balance of model predictability and model interpretability because the latter is more able to meet biological demands. Finally, we discuss how to employ these two routes to design synthetic regulatory DNA, as a promising strategy for optimizing crop agronomic properties.
Collapse
|
39
|
Accuracy and data efficiency in deep learning models of protein expression. Nat Commun 2022; 13:7755. [PMID: 36517468 PMCID: PMC9751117 DOI: 10.1038/s41467-022-34902-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.
Collapse
|