101
|
He Y, Zhou F, Bai J, Gao Y, Huang X, Wang Y. ViTax: adaptive hierarchical viral taxonomy classification with a taxonomy belief tree on a foundation model. Brief Bioinform 2024; 26:bbaf041. [PMID: 39921398 PMCID: PMC11805961 DOI: 10.1093/bib/bbaf041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Revised: 12/18/2024] [Accepted: 01/20/2025] [Indexed: 02/10/2025] Open
Abstract
Viruses exert a profound influence on both human health and the global ecosystem, yet they remain largely unexplored. Precise taxonomic classification of viral sequences is essential for discovering novel viruses, elucidating their functions, and assessing their implications for public health and environmental monitoring. Traditional taxonomy methods based on genome references are limited by the vast number of unexplored viruses, rapid mutation rates, and high genetic diversity. Additionally, highly imbalanced species distribution and significant variances in inter-species genomic distances across taxonomic units pose challenges to classifier training. Conceptualizing genomic sequences as sentences in a natural language, large language models provide novel approaches for extracting intrinsic viral genome characteristics. In this study, we introduce ViTax, a virus taxonomy classification tool powered by HyenaDNA, a large language foundation model for long-range genomic sequences at single nucleotide resolution. ViTax integrates supervised prototypical contrastive learning to address the highly imbalanced distributions across various taxonomic clades and demonstrates superior performance to current leading methods in virus taxonomy, particularly significant for long sequences. Moreover, ViTax designs a belief mapping tree using the Lowest Common Ancestor algorithm to adaptively assign a sequence to the lowest taxonomy clade with confidence. For the open-set problem, where sequences belong to novel and unexplored genera, ViTax can adaptively assign them to a higher level of known taxonomy with outstanding performance. These capabilities make ViTax a robust tool for advancing the accuracy and reliability of viral taxonomy classification. The code is available at https://github.com/Ying-Lab/ViTax.
Collapse
Affiliation(s)
- YuShuang He
- Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
| | - Feng Zhou
- Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361005, China
| | - JiaXing Bai
- Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
| | - YiChun Gao
- Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
| | - Xiaobing Huang
- Department of Medical Oncology, Fuzhou First Hospital Affiliated with Fujian Medical University, Fuzhou, Fujian 350108, China
| | - Ying Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361005, China
- National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361005, China
- State Key Laboratory of Mariculture Breeding, Xiamen Key Laboratory of Big Data Intelligent Analysis and Decision, Xiamen University, Xiamen, Fujian 350108, China
| |
Collapse
|
102
|
Lu Q, Xu J, Zhang R, Liu H, Wang M, Liu X, Yue Z, Gao Y. RiceSNP-ABST: a deep learning approach to identify abiotic stress-associated single nucleotide polymorphisms in rice. Brief Bioinform 2024; 26:bbae702. [PMID: 39757606 PMCID: PMC11962596 DOI: 10.1093/bib/bbae702] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/16/2024] [Accepted: 12/23/2024] [Indexed: 01/07/2025] Open
Abstract
Given the adverse effects faced by rice due to abiotic stresses, the precise and rapid identification of single nucleotide polymorphisms (SNPs) associated with abiotic stress traits (ABST-SNPs) in rice is crucial for developing resistant rice varieties. The scarcity of high-quality data related to abiotic stress in rice has hindered the development of computational models and constrained research efforts aimed at rice improvement and breeding. Genome-wide association studies provide a better statistical power to consider ABST-SNPs in rice. Meanwhile, deep learning methods have shown their capability in predicting disease- or phenotype-associated loci, but have primarily focused on human species. Therefore, developing predictive models for identifying ABST-SNPs in rice is both urgent and valuable. In this paper, a model called RiceSNP-ABST is proposed for predicting ABST-SNPs in rice. Firstly, six training datasets were generated using a novel strategy for negative sample construction. Secondly, four feature encoding methods were proposed based on DNA sequence fragments, followed by feature selection. Finally, convolutional neural networks with residual connections were used to determine whether the sequences contained rice ABST-SNPs. RiceSNP-ABST outperformed traditional machine learning and state-of-the-art methods on the benchmark dataset and demonstrated consistent generalization on an independent dataset and cross-species datasets. Notably, multi-granularity causal structure learning was employed to elucidate the relationships among DNA structural features, aiming to identify key genetic variants more effectively. The web-based tool for the RiceSNP-ABST can be accessed at http://rice-snp-abst.aielab.cc.
Collapse
Affiliation(s)
- Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Hangcheng Liu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Meng Wang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
103
|
Wang Y, Kong S, Zhou C, Wang Y, Zhang Y, Fang Y, Li G. A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles. Brief Bioinform 2024; 26:bbae651. [PMID: 39708837 DOI: 10.1093/bib/bbae651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/23/2024] Open
Abstract
Advances in three-dimensional (3D) genomics have revealed the spatial characteristics of chromatin interactions in gene expression regulation, which is crucial for understanding molecular mechanisms in biological processes. High-throughput technologies like ChIA-PET, Hi-C, and their derivatives methods have greatly enhanced our knowledge of 3D chromatin architecture. However, the chromatin interaction mechanisms remain largely unexplored. Deep learning, with its powerful feature extraction and pattern recognition capabilities, offers a promising approach for integrating multi-omics data, to build accurate predictive models of chromatin interaction matrices. This review systematically summarizes recent advances in chromatin interaction matrix prediction models. By integrating DNA sequences and epigenetic signals, we investigate the latest developments in these methods. This article details various models, focusing on how one-dimensional (1D) information transforms into the 3D structure chromatin interactions, and how the integration of different deep learning modules specifically affects model accuracy. Additionally, we discuss the critical role of DNA sequence information and epigenetic markers in shaping 3D genome interaction patterns. Finally, this review addresses the challenges in predicting chromatin interaction matrices, in order to improve the precise mapping of chromatin interaction matrices and DNA sequence, and supporting the transformation and theoretical development of 3D genomics across biological systems.
Collapse
Affiliation(s)
- Yunlong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Siyuan Kong
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Cong Zhou
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Yanfang Wang
- State Key Laboratory of Animal Biotech Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), No. 2 West Yuanmingyuan Rd, Haidian District, Beijing 100193, China
| | - Yubo Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
- Sequencing Facility, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, MD 21701, United States
| | - Yaping Fang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Guoliang Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| |
Collapse
|
104
|
Ding Z, Wei R, Xia J, Mu Y, Wang J, Lin Y. Exploring the potential of large language model-based chatbots in challenges of ribosome profiling data analysis: a review. Brief Bioinform 2024; 26:bbae641. [PMID: 39668339 PMCID: PMC11638007 DOI: 10.1093/bib/bbae641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 11/02/2024] [Accepted: 11/27/2024] [Indexed: 12/14/2024] Open
Abstract
Ribosome profiling (Ribo-seq) provides transcriptome-wide insights into protein synthesis dynamics, yet its analysis poses challenges, particularly for nonbioinformatics researchers. Large language model-based chatbots offer promising solutions by leveraging natural language processing. This review explores their convergence, highlighting opportunities for synergy. We discuss challenges in Ribo-seq analysis and how chatbots mitigate them, facilitating scientific discovery. Through case studies, we illustrate chatbots' potential contributions, including data analysis and result interpretation. Despite the absence of applied examples, existing software underscores the value of chatbots and the large language model. We anticipate their pivotal role in future Ribo-seq analysis, overcoming limitations. Challenges such as model bias and data privacy require attention, but emerging trends offer promise. The integration of large language models and Ribo-seq analysis holds immense potential for advancing translational regulation and gene expression understanding.
Collapse
Affiliation(s)
- Zheyu Ding
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| | - Rong Wei
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| | - Jianing Xia
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| | - Yonghao Mu
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| | - Jiahuan Wang
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| | - Yingying Lin
- School of Pharmacy, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
- Key Laboratory of Elemene Class Anti-Cancer Chinese Medicines, Engineering Laboratory of Development and Application of Traditional Chinese Medicines, Collaborative Innovation Center of Traditional Chinese Medicines of Zhejiang Province, Hangzhou Normal University, Hangzhou, Zhejiang 311121, China
| |
Collapse
|
105
|
Cho HN, Jun TJ, Kim YH, Kang H, Ahn I, Gwon H, Kim Y, Seo J, Choi H, Kim M, Han J, Kee G, Park S, Ko S. Task-Specific Transformer-Based Language Models in Health Care: Scoping Review. JMIR Med Inform 2024; 12:e49724. [PMID: 39556827 PMCID: PMC11612605 DOI: 10.2196/49724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/10/2023] [Accepted: 10/21/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Transformer-based language models have shown great potential to revolutionize health care by advancing clinical decision support, patient interaction, and disease prediction. However, despite their rapid development, the implementation of transformer-based language models in health care settings remains limited. This is partly due to the lack of a comprehensive review, which hinders a systematic understanding of their applications and limitations. Without clear guidelines and consolidated information, both researchers and physicians face difficulties in using these models effectively, resulting in inefficient research efforts and slow integration into clinical workflows. OBJECTIVE This scoping review addresses this gap by examining studies on medical transformer-based language models and categorizing them into 6 tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition. METHODS We conducted a scoping review following the Cochrane scoping review protocol. A comprehensive literature search was performed across databases, including Google Scholar and PubMed, covering publications from January 2017 to September 2024. Studies involving transformer-derived models in medical tasks were included. Data were categorized into 6 key tasks. RESULTS Our key findings revealed both advancements and critical challenges in applying transformer-based models to health care tasks. For example, models like MedPIR involving dialogue generation show promise but face privacy and ethical concerns, while question-answering models like BioBERT improve accuracy but struggle with the complexity of medical terminology. The BioBERTSum summarization model aids clinicians by condensing medical texts but needs better handling of long sequences. CONCLUSIONS This review attempted to provide a consolidated understanding of the role of transformer-based language models in health care and to guide future research directions. By addressing current challenges and exploring the potential for real-world applications, we envision significant improvements in health care informatics. Addressing the identified challenges and implementing proposed solutions can enable transformer-based language models to significantly improve health care delivery and patient outcomes. Our review provides valuable insights for future research and practical applications, setting the stage for transformative advancements in medical informatics.
Collapse
Affiliation(s)
- Ha Na Cho
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejun Kang
- Division of Cardiology, Asan Medical Center, Seoul, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Hansle Gwon
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiahn Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiye Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Soyoung Ko
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| |
Collapse
|
106
|
Nguyen E, Poli M, Durrant MG, Kang B, Katrekar D, Li DB, Bartie LJ, Thomas AW, King SH, Brixi G, Sullivan J, Ng MY, Lewis A, Lou A, Ermon S, Baccus SA, Hernandez-Boussard T, Ré C, Hsu PD, Hie BL. Sequence modeling and design from molecular to genome scale with Evo. Science 2024; 386:eado9336. [PMID: 39541441 PMCID: PMC12057570 DOI: 10.1126/science.ado9336] [Citation(s) in RCA: 34] [Impact Index Per Article: 34.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Accepted: 09/09/2024] [Indexed: 11/16/2024]
Abstract
The genome is a sequence that encodes the DNA, RNA, and proteins that orchestrate an organism's function. We present Evo, a long-context genomic foundation model with a frontier architecture trained on millions of prokaryotic and phage genomes, and report scaling laws on DNA to complement observations in language and vision. Evo generalizes across DNA, RNA, and proteins, enabling zero-shot function prediction competitive with domain-specific language models and the generation of functional CRISPR-Cas and transposon systems, representing the first examples of protein-RNA and protein-DNA codesign with a language model. Evo also learns how small mutations affect whole-organism fitness and generates megabase-scale sequences with plausible genomic architecture. These prediction and generation capabilities span molecular to genomic scales of complexity, advancing our understanding and control of biology.
Collapse
Affiliation(s)
- Eric Nguyen
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Michael Poli
- Department of Computer Science, Stanford University, Stanford, CA, USA
- TogetherAI, San Francisco, CA, USA
| | | | - Brian Kang
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | | | - David B. Li
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | | | - Armin W. Thomas
- Stanford Data Science, Stanford University, Stanford, CA, USA
| | - Samuel H. King
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Garyk Brixi
- Arc Institute, Palo Alto, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Madelena Y. Ng
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Ashley Lewis
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Aaron Lou
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Stefano Ermon
- Department of Computer Science, Stanford University, Stanford, CA, USA
- CZ Biohub, San Francisco, CA, USA
| | | | | | - Christopher Ré
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Patrick D. Hsu
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering and Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Brian L. Hie
- Arc Institute, Palo Alto, CA, USA
- Stanford Data Science, Stanford University, Stanford, CA, USA
- Department of Chemical Engineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
107
|
Theodoris CV. Learning the language of DNA. Science 2024; 386:729-730. [PMID: 39541478 DOI: 10.1126/science.adt3007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
A genomic foundation model broadly enables sequence modeling, prediction, and design.
Collapse
Affiliation(s)
- Christina V Theodoris
- Gladstone Institute of Cardiovascular Disease, San Francisco, CA, USA
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
- Department of Pediatrics, Institute for Human Genetics, and Cardiovascular Research Institute, University of California, San Francisco, San Francisco, CA, USA
| |
Collapse
|
108
|
Cao C, Wang C, Dai Q, Zou Q, Wang T. CRBPSA: CircRNA-RBP interaction sites identification using sequence structural attention model. BMC Biol 2024; 22:260. [PMID: 39543602 PMCID: PMC11566611 DOI: 10.1186/s12915-024-02055-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Accepted: 10/30/2024] [Indexed: 11/17/2024] Open
Abstract
BACKGROUND Due to the ability of circRNA to bind with corresponding RBPs and play a critical role in gene regulation and disease prevention, numerous identification algorithms have been developed. Nevertheless, most of the current mainstream methods primarily capture one-dimensional sequence features through various descriptors, while neglecting the effective extraction of secondary structure features. Moreover, as the number of introduced descriptors increases, the issues of sparsity and ineffective representation also rise, causing a significant burden on computational models and leaving room for improvement in predictive performance. RESULTS Based on this, we focused on capturing the features of secondary structure in sequences and developed a new architecture called CRBPSA, which is based on a sequence-structure attention mechanism. Firstly, a base-pairing matrix is generated by calculating the matching probability between each base, with a Gaussian function introduced as a weight to construct the secondary structure. Then, a Structure_Transformer is employed to extract base-pairing information and spatial positional dependencies, enabling the identification of binding sites through deeper feature extraction. Experimental results using the same set of hyperparameters on 37 circRNA datasets, totaling 671,952 samples, show that the CRBPSA algorithm achieves an average AUC of 99.93%, surpassing all existing prediction methods. CONCLUSIONS CRBPSA is a lightweight and efficient prediction tool for circRNA-RBP, which can capture structural features of sequences with minimal computational resources and accurately predict protein-binding sites. This tool facilitates a deeper understanding of the biological processes and mechanisms underlying circRNA and protein interactions.
Collapse
Affiliation(s)
- Chao Cao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Chunyu Wang
- Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang, China
| | - Qi Dai
- College of Life Science and Medicine, Zhejiang Sci-Tech University, Hangzhou, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Tao Wang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
109
|
Kumar A, Dixit S, Srinivasan K, M D, Vincent PMDR. Personalized cancer vaccine design using AI-powered technologies. Front Immunol 2024; 15:1357217. [PMID: 39582860 PMCID: PMC11581883 DOI: 10.3389/fimmu.2024.1357217] [Citation(s) in RCA: 12] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2023] [Accepted: 09/24/2024] [Indexed: 11/26/2024] Open
Abstract
Immunotherapy has ushered in a new era of cancer treatment, yet cancer remains a leading cause of global mortality. Among various therapeutic strategies, cancer vaccines have shown promise by activating the immune system to specifically target cancer cells. While current cancer vaccines are primarily prophylactic, advancements in targeting tumor-associated antigens (TAAs) and neoantigens have paved the way for therapeutic vaccines. The integration of artificial intelligence (AI) into cancer vaccine development is revolutionizing the field by enhancing various aspect of design and delivery. This review explores how AI facilitates precise epitope design, optimizes mRNA and DNA vaccine instructions, and enables personalized vaccine strategies by predicting patient responses. By utilizing AI technologies, researchers can navigate complex biological datasets and uncover novel therapeutic targets, thereby improving the precision and efficacy of cancer vaccines. Despite the promise of AI-powered cancer vaccines, significant challenges remain, such as tumor heterogeneity and genetic variability, which can limit the effectiveness of neoantigen prediction. Moreover, ethical and regulatory concerns surrounding data privacy and algorithmic bias must be addressed to ensure responsible AI deployment. The future of cancer vaccine development lies in the seamless integration of AI to create personalized immunotherapies that offer targeted and effective cancer treatments. This review underscores the importance of interdisciplinary collaboration and innovation in overcoming these challenges and advancing cancer vaccine development.
Collapse
Affiliation(s)
- Anant Kumar
- School of Bioscience and Technology, Vellore Institute of Technology, Vellore, India
| | - Shriniket Dixit
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India
| | - Dinakaran M
- School of Computer Science and Engineering, Vellore Institute of Technology, Chennai, India
| | - P. M. Durai Raj Vincent
- School of Computer Science Engineering and Information Systems, Vellore Institute of Technology, Vellore, India
| |
Collapse
|
110
|
Liu J, Shen H, Yang Y, Yang M, Zhang Q, Chen K, Li X. Transformer-based representation learning and multiple-instance learning for cancer diagnosis exclusively from raw sequencing fragments of bisulfite-treated plasma cell-free DNA. Mol Oncol 2024; 18:2755-2769. [PMID: 39380154 PMCID: PMC11547222 DOI: 10.1002/1878-0261.13745] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 07/31/2024] [Accepted: 09/24/2024] [Indexed: 10/10/2024] Open
Abstract
Early cancer diagnosis from bisulfite-treated cell-free DNA (cfDNA) fragments requires tedious data analytical procedures. Here, we present a deep-learning-based approach for early cancer interception and diagnosis (DECIDIA) that can achieve accurate cancer diagnosis exclusively from bisulfite-treated cfDNA sequencing fragments. DECIDIA relies on transformer-based representation learning of DNA fragments and weakly supervised multiple-instance learning for classification. We systematically evaluate the performance of DECIDIA for cancer diagnosis and cancer type prediction on a curated dataset of 5389 samples that consist of colorectal cancer (CRC; n = 1574), hepatocellular cell carcinoma (HCC; n = 1181), lung cancer (n = 654), and non-cancer control (n = 1980). DECIDIA achieved an area under the receiver operating curve (AUROC) of 0.980 (95% CI, 0.976-0.984) in 10-fold cross-validation settings on the CRC dataset by differentiating cancer patients from cancer-free controls, outperforming benchmarked methods that are based on methylation intensities. Noticeably, DECIDIA achieved an AUROC of 0.910 (95% CI, 0.896-0.924) on the externally independent HCC testing set in distinguishing HCC patients from cancer-free controls, although there was no HCC data used in model development. In the settings of cancer-type classification, we observed that DECIDIA achieved a micro-average AUROC of 0.963 (95% CI, 0.960-0.966) and an overall accuracy of 82.8% (95% CI, 81.8-83.9). In addition, we distilled four sequence signatures from the raw sequencing reads that exhibited differential patterns in cancer versus control and among different cancer types. Our approach represents a new paradigm towards eliminating the tedious data analytical procedures for liquid biopsy that uses bisulfite-treated cfDNA methylome.
Collapse
Affiliation(s)
- Jilei Liu
- Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Hongru Shen
- Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Yichen Yang
- Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Meng Yang
- Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Qiang Zhang
- Department of Maxillofacial and Otorhinolaryngology Oncology, Tianjin's Clinical Research Center for Cancer, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Kexin Chen
- Department of Epidemiology and Biostatistics, Key Laboratory of Molecular Cancer Epidemiology of Tianjin, Tianjin's Clinical Research Center for Cancer, Key Laboratory of Prevention and Control of Major Diseases in the Population Ministry of Education, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| | - Xiangchun Li
- Tianjin Cancer Institute, Tianjin's Clinical Research Center for Cancer, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin Medical University Cancer Institute and HospitalTianjin Medical UniversityChina
| |
Collapse
|
111
|
Romeijn L, Bernatavicius A, Vu D. MycoAI: Fast and accurate taxonomic classification for fungal ITS sequences. Mol Ecol Resour 2024; 24:e14006. [PMID: 39152642 DOI: 10.1111/1755-0998.14006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 07/12/2024] [Accepted: 08/06/2024] [Indexed: 08/19/2024]
Abstract
Efficient and accurate classification of DNA barcode data is crucial for large-scale fungal biodiversity studies. However, existing methods are either computationally expensive or lack accuracy. Previous research has demonstrated the potential of deep learning in this domain, successfully training neural networks for biological sequence classification. We introduce the MycoAI Python package, featuring various deep learning models such as BERT and CNN tailored for fungal Internal Transcribed Spacer (ITS) sequences. We explore different neural architecture designs and encoding methods to identify optimal models. By employing a multi-head output architecture and multi-level hierarchical label smoothing, MycoAI effectively generalizes across the taxonomic hierarchy. Using over 5 million labelled sequences from the UNITE database, we develop two models: MycoAI-BERT and MycoAI-CNN. While we emphasize the necessity of verifying classification results by AI models due to insufficient reference data, MycoAI still exhibits substantial potential. When benchmarked against existing classifiers such as DNABarcoder and RDP on two independent test sets with labels present in the training dataset, MycoAI models demonstrate high accuracy at the genus and higher taxonomic levels, with MycoAI-CNN being the fastest and most accurate. In terms of efficiency, MycoAI models can classify over 300,000 sequences within 5 min. We publicly release the MycoAI models, enabling mycologists to classify their ITS barcode data efficiently. Additionally, MycoAI serves as a platform for developing further deep learning-based classification methods. The source code for MycoAI is available under the MIT Licence at https://github.com/MycoAI/MycoAI.
Collapse
Affiliation(s)
- Luuk Romeijn
- Leiden Institute of Advanced Computer Science, Leiden University, Leiden, Netherlands
| | - Andrius Bernatavicius
- Leiden Institute of Advanced Computer Science, Leiden University, Leiden, Netherlands
- Leiden Academic Centre for Drug Research, Leiden University, Leiden, Netherlands
| | - Duong Vu
- Westerdijk Fungal Biodiveristy Institute, Utrecht, Netherlands
| |
Collapse
|
112
|
Song T, Song H, Pan Z, Gao Y, Dai H, Wang X. DeepDualEnhancer: A Dual-Feature Input DNABert Based Deep Learning Method for Enhancer Recognition. Int J Mol Sci 2024; 25:11744. [PMID: 39519295 PMCID: PMC11546905 DOI: 10.3390/ijms252111744] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2024] [Revised: 10/23/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
Enhancers are cis-regulatory DNA sequences that are widely distributed throughout the genome. They can precisely regulate the expression of target genes. Since the features of enhancer segments are difficult to detect, we propose DeepDualEnhancer, a DNABert-based method using a multi-scale convolutional neural network, BiLSTM, for enhancer identification. We first designed the DeepDualEnhancer method based only on the DNA sequence input. It mainly consists of a multi-scale Convolutional Neural Network, and BiLSTM to extract features by DNABert and embedding, respectively. Meanwhile, we collected new datasets from the enhancer-promoter interaction field and designed the method DeepDualEnhancer-genomic for inputting DNA sequences and genomic signals, which consists of the transformer sequence attention. Extensive comparisons of our method with 20 other excellent methods through 5-fold cross validation, ablation experiments, and an independent test demonstrated that DeepDualEnhancer achieves the best performance. It is also found that the inclusion of genomic signals helps the enhancer recognition task to be performed better.
Collapse
Affiliation(s)
| | | | | | | | | | - Xun Wang
- Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum, Qingdao 266555, China; (T.S.); (H.S.); (Z.P.); (Y.G.); (H.D.)
| |
Collapse
|
113
|
Xu R, Li D, Yang W, Wang G, Li Y. Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure. Bioinformatics 2024; 40:btae640. [PMID: 39460948 PMCID: PMC11639665 DOI: 10.1093/bioinformatics/btae640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 10/15/2024] [Accepted: 10/22/2024] [Indexed: 10/28/2024] Open
Abstract
MOTIVATION Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations. RESULTS To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks. AVAILABILITY AND IMPLEMENTATION MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.
Collapse
Affiliation(s)
- Ruiting Xu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dan Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Wen Yang
- International Medical Center, Shenzhen University General Hospital, SZU 518055, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Yang Li
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
114
|
Li C, Wang H, Wen Y, Yin R, Zeng X, Li K. GenoM7GNet: An Efficient N 7-Methylguanosine Site Prediction Approach Based on a Nucleotide Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2258-2268. [PMID: 39302806 DOI: 10.1109/tcbb.2024.3459870] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
N-methylguanosine (m7G), one of the mainstream post-transcriptional RNA modifications, occupies an exceedingly significant place in medical treatments. However, classic approaches for identifying m7G sites are costly both in time and equipment. Meanwhile, the existing machine learning methods extract limited hidden information from RNA sequences, thus making it difficult to improve the accuracy. Therefore, we put forward to a deep learning network, called "GenoM7GNet," for m7G site identification. This model utilizes a Bidirectional Encoder Representation from Transformers (BERT) and is pretrained on nucleotide sequences data to capture hidden patterns from RNA sequences for m7G site prediction. Moreover, through detailed comparative experiments with various deep learning models, we discovered that the one-dimensional convolutional neural network (CNN) exhibits outstanding performance in sequence feature learning and classification. The proposed GenoM7GNet model achieved 0.953in accuracy, 0.932in sensitivity, 0.976in specificity, 0.907in Matthews Correlation Coefficient and 0.984in Area Under the receiver operating characteristic Curve on performance evaluation. Extensive experimental results further prove that our GenoM7GNet model markedly surpasses other state-of-the-art models in predicting m7G sites, exhibiting high computing performance.
Collapse
|
115
|
Yu Z, Zhang Y. Foundation model for comprehensive transcriptional regulation analysis. Natl Sci Rev 2024; 11:nwae355. [PMID: 39555104 PMCID: PMC11565239 DOI: 10.1093/nsr/nwae355] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Revised: 09/22/2024] [Accepted: 10/11/2024] [Indexed: 11/19/2024] Open
Affiliation(s)
- Zhaowei Yu
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, China
| | - Yong Zhang
- State Key Laboratory of Cardiovascular Diseases and Medical Innovation Center, Institute for Regenerative Medicine, Department of Neurosurgery, Shanghai East Hospital, Shanghai Key Laboratory of Signaling and Disease Research, Frontier Science Center for Stem Cell Research, School of Life Sciences and Technology, Tongji University, China
| |
Collapse
|
116
|
Yu X, Yani C, Wang Z, Long H, Zeng R, Liu X, Anas B, Ren J. iDNA-ITLM: An interpretable and transferable learning model for identifying DNA methylation. PLoS One 2024; 19:e0301791. [PMID: 39480834 PMCID: PMC11527195 DOI: 10.1371/journal.pone.0301791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 03/20/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, from the perspective of image processing, we propose the iDNA-ITLM model, using a novel data enhance strategy by continuously self-replicating a short DNA sequence into a longer DNA sequence and then embedding it into a high-dimensional matrix to enlarge the receptive field, for identifying DNA methylation sites. Our model consistently outperforms the current state-of-the-art sequence-based DNA methylation site recognition methods when evaluated on 17 benchmark datasets that cover multiple species and include three DNA methylation modifications (4mC, 5hmC, and 6mA). The experimental results demonstrate the robustness and superior performance of our model across these datasets. In addition, our model can transfer learning to RNA methylation sequences and produce good results without modifying the hyperparameters in the model. The proposed iDNA-ITLM model can be considered a universal predictor across DNA and RNA methylation species.
Collapse
Affiliation(s)
- Xia Yu
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Cui Yani
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| | - Zhichao Wang
- Unit 32033, The People’s Liberation Army, Beijing, China
| | - Haixia Long
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Rao Zeng
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Xiling Liu
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Bilal Anas
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, Hainan, China
| | - Jia Ren
- School of Information and Communication Engineering, Hainan University, Haikou, Hainan, China
| |
Collapse
|
117
|
Jyoti, Ritu, Gupta S, Shankar R. Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery. Heliyon 2024; 10:e39140. [PMID: 39640721 PMCID: PMC11620080 DOI: 10.1016/j.heliyon.2024.e39140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 08/23/2024] [Accepted: 10/08/2024] [Indexed: 12/07/2024] Open
Abstract
Transcription factors (TFs) are regulatory proteins which bind to a specific DNA region known as the transcription factor binding regions (TFBRs) to regulate the rate of transcription process. The identification of TFBRs has been made possible by a number of experimental and computational techniques established during the past few years. The process of TFBR identification involves peak identification in the binding data, followed by the identification of motif characteristics. Using the same binding data attempts have been made to raise computational models to identify such binding regions which could save time and resources spent for binding experiments. These computational approaches depend a lot on what way they learn and how. These existing computational approaches are skewed heavily around human TFBRs discovery, while plants have drastically different genomic setup for regulation which these approaches have grossly ignored. Here, we provide a comprehensive study of the current state of the matters in plant specific TF discovery algorithms. While doing so, we encountered several software tools' issues rendering the tools not useable to researches. We fixed them and have also provided the corrected scripts for such tools. We expect this study to serve as a guide for better understanding of software tools' approaches for plant specific TFBRs discovery and the care to be taken while applying them, especially during cross-species applications. The corrected scripts of these software tools are made available at https://github.com/SCBB-LAB/Comparative-analysis-of-plant-TFBS-software.
Collapse
Affiliation(s)
- Jyoti
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ritu
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Sagar Gupta
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| |
Collapse
|
118
|
Shao B, Yan J. A long-context language model for deciphering and generating bacteriophage genomes. Nat Commun 2024; 15:9392. [PMID: 39477977 PMCID: PMC11525655 DOI: 10.1038/s41467-024-53759-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 10/22/2024] [Indexed: 11/02/2024] Open
Abstract
Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.
Collapse
Affiliation(s)
- Bin Shao
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA.
| | - Jiawei Yan
- Independent researcher, 100 N Gushan Rd, Shanghai, 200135, China
| |
Collapse
|
119
|
Nahali S, Safari L, Khanteymoori A, Huang J. StructmRNA a BERT based model with dual level and conditional masking for mRNA representation. Sci Rep 2024; 14:26043. [PMID: 39472486 PMCID: PMC11522565 DOI: 10.1038/s41598-024-77172-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 10/21/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, we introduce StructmRNA, a new BERT-based model that was designed for the detailed analysis of mRNA sequences and structures. The success of DNABERT in understanding the intricate language of non-coding DNA with bidirectional encoder representations is extended to mRNA with StructmRNA. This new model uses a special dual-level masking technique that covers both sequence and structure, along with conditional masking. This enables StructmRNA to adeptly generate meaningful embeddings for mRNA sequences, even in the absence of explicit structural data, by capitalizing on the intricate sequence-structure correlations learned during extensive pre-training on vast datasets. Compared to well-known models like those in the Stanford OpenVaccine project, StructmRNA performs better in important tasks such as predicting RNA degradation. Thus, StructmRNA can inform better RNA-based treatments by predicting the secondary structures and biological functions of unseen mRNA sequences. The proficiency of this model is further confirmed by rigorous evaluations, revealing its unprecedented ability to generalize across various organisms and conditions, thereby marking a significant advance in the predictive analysis of mRNA for therapeutic design. With this work, we aim to set a new standard for mRNA analysis, contributing to the broader field of genomics and therapeutic development.
Collapse
Affiliation(s)
- Sepideh Nahali
- Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Ontario, Canada.
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran.
| | - Leila Safari
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran
| | | | - Jimmy Huang
- Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Ontario, Canada
| |
Collapse
|
120
|
Kabir A, Bhattarai M, Peterson S, Najman-Licht Y, Rasmussen K, Shehu A, Bishop A, Alexandrov B, Usheva A. DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Res 2024; 52:e91. [PMID: 39271116 PMCID: PMC11514457 DOI: 10.1093/nar/gkae783] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/21/2024] [Accepted: 08/29/2024] [Indexed: 09/15/2024] Open
Abstract
It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.
Collapse
Affiliation(s)
- Anowarul Kabir
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Selma Peterson
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | | | - Kim Ø Rasmussen
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Amarda Shehu
- Department of Computer Science, George Mason University, 4400 University Dr, 22030 VA, USA
| | - Alan R Bishop
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Boian Alexandrov
- Theoretical Division, Los Alamos National Laboratory, Los Alamos, 87544 NM, USA
| | - Anny Usheva
- Department of Surgery, Brown University, 69 Brown St Box 1822, 02912 RI, USA
| |
Collapse
|
121
|
Zhao H, Song G. Antiviral Peptide-Generative Pre-Trained Transformer (AVP-GPT): A Deep Learning-Powered Model for Antiviral Peptide Design with High-Throughput Discovery and Exceptional Potency. Viruses 2024; 16:1673. [PMID: 39599788 PMCID: PMC11599114 DOI: 10.3390/v16111673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Revised: 10/20/2024] [Accepted: 10/23/2024] [Indexed: 11/29/2024] Open
Abstract
Traditional antiviral peptide (AVP) discovery is a time-consuming and expensive process. This study introduces AVP-GPT, a novel deep learning method utilizing transformer-based language models and multimodal architectures specifically designed for AVP design. AVP-GPT demonstrated exceptional efficiency, generating 10,000 unique peptides and identifying potential AVPs within two days on a GPU system. Pre-trained on a respiratory syncytial virus (RSV) dataset, AVP-GPT successfully adapted to influenza A virus (INFVA) and other respiratory viruses. Compared to state-of-the-art models like LSTM and SVM, AVP-GPT achieved significantly lower perplexity (2.09 vs. 16.13) and higher AUC (0.90 vs. 0.82), indicating superior peptide sequence prediction and AVP classification. AVP-GPT generated a diverse set of peptides with excellent novelty and identified candidates with remarkably higher antiviral success rates than conventional design methods. Notably, AVP-GPT generated novel peptides against RSV and INFVA with exceptional potency, including four peptides exhibiting EC50 values around 0.02 uM-the strongest anti-RSV activity reported to date. These findings highlight AVP-GPT's potential to revolutionize AVP discovery and development, accelerating the creation of novel antiviral drugs. Future studies could explore the application of AVP-GPT to other viral targets and investigate alternative AVP design strategies.
Collapse
Affiliation(s)
| | - Gengshen Song
- Beijing Youcare Kechuang Pharmaceutical Technology Co., Ltd., Beijing 100176, China;
| |
Collapse
|
122
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
123
|
Chiliński M, Plewczynski D. HiCDiffusion - diffusion-enhanced, transformer-based prediction of chromatin interactions from DNA sequences. BMC Genomics 2024; 25:964. [PMID: 39407104 PMCID: PMC11481779 DOI: 10.1186/s12864-024-10885-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2024] [Accepted: 10/09/2024] [Indexed: 10/19/2024] Open
Abstract
Prediction of chromatin interactions from DNA sequence has been a significant research challenge in the last couple of years. Several solutions have been proposed, most of which are based on encoder-decoder architecture, where 1D sequence is convoluted, encoded into the latent representation, and then decoded using 2D convolutions into the Hi-C pairwise chromatin spatial proximity matrix. Those methods, while obtaining high correlation scores and improved metrics, produce Hi-C matrices that are artificial - they are blurred due to the deep learning model architecture. In our study, we propose the HiCDiffusion, sequence-only model that addresses this problem. We first train the encoder-decoder neural network and then use it as a component of the diffusion model - where we guide the diffusion using a latent representation of the sequence, as well as the final output from the encoder-decoder. That way, we obtain the high-resolution Hi-C matrices that not only better resemble the experimental results - improving the Fréchet inception distance by an average of 11 times, with the highest improvement of 56 times - but also obtain similar classic metrics to current state-of-the-art encoder-decoder architectures used for the task.
Collapse
Affiliation(s)
- Mateusz Chiliński
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, 00-662, Poland
- Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen, Denmark
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, 02-097, Poland
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Warsaw, 00-662, Poland.
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Warsaw, 02-097, Poland.
| |
Collapse
|
124
|
Bunne C, Roohani Y, Rosen Y, Gupta A, Zhang X, Roed M, Alexandrov T, AlQuraishi M, Brennan P, Burkhardt DB, Califano A, Cool J, Dernburg AF, Ewing K, Fox EB, Haury M, Herr AE, Horvitz E, Hsu PD, Jain V, Johnson GR, Kalil T, Kelley DR, Kelley SO, Kreshuk A, Mitchison T, Otte S, Shendure J, Sofroniew NJ, Theis F, Theodoris CV, Upadhyayula S, Valer M, Wang B, Xing E, Yeung-Levy S, Zitnik M, Karaletsos T, Regev A, Lundberg E, Leskovec J, Quake SR. How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities. ARXIV 2024:arXiv:2409.11654v2. [PMID: 39398201 PMCID: PMC11468656] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
The cell is arguably the most fundamental unit of life and is central to understanding biology. Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease. Recent advances in artificial intelligence (AI), combined with the ability to generate large-scale experimental data, present novel opportunities to model cells. Here we propose a vision of leveraging advances in AI to construct virtual cells, high-fidelity simulations of cells and cellular systems under different conditions that are directly learned from biological data across measurements and scales. We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities across scales, and facilitating interpretable in silico experiments to predict and understand their behavior using Virtual Instruments. We further address the challenges, opportunities and requirements to realize this vision including data needs, evaluation strategies, and community standards and engagement to ensure biological accuracy and broad utility. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration. With open science collaborations across the biomedical ecosystem that includes academia, philanthropy, and the biopharma and AI industries, a comprehensive predictive understanding of cell mechanisms and interactions has come into reach.
Collapse
Affiliation(s)
- Charlotte Bunne
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Genentech, South San Francisco, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- School of Computer and Communication Sciences and School of Life Sciences, EPFL, Lausanne, Switzerland
| | - Yusuf Roohani
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- Arc Institute, Palo Alto, CA, USA
| | - Yanay Rosen
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Ankit Gupta
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- KTH Royal Institute of Technology, Science for Life Laboratory, Department of Protein Science, Stockholm, Sweden
| | - Xikun Zhang
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Marcel Roed
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Theo Alexandrov
- Department of Pharmacology, University of California, San Diego, CA, USA
- Department of Bioengineering, University of California, San Diego, CA, USA
| | | | | | | | - Andrea Califano
- Department of Systems Biology, Columbia University, New York, NY, USA
- Vagelos College of Physicians and Surgeons, Columbia University Irving Medical Center, New York, NY, USA
- Chan Zuckerberg Biohub New York, NY, USA
| | - Jonah Cool
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Abby F Dernburg
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Kirsty Ewing
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Emily B Fox
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Statistics, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub San Francisco, CA, USA
| | - Matthias Haury
- Chan Zuckerberg Institute for Advanced Biological Imaging, Redwood City, CA, USA
| | - Amy E Herr
- Chan Zuckerberg Biohub San Francisco, CA, USA
- Department of Bioengineering, University of California, Berkeley, CA, USA
| | | | - Patrick D Hsu
- Arc Institute, Palo Alto, CA, USA
- Department of Bioengineering, University of California, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | | | | | | | | | - Shana O Kelley
- Chan Zuckerberg Biohub Chicago, IL, USA
- Northwestern University, Evanston, IL, USA
| | - Anna Kreshuk
- Cell Biology and Biophysics Unit, European Molecular Biology Laboratory, Heidelberg, Germany
| | - Tim Mitchison
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Stephani Otte
- Chan Zuckerberg Institute for Advanced Biological Imaging, Redwood City, CA, USA
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
- Brotman Baty Institute for Precision Medicine, Seattle, WA, USA
- Seattle Hub for Synthetic Biology, Seattle, WA, USA
- Howard Hughes Medical Institute, Seattle, WA, USA
| | | | - Fabian Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Christina V Theodoris
- Gladstone Institute of Cardiovascular Disease, Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
- Department of Pediatrics, University of California, San Francisco, CA, USA
| | - Srigokul Upadhyayula
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Chan Zuckerberg Biohub San Francisco, CA, USA
- Molecular Biophysics and Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Marc Valer
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Bo Wang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Eric Xing
- Carnegie Mellon University, School of Computer Science, Pittsburgh, PA, USA
- Mohamed Bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Serena Yeung-Levy
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | | | - Emma Lundberg
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- KTH Royal Institute of Technology, Science for Life Laboratory, Department of Protein Science, Stockholm, Sweden
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Initiative, Redwood City, CA, USA
| | - Stephen R Quake
- Chan Zuckerberg Initiative, Redwood City, CA, USA
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Department of Applied Physics, Stanford University, Stanford, CA, USA
| |
Collapse
|
125
|
Naghipourfar M, Chen S, Howard MK, Macdonald CB, Saberi A, Hagen T, Mofrad MRK, Coyote-Maestas W, Goodarzi H. A Suite of Foundation Models Captures the Contextual Interplay Between Codons. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.10.617568. [PMID: 39416097 PMCID: PMC11482952 DOI: 10.1101/2024.10.10.617568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/19/2024]
Abstract
In the canonical genetic code, many amino acids are assigned more than one codon. Work by us and others has shown that the choice of these synonymous codon is not random, and carries regulatory and functional consequences. Existing protein foundation models ignore this context-dependent role of coding sequence in shaping the protein landscape of the cell. To address this gap, we introduce cdsFM, a suite of codon-resolution large language models, including both EnCodon and DeCodon models, with up to 1B parameters. Pre-trained on 60 million protein-coding sequences from more than 5,000 species, our models effectively learn the relationship between codons and amino acids, recapitualing the overall structure of the genetic code. In addition to outperforming state-of-the-art genomic foundation models in a variety of zero-shot and few-shot learning tasks, the larger pre-trained models were superior in predicting the choice of synonymous codons. To systematically assess the impact of synonymous codon choices on protein expression and our models' ability to capture these effects, we generated a large dataset measuring overall and surface expression levels of three proteins as a function of changes in their synonymous codons. We showed that our EnCodon models could be readily fine-tuned to predict the contextual consequences of synonymous codon choices. Armed with this knowledge, we applied EnCodon to existing clinical datasets of synonymous variants, and we identified a large number of synonymous codons that are likely pathogenic, several of which we experimentally confirmed in a cell-based model. Together, our findings establish the cdsFM suite as a powerful tool for decoding the complex functional grammar underlying the choice of synonymous codons.
Collapse
Affiliation(s)
- Mohsen Naghipourfar
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, Berkeley, CA, USA
- Arc Institute, Palo Alto, CA, USA
| | | | - Mathew K. Howard
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Tetrad Graduate Program, UCSF, San Francisco, CA, USA
| | - Christian B. Macdonald
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
| | - Ali Saberi
- Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, Canada
| | | | - Mohammad R. K. Mofrad
- Molecular Cell Biomechanics Laboratory, Departments of Bioengineering and Mechanical Engineering, University of California, Berkeley, Berkeley, CA, USA
| | - Willow Coyote-Maestas
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA 94158, USA
- Quantitative Biosciences Institute, University of California, San Francisco, USA
| | - Hani Goodarzi
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA
| |
Collapse
|
126
|
Lal A, Garfield D, Biancalani T, Eraslan G. Designing realistic regulatory DNA with autoregressive language models. Genome Res 2024; 34:1411-1420. [PMID: 39322281 PMCID: PMC11529870 DOI: 10.1101/gr.279142.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 08/19/2024] [Indexed: 09/27/2024]
Abstract
Cis-regulatory elements (CREs), such as promoters and enhancers, are DNA sequences that regulate the expression of genes. The activity of a CRE is influenced by the order, composition, and spacing of sequence motifs that are bound by proteins called transcription factors (TFs). Synthetic CREs with specific properties are needed for biomanufacturing as well as for many therapeutic applications including cell and gene therapy. Here, we present regLM, a framework to design synthetic CREs with desired properties, such as high, low, or cell type-specific activity, using autoregressive language models in conjunction with supervised sequence-to-function models. We used our framework to design synthetic yeast promoters and cell type-specific human enhancers. We demonstrate that the synthetic CREs generated by our approach are not only predicted to have the desired functionality but also contain biological features similar to experimentally validated CREs. regLM thus facilitates the design of realistic regulatory DNA elements while providing insights into the cis-regulatory code.
Collapse
Affiliation(s)
- Avantika Lal
- Biology Research|AI Development, gRED Computational Sciences, Genentech, South San Francisco, California 94080, USA;
| | - David Garfield
- OMNI Bioinformatics and Department of Regenerative Medicine, Genentech, South San Francisco, California 94080, USA
| | - Tommaso Biancalani
- Biology Research|AI Development, gRED Computational Sciences, Genentech, South San Francisco, California 94080, USA
| | - Gokcen Eraslan
- Biology Research|AI Development, gRED Computational Sciences, Genentech, South San Francisco, California 94080, USA;
| |
Collapse
|
127
|
Zhang G, Xie H, Dai X. DeepIndel: An Interpretable Deep Learning Approach for Predicting CRISPR/Cas9-Mediated Editing Outcomes. Int J Mol Sci 2024; 25:10928. [PMID: 39456711 PMCID: PMC11507043 DOI: 10.3390/ijms252010928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 10/01/2024] [Accepted: 10/08/2024] [Indexed: 10/28/2024] Open
Abstract
CRISPR/Cas9 has been applied to edit the genome of various organisms, but our understanding of editing outcomes at specific sites after Cas9-mediated DNA cleavage is still limited. Several deep learning-based methods have been proposed for repair outcome prediction; however, there is still room for improvement in terms of performance regarding frameshifts and model interpretability. Here, we present DeepIndel, an end-to-end multi-label regression model for predicting repair outcomes based on the BERT-base module. We demonstrate that our model outperforms existing methods in terms of accuracy and generalizability across various metrics. Furthermore, we utilized Deep SHAP to visualize the importance of nucleotides at various positions for DNA sequence and found that mononucleotides and trinucleotides in DNA sequences surrounding the cut site play a significant role in repair outcome prediction.
Collapse
Affiliation(s)
- Guishan Zhang
- College of Engineering, Shantou University, Shantou 515063, China; (G.Z.); (H.X.)
| | - Huanzeng Xie
- College of Engineering, Shantou University, Shantou 515063, China; (G.Z.); (H.X.)
| | - Xianhua Dai
- School of Cyber Science and Technology, Sun Yat-sen University, Shenzhen 518107, China
| |
Collapse
|
128
|
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, Kim S, Kim D, Shin Y, Kwak IY, Meshcheryakov G, Lando A, Zinkevich A, Kim BC, Lee J, Kang T, Vaishnav ED, Yadollahpour P, Kim S, Albrecht J, Regev A, Gong W, Kulakovskiy IV, Meyer P, de Boer CG. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol 2024:10.1038/s41587-024-02414-w. [PMID: 39394483 DOI: 10.1038/s41587-024-02414-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 08/29/2024] [Indexed: 10/13/2024]
Abstract
A systematic evaluation of how model architectures and training strategies impact genomics model performance is needed. To address this gap, we held a DREAM Challenge where competitors trained models on a dataset of millions of random promoter DNA sequences and corresponding expression levels, experimentally determined in yeast. For a robust evaluation of the models, we designed a comprehensive suite of benchmarks encompassing various sequence types. All top-performing models used neural networks but diverged in architectures and training strategies. To dissect how architectural and training choices impact performance, we developed the Prix Fixe framework to divide models into modular building blocks. We tested all possible combinations for the top three models, further improving their performance. The DREAM Challenge models not only achieved state-of-the-art results on our comprehensive yeast dataset but also consistently surpassed existing benchmarks on Drosophila and human genomic datasets, demonstrating the progress that can be driven by gold-standard genomics datasets.
Collapse
Affiliation(s)
| | - Daria Nogina
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
| | - Dmitry Penzar
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- AIRI, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Dohoon Lee
- Seoul National University, Seoul, South Korea
| | | | - Nayeon Kim
- Seoul National University, Seoul, South Korea
| | | | - Dohyeon Kim
- Seoul National University, Seoul, South Korea
| | - Yeojin Shin
- Seoul National University, Seoul, South Korea
| | | | | | | | - Arsenii Zinkevich
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, Moscow, Russia
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | | | - Juhyun Lee
- Chung-Ang University, Seoul, South Korea
| | - Taein Kang
- Chung-Ang University, Seoul, South Korea
| | - Eeshit Dhaval Vaishnav
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Sequome, Inc., South San Francisco, CA, USA
| | | | - Sun Kim
- Seoul National University, Seoul, South Korea
| | | | - Aviv Regev
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Genentech, San Francisco, CA, USA
| | - Wuming Gong
- University of Minnesota, Minneapolis, MN, USA
| | - Ivan V Kulakovskiy
- Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
- Institute of Protein Research, Russian Academy of Sciences, Pushchino, Russia
| | - Pablo Meyer
- Health Care and Life Sciences, IBM Research, New York, NY, USA
| | - Carl G de Boer
- University of British Columbia, Vancouver, British Columbia, Canada.
| |
Collapse
|
129
|
Wang J. Deep Learning in Hematology: From Molecules to Patients. Clin Hematol Int 2024; 6:19-42. [PMID: 39417017 PMCID: PMC11477942 DOI: 10.46989/001c.124131] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Accepted: 06/29/2024] [Indexed: 10/19/2024] Open
Abstract
Deep learning (DL), a subfield of machine learning, has made remarkable strides across various aspects of medicine. This review examines DL's applications in hematology, spanning from molecular insights to patient care. The review begins by providing a straightforward introduction to the basics of DL tailored for those without prior knowledge, touching on essential concepts, principal architectures, and prevalent training methods. It then discusses the applications of DL in hematology, concentrating on elucidating the models' architecture, their applications, performance metrics, and inherent limitations. For example, at the molecular level, DL has improved the analysis of multi-omics data and protein structure prediction. For cells and tissues, DL enables the automation of cytomorphology analysis, interpretation of flow cytometry data, and diagnosis from whole slide images. At the patient level, DL's utility extends to analyzing curated clinical data, electronic health records, and clinical notes through large language models. While DL has shown promising results in various hematology applications, challenges remain in model generalizability and explainability. Moreover, the integration of novel DL architectures into hematology has been relatively slow in comparison to that in other medical fields.
Collapse
Affiliation(s)
- Jiasheng Wang
- Division of Hematology, Department of MedicineThe Ohio State University Comprehensive Cancer Center
| |
Collapse
|
130
|
Crombie TA, Rajaei M, Saxena AS, Johnson LM, Saber S, Tanny RE, Ponciano JM, Andersen EC, Zhou J, Baer CF. Direct inference of the distribution of fitness effects of spontaneous mutations from recombinant inbred Caenorhabditis elegans mutation accumulation lines. Genetics 2024; 228:iyae136. [PMID: 39139098 PMCID: PMC12098947 DOI: 10.1093/genetics/iyae136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 07/30/2024] [Accepted: 08/02/2024] [Indexed: 08/15/2024] Open
Abstract
The distribution of fitness effects of new mutations plays a central role in evolutionary biology. Estimates of the distribution of fitness effect from experimental mutation accumulation lines are compromised by the complete linkage disequilibrium between mutations in different lines. To reduce the linkage disequilibrium, we constructed 2 sets of recombinant inbred lines from a cross of 2 Caenorhabditis elegans mutation accumulation lines. One set of lines ("RIAILs") was intercrossed for 10 generations prior to 10 generations of selfing; the second set of lines ("RILs") omitted the intercrossing. Residual linkage disequilibrium in the RIAILs is much less than in the RILs, which affects the inferred distribution of fitness effect when the sets of lines are analyzed separately. The best-fit model estimated from all lines (RIAILs + RILs) infers a large fraction of mutations with positive effects (∼40%); models that constrain mutations to have negative effects fit much worse. The conclusion is the same using only the RILs. For the RIAILs, however, models that constrain mutations to have negative effects fit nearly as well as models that allow positive effects. When mutations in high linkage disequilibrium are pooled into haplotypes, the inferred distribution of fitness effect becomes increasingly negative-skewed and leptokurtic. We conclude that the conventional wisdom-most mutations have effects near 0, a handful of mutations have effects that are substantially negative, and mutations with positive effects are very rare-is likely correct, and that unless it can be shown otherwise, estimates of the distribution of fitness effect that infer a substantial fraction of mutations with positive effects are likely confounded by linkage disequilibrium.
Collapse
Affiliation(s)
- Timothy A Crombie
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
| | - Moein Rajaei
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | | | - Lindsay M Johnson
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Sayran Saber
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Robyn E Tanny
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | | | - Erik C Andersen
- Department of Molecular Biosciences, Northwestern University, Evanston, IL 60208, USA
- Department of Biology, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
| | - Charles F Baer
- Department of Biology, University of Florida, Gainesville, FL 32611, USA
- University of Florida Genetics Institute, Gainesville, FL 32611, USA
| |
Collapse
|
131
|
Yang Y, Li G, Pang K, Cao W, Zhang Z, Li X. Deciphering 3'UTR Mediated Gene Regulation Using Interpretable Deep Representation Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2407013. [PMID: 39159140 PMCID: PMC11497048 DOI: 10.1002/advs.202407013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 07/23/2024] [Indexed: 08/21/2024]
Abstract
The 3' untranslated regions (3'UTRs) of messenger RNAs contain many important cis-regulatory elements that are under functional and evolutionary constraints. It is hypothesized that these constraints are similar to grammars and syntaxes in human languages and can be modeled by advanced natural language techniques such as Transformers, which has been very effective in modeling complex protein sequence and structures. Here 3UTRBERT is described, which implements an attention-based language model, i.e., Bidirectional Encoder Representations from Transformers (BERT). 3UTRBERT is pre-trained on aggregated 3'UTR sequences of human mRNAs in a task-agnostic manner; the pre-trained model is then fine-tuned for specific downstream tasks such as identifying RBP binding sites, m6A RNA modification sites, and predicting RNA sub-cellular localizations. Benchmark results show that 3UTRBERT generally outperformed other contemporary methods in each of these tasks. More importantly, the self-attention mechanism within 3UTRBERT allows direct visualization of the semantic relationship between sequence elements and effectively identifies regions with important regulatory potential. It is expected that 3UTRBERT model can serve as the foundational tool to analyze various sequence labeling tasks within the 3'UTR fields, thus enhancing the decipherability of post-transcriptional regulatory mechanisms.
Collapse
Affiliation(s)
- Yuning Yang
- School of Information Science and TechnologyNortheast Normal UniversityChangchunJilin130117China
| | - Gen Li
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Kuan Pang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Wuxinhao Cao
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular ResearchUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Computer ScienceUniversity of TorontoTorontoONM5S 3E1Canada
- Department of Molecular GeneticsUniversity of TorontoTorontoONM5S 3E1Canada
| | - Xiangtao Li
- School of Artificial IntelligenceJilin UniversityChangchunJilin130012China
| |
Collapse
|
132
|
Lam HYI, Ong XE, Mutwil M. Large language models in plant biology. TRENDS IN PLANT SCIENCE 2024; 29:1145-1155. [PMID: 38797656 DOI: 10.1016/j.tplants.2024.04.013] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 04/29/2024] [Accepted: 04/30/2024] [Indexed: 05/29/2024]
Abstract
Large language models (LLMs), such as ChatGPT, have taken the world by storm. However, LLMs are not limited to human language and can be used to analyze sequential data, such as DNA, protein, and gene expression. The resulting foundation models can be repurposed to identify the complex patterns within the data, resulting in powerful, multipurpose prediction tools able to predict the state of cellular systems. This review outlines the different types of LLMs and showcases their recent uses in biology. Since LLMs have not yet been embraced by the plant community, we also cover how these models can be deployed for the plant kingdom.
Collapse
Affiliation(s)
- Hilbert Yuen In Lam
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Xing Er Ong
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore
| | - Marek Mutwil
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore, 637551, Singapore.
| |
Collapse
|
133
|
Zhang Y, Mao M, Zhang R, Liao YT, Wu VCH. DeepPL: A deep-learning-based tool for the prediction of bacteriophage lifecycle. PLoS Comput Biol 2024; 20:e1012525. [PMID: 39418300 PMCID: PMC11521287 DOI: 10.1371/journal.pcbi.1012525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Revised: 10/29/2024] [Accepted: 09/30/2024] [Indexed: 10/19/2024] Open
Abstract
Bacteriophages (phages) are viruses that infect bacteria and can be classified into two different lifecycles. Virulent phages (or lytic phages) have a lytic cycle that can lyse the bacteria host after their infection. Temperate phages (or lysogenic phages) can integrate their phage genomes into bacterial chromosomes and replicate with bacterial hosts via the lysogenic cycle. Identifying phage lifecycles is a crucial step in developing suitable applications for phages. Compared to the complicated traditional biological experiments, several tools have been designed for predicting phage lifecycle using different algorithms, such as random forest (RF), linear support-vector classifier (SVC), and convolutional neural network (CNN). In this study, we developed a natural language processing (NLP)-based tool-DeepPL-for predicting phage lifecycles via nucleotide sequences. The test results showed that our DeepPL had an accuracy of 94.65% with a sensitivity of 92.24% and a specificity of 95.91%. Moreover, DeepPL had 100% accuracy in lifecycle prediction on the phages we isolated and biologically verified previously in the lab. Additionally, a mock phage community metagenomic dataset was used to test the potential usage of DeepPL in viral metagenomic research. DeepPL displayed a 100% accuracy for individual phage complete genomes and high accuracies ranging from 71.14% to 100% on phage contigs produced by various next-generation sequencing technologies. Overall, our study indicates that DeepPL has a reliable performance on phage lifecycle prediction using the most fundamental nucleotide sequences and can be applied to future phage and metagenomic research.
Collapse
Affiliation(s)
- Yujie Zhang
- Produce Safety and Microbiology Research Unit, U.S. Department of Agriculture, Agricultural Research Service, Western Regional Research Center, Albany, California, United States of America
| | - Mark Mao
- Clowit, LLC. Burlingame, California, United States of America
| | - Robert Zhang
- Clowit, LLC. Burlingame, California, United States of America
| | - Yen-Te Liao
- Produce Safety and Microbiology Research Unit, U.S. Department of Agriculture, Agricultural Research Service, Western Regional Research Center, Albany, California, United States of America
| | - Vivian C. H. Wu
- Produce Safety and Microbiology Research Unit, U.S. Department of Agriculture, Agricultural Research Service, Western Regional Research Center, Albany, California, United States of America
| |
Collapse
|
134
|
Hou A, Luo H, Liu H, Luo L, Ding P. Multi-scale DNA language model improves 6 mA binding sites prediction. Comput Biol Chem 2024; 112:108129. [PMID: 39067351 DOI: 10.1016/j.compbiolchem.2024.108129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 06/05/2024] [Accepted: 06/10/2024] [Indexed: 07/30/2024]
Abstract
DNA methylation at the N6 position of adenine (N6-methyladenine, 6 mA), which refers to the attachment of a methyl group to the N6 site of the adenine (A) of DNA, is an important epigenetic modification in prokaryotic and eukaryotic genomes. Accurately predicting the 6 mA binding sites can provide crucial insights into gene regulation, DNA repair, disease development and so on. Wet experiments are commonly used for analyzing 6 mA binding sites. However, they suffer from high cost and expensive time. Therefore, various deep learning methods have been widely used to predict 6 mA binding sites recently. In this study, we develop a framework based on multi-scale DNA language model named "iDNA6mA-MDL". "iDNA6mA-MDL" integrates multiple kmers and the nucleotide property and frequency method for feature embedding, which can capture a full range of DNA sequence context information. At the prediction stage, it also leverages DNABERT to compensate for the incomplete capture of global DNA information. Experiments show that our framework obtains average AUC of 0.981 on a classic 6 mA rice gene dataset, going beyond all existing advanced models under fivefold cross-validations. Moreover, "iDNA6mA-MDL" outperforms most of the popular state-of-the-art methods on another 11 6 mA datasets, demonstrating its effectiveness in 6 mA binding sites prediction.
Collapse
Affiliation(s)
- Anlin Hou
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Hanyu Luo
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Huan Liu
- School of Computer Science, University of South China, Hengyang 421001, China
| | - Lingyun Luo
- School of Computer Science, University of South China, Hengyang 421001, China.
| | - Pingjian Ding
- School of Computer Science, University of South China, Hengyang 421001, China
| |
Collapse
|
135
|
Kumar Halder A, Agarwal A, Jodkowska K, Plewczynski D. A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction. Brief Funct Genomics 2024; 23:538-548. [PMID: 38555493 DOI: 10.1093/bfgp/elae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/07/2024] [Accepted: 03/04/2024] [Indexed: 04/02/2024] Open
Abstract
Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.
Collapse
Affiliation(s)
- Anup Kumar Halder
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Abhishek Agarwal
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Karolina Jodkowska
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| |
Collapse
|
136
|
İhtiyar MN, Özgür A. Generative language models on nucleotide sequences of human genes. Sci Rep 2024; 14:22204. [PMID: 39333252 PMCID: PMC11437190 DOI: 10.1038/s41598-024-72512-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 09/09/2024] [Indexed: 09/29/2024] Open
Abstract
Language models, especially transformer-based ones, have achieved colossal success in natural language processing. To be precise, studies like BERT for natural language understanding and works like GPT-3 for natural language generation are very important. If we consider DNA sequences as a text written with an alphabet of four letters representing the nucleotides, they are similar in structure to natural languages. This similarity has led to the development of discriminative language models such as DNABERT in the field of DNA-related bioinformatics. To our knowledge, however, the generative side of the coin is still largely unexplored. Therefore, we have focused on the development of an autoregressive generative language model such as GPT-3 for DNA sequences. Since working with whole DNA sequences is challenging without extensive computational resources, we decided to conduct our study on a smaller scale and focus on nucleotide sequences of human genes, i.e. unique parts of DNA with specific functions, rather than the whole DNA. This decision has not significantly changed the structure of the problem, as both DNA and genes can be considered as 1D sequences consisting of four different nucleotides without losing much information and without oversimplification. First of all, we systematically studied an almost entirely unexplored problem and observed that recurrent neural networks (RNNs) perform best, while simple techniques such as N-grams are also promising. Another beneficial point was learning how to work with generative models on languages we do not understand, unlike natural languages. The importance of using real-world tasks beyond classical metrics such as perplexity was noted. In addition, we examined whether the data-hungry nature of these models can be altered by selecting a language with minimal vocabulary size, four due to four different types of nucleotides. The reason for reviewing this was that choosing such a language might make the problem easier. However, in this study, we found that this did not change the amount of data required very much.
Collapse
Affiliation(s)
- Musa Nuri İhtiyar
- Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, 34342, Istanbul, Turkey.
| |
Collapse
|
137
|
AlSaad R, Abd-Alrazaq A, Boughorbel S, Ahmed A, Renault MA, Damseh R, Sheikh J. Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook. J Med Internet Res 2024; 26:e59505. [PMID: 39321458 PMCID: PMC11464944 DOI: 10.2196/59505] [Citation(s) in RCA: 15] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Revised: 08/07/2024] [Accepted: 08/20/2024] [Indexed: 09/27/2024] Open
Abstract
In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data-driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems.
Collapse
Affiliation(s)
- Rawan AlSaad
- Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
| | | | - Sabri Boughorbel
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Arfan Ahmed
- Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
| | | | - Rafat Damseh
- Department of Computer Science and Software Engineering, United Arab Emirates University, Al Ain, United Arab Emirates
| | - Javaid Sheikh
- Weill Cornell Medicine-Qatar, Education City, Doha, Qatar
| |
Collapse
|
138
|
Yang K, Islas N, Jewell S, Jha A, Radens CM, Pleiss JA, Lynch KW, Barash Y, Choi PS. Machine learning-optimized targeted detection of alternative splicing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.20.614162. [PMID: 39386495 PMCID: PMC11463589 DOI: 10.1101/2024.09.20.614162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
RNA-sequencing (RNA-seq) is widely adopted for transcriptome analysis but has inherent biases which hinder the comprehensive detection and quantification of alternative splicing. To address this, we present an efficient targeted RNA-seq method that greatly enriches for splicing-informative junction-spanning reads. Local Splicing Variation sequencing (LSV-seq) utilizes multiplexed reverse transcription from highly scalable pools of primers anchored near splicing events of interest. Primers are designed using Optimal Prime, a novel machine learning algorithm trained on the performance of thousands of primer sequences. In experimental benchmarks, LSV-seq achieves high on-target capture rates and concordance with RNA-seq, while requiring significantly lower sequencing depth. Leveraging deep learning splicing code predictions, we used LSV-seq to target events with low coverage in GTEx RNA-seq data and newly discover hundreds of tissue-specific splicing events. Our results demonstrate the ability of LSV-seq to quantify splicing of events of interest at high-throughput and with exceptional sensitivity.
Collapse
Affiliation(s)
- Kevin Yang
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nathaniel Islas
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - San Jewell
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Anupama Jha
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Caleb M. Radens
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jeffrey A. Pleiss
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Kristen W. Lynch
- Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA, USA
| | - Yoseph Barash
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Peter S. Choi
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| |
Collapse
|
139
|
Fu L, Shi J, Huang B. Binning Metagenomic Contigs Using Contig Embedding and Decomposed Tetranucleotide Frequency. BIOLOGY 2024; 13:755. [PMID: 39452065 PMCID: PMC11505167 DOI: 10.3390/biology13100755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2024] [Revised: 09/15/2024] [Accepted: 09/23/2024] [Indexed: 10/26/2024]
Abstract
Metagenomic binning is a crucial step in metagenomic research. It can aggregate the genome sequences belonging to the same microbial species into independent bins. Most existing methods ignore the semantic information of contigs and lack effective processing of tetranucleotide frequency, resulting in insufficient and complex feature information extracted for binning and poor binning results. To address the above problems, we propose CedtBin, a metagenomic binning method based on contig embedding and decomposed tetranucleotide frequency. First, the improved BERT model is used to learn the contigs to obtain their embedding representation. Secondly, the tetranucleotide frequencies are decomposed using a non-negative matrix factorization (NMF) algorithm. After that, the two features are spliced and input into the clustering algorithm for binning. Considering the sensitivity of the DBSCAN clustering algorithm to input parameters, in order to solve the drawbacks of manual parameter input, we also propose an Annoy-DBSCAN algorithm that can adaptively determine the parameters of the DBSCAN algorithm. This algorithm uses Approximate Nearest Neighbors Oh Yeah (Annoy) and combines it with a grid search strategy to find the optimal parameters of the DBSCAN algorithm. On simulated and real datasets, CedtBin achieves better binning results than mainstream methods and can reconstruct more genomes, indicating that the proposed method is effective.
Collapse
Affiliation(s)
- Long Fu
- School of Computer and Electronic Information, Guangxi University, Nanning 530004, China; (L.F.); (J.S.)
| | - Jiabin Shi
- School of Computer and Electronic Information, Guangxi University, Nanning 530004, China; (L.F.); (J.S.)
| | - Baohua Huang
- School of Computer and Electronic Information, Guangxi University, Nanning 530004, China; (L.F.); (J.S.)
- Guangxi Key Laboratory of Digital Infrastructure, Guangxi Zhuang Autonomous Region Information Center, Nanning 530004, China
| |
Collapse
|
140
|
Todhunter ME, Jubair S, Verma R, Saqe R, Shen K, Duffy B. Artificial intelligence and machine learning applications for cultured meat. Front Artif Intell 2024; 7:1424012. [PMID: 39381621 PMCID: PMC11460582 DOI: 10.3389/frai.2024.1424012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 08/21/2024] [Indexed: 10/10/2024] Open
Abstract
Cultured meat has the potential to provide a complementary meat industry with reduced environmental, ethical, and health impacts. However, major technological challenges remain which require time-and resource-intensive research and development efforts. Machine learning has the potential to accelerate cultured meat technology by streamlining experiments, predicting optimal results, and reducing experimentation time and resources. However, the use of machine learning in cultured meat is in its infancy. This review covers the work available to date on the use of machine learning in cultured meat and explores future possibilities. We address four major areas of cultured meat research and development: establishing cell lines, cell culture media design, microscopy and image analysis, and bioprocessing and food processing optimization. In addition, we have included a survey of datasets relevant to CM research. This review aims to provide the foundation necessary for both cultured meat and machine learning scientists to identify research opportunities at the intersection between cultured meat and machine learning.
Collapse
Affiliation(s)
| | - Sheikh Jubair
- Alberta Machine Intelligence Institute, Edmonton, AB, Canada
| | - Ruchika Verma
- Alberta Machine Intelligence Institute, Edmonton, AB, Canada
| | - Rikard Saqe
- Department of Biology, University of Waterloo, Waterloo, ON, Canada
| | - Kevin Shen
- Department of Mathematics, University of Waterloo, Waterloo, ON, Canada
| | | |
Collapse
|
141
|
Phan H, Brouard C, Mourad R. Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction. Brief Bioinform 2024; 25:bbae560. [PMID: 39489607 PMCID: PMC11531863 DOI: 10.1093/bib/bbae560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 09/13/2024] [Accepted: 10/17/2024] [Indexed: 11/05/2024] Open
Abstract
Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding single nucleotide polymorphisms identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning (SSL) based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. We further improved it incorporating principles from the Noisy Student algorithm to predict the confidence in pseudo-labeled data used for pre-training, which showed improvements for transcription factor with very few binding (very small training data). The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in most cases strong predictive performance improvements compared to standard supervised learning. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.
Collapse
Affiliation(s)
- Han Phan
- INRAE, MIAT, 31326 Castanet-Tolosan, France
| | | | - Raphaël Mourad
- INRAE, MIAT, 31326 Castanet-Tolosan, France
- University of Toulouse, UPS, 31062 Toulouse, France
| |
Collapse
|
142
|
Li Q, Hu Z, Wang Y, Li L, Fan Y, King I, Jia G, Wang S, Song L, Li Y. Progress and opportunities of foundation models in bioinformatics. Brief Bioinform 2024; 25:bbae548. [PMID: 39461902 PMCID: PMC11512649 DOI: 10.1093/bib/bbae548] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Revised: 08/20/2024] [Accepted: 10/12/2024] [Indexed: 10/29/2024] Open
Abstract
Bioinformatics has undergone a paradigm shift in artificial intelligence (AI), particularly through foundation models (FMs), which address longstanding challenges in bioinformatics such as limited annotated data and data noise. These AI techniques have demonstrated remarkable efficacy across various downstream validation tasks, effectively representing diverse biological entities and heralding a new era in computational biology. The primary goal of this survey is to conduct a general investigation and summary of FMs in bioinformatics, tracing their evolutionary trajectory, current research landscape, and methodological frameworks. Our primary focus is on elucidating the application of FMs to specific biological problems, offering insights to guide the research community in choosing appropriate FMs for tasks like sequence analysis, structure prediction, and function annotation. Each section delves into the intricacies of the targeted challenges, contrasting the architectures and advancements of FMs with conventional methods and showcasing their utility across different biological domains. Further, this review scrutinizes the hurdles and constraints encountered by FMs in biology, including issues of data noise, model interpretability, and potential biases. This analysis provides a theoretical groundwork for understanding the circumstances under which certain FMs may exhibit suboptimal performance. Lastly, we outline prospective pathways and methodologies for the future development of FMs in biological research, facilitating ongoing innovation in the field. This comprehensive examination not only serves as an academic reference but also as a roadmap for forthcoming explorations and applications of FMs in biology.
Collapse
Affiliation(s)
- Qing Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Zhihang Hu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yixuan Wang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Lei Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Yimin Fan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Irwin King
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| | - Gengjie Jia
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Guangdong, 518120, China
| | - Sheng Wang
- Shanghai Zelixir Biotech Company Ltd., Shanghai, 200030, China
- Shenzhen Institute of Advanced Technology, Xueyuan Avenue, Shenzhen University Town, Nanshan District, Shenzhen, Guangdong, 518055, China
| | - Le Song
- BioMap, Zhongguancun Life Science Park, Haidian District, Beijing, 100085, China
| | - Yu Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, 999077, China
| |
Collapse
|
143
|
Xu J, Gao Y, Lu Q, Zhang R, Gui J, Liu X, Yue Z. RiceSNP-BST: a deep learning framework for predicting biotic stress-associated SNPs in rice. Brief Bioinform 2024; 25:bbae599. [PMID: 39562160 PMCID: PMC11576077 DOI: 10.1093/bib/bbae599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Revised: 10/07/2024] [Accepted: 11/04/2024] [Indexed: 11/21/2024] Open
Abstract
Rice consistently faces significant threats from biotic stresses, such as fungi, bacteria, pests, and viruses. Consequently, accurately and rapidly identifying previously unknown single-nucleotide polymorphisms (SNPs) in the rice genome is a critical challenge for rice research and the development of resistant varieties. However, the limited availability of high-quality rice genotype data has hindered this research. Deep learning has transformed biological research by facilitating the prediction and analysis of SNPs in biological sequence data. Convolutional neural networks are especially effective in extracting structural and local features from DNA sequences, leading to significant advancements in genomics. Nevertheless, the expanding catalog of genome-wide association studies provides valuable biological insights for rice research. Expanding on this idea, we introduce RiceSNP-BST, an automatic architecture search framework designed to predict SNPs associated with rice biotic stress traits (BST-associated SNPs) by integrating multidimensional features. Notably, the model successfully innovates the datasets, offering more precision than state-of-the-art methods while demonstrating good performance on an independent test set and cross-species datasets. Additionally, we extracted features from the original DNA sequences and employed causal inference to enhance the biological interpretability of the model. This study highlights the potential of RiceSNP-BST in advancing genome prediction in rice. Furthermore, a user-friendly web server for RiceSNP-BST (http://rice-snp-bst.aielab.cc) has been developed to support broader genome research.
Collapse
Affiliation(s)
- Jiajun Xu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Yujia Gao
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Quan Lu
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Renyi Zhang
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Jianfeng Gui
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Xiaoshuang Liu
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| | - Zhenyu Yue
- School of Information and Artificial Intelligence, Anhui Provincial Engineering Research Center for Beidou Precision Agriculture Information, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
- Research Center for Biological Breeding Technology, Advance Academy, Anhui Agricultural University, 130, Changjiang West Road, Hefei, Anhui Province 230036, China
| |
Collapse
|
144
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic Language Models: Opportunities and Challenges. ARXIV 2024:arXiv:2407.11435v2. [PMID: 39070037 PMCID: PMC11275703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley
| | - Yun S. Song
- Computer Science Division, University of California, Berkeley
- Department of Statistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
| |
Collapse
|
145
|
Chao KH, Mao A, Salzberg SL, Pertea M. Splam: a deep-learning-based splice site predictor that improves spliced alignments. Genome Biol 2024; 25:243. [PMID: 39285451 PMCID: PMC11406845 DOI: 10.1186/s13059-024-03379-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open
Abstract
The process of splicing messenger RNA to remove introns plays a central role in creating genes and gene variants. We describe Splam, a novel method for predicting splice junctions in DNA using deep residual convolutional neural networks. Unlike previous models, Splam looks at a 400-base-pair window flanking each splice site, reflecting the biological splicing process that relies primarily on signals within this window. Splam also trains on donor and acceptor pairs together, mirroring how the splicing machinery recognizes both ends of each intron. Compared to SpliceAI, Splam is consistently more accurate, achieving 96% accuracy in predicting human splice junctions.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.
| | - Alan Mao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Steven L Salzberg
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, 21205, USA
| | - Mihaela Pertea
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, 21218, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21211, USA.
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, 21218, USA.
| |
Collapse
|
146
|
Sanabria M, Hirsch J, Poetsch AR. Distinguishing word identity and sequence context in DNA language models. BMC Bioinformatics 2024; 25:301. [PMID: 39272021 PMCID: PMC11395559 DOI: 10.1186/s12859-024-05869-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Accepted: 07/12/2024] [Indexed: 09/15/2024] Open
Abstract
Transformer-based large language models (LLMs) are very suited for biological sequence data, because of analogies to natural language. Complex relationships can be learned, because a concept of "words" can be generated through tokenization. Training the models with masked token prediction, they learn both token sequence identity and larger sequence context. We developed methodology to interrogate model learning, which is both relevant for the interpretability of the model and to evaluate its potential for specific tasks. We used DNABERT, a DNA language model trained on the human genome with overlapping k-mers as tokens. To gain insight into the model's learning, we interrogated how the model performs predictions, extracted token embeddings, and defined a fine-tuning benchmarking task to predict the next tokens of different sizes without overlaps. This task evaluates foundation models without interrogating specific genome biology, it does not depend on tokenization strategies, vocabulary size, the dictionary, or the number of training parameters. Lastly, there is no leakage of information from token identity into the prediction task, which makes it particularly useful to evaluate the learning of sequence context. We discovered that the model with overlapping k-mers struggles to learn larger sequence context. Instead, the learned embeddings largely represent token sequence. Still, good performance is achieved for genome-biology-inspired fine-tuning tasks. Models with overlapping tokens may be used for tasks where a larger sequence context is of less relevance, but the token sequence directly represents the desired learning features. This emphasizes the need to interrogate knowledge representation in biological LLMs.
Collapse
Affiliation(s)
- Melissa Sanabria
- Biomedical Genomics, Biotechnology Center, Center for Molecular and Cellular Bioengineering, Technische Universitat Dresden, Dresden, Germany
| | - Jonas Hirsch
- Biomedical Genomics, Biotechnology Center, Center for Molecular and Cellular Bioengineering, Technische Universitat Dresden, Dresden, Germany
| | - Anna R Poetsch
- Biomedical Genomics, Biotechnology Center, Center for Molecular and Cellular Bioengineering, Technische Universitat Dresden, Dresden, Germany.
- National Center for Tumor Diseases, Partner site Dresden, German Cancer Research Center, Dresden, Germany.
| |
Collapse
|
147
|
Bhattacharya M, Pal S, Chatterjee S, Lee SS, Chakraborty C. Large language model to multimodal large language model: A journey to shape the biological macromolecules to biological sciences and medicine. MOLECULAR THERAPY. NUCLEIC ACIDS 2024; 35:102255. [PMID: 39377065 PMCID: PMC11456558 DOI: 10.1016/j.omtn.2024.102255] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 10/09/2024]
Abstract
After ChatGPT was released, large language models (LLMs) became more popular. Academicians use ChatGPT or LLM models for different purposes, and the use of ChatGPT or LLM is increasing from medical science to diversified areas. Recently, the multimodal LLM (MLLM) has also become popular. Therefore, we comprehensively illustrate the LLM and MLLM models for a complete understanding. We also aim for simple and extended reviews of LLMs and MLLMs for a broad category of readers, such as researchers, students in diversified fields, and other academicians. The review article illustrates the LLM and MLLM models, their working principles, and their applications in diversified fields. First, we demonstrate the technical concept of LLMs, working principle, Black Box, and the evolution of LLMs. To explain the working principle, we discuss the tokenization process, token representation, and token relationships. We also extensively demonstrate the application of LLMs in biological macromolecules, medical science, biological science, and other areas. We illustrate the multimodal applications of LLMs or MLLMs. Finally, we illustrate the limitations, challenges, and future prospects of LLMs. The review acts as a booster dose for clinicians, a primer for molecular biologists, and a catalyst for scientists, and also benefits diversified academicians.
Collapse
Affiliation(s)
- Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore, Odisha 756020, India
| | - Soumen Pal
- School of Mechanical Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu 632014, India
| | - Srijan Chatterjee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do 24252, Republic of Korea
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do 24252, Republic of Korea
| | - Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India
| |
Collapse
|
148
|
Sereshki S, Lonardi S. Predicting Differentially Methylated Cytosines in TET and DNMT3 Knockout Mutants via a Large Language Model. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.02.592257. [PMID: 39282350 PMCID: PMC11398415 DOI: 10.1101/2024.05.02.592257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
DNA cytosine methylation is an epigenetic marker which regulates many cellular processes. Mammalian genomes typically maintain consistent methylation patterns over time, except in specific regulatory regions like promoters and certain types of enhancers. The dynamics of DNA methylation is controlled by a complex cellular machinery, in which the enzymes DNMT3 and TET play a major role. This study explores the identification of differentially methylated cytosines (DMCs) in TET and DNMT3 knockout mutants in mice and human embryonic stem cells. We investigate (i) whether a large language model can be trained to recognize DMCs in human and mouse from the sequence surrounding the cytosine of interest, (ii) whether a classifier trained on human knockout data can predict DMCs in the mouse genome (and vice versa), (iii) whether a classifier trained on DNMT3 knockout can predict DMCs for TET knockout (and vice versa). Our study identifies statistically significant motifs associated with the prediction of DMCs each mutant, casting a new light on the understanding of DNA methylation dynamics in stem cells. Our software tool is available at https://github.com/ucrbioinfo/dmc_prediction.
Collapse
Affiliation(s)
- Saleh Sereshki
- Department of Computer Science and Engineering, University of California, Riverside, 900 University Ave, Riverside, 92521, CA, United States
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, 900 University Ave, Riverside, 92521, CA, United States
| |
Collapse
|
149
|
Yu CQ, Wang XF, Li LP, You ZH, Ren ZH, Chu P, Guo F, Wang ZY. RBNE-CMI: An Efficient Method for Predicting circRNA-miRNA Interactions via Multiattribute Incomplete Heterogeneous Network Embedding. J Chem Inf Model 2024. [PMID: 39231016 DOI: 10.1021/acs.jcim.4c01118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Abstract
Circular RNA (circRNA)-microRNA (miRNA) interaction (CMI) plays crucial roles in cellular regulation, offering promising perspectives for disease diagnosis and therapy. Therefore, it is necessary to employ computational methods for the rapid and cost-effective prediction of potential circRNA-miRNA interactions. However, the existing methods are limited by incomplete data; therefore, it is difficult to model molecules with different attributes on a large scale, which greatly hinders the efficiency and performance of prediction. In this study, we propose an effective method for predicting circRNA-miRNA interactions, called RBNE-CMI, and introduce a framework that can embed incomplete multiattribute CMI heterogeneous networks. By combining the proposed method, we integrate different data sets in the CMI prediction field into one incomplete network for modeling, achieving superior performance in 5-fold cross-validation. Moreover, in the prediction task based on complete data, the proposed method still achieves better performance than the known model. In addition, in the case study, we successfully predicted 18 of the 20 potential cancer biomarkers. The data and source code can be found at https://github.com/1axin/RBNE-CMI.
Collapse
Affiliation(s)
- Chang-Qing Yu
- School of Information Engineering, Xijing University, Xi'an 710123 China
| | - Xin-Fei Wang
- College of Computer Science and Technology, Jilin University, Changchun 130012 China
| | - Li-Ping Li
- Yizhi School of Agriculture and Forestry, Xiangyang Polytechnic Institute, Xianyang 712000, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
| | - Zhong-Hao Ren
- College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
| | - Peng Chu
- School of Information Engineering, Xijing University, Xi'an 710123 China
| | - Feng Guo
- School of Information Engineering, Xijing University, Xi'an 710123 China
| | - Zhen-Yu Wang
- School of Telecommunications, Lanzhou University of Technology, Lanzhou 730000, China
| |
Collapse
|
150
|
Boshar S, Trop E, de Almeida BP, Copoiu L, Pierrot T. Are genomic language models all you need? Exploring genomic language models on protein downstream tasks. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae529. [PMID: 39212609 PMCID: PMC11399231 DOI: 10.1093/bioinformatics/btae529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 08/20/2024] [Accepted: 08/28/2024] [Indexed: 09/04/2024]
Abstract
MOTIVATION Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs. RESULTS In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics. AVAILABILITY AND IMPLEMENTATION We make our inference code, 3mer pre-trained model weights and datasets available.
Collapse
Affiliation(s)
- Sam Boshar
- InstaDeep, Cambridge, MA 02142, United States
| | - Evan Trop
- InstaDeep, Cambridge, MA 02142, United States
| | | | | | | |
Collapse
|