1
|
Lee S, Park JS, Hong JH, Woo H, Lee CH, Yoon JH, Lee KB, Chung S, Yoon DS, Lee JH. Artificial intelligence in bacterial diagnostics and antimicrobial susceptibility testing: Current advances and future prospects. Biosens Bioelectron 2025; 280:117399. [PMID: 40184880 DOI: 10.1016/j.bios.2025.117399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2024] [Revised: 03/14/2025] [Accepted: 03/18/2025] [Indexed: 04/07/2025]
Abstract
Recently, artificial intelligence (AI) has emerged as a transformative tool, enhancing the speed, accuracy, and scalability of bacterial diagnostics. This review explores the role of AI in revolutionizing bacterial detection and antimicrobial susceptibility testing (AST) by leveraging machine learning models, including Random Forest, Support Vector Machines (SVM), and deep learning architectures such as Convolutional Neural Networks (CNNs) and transformers. The integration of AI into these methods promises to address the current limitations of traditional techniques, offering a path toward more efficient, accessible, and reliable diagnostic solutions. In particular, AI-based approaches have demonstrated significant potential in resource-limited settings by enabling cost-effective and portable diagnostic solutions, reducing dependency on specialized infrastructure, and facilitating remote bacterial detection through smartphone-integrated platforms and telemedicine applications. This review highlights AI's transformative role in automating data analysis, minimizing human error, and delivering real-time diagnostic results, ultimately improving patient outcomes and optimizing healthcare efficiency. In addition, we not only examine the current advances in machine learning and deep learning but also review their applications in plate counting, mass spectrometry, morphology-based and motion-based microscopic detection, holographic microscopy, colorimetric and fluorescence detection, electrochemical sensors, Raman and Surface-Enhanced Raman Spectroscopy (SERS), and Atomic Force Microscopy (AFM) for bacterial diagnostics and AST. Finally, we discuss the future directions and potential advancements in AI-driven bacterial diagnostics.
Collapse
Affiliation(s)
- Seungmin Lee
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea; School of Biomedical Engineering, Korea University, 145 Anam-ro, Seongbuk, Seoul, 02841, Republic of Korea
| | - Jeong Soo Park
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea; School of Mechanical Engineering, Korea University, 145 Anam-ro, Seoungbuk-gu, Seoul, 02841, Republic of Korea
| | - Ji Hye Hong
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea; School of Biomedical Engineering, Korea University, 145 Anam-ro, Seongbuk, Seoul, 02841, Republic of Korea
| | - Hyowon Woo
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Chang-Hyun Lee
- Department of Electrical Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon, Seoul, 01897, Republic of Korea
| | - Ju Hwan Yoon
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea; Department of Electrical Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon, Seoul, 01897, Republic of Korea
| | - Ki-Baek Lee
- Department of Electrical Engineering, Kwangwoon University, 20 Kwangwoon-ro, Nowon, Seoul, 01897, Republic of Korea
| | - Seok Chung
- School of Mechanical Engineering, Korea University, 145 Anam-ro, Seoungbuk-gu, Seoul, 02841, Republic of Korea.
| | - Dae Sung Yoon
- School of Biomedical Engineering, Korea University, 145 Anam-ro, Seongbuk, Seoul, 02841, Republic of Korea; Interdisciplinary Program in Precision Public Health, Korea University, Seoul, 02841, Republic of Korea; Astrion Inc, Seoul, 02841, Republic of Korea.
| | - Jeong Hoon Lee
- KU-KIST Graduate School of Converging Science and Technology, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea; Department of Integrative Energy Engineering, College of Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea.
| |
Collapse
|
2
|
Wang T, Liu Z. m6A-SPP: Identification of RNA N6-methyladenosine modification sites through multi-source biological features and a hybrid deep learning architecture. Int J Biol Macromol 2025; 316:144789. [PMID: 40449782 DOI: 10.1016/j.ijbiomac.2025.144789] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2025] [Revised: 05/20/2025] [Accepted: 05/28/2025] [Indexed: 06/03/2025]
Abstract
The N6-methyladenosine(m6A) modification plays crucial regulatory roles in various biological processes including gene expression regulation, RNA stability, splicing, and translation. Accurate prediction of m6A modification sites is essential for understanding their biological functions and implications in diseases. To address this, we introduce m6A-SPP, a novel deep learning framework for predicting m6A modification sites effectively. The model integrates both sequence features and physicochemical properties of RNA through two specialized modules. The sequence feature module leverages a pretrained bidirectional encoder representation of transformers (BERT) module (DNABERT), combined with convolutional neural networks (CNN), to provide refined processing of RNA sequence representations. The physicochemical feature module, on the other hand, computes feature embeddings by incorporating three crucial physicochemical properties. The feature matrices from both modules are then concatenated effectively and passed through fully connected layers to produce precise predictions of m6A modification sites. Comprehensive evaluations were performed on a dataset with single-nucleotide resolution for m6A, encompassing eight cell lines (such as HEK293T and HeLa) and three tissue types (including Brain, Liver, and Kidney). The experimental results demonstrate that m6A-SPP surpasses existing methods, highlighting its better performance in predicting m6A modification sites.
Collapse
Affiliation(s)
- Tong Wang
- School of Computer and Information Engineering, Institute for Artificial Intelligence, Shanghai Polytechnic University, Shanghai 201209, China.
| | - Zhendong Liu
- School of Computer and Information Engineering, Institute for Artificial Intelligence, Shanghai Polytechnic University, Shanghai 201209, China
| |
Collapse
|
3
|
M P A, K S A, Madhavan M. Transformer-based models for uncovering genetic mutations in cancerous and non-cancerous genomes. Gene 2025; 963:149460. [PMID: 40441324 DOI: 10.1016/j.gene.2025.149460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Revised: 03/06/2025] [Accepted: 03/28/2025] [Indexed: 06/02/2025]
Abstract
Genetic mutations, arising from permanent changes in DNA sequences due to replication errors, environmental factors, or exposure to chemicals and radiation are crucial for uncovering the underlying mechanism that drives cancer. Key mutation types, such as Single Nucleotide Variants (SNVs), Indels and Duplications, affect the genomes in distinct ways, emphasizing the need for reliable and accurate detection techniques tailored to each mutation type. This study advances mutation classification by leveraging transformer-based models to analyze genetic mutations across cancerous and non-cancerous genomes. The proposed classification framework utilizes pre-trained DNABERT-2 and Nucleotide Transformer models to accurately categorize mutation types. To comprehensively evaluate model performance, we employ three distinct datasets: a real-world genomic dataset and two synthetic datasets generated using a WGAN-GP model, effectively addressing data imbalance by creating diverse and well-represented mutation samples. Results from experiments based on F1 score, recall, accuracy, and precision, demonstrate that the proposed classifier outperforms existing models in distinguishing mutation types. These findings highlight the potential of our approach to enhance genetic analysis, contributing to improved mutation classification for personalized treatment strategies.
Collapse
Affiliation(s)
- Anjana M P
- Department of Computer Applications, Cochin University of Science and Technology, Cochin 682022, India.
| | - Arun K S
- Department of Computer Applications, Cochin University of Science and Technology, Cochin 682022, India.
| | - Manu Madhavan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, 686635, India.
| |
Collapse
|
4
|
Kawasaki J, Suzuki T, Hamada M. Hidden challenges in evaluating spillover risk of zoonotic viruses using machine learning models. COMMUNICATIONS MEDICINE 2025; 5:187. [PMID: 40394176 PMCID: PMC12092720 DOI: 10.1038/s43856-025-00903-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 05/09/2025] [Indexed: 05/22/2025] Open
Abstract
BACKGROUND Machine learning models have been deployed to assess the zoonotic spillover risk of viruses by identifying their potential for human infectivity. However, the lack of comprehensive datasets for viral infectivity poses a major challenge, limiting the predictable range of viruses. METHODS In this study, we address this limitation through two key strategies: constructing expansive datasets across 26 viral families and developing the BERT-infect model, which leverages large language models pre-trained on extensive nucleotide sequences. RESULTS Here we show that our approach substantially boosts model performance. This enhancement is particularly notable in segmented RNA viruses, which are involved with severe zoonoses but have been overlooked due to limited data availability. Our model also exhibits high predictive performance even with partial viral sequences, such as high-throughput sequencing reads or contig sequences from de novo sequence assemblies, indicating the model's applicability for mining zoonotic viruses from virus metagenomic data. Furthermore, models trained on data up to 2018 demonstrate robust predictive capability for most viruses identified post-2018. Nonetheless, high-resolution evaluation based on phylogenetic analysis reveals general limitations in current machine learning models: the difficulty in alerting the human infectious risk in specific zoonotic viral lineages, including SARS-CoV-2. CONCLUSIONS Our study provides a comprehensive benchmark for viral infectivity prediction models and highlights unresolved issues in fully exploiting machine learning to prepare for future zoonotic threats.
Collapse
Affiliation(s)
- Junna Kawasaki
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- Department of Infectious Disease Pathobiology, Graduate School of Medicine, Chiba University, Chiba, Japan.
| | - Tadaki Suzuki
- Department of Infectious Disease Pathobiology, Graduate School of Medicine, Chiba University, Chiba, Japan
- Department of Infectious Disease Pathology, National Institute of Infectious Diseases, Japan Institute for Health Security, Tokyo, Japan
| | - Michiaki Hamada
- Faculty of Science and Engineering, Waseda University, Tokyo, Japan.
- Cellular and Molecular Biotechnology Research Institute (CMB), National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.
- Graduate School of Medicine, Nippon Medical School, Tokyo, Japan.
| |
Collapse
|
5
|
Yoosefzadeh-Najafabadi M. From text to traits: exploring the role of large language models in plant breeding. FRONTIERS IN PLANT SCIENCE 2025; 16:1583344. [PMID: 40438742 PMCID: PMC12116590 DOI: 10.3389/fpls.2025.1583344] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 04/18/2025] [Indexed: 06/01/2025]
Abstract
Modern plant breeders regularly deal with the intricate patterns within biological data in order to better understand the biological background behind a trait of interest and speed up the breeding process. Recently, Large Language Models (LLMs) have gained widespread adoption in everyday contexts, showcasing remarkable capabilities in understanding and generating human-like text. By harnessing the capabilities of LLMs, foundational models can be repurposed to uncover intricate patterns within biological data, leading to the development of robust and flexible predictive tools that provide valuable insights into complex plant breeding systems. Despite the significant progress made in utilizing LLMs in various scientific domains, their adoption within plant breeding remains unexplored, presenting a significant opportunity for innovation. This review paper explores how LLMs, initially designed for natural language tasks, can be adapted to address specific challenges in plant breeding, such as identifying novel genetic interactions, predicting performance of a trait of interest, and well-integrating diverse datasets such as multi-omics, phenotypic, and environmental sources. Compared to conventional breeding methods, LLMs offer the potential to enhance the discovery of genetic relationships, improve trait prediction accuracy, and facilitate informed decision-making. This review aims to bridge this gap by highlighting current advancements, challenges, and future directions for integrating LLMs into plant breeding, ultimately contributing to sustainable agriculture and improved global food security.
Collapse
|
6
|
Barbadilla-Martínez L, Klaassen N, van Steensel B, de Ridder J. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 2025:10.1038/s41576-025-00841-2. [PMID: 40360798 DOI: 10.1038/s41576-025-00841-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/01/2025] [Indexed: 05/15/2025]
Abstract
Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.
Collapse
Affiliation(s)
- Lucía Barbadilla-Martínez
- Oncode Institute, Utrecht, The Netherlands
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands
| | - Noud Klaassen
- Oncode Institute, Utrecht, The Netherlands
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Bas van Steensel
- Oncode Institute, Utrecht, The Netherlands.
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands.
| | - Jeroen de Ridder
- Oncode Institute, Utrecht, The Netherlands.
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands.
| |
Collapse
|
7
|
Hu G, Zhou T, Zhou P, Yau SST. Novel natural vector with asymmetric covariance for classifying biological sequences. Gene 2025; 962:149532. [PMID: 40367998 DOI: 10.1016/j.gene.2025.149532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2025] [Revised: 04/07/2025] [Accepted: 04/23/2025] [Indexed: 05/16/2025]
Abstract
The genome sequences of organisms form a large and complex landscape, presenting a significant challenge in bioinformatics: how to utilize mathematical tools to describe and analyze this space effectively. The ability to compare relationships between different organisms depends on creating a rational mapping rule that can uniformly encode genome sequences of varying lengths as vectors in a measurable space. This mapping would enable researchers to apply modern mathematical and machine learning techniques to otherwise challenging genomic comparisons. The natural vector method has been proposed as a concise and effective approach to accomplish this. However, its various iterations have certain limitations. In response, we carefully analyze the strengths and weaknesses of these natural vector methods and propose an improved version-an asymmetric covariance natural vector method (ACNV). This new method incorporates k-mer information alongside covariance computations with asymmetric properties between base positions. We tested ACNV on microbial genome sequence datasets, including bacterial, fungal, and viral sequences, evaluating its performance in terms of classification accuracy and convex hull separation. The results demonstrate that ACNV effectively captures sequence characteristics, showcasing its robust sequence representation capabilities and highlighting its elegant geometric properties.
Collapse
Affiliation(s)
- Guoqing Hu
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China.
| | - Tao Zhou
- Department of Mathematical Sciences, Tsinghua University, 100084, Beijing, China
| | - Piyu Zhou
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; State Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 100190, Beijing, China; University of Chinese Academy of Sciences, 100049, Beijing, China
| | - Stephen Shing-Toung Yau
- Beijing Institute of Mathematical Sciences and Applications (BIMSA), 101408, Beijing, China; Department of Mathematical Sciences, Tsinghua University, 100084, Beijing, China.
| |
Collapse
|
8
|
Yang X, Liao M, Ye B, Xia J, Zhao J. iEnhancer-GDM: A Deep Learning Framework Based on Generative Adversarial Network and Multi-head Attention Mechanism to Identify Enhancers and Their Strength. Interdiscip Sci 2025:10.1007/s12539-025-00703-9. [PMID: 40335860 DOI: 10.1007/s12539-025-00703-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Revised: 03/13/2025] [Accepted: 03/13/2025] [Indexed: 05/09/2025]
Abstract
Enhancers are short DNA fragments capable of significantly increase the frequency of gene transcription. They often exert their effects on targeted genes over long distances, either in cis or in trans configurations. Identifying enhancers poses a challenge due to their variable position and sensitivities. Genetic variants within enhancer regions have been implicated in human diseases, highlighting critical importance of enhancers identification and strength prediction. Here, we develop a two-layer predictor named iEnhancer-GDM to identify enhancers and to predict enhancer strength. To address the challenges posed by the limited size of enhancer training dataset, which could cause issues such as model overfitting and low classification accuracy, we introduce a Wasserstein generative adversarial network (WGAN-GP) to augment the dataset. We employ a dna2vec embedding layer to encode raw DNA sequences into numerical feature representations, and then integrate multi-scale convolutional neural network, bidirectional long short-term memory network and multi-head attention mechanism for feature representation and classification. Our results validate the effectiveness of data augmentation in WGAN-GP. Our model iEnhancer-GDM achieves superior performance on an independent test dataset, and outperforms the existing models with improvements of 2.45% for enhancer identification and 11.5% for enhancer strength prediction by benchmarking against current methods. iEnhancer-GDM advances the precise enhancer identification and strength prediction, thereby helping to understand the functions of enhancers and their associations on genomics.
Collapse
Affiliation(s)
- Xiaomei Yang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| | - Meng Liao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| | - Bin Ye
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
- School of Computer Science and Technology, Anhui University, Hefei, 230601, China.
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, 230601, China.
| | - Jianping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, 830017, China
| |
Collapse
|
9
|
Chhibbar P, Das J. Machine learning approaches enable the discovery of therapeutics across domains. Mol Ther 2025; 33:2269-2278. [PMID: 40186352 DOI: 10.1016/j.ymthe.2025.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Revised: 03/21/2025] [Accepted: 04/01/2025] [Indexed: 04/07/2025] Open
Abstract
Multi-modal datasets have grown exponentially in the last decade. This has created an enormous demand for machine learning models that can predict complex outcomes by leveraging cellular, molecular, and humoral profiles. Corresponding inference of mechanisms can help to uncover new therapeutic targets. Here, we discuss how biological principles guide the design of predictive models and how interpretable machine learning can lead to novel mechanistic insights. We provide descriptions of multiple learning techniques and how suited they are to domain adaptations. Finally, we talk about broad learning capabilities of foundation models on large datasets and whether they can be used to provide meaningful inference about biological datasets.
Collapse
Affiliation(s)
- Prabal Chhibbar
- Centre for Systems Immunology, Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA; Integrative Systems Biology PhD Program, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| | - Jishnu Das
- Centre for Systems Immunology, Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA; Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| |
Collapse
|
10
|
Kehl KL. Use of Large Language Models in Clinical Cancer Research. JCO Clin Cancer Inform 2025; 9:e2500027. [PMID: 40388683 DOI: 10.1200/cci-25-00027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/25/2025] [Accepted: 04/16/2025] [Indexed: 05/21/2025] Open
Abstract
Artificial intelligence (AI) is increasingly being applied to clinical cancer research, driving precision oncology objectives by gathering clinical data at scales that were not previously possible. Although small, domain-specific models have been used toward this end for several years, general-purpose large language models (LLMs) now enable scalable data extraction and analysis without the need for large, labeled training data sets. These models support several applications, including building clinico-omic databases, matching patients to clinical trials, and developing multimodal foundation models that integrate text, imaging, and molecular data. LLMs can also streamline research workflows, from automating documentation to accelerating clinical decision making. However, data privacy, hallucination risks, computational costs, regulatory requirements, and validation standards remain significant considerations. Careful implementation of AI tools will therefore be an important task for cancer researchers in coming years.
Collapse
|
11
|
Pope Q, Varma R, Tataru C, David MM, Fern X. Learning a deep language model for microbiomes: The power of large scale unlabeled microbiome data. PLoS Comput Biol 2025; 21:e1011353. [PMID: 40334224 PMCID: PMC12058177 DOI: 10.1371/journal.pcbi.1011353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 03/24/2025] [Indexed: 05/09/2025] Open
Abstract
We use open source human gut microbiome data to learn a microbial "language" model by adapting techniques from Natural Language Processing (NLP). Our microbial "language" model is trained in a self-supervised fashion (i.e., without additional external labels) to capture the interactions among different microbial taxa and the common compositional patterns in microbial communities. The learned model produces contextualized taxon representations that allow a single microbial taxon to be represented differently according to the specific microbial environment in which it appears. The model further provides a sample representation by collectively interpreting different microbial taxa in the sample and their interactions as a whole. We demonstrate that, while our sample representation performs comparably to baseline models in in-domain prediction tasks such as predicting Irritable Bowel Disease (IBD) and diet patterns, it significantly outperforms them when generalizing to test data from independent studies, even in the presence of substantial distribution shifts. Through a variety of analyses, we further show that the pre-trained, context-sensitive embedding captures meaningful biological information, including taxonomic relationships, correlations with biological pathways, and relevance to IBD expression, despite the model never being explicitly exposed to such signals.
Collapse
Affiliation(s)
- Quintin Pope
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
| | - Rohan Varma
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
| | - Christine Tataru
- Department of Pathology, Brigham and Women’s Hospital, Boston, Massachusetts, United States of America
| | - Maude M David
- Department of Pharmaceutical Sciences, Oregon State University, Corvallis, Oregon, United States of America
| | - Xiaoli Fern
- School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, United States of America
| |
Collapse
|
12
|
Yamada K, Suga K, Abe N, Hashimoto K, Tsutsumi S, Inagaki M, Hashiya F, Abe H, Hamada M. Multi-objective computational optimization of human 5' UTR sequences. Brief Bioinform 2025; 26:bbaf225. [PMID: 40413870 DOI: 10.1093/bib/bbaf225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Revised: 03/26/2025] [Accepted: 04/07/2025] [Indexed: 05/27/2025] Open
Abstract
The computational design of messenger RNA (mRNA) sequences is a critical technology for both scientific research and industrial applications. Recent advances in prediction and optimization models have enabled the automatic scoring and optimization of $5^\prime $ UTR sequences, key upstream elements of mRNA. However, fully automated design of $5^\prime $ UTR sequences with more than two objective scores has not yet been explored. In this study, we present a computational pipeline that optimizes human $5^\prime $ UTR sequences in a multi-objective framework, addressing up to four distinct and conflicting objectives. Our work represents an important advancement in the multi-objective computational design of mRNA sequences, paving the way for more sophisticated mRNA engineering.
Collapse
Affiliation(s)
- Keisuke Yamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
- Department of Bioengineering, University of Pennsylvania, 210 South 33rd Street, Philadelphia, PA 19104, United States
| | - Kanta Suga
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
| | - Naoko Abe
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Koji Hashimoto
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
- Graduate School of Arts and Sciences, The University of Tokyo, 3-8-1 Komaba, Meguro-ku, Tokyo 153-8902, Japan
| | - Susumu Tsutsumi
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Masahito Inagaki
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Fumitaka Hashiya
- Research Center for Materials Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
| | - Hiroshi Abe
- Department of Chemistry, Graduate School of Science, Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8602, Aichi, Japan
- Institute for Glyco-core Research (iGCORE), Nagoya University, Furo-cho, Chikusa-ku, Nagoya 464-8601, Aichi, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1, Okubo Shinjuku-ku, Tokyo 169-8555, Japan
- Cellular and Molecular Biotechnology Research Institute (CMB), National Institute of Advanced Industrial Science and Technology (AIST), 2-4-7, Aomi, Koto-ku, Tokyo 135-0064, Japan
- Graduate School of Medicine, Nippon Medical School, 1-1-5, Sendagi, Bunkyo-ku, Tokyo 113-8602, Japan
| |
Collapse
|
13
|
Ali S, Qadri YA, Ahmad K, Lin Z, Leung MF, Kim SW, Vasilakos AV, Zhou T. Large Language Models in Genomics-A Perspective on Personalized Medicine. Bioengineering (Basel) 2025; 12:440. [PMID: 40428059 PMCID: PMC12108693 DOI: 10.3390/bioengineering12050440] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2025] [Revised: 04/21/2025] [Accepted: 04/22/2025] [Indexed: 05/29/2025] Open
Abstract
Integrating artificial intelligence (AI), particularly large language models (LLMs), into the healthcare industry is revolutionizing the field of medicine. LLMs possess the capability to analyze the scientific literature and genomic data by comprehending and producing human-like text. This enhances the accuracy, precision, and efficiency of extensive genomic analyses through contextualization. LLMs have made significant advancements in their ability to understand complex genetic terminology and accurately predict medical outcomes. These capabilities allow for a more thorough understanding of genetic influences on health issues and the creation of more effective therapies. This review emphasizes LLMs' significant impact on healthcare, evaluates their triumphs and limitations in genomic data processing, and makes recommendations for addressing these limitations in order to enhance the healthcare system. It explores the latest advancements in LLMs for genomic analysis, focusing on enhancing disease diagnosis and treatment accuracy by taking into account an individual's genetic composition. It also anticipates a future in which AI-driven genomic analysis is commonplace in clinical practice, suggesting potential research areas. To effectively leverage LLMs' potential in personalized medicine, it is vital to actively support innovation across multiple sectors, ensuring that AI developments directly contribute to healthcare solutions tailored to individual patients.
Collapse
Affiliation(s)
- Shahid Ali
- School of Cyberspace Security, Hainan University, Haikou 570228, China; (S.A.); (Z.L.)
| | - Yazdan Ahmad Qadri
- School of Computer Science and Engineering, Yeungnam University, 280, Daehak-ro, Gyeongsan-si 38541, Gyeongsangbuk-do, Republic of Korea; (Y.A.Q.); (S.W.K.)
| | - Khurshid Ahmad
- Department of Health Informatics, College of Applied Medical Sciences, Qassim University, Buraydah 51452, Saudi Arabia;
| | - Zhizhe Lin
- School of Cyberspace Security, Hainan University, Haikou 570228, China; (S.A.); (Z.L.)
| | - Man-Fai Leung
- School of Computing and Information Science, Anglia Ruskin University, Cambridge CB1 1PT, UK;
| | - Sung Won Kim
- School of Computer Science and Engineering, Yeungnam University, 280, Daehak-ro, Gyeongsan-si 38541, Gyeongsangbuk-do, Republic of Korea; (Y.A.Q.); (S.W.K.)
| | - Athanasios V. Vasilakos
- Department of Information and Communication Technology, University of Agder, 4879 Grimstad, Norway
| | - Teng Zhou
- School of Cyberspace Security, Hainan University, Haikou 570228, China; (S.A.); (Z.L.)
| |
Collapse
|
14
|
Xiao Y, Zhang Y. deep-Sep: a deep learning-based method for fast and accurate prediction of selenoprotein genes in bacteria. mSystems 2025; 10:e0125824. [PMID: 40062874 PMCID: PMC12013277 DOI: 10.1128/msystems.01258-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Accepted: 02/07/2025] [Indexed: 04/23/2025] Open
Abstract
Selenoproteins are a special group of proteins with major roles in cellular antioxidant defense. They contain the 21st amino acid selenocysteine (Sec) in the active sites, which is encoded by an in-frame UGA codon. Compared to eukaryotes, identification of selenoprotein genes in bacteria remains challenging due to the absence of an effective strategy for distinguishing the Sec-encoding UGA codon from a normal stop signal. In this study, we have developed a deep learning-based algorithm, deep-Sep, for quickly and precisely identifying selenoprotein genes in bacterial genomic sequences. This algorithm uses a Transformer-based neural network architecture to construct an optimal model for detecting Sec-encoding UGA codons and a homology search-based strategy to remove additional false positives. During the training and testing stages, deep-Sep has demonstrated commendable performance, including an F1 score of 0.939 and an area under the receiver operating characteristic curve of 0.987. Furthermore, when applied to 20 bacterial genomes as independent test data sets, deep-Sep exhibited remarkable capability in identifying both known and new selenoprotein genes, which significantly outperforms the existing state-of-the-art method. Our algorithm has proved to be a powerful tool for comprehensively characterizing selenoprotein genes in bacterial genomes, which should not only assist in accurate annotation of selenoprotein genes in genome sequencing projects but also provide new insights for a deeper understanding of the roles of selenium in bacteria.IMPORTANCESelenium is an essential micronutrient present in selenoproteins in the form of Sec, which is a rare amino acid encoded by the opal stop codon UGA. Identification of all selenoproteins is of vital importance for investigating the functions of selenium in nature. Previous strategies for predicting selenoprotein genes mainly relied on the identification of a special cis-acting Sec insertion sequence (SECIS) element within mRNAs. However, due to the complexity and variability of SECIS elements, recognition of all selenoprotein genes in bacteria is still a major challenge in the annotation of bacterial genomes. We have developed a deep learning-based algorithm to predict selenoprotein genes in bacterial genomic sequences, which demonstrates superior performance compared to currently available methods. This algorithm can be utilized in either web-based or local (standalone) modes, serving as a promising tool for identifying the complete set of selenoprotein genes in bacteria.
Collapse
Affiliation(s)
- Yao Xiao
- Shenzhen Key Laboratory of Marine Bioresources and Ecology, Brain Disease and Big Data Research Institute, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, Guangdong, China
| | - Yan Zhang
- Shenzhen Key Laboratory of Marine Bioresources and Ecology, Brain Disease and Big Data Research Institute, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen, Guangdong, China
- Shenzhen-Hong Kong Institute of Brain Science-Shenzhen Fundamental Research Institutions, Shenzhen, Guangdong, China
| |
Collapse
|
15
|
Sheng N, Qiao J, Wei L, Shi H, Guo H, Yang C. Computational models for prediction of m6A sites using deep learning. Methods 2025; 240:113-124. [PMID: 40268153 DOI: 10.1016/j.ymeth.2025.04.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 04/02/2025] [Accepted: 04/07/2025] [Indexed: 04/25/2025] Open
Abstract
RNA modifications play a crucial role in enhancing the structural and functional diversity of RNA molecules and regulating various stages of the RNA life cycle. Among these modifications, N6-Methyladenosine (m6A) is the most common internal modification in eukaryotic mRNAs and has been extensively studied over the past decade. Accurate identification of m6A modification sites is essential for understanding their function and underlying mechanisms. Traditional methods predominantly rely on machine learning techniques to recognize m6A sites, which often fail to capture the contextual features of these sites comprehensively. In this study, we comprehensively summarize previously published methods based on machine learning and deep learning. We also validate multiple deep learning approaches on benchmark dataset, including previously underutilized methods in m6A site prediction, pre-trained models specifically designed for biological sequence and other basic deep learning methods. Additionally, we further analyze the dataset features and interpret the model's predictions to enhance understanding. Our experimental results clearly demonstrate the effectiveness of the deep learning models, elucidating their strong potential in accurately recognizing m6A modification sites.
Collapse
Affiliation(s)
- Nan Sheng
- School of Software, Shandong University, Jinan 250101, PR China
| | - Jianbo Qiao
- School of Software, Shandong University, Jinan 250101, PR China
| | - Leyi Wei
- School of Software, Shandong University, Jinan 250101, PR China
| | - Hua Shi
- School of Opto-electronic and Communication Engineering, Xiamen University of Technology, Xiamen, PR China
| | - Huannan Guo
- Beidahuang Industry Group General Hospital, PR China.
| | - Changshun Yang
- Department of Gastrointestinal Surgery, Fuzhou University Affiliated Provincial Hospital, Fuzhou 350004, PR China.
| |
Collapse
|
16
|
Ji L, Hou W, Zhou H, Xiong L, Liu C, Yuan Z, Li L. EBMGP: a deep learning model for genomic prediction based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling. TAG. THEORETICAL AND APPLIED GENETICS. THEORETISCHE UND ANGEWANDTE GENETIK 2025; 138:103. [PMID: 40253568 PMCID: PMC12009238 DOI: 10.1007/s00122-025-04894-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/13/2024] [Accepted: 03/27/2025] [Indexed: 04/21/2025]
Abstract
Enhancing early selection through genomic estimated breeding values is pivotal for reducing generation intervals and accelerating breeding programs. Recently, deep learning (DL) approaches have gained prominence in genomic prediction (GP). Here, we introduce a novel DL framework for GP based on Elastic Net feature selection and bidirectional encoder representations from transformer's embedding and multi-head attention pooling (EBMGP). EBMGP applies Elastic Net for the selection of features, thereby diminishing the computational burden and bolstering the predictive accuracy. In EBMGP, SNPs are treated as "words," and groups of adjacent SNPs with similar LD levels are considered "sentences." By applying bidirectional encoder representations from transformers embeddings, this method models SNPs in a manner analogous to human language, capturing complex genetic interactions at both the "word" and "sentence" scales. This flexible representation seamlessly integrates into any DL network and demonstrates a marked improvement in predictive performance for EBMGP and SoyDNGP compared to the widely used one-hot representation. We propose multi-head attention pooling, which can adaptively assign weights to features while learning features from multiple subspaces through multi-heads for a high level of semantic understanding. In a comprehensive comparative analysis across four diverse plant and animal datasets, EBMGP outperformed competing models in 13 out of 16 tasks, achieving accuracy gains ranging from 0.74 to 9.55% over the second-best model. These results underscore EBMGP's robustness in genomic prediction and highlight its potential for deep learning applications in life sciences.
Collapse
Affiliation(s)
- Lu Ji
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
- Basic Biology Laboratory, Hunan First Normal University, Changsha, 410205, China
| | - Wei Hou
- College of Bioscience and Biotechnology, Hunan Agricultural University, Changsha, 410128, China
| | - Heng Zhou
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Liwen Xiong
- College of Life Sciences, University of Chinese Academy of Sciences, Beijing, Beijing, 100049, China
| | - Chunhai Liu
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China
| | - Zheming Yuan
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| | - Lanzhi Li
- Hunan Engineering and Technology Research Center for Agricultural Big Data Analysis and Decision-Making, Hunan Agricultural University, Changsha, 410128, China.
| |
Collapse
|
17
|
Consens ME, Li B, Poetsch AR, Gilbert S. Genomic language models could transform medicine but not yet. NPJ Digit Med 2025; 8:212. [PMID: 40251342 PMCID: PMC12008430 DOI: 10.1038/s41746-025-01603-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2025] [Accepted: 03/31/2025] [Indexed: 04/20/2025] Open
Affiliation(s)
- Micaela Elisa Consens
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
- Vector Institute for Artificial Intelligence, Toronto, ON, Canada
- Peter Munk Cardiac Center, University Health Network, Toronto, ON, Canada
| | - Ben Li
- Division of Vascular Surgery, University of Toronto, Toronto, ON, Canada
- Temerty Centre for Artificial Intelligence Research and Education in Medicine, University of Toronto, Toronto, ON, Canada
| | - Anna R Poetsch
- Biomedical Genomics, Biotechnology Center, Center for Molecular and Cellular Bioengineering, Technische Universität, Dresden, Germany
- National Center for Tumor Diseases (NCT) partner site Dresden, German Cancer Research Center (DKFZ), Dresden, Germany
| | - Stephen Gilbert
- Carl Gustav Carus University Hospital Dresden, Dresden University of Technology, Dresden, Germany.
- Else Kröner Fresenius Center for Digital Health, TUD Dresden University of Technology, Dresden, Germany.
| |
Collapse
|
18
|
Rausch T, Marschall T, Korbel JO. The impact of long-read sequencing on human population-scale genomics. Genome Res 2025; 35:593-598. [PMID: 40228902 PMCID: PMC12047236 DOI: 10.1101/gr.280120.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/16/2025]
Abstract
Long-read sequencing technologies, particularly those from Pacific Biosciences and Oxford Nanopore Technologies, are revolutionizing genome research by providing high-resolution insights into complex and repetitive regions of the human genome that were previously inaccessible. These advances have been particularly enabling for the comprehensive detection of genomic structural variants (SVs), which is critical for linking genotype to phenotype in population-scale and rare disease studies, as well as in cancer. Recent developments in sequencing throughput and computational methods, such as pangenome graphs and haplotype-resolved assemblies, are paving the way for the future inclusion of long-read sequencing in clinical cohort studies and disease diagnostics. DNA methylation signals directly obtained from long reads enhance the utility of single-molecule long-read sequencing technologies by enabling molecular phenotypes to be interpreted, and by allowing the identification of the parent of origin of de novo mutations. Despite this recent progress, challenges remain in scaling long-read technologies to large populations due to cost, computational complexity, and the lack of tools to facilitate the efficient interpretation of SVs in graphs. This perspective provides a succinct review on the current state of long-read sequencing in genomics by highlighting its transformative potential and key hurdles, and emphasizing future opportunities for advancing the understanding of human genetic diversity and diseases through population-scale long-read analysis.
Collapse
Affiliation(s)
- Tobias Rausch
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany;
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University, 40225 Düsseldorf, Germany;
- Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany
| | - Jan O Korbel
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany;
| |
Collapse
|
19
|
Wang T, Gao M. Utilizing a deep learning model based on BERT for identifying enhancers and their strength. PLoS One 2025; 20:e0320085. [PMID: 40203028 PMCID: PMC11981215 DOI: 10.1371/journal.pone.0320085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2024] [Accepted: 02/12/2025] [Indexed: 04/11/2025] Open
Abstract
An enhancer is a specific DNA sequence typically located within a gene at upstream or downstream position and serves as a pivotal element in the regulation of eukaryotic gene transcription. Therefore, the recognition of enhancers is highly significant for comprehending gene expression regulatory systems. While some useful predictive models have been proposed, there are still deficiencies in these models. To address current limitations, we propose a model, DNABERT2-Enhancer, based on transformer architecture and deep learning, designed for the recognition of enhancers (classified as either enhancer or non-enhancer) and the identification of their activity (strong or weak enhancers). More specifically, DNABERT2-Enhancer is composed of a BERT model for extracting features and a CNN model for enhancers classification. Parameters of the BERT model are initialized by a pre-training DNABERT-2 language model. The enhancer recognition task is then fine-tuned through transfer learning to convert the original sequence into feature vectors. Subsequently, the CNN network is employed to learn the feature vector generated by BERT and produce the prediction results. In comparison with existing predictors utilizing the identical dataset, our approach demonstrates superior performance. This suggests that the model will be a useful instrument for academic research on the enhancer recognition.
Collapse
Affiliation(s)
- Tong Wang
- School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai, China
| | - Mengqi Gao
- School of Computer and Information Engineering, Shanghai Polytechnic University, Shanghai, China
| |
Collapse
|
20
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
21
|
Cui H, Tejada-Lapuerta A, Brbić M, Saez-Rodriguez J, Cristea S, Goodarzi H, Lotfollahi M, Theis FJ, Wang B. Towards multimodal foundation models in molecular cell biology. Nature 2025; 640:623-633. [PMID: 40240854 DOI: 10.1038/s41586-025-08710-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 01/29/2025] [Indexed: 04/18/2025]
Abstract
The rapid advent of high-throughput omics technologies has created an exponential growth in biological data, often outpacing our ability to derive molecular insights. Large-language models have shown a way out of this data deluge in natural language processing by integrating massive datasets into a joint model with manifold downstream use cases. Here we envision developing multimodal foundation models, pretrained on diverse omics datasets, including genomics, transcriptomics, epigenomics, proteomics, metabolomics and spatial profiling. These models are expected to exhibit unprecedented potential for characterizing the molecular states of cells across a broad continuum, thereby facilitating the creation of holistic maps of cells, genes and tissues. Context-specific transfer learning of the foundation models can empower diverse applications from novel cell-type recognition, biomarker discovery and gene regulation inference, to in silico perturbations. This new paradigm could launch an era of artificial intelligence-empowered analyses, one that promises to unravel the intricate complexities of molecular cell biology, to support experimental design and, more broadly, to profoundly extend our understanding of life sciences.
Collapse
Affiliation(s)
- Haotian Cui
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada
| | - Alejandro Tejada-Lapuerta
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
| | - Maria Brbić
- School of Computer and Communication Sciences, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
- School of Life Sciences, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Simona Cristea
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Hani Goodarzi
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | - Mohammad Lotfollahi
- Wellcome Sanger Institute, Cambridge, UK
- Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
| | - Bo Wang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada.
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
22
|
Li H, Meng J, Wang Z, Luan Y. PmiProPred: A novel method towards plant miRNA promoter prediction based on CNN-Transformer network and convolutional block attention mechanism. Int J Biol Macromol 2025; 302:140630. [PMID: 39909261 DOI: 10.1016/j.ijbiomac.2025.140630] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 01/31/2025] [Accepted: 02/01/2025] [Indexed: 02/07/2025]
Abstract
It is crucial to understand the transcription mechanisms of miRNAs, especially considering the presence of peptides encoded by miRNAs. Since promoters function as the switch for gene transcription, precisely identifying these regions is essential for fully annotating miRNA transcripts. Nonetheless, existing computational methods still have room for improvement in the characterization of promoter regions. Here, we present PmiProPred, an advanced tool designed for predicting plant miRNA promoters from a wide spectrum of genomes. It consists of two core components: multi-stream deep feature extraction (MsDFE) and multi-stream deep feature refinement (MsDFR). The MsDFE utilizes Transformer and CNN to gather multi-view features, while the MsDFR focuses on aligning and refining them using channel and spatial attention mechanisms. Ultimately, a multi-layer perceptron is employed to accomplish the miRNA promoter identification task. Across three independent testing datasets, PmiProPred achieves accuracies of 94.630%, 96.659%, and 92.480%, respectively, substantially surpassing the latest methods. Additionally, PmiProPred is employed to identify potential core promoters in the upstream 2-kb regions of intergenic miRNAs in five plant species. We further conduct cis-regulatory elements mining on the predicted promoters and perform an in-depth analysis of the identified motifs. Altogether, PmiProPred is a robust and effective tool for discovering plant miRNA promoters.
Collapse
Affiliation(s)
- Haibin Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Zhaowei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116024, China.
| |
Collapse
|
23
|
Fang L, Teng J, Lin Q, Bai Z, Liu S, Guan D, Li B, Gao Y, Hou Y, Gong M, Pan Z, Yu Y, Clark EL, Smith J, Rawlik K, Xiang R, Chamberlain AJ, Goddard ME, Littlejohn M, Larson G, MacHugh DE, O'Grady JF, Sørensen P, Sahana G, Lund MS, Jiang Z, Pan X, Gong W, Zhang H, He X, Zhang Y, Gao N, He J, Yi G, Liu Y, Tang Z, Zhao P, Zhou Y, Fu L, Wang X, Hao D, Liu L, Chen S, Young RS, Shen X, Xia C, Cheng H, Ma L, Cole JB, Baldwin RL, Li CJ, Van Tassell CP, Rosen BD, Bhowmik N, Lunney J, Liu W, Guan L, Zhao X, Ibeagha-Awemu EM, Luo Y, Lin L, Canela-Xandri O, Derks MFL, Crooijmans RPMA, Gòdia M, Madsen O, Groenen MAM, Koltes JE, Tuggle CK, McCarthy FM, Rocha D, Giuffra E, Amills M, Clop A, Ballester M, Tosser-Klopp G, Li J, Fang C, Fang M, Wang Q, Hou Z, Wang Q, Zhao F, Jiang L, Zhao G, Zhou Z, Zhou R, Liu H, Deng J, Jin L, Li M, Mo D, Liu X, Chen Y, Yuan X, Li J, Zhao S, Zhang Y, Ding X, Sun D, et alFang L, Teng J, Lin Q, Bai Z, Liu S, Guan D, Li B, Gao Y, Hou Y, Gong M, Pan Z, Yu Y, Clark EL, Smith J, Rawlik K, Xiang R, Chamberlain AJ, Goddard ME, Littlejohn M, Larson G, MacHugh DE, O'Grady JF, Sørensen P, Sahana G, Lund MS, Jiang Z, Pan X, Gong W, Zhang H, He X, Zhang Y, Gao N, He J, Yi G, Liu Y, Tang Z, Zhao P, Zhou Y, Fu L, Wang X, Hao D, Liu L, Chen S, Young RS, Shen X, Xia C, Cheng H, Ma L, Cole JB, Baldwin RL, Li CJ, Van Tassell CP, Rosen BD, Bhowmik N, Lunney J, Liu W, Guan L, Zhao X, Ibeagha-Awemu EM, Luo Y, Lin L, Canela-Xandri O, Derks MFL, Crooijmans RPMA, Gòdia M, Madsen O, Groenen MAM, Koltes JE, Tuggle CK, McCarthy FM, Rocha D, Giuffra E, Amills M, Clop A, Ballester M, Tosser-Klopp G, Li J, Fang C, Fang M, Wang Q, Hou Z, Wang Q, Zhao F, Jiang L, Zhao G, Zhou Z, Zhou R, Liu H, Deng J, Jin L, Li M, Mo D, Liu X, Chen Y, Yuan X, Li J, Zhao S, Zhang Y, Ding X, Sun D, Sun HZ, Li C, Wang Y, Jiang Y, Wu D, Wang W, Fan X, Zhang Q, Li K, Zhang H, Yang N, Hu X, Huang W, Song J, Wu Y, Yang J, Wu W, Kasper C, Liu X, Yu X, Cui L, Zhou X, Kim S, Li W, Im HK, Buckler ES, Ren B, Schatz MC, Li JJ, Palmer AA, Frantz L, Zhou H, Zhang Z, Liu GE. The Farm Animal Genotype-Tissue Expression (FarmGTEx) Project. Nat Genet 2025; 57:786-796. [PMID: 40097783 DOI: 10.1038/s41588-025-02121-5] [Show More Authors] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 02/06/2025] [Indexed: 03/19/2025]
Abstract
Genetic mutation and drift, coupled with natural and human-mediated selection and migration, have produced a wide variety of genotypes and phenotypes in farmed animals. We here introduce the Farm Animal Genotype-Tissue Expression (FarmGTEx) Project, which aims to elucidate the genetic determinants of gene expression across 16 terrestrial and aquatic domestic species under diverse biological and environmental contexts. For each species, we aim to collect multiomics data, particularly genomics and transcriptomics, from 50 tissues of 1,000 healthy adults and 200 additional animals representing a specific context. This Perspective provides an overview of the priorities of FarmGTEx and advocates for coordinated strategies of data analysis and resource-sharing initiatives. FarmGTEx aims to serve as a platform for investigating context-specific regulatory effects, which will deepen our understanding of molecular mechanisms underlying complex phenotypes. The knowledge and insights provided by FarmGTEx will contribute to improving sustainable agriculture-based food systems, comparative biology and eventual human biomedicine.
Collapse
Affiliation(s)
- Lingzhao Fang
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark.
| | - Jinyan Teng
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Qing Lin
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Zhonghao Bai
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Shuli Liu
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
- School of Life Sciences, Westlake University, Hangzhou, China
| | - Dailu Guan
- Department of Animal Science, University of California, Davis, Davis, CA, USA
| | - Bingjie Li
- Department of Animal and Veterinary Sciences, Scotland's Rural College, Midlothian, UK
| | - Yahui Gao
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Yali Hou
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Mian Gong
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Zhangyuan Pan
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Ying Yu
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Emily L Clark
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian, UK
| | - Jacqueline Smith
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, Easter Bush, Midlothian, UK
| | - Konrad Rawlik
- Baillie Gifford Pandemic Science Hub, Centre for Inflammation Research, Institute for Regeneration and Repair, the University of Edinburgh, Edinburgh, UK
| | - Ruidong Xiang
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, Victoria, Australia
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia
- School of Agriculture, Food and Ecosystem Sciences, the University of Melbourne, Parkville, Victoria, Australia
| | - Amanda J Chamberlain
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, Victoria, Australia
- School of Applied Systems Biology, La Trobe University, Bundoora, Victoria, Australia
| | - Michael E Goddard
- Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, Victoria, Australia
- School of Agriculture, Food and Ecosystem Sciences, the University of Melbourne, Parkville, Victoria, Australia
| | - Mathew Littlejohn
- Research and Development, Livestock Improvement Corporation, Hamilton, New Zealand
- AL Rae Centre for Genetics and Breeding, Massey University, Palmerston North, New Zealand
| | - Greger Larson
- The Palaeogenomics and Bio-Archaeology Research Network, School of Archaeology, University of Oxford, Oxford, UK
| | - David E MacHugh
- UCD School of Agriculture and Food Science, University College Dublin, Belfield, Dublin, Ireland
- UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Belfield, Dublin, Ireland
- UCD One Health Centre, University College Dublin, Belfield, Dublin, Ireland
| | - John F O'Grady
- UCD School of Agriculture and Food Science, University College Dublin, Belfield, Dublin, Ireland
| | - Peter Sørensen
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Goutam Sahana
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Mogens Sandø Lund
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
| | - Zhihua Jiang
- Department of Animal Sciences and Center for Reproductive Biology, Washington State University, Pullman, WA, USA
| | - Xiangchun Pan
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Wentao Gong
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Haihan Zhang
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, China
| | - Xi He
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, China
| | - Yuebo Zhang
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, China
| | - Ning Gao
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, China
| | - Jun He
- College of Animal Science and Technology, Hunan Agricultural University, Changsha, China
| | - Guoqiang Yi
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Zhonglin Tang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Pengju Zhao
- Hainan Institute, Zhejiang University, Yongyou Industry Park, Yazhou Bay Sci-Tech City, Sanya, China
| | - Yang Zhou
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of the Ministry of Education, Huazhong Agricultural University, Wuhan, China
- Yazhouwan National Laboratory, Sanya, China
| | - Liangliang Fu
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of the Ministry of Education, Huazhong Agricultural University, Wuhan, China
| | - Xiao Wang
- Institute of Animal Science and Veterinary Medicine, Shandong Academy of Agricultural Sciences, Jinan, China
| | - Dan Hao
- Poultry Institute, Shandong Academy of Agricultural Sciences, Jinan, China
| | - Lei Liu
- Yazhouwan National Laboratory, Sanya, China
| | - Siqian Chen
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Robert S Young
- Usher Institute, University of Edinburgh, Edinburgh, UK
- Zhejiang University-University of Edinburgh Institute, Zhejiang University, Haining, P. R. China
| | - Xia Shen
- Usher Institute, University of Edinburgh, Edinburgh, UK
- State Key Laboratory of Genetic Engineering, Center for Evolutionary Biology, School of Life Sciences, Fudan University, Shanghai, China
- Center for Intelligent Medicine Research, Greater Bay Area Institute of Precision Medicine (Guangzhou), Fudan University, Guangzhou, China
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Charley Xia
- Lothian Birth Cohort studies, University of Edinburgh, Edinburgh, UK
- Department of Psychology, University of Edinburgh, Edinburgh, UK
| | - Hao Cheng
- Department of Animal Science, University of California, Davis, Davis, CA, USA
| | - Li Ma
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, USA
| | - John B Cole
- Council on Dairy Cattle Breeding, Bowie, MD, USA
- Department of Animal Sciences, Donald Henry Barron Reproductive and Perinatal Biology Research Program and the Genetics Institute, University of Florida, Gainesville, FL, USA
- Department of Animal Science, North Carolina State University, Raleigh, NC, USA
| | - Ransom L Baldwin
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA
| | - Cong-Jun Li
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA
| | - Curtis P Van Tassell
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA
| | - Benjamin D Rosen
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA
| | - Nayan Bhowmik
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA
| | - Joan Lunney
- Animal Parasitic Diseases Laboratory, BARC, NEA, ARS, USDA, Beltsville, MD, USA
| | - Wansheng Liu
- Department of Animal Science, Center for Reproductive Biology and Health, College of Agricultural Sciences, the Pennsylvania State University, University Park, PA, USA
| | - Leluo Guan
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, Alberta, Canada
- Faculty of Land and Food Systems, University of British Columbia, Vancouver, British Columbia, Canada
| | - Xin Zhao
- Department of Animal Science, McGill University, Sainte-Anne-de-Bellevue, Quebec, Canada
| | - Eveline M Ibeagha-Awemu
- Sherbrooke Research and Development Centre, Agriculture and Agri-Food Canada, Sherbrooke, Quebec, Canada
| | - Yonglun Luo
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
- Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus, Denmark
| | - Lin Lin
- Department of Biomedicine, Aarhus University, Aarhus, Denmark
- Steno Diabetes Center Aarhus, Aarhus University Hospital, Aarhus, Denmark
| | - Oriol Canela-Xandri
- MRC Human Genetics Unit at the Institute of Genetics and Cancer, the University of Edinburgh, Edinburgh, UK
| | - Martijn F L Derks
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | | | - Marta Gòdia
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - Ole Madsen
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - Martien A M Groenen
- Animal Breeding and Genomics, Wageningen University & Research, Wageningen, the Netherlands
| | - James E Koltes
- Department of Animal Science, Iowa State University, Ames, IA, USA
| | | | | | - Dominique Rocha
- GABI, AgroParisTech, INRAE, Paris-Saclay University, Jouy-en-Josas, France
| | - Elisabetta Giuffra
- GABI, AgroParisTech, INRAE, Paris-Saclay University, Jouy-en-Josas, France
| | - Marcel Amills
- Department of Animal Genetics, Centre for Research in Agricultural Genomics, CSIC-IRTA-UAB-UB, Campus de la Universitat Autònoma de Barcelona, Bellaterra, Spain
- Departament de Ciència Animal i dels Aliments, Universitat Autònoma de Barcelona, Bellaterra, Spain
| | - Alex Clop
- Department of Animal Genetics, Centre for Research in Agricultural Genomics, CSIC-IRTA-UAB-UB, Campus de la Universitat Autònoma de Barcelona, Bellaterra, Spain
- Consejo Superior de Investigaciones Científicas, Barcelona, Spain
| | - Maria Ballester
- Animal Breeding and Genetics Programme, Institut de Recerca i Tecnologia Agroalimentàries (IRTA), Torre Marimon, Caldes de Montbui, Spain
| | | | - Jing Li
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
- School of Agriculture and Life Sciences, Kunming University, Kunming, China
| | - Chao Fang
- LC-Bio Technologies, Co., Ltd, Hangzhou, China
| | - Ming Fang
- Key Laboratory of Healthy Mariculture for the East China Sea, Ministry of Agriculture and Rural Affairs, Jimei University, Xiamen, China
| | - Qishan Wang
- College of Animal Sciences, Zhejiang University, Hangzhou, China
| | - Zhuocheng Hou
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Qin Wang
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Fuping Zhao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Lin Jiang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Guiping Zhao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Zhengkui Zhou
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Rong Zhou
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Hehe Liu
- College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Juan Deng
- College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Long Jin
- College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Mingzhou Li
- College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Delin Mo
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Xiaohong Liu
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Yaosheng Chen
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Xiaolong Yuan
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Jiaqi Li
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China
| | - Shuhong Zhao
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of the Ministry of Education, Huazhong Agricultural University, Wuhan, China
- Yazhouwan National Laboratory, Sanya, China
| | - Yi Zhang
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Xiangdong Ding
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Dongxiao Sun
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Hui-Zeng Sun
- Key Laboratory of Dairy Cow Genetic Improvement and Milk Quality Research of Zhejiang Province, College of Animal Sciences, Zhejiang University, Hangzhou, China
| | - Cong Li
- College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Yu Wang
- College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Yu Jiang
- College of Animal Science and Technology, Northwest A&F University, Yangling, China
| | - Dongdong Wu
- Key Laboratory of Genetic Evolution and Animal Models, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Wenwen Wang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation and Utilization, College of Animal Science, Shandong Agricultural University, Tai'an, China
| | - Xinzhong Fan
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation and Utilization, College of Animal Science, Shandong Agricultural University, Tai'an, China
| | - Qin Zhang
- Shandong Provincial Key Laboratory for Livestock Germplasm Innovation and Utilization, College of Animal Science, Shandong Agricultural University, Tai'an, China
| | - Kui Li
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Hao Zhang
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Ning Yang
- National Engineering Laboratory for Animal Breeding, State Key Laboratory of Animal Biotech Breeding, Key Laboratory of Animal Genetics, Breeding and Reproduction of the Ministry of Agriculture and Rural Affairs, College of Animal Science and Technology, China Agricultural University, Beijing, China
| | - Xiaoxiang Hu
- State Key Laboratory of Animal Biotech Breeding, College of Biological Sciences, China Agricultural University, Beijing, China
| | - Wen Huang
- Department of Animal Science, Michigan State University, East Lansing, MI, USA
| | - Jiuzhou Song
- Department of Animal and Avian Sciences, University of Maryland, College Park, MD, USA
| | - Yang Wu
- Institute of Rare Diseases, West China Hospital of Sichuan University, Chengdu, China
| | - Jian Yang
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
- School of Life Sciences, Westlake University, Hangzhou, China
| | - Weiwei Wu
- Institute of Animal Science, Xinjiang Academy of Animal Science, Ürümqi City, China
| | - Claudia Kasper
- Animal GenoPhenomics, Animal Production Systems and Animal Health, Agroscope Posieux, Fribourg, Switzerland
| | - Xinfeng Liu
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, Denmark
- State Key Laboratory of Herbage Improvement and Grassland Agro-ecosystem, College of Ecology, Lanzhou University, Lanzhou, China
| | - Xiaofei Yu
- College of Marine Life Sciences, Ocean University of China, Qingdao, China
| | - Leilei Cui
- School of Life Sciences, Nanchang University, Nanchang, China
- Jiangxi Province Key Laboratory of Aging and Disease, Human Aging Research Institute and School of Life Science, Nanchang University, Jiangxi, China
| | - Xiang Zhou
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Seyoung Kim
- Department of Epidemiology, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wei Li
- Division of Computational Biomedicine, Department of Biological Chemistry, School of Medicine, University of California, Irvine, Irvine, CA, USA
| | - Hae Kyung Im
- Department of Medicine and Human Genetics, the University of Chicago, Chicago, IL, USA
| | - Edward S Buckler
- Section of Plant Breeding and Genetics, Cornell University, Ithaca, NY, USA
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, USA
- Agricultural Research Service, United States Department of Agriculture, Ithaca, NY, USA
| | - Bing Ren
- Department of Cellular and Molecular Medicine, Center for Epigenomics, Moores Cancer Center and Institute of Genomic Medicine, University of California San Diego, School of Medicine, La Jolla, CA, USA
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, Los Angeles, CA, USA.
| | - Abraham A Palmer
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA.
- Institute for Genomic Medicine, University of California San Diego, La Jolla, CA, USA.
| | - Laurent Frantz
- Palaeogenomics Group, Institute of Palaeoanatomy, Domestication Research and the History of Veterinary Medicine, Ludwig-Maximilians-Universität, Munich, Germany.
- School of Biological and Behavioural Sciences, Queen Mary University of London, London, UK.
| | - Huaijun Zhou
- Department of Animal Science, University of California, Davis, Davis, CA, USA.
| | - Zhe Zhang
- State Key Laboratory of Swine and Poultry Breeding Industry, National Engineering Research Center for Breeding Swine Industry, Guangdong Provincial Key Lab of Agro-Animal Genomics and Molecular Breeding, College of Animal Science, South China Agricultural University, Guangzhou, China.
| | - George E Liu
- Animal Genomics and Improvement Laboratory, Henry A. Wallace Beltsville Agricultural Research Center, Agricultural Research Service, USDA, Beltsville, MD, USA.
| |
Collapse
|
24
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025; 41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA; Department of Statistics, University of California, Berkeley, CA, USA; Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
25
|
Cheng S, Wei Y, Zhou Y, Xu Z, Wright DN, Liu J, Peng Y. Deciphering genomic codes using advanced natural language processing techniques: a scoping review. J Am Med Inform Assoc 2025; 32:761-772. [PMID: 39998912 PMCID: PMC12005631 DOI: 10.1093/jamia/ocaf029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Revised: 01/20/2025] [Accepted: 02/05/2025] [Indexed: 02/27/2025] Open
Abstract
OBJECTIVES The vast and complex nature of human genomic sequencing data presents challenges for effective analysis. This review aims to investigate the application of natural language processing (NLP) techniques, particularly large language models (LLMs) and transformer architectures, in deciphering genomic codes, focusing on tokenization, transformer models, and regulatory annotation prediction. The goal of this review is to assess data and model accessibility in the most recent literature, gaining a better understanding of the existing capabilities and constraints of these tools in processing genomic sequencing data. MATERIALS AND METHODS Following Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, our scoping review was conducted across PubMed, Medline, Scopus, Web of Science, Embase, and ACM Digital Library. Studies were included if they focused on NLP methodologies applied to genomic sequencing data analysis, without restrictions on publication date or article type. RESULTS A total of 26 studies published between 2021 and April 2024 were selected for review. The review highlights that tokenization and transformer models enhance the processing and understanding of genomic data, with applications in predicting regulatory annotations like transcription-factor binding sites and chromatin accessibility. DISCUSSION The application of NLP and LLMs to genomic sequencing data interpretation is a promising field that can help streamline the processing of large-scale genomic data while also providing a better understanding of its complex structures. It has the potential to drive advancements in personalized medicine by offering more efficient and scalable solutions for genomic analysis. Further research is also needed to discuss and overcome current limitations, enhancing model transparency and applicability. CONCLUSION This review highlights the growing role of NLP, particularly LLMs, in genomic sequencing data analysis. While these models improve data processing and regulatory annotation prediction, challenges remain in accessibility and interpretability. Further research is needed to refine their application in genomics.
Collapse
Affiliation(s)
- Shuyan Cheng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yishu Wei
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yiliang Zhou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Zihan Xu
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Drew N Wright
- Samuel J. Wood Library & C.V. Starr Biomedical Information Center, Weill Cornell Medicine, New York, NY 10065, United States
| | - Jinze Liu
- School of Public Health, Virginia Commonwealth University, Richmond, VA 23219, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| |
Collapse
|
26
|
Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev 2025; 12:nwaf028. [PMID: 40078374 PMCID: PMC11900445 DOI: 10.1093/nsr/nwaf028] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 12/17/2024] [Accepted: 01/08/2025] [Indexed: 03/14/2025] Open
Abstract
With the adoption of foundation models (FMs), artificial intelligence (AI) has become increasingly significant in bioinformatics and has successfully addressed many historical challenges, such as pre-training frameworks, model evaluation and interpretability. FMs demonstrate notable proficiency in managing large-scale, unlabeled datasets, because experimental procedures are costly and labor intensive. In various downstream tasks, FMs have consistently achieved noteworthy results, demonstrating high levels of accuracy in representing biological entities. A new era in computational biology has been ushered in by the application of FMs, focusing on both general and specific biological issues. In this review, we introduce recent advancements in bioinformatics FMs employed in a variety of downstream tasks, including genomics, transcriptomics, proteomics, drug discovery and single-cell analysis. Our aim is to assist scientists in selecting appropriate FMs in bioinformatics, according to four model types: language FMs, vision FMs, graph FMs and multimodal FMs. In addition to understanding molecular landscapes, AI technology can establish the theoretical and practical foundation for continued innovation in molecular biology.
Collapse
Affiliation(s)
- Fei Guo
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| | - Renchu Guan
- Key Laboratory for Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk 23529, USA
| | - Qi Liu
- School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiaowo Wang
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Can Yang
- Department of Mathematics, State Key Laboratory of Molecular Neuroscience, and Big Data Bio-Intelligence Lab, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| |
Collapse
|
27
|
Yu T, Cheng L, Khalitov R, Yang Z. A sparse and wide neural network model for DNA sequences. Neural Netw 2025; 184:107040. [PMID: 39709643 DOI: 10.1016/j.neunet.2024.107040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 10/23/2024] [Accepted: 12/07/2024] [Indexed: 12/24/2024]
Abstract
Accurate modeling of DNA sequences requires capturing distant semantic relationships between the nucleotide acid bases. Most existing deep neural network models face two challenges: (1) they are limited to short DNA fragments and cannot capture long-range interactions, and (2) they require many supervised labels, which is often expensive in practice. We propose a new neural network model called SwanDNA to address the above challenges. By using a sparse and wide network architecture, our model enables inferences over very long DNA sequences. By incorporating the neural network into a self-supervised learning framework, our method can give accurate predictions while using less supervised labels. We evaluate SwanDNA in three DNA sequence inference tasks, human variant effect, open chromatin regions detection in plant genes, and GenomicBenchmarks. SwanDNA outperforms all competitors in the first two tasks and achieves state-of-art in seven of eight datasets in GenomicBenchmarks. Our code is available at https://github.com/wiedersehne/SwanDNA.
Collapse
Affiliation(s)
- Tong Yu
- Norwegian University of Science and Technology, Trondheim, Norway.
| | - Lei Cheng
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Ruslan Khalitov
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Zhirong Yang
- Norwegian University of Science and Technology, Trondheim, Norway; Jinhua Institute of Zhejiang University, Hangzhou, China
| |
Collapse
|
28
|
Romer AS, Grisnik M, Dallas JW, Sutton W, Murray CM, Hardman RH, Blanchard T, Hanscom RJ, Clark RW, Godwin C, Alexander NR, Moe KC, Cobb VA, Eaker J, Colvin R, Thames D, Ogle C, Campbell J, Frost C, Brubaker RL, Snyder SD, Rurik AJ, Cummins CE, Ludwig DW, Phillips JL, Walker DM. Effects of snake fungal disease (ophidiomycosis) on the skin microbiome across two major experimental scales. CONSERVATION BIOLOGY : THE JOURNAL OF THE SOCIETY FOR CONSERVATION BIOLOGY 2025; 39:e14411. [PMID: 39530499 PMCID: PMC11959348 DOI: 10.1111/cobi.14411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 06/26/2024] [Accepted: 07/29/2024] [Indexed: 11/16/2024]
Abstract
Emerging infectious diseases are increasingly recognized as a significant threat to global biodiversity conservation. Elucidating the relationship between pathogens and the host microbiome could lead to novel approaches for mitigating disease impacts. Pathogens can alter the host microbiome by inducing dysbiosis, an ecological state characterized by a reduction in bacterial alpha diversity, an increase in pathobionts, or a shift in beta diversity. We used the snake fungal disease (SFD; ophidiomycosis), system to examine how an emerging pathogen may induce dysbiosis across two experimental scales. We used quantitative polymerase chain reaction, bacterial amplicon sequencing, and a deep learning neural network to characterize the skin microbiome of free-ranging snakes across a broad phylogenetic and spatial extent. Habitat suitability models were used to find variables associated with fungal presence on the landscape. We also conducted a laboratory study of northern watersnakes to examine temporal changes in the skin microbiome following inoculation with Ophidiomyces ophidiicola. Patterns characteristic of dysbiosis were found at both scales, as were nonlinear changes in alpha and alterations in beta diversity, although structural-level and dispersion changes differed between field and laboratory contexts. The neural network was far more accurate (99.8% positive predictive value [PPV]) in predicting disease state than other analytic techniques (36.4% PPV). The genus Pseudomonas was characteristic of disease-negative microbiomes, whereas, positive snakes were characterized by the pathobionts Chryseobacterium, Paracoccus, and Sphingobacterium. Geographic regions suitable for O. ophidiicola had high pathogen loads (>0.66 maximum sensitivity + specificity). We found that pathogen-induced dysbiosis of the microbiome followed predictable trends, that disease state could be classified with neural network analyses, and that habitat suitability models predicted habitat for the SFD pathogen.
Collapse
Affiliation(s)
- Alexander S. Romer
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Matthew Grisnik
- Department of BiologyCoastal Carolina UniversityConwaySouth CarolinaUSA
| | - Jason W. Dallas
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - William Sutton
- Department of Agricultural and Environmental SciencesTennessee State UniversityNashvilleTennesseeUSA
| | - Christopher M. Murray
- Department of Biological SciencesSoutheastern Louisiana UniversityHammondLouisianaUSA
| | | | - Tom Blanchard
- Department of Biological SciencesUniversity of Tennessee at MartinMartinTennesseeUSA
| | - Ryan J. Hanscom
- Department of BiologySan Diego State UniversitySan DiegoCaliforniaUSA
| | - Rulon W. Clark
- Department of BiologySan Diego State UniversitySan DiegoCaliforniaUSA
| | - Cody Godwin
- Department of Natural SciencesSanta Fe CollegeGainesvilleFloridaUSA
| | - N. Reed Alexander
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Kylie C. Moe
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Vincent A. Cobb
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Jesse Eaker
- Department of Natural SciencesSanta Fe CollegeGainesvilleFloridaUSA
| | - Rob Colvin
- Tennessee Wildlife Resources AgencyNashvilleTennesseeUSA
| | - Dustin Thames
- Tennessee Wildlife Resources AgencyNashvilleTennesseeUSA
| | - Chris Ogle
- Tennessee Wildlife Resources AgencyNashvilleTennesseeUSA
| | - Josh Campbell
- Tennessee Wildlife Resources AgencyNashvilleTennesseeUSA
| | - Carlin Frost
- Department of BiologyCoastal Carolina UniversityConwaySouth CarolinaUSA
| | | | - Shawn D. Snyder
- Department of Wildlife, Fisheries and Conservation BiologyUniversity of MaineOronoMaineUSA
| | - Alexander J. Rurik
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Chloe E. Cummins
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - David W. Ludwig
- Department of Computer ScienceMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Joshua L. Phillips
- Department of Computer ScienceMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| | - Donald M. Walker
- Department of BiologyMiddle Tennessee State UniversityMurfreesboroTennesseeUSA
| |
Collapse
|
29
|
González M, Durán RE, Seeger M, Araya M, Jara N. Negative dataset selection impacts machine learning-based predictors for multiple bacterial species promoters. Bioinformatics 2025; 41:btaf135. [PMID: 40152247 PMCID: PMC11993300 DOI: 10.1093/bioinformatics/btaf135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Revised: 03/13/2025] [Accepted: 03/25/2025] [Indexed: 03/29/2025] Open
Abstract
MOTIVATION Advances in bacterial promoter predictors based on machine learning have greatly improved identification metrics. However, existing models overlooked the impact of negative datasets, previously identified in GC-content discrepancies between positive and negative datasets in single-species models. This study aims to investigate whether multiple-species models for promoter classification are inherently biased due to the selection criteria of negative datasets. We further explore whether the generation of synthetic random sequences (SRS) that mimic GC-content distribution of promoters can partly reduce this bias. RESULTS Multiple-species predictors exhibited GC-content bias when using CDS as a negative dataset, suggested by specificity and sensibility metrics in a species-specific manner, and investigated by dimensionality reduction. We demonstrated a reduction in this bias by using the SRS dataset, with less detection of background noise in real genomic data. In both scenarios DNABERT showed the best metrics. These findings suggest that GC-balanced datasets can enhance the generalizability of promoter predictors across Bacteria. AVAILABILITY AND IMPLEMENTATION The source code of the experiments is freely available at https://github.com/maigonzalezh/MultispeciesPromoterClassifier.
Collapse
Affiliation(s)
- Marcelo González
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| | - Roberto E Durán
- Laboratorio de Microbiología Molecular y Biotecnología Ambiental, Department of Chemistry & Center of Biotechnology Daniel Alkalay Lowitt, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
- Millennium Nucleus Bioproducts, Genomics and Environmental Microbiology (BioGEM), Avenida España 1680, Valparaíso 2390123, Chile
| | - Michael Seeger
- Laboratorio de Microbiología Molecular y Biotecnología Ambiental, Department of Chemistry & Center of Biotechnology Daniel Alkalay Lowitt, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
- Millennium Nucleus Bioproducts, Genomics and Environmental Microbiology (BioGEM), Avenida España 1680, Valparaíso 2390123, Chile
| | - Mauricio Araya
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| | - Nicolás Jara
- Departamento de Electrónica, Universidad Técnica Federico Santa María, Avenida España 1680, Valparaíso 2390123, Chile
| |
Collapse
|
30
|
Refahi M, Sokhansanj BA, Mell JC, Brown JR, Yoo H, Hearne G, Rosen GL. Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization. Commun Biol 2025; 8:517. [PMID: 40155693 PMCID: PMC11953366 DOI: 10.1038/s42003-025-07902-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 03/07/2025] [Indexed: 04/01/2025] Open
Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Collapse
Affiliation(s)
| | - Bahrad A Sokhansanj
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Joshua C Mell
- College of Medicine, Drexel University, Philadelphia, PA, USA
| | - James R Brown
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Hyunwoo Yoo
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gavin Hearne
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gail L Rosen
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.
| |
Collapse
|
31
|
Tyagi N, Vahab N, Tyagi S. Genome language modeling (GLM): a beginner's cheat sheet. Biol Methods Protoc 2025; 10:bpaf022. [PMID: 40370585 PMCID: PMC12077296 DOI: 10.1093/biomethods/bpaf022] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/17/2025] [Accepted: 03/23/2025] [Indexed: 05/16/2025] Open
Abstract
Integrating genomics with diverse data modalities has the potential to revolutionize personalized medicine. However, this integration poses significant challenges due to the fundamental differences in data types and structures. The vast size of the genome necessitates transformation into a condensed representation containing key biomarkers and relevant features to ensure interoperability with other modalities. This commentary explores both conventional and state-of-the-art approaches to genome language modeling (GLM), with a focus on representing and extracting meaningful features from genomic sequences. We focus on the latest trends of applying language modeling techniques on genomics sequence data, treating it as a text modality. Effective feature extraction is essential in enabling machine learning models to effectively analyze large genomic datasets, particularly within multimodal frameworks. We first provide a step-by-step guide to various genomic sequence preprocessing and tokenization techniques. Then we explore feature extraction methods for the transformation of tokens using frequency, embedding, and neural network-based approaches. In the end, we discuss machine learning (ML) applications in genomics, focusing on classification, regression, language processing algorithms, and multimodal integration. Additionally, we explore the role of GLM in functional annotation, emphasizing how advanced ML models, such as Bidirectional encoder representations from transformers, enhance the interpretation of genomic data. To the best of our knowledge, we compile the first end-to-end analytic guide to convert complex genomic data into biologically interpretable information using GLM, thereby facilitating the development of novel data-driven hypotheses.
Collapse
Affiliation(s)
- Navya Tyagi
- AI and Data Science, Indian Institute of Technology, Madras, Chennai 600036, Tamil Nadu, India
- Amity Institute of Integrative Health Sciences, Amity University, Gurugram 122412, Haryana, India
| | - Naima Vahab
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| | - Sonika Tyagi
- School of Computing Technologies, Royal Melbourne Institute of Technology (RMIT) University, 3001 Melbourne, Australia
| |
Collapse
|
32
|
Citu C, Chang L, Manuel AM, Enduru N, Zhao Z. Identification and catalog of viral transcriptional regulators in human diseases. iScience 2025; 28:112081. [PMID: 40124487 PMCID: PMC11928865 DOI: 10.1016/j.isci.2025.112081] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/02/2025] [Accepted: 02/18/2025] [Indexed: 03/25/2025] Open
Abstract
Viral genomes encode viral transcriptional regulators (vTRs) that manipulate host gene expression to facilitate replication and evade immune detection. Nevertheless, their role in non-cancerous diseases remains largely underexplored. Here, we unveiled 268 new candidate vTRs from 14 of the 20 viral families we investigated. We mapped vTRs' genome-wide binding profiles and identified their potential human targets, which were enriched in immune-mediated pathways, neurodegenerative disorders, and cancers. Through vTR DNA-binding preference analysis, 283 virus-specific and human-like motifs were identified. Prioritized Epstein-Barr virus (EBV) vTR target genes were associated with multiple sclerosis (MS), rheumatoid arthritis, and systemic lupus erythematosus. The partitioned heritability study among 19 diseases indicated significant enrichment of these diseases in EBV vTR-binding sites, implicating EBV vTRs' roles in immune-mediated disorders. Finally, drug repurposing analysis pinpointed candidate drugs for MS, asthma, and Alzheimer disease. This study enhances our understanding of vTRs in diverse human diseases and identifies potential therapeutic targets for future investigation.
Collapse
Affiliation(s)
- Citu Citu
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Le Chang
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Astrid M. Manuel
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Nitesh Enduru
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
33
|
Xie H, Wang L, Qian Y, Ding Y, Guo F. Methyl-GP: accurate generic DNA methylation prediction based on a language model and representation learning. Nucleic Acids Res 2025; 53:gkaf223. [PMID: 40156859 PMCID: PMC11952970 DOI: 10.1093/nar/gkaf223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 01/24/2025] [Accepted: 03/12/2025] [Indexed: 04/01/2025] Open
Abstract
Accurate prediction of DNA methylation remains a challenge. Identifying DNA methylation is important for understanding its functions and elucidating its role in gene regulation mechanisms. In this study, we propose Methyl-GP, a general predictor that accurately predicts three types of DNA methylation from DNA sequences. We found that the conservation of sequence patterns among different species contributes to enhancing the generalizability of the model. By fine-tuning a language model on a dataset comprising multiple species with similar sequence patterns and employing a fusion module to integrate embeddings into a high-quality comprehensive representation, Methyl-GP demonstrates satisfactory predictive performance in methylation identification. Experiments on 17 benchmark datasets for three types of DNA methylation (4mC, 5hmC, and 6mA) demonstrate the superiority of Methyl-GP over existing predictors. Furthermore, by utilizing the attention mechanism, we have visualized the sequence patterns learned by the model, which may help us to gain a deeper understanding of methylation patterns across various species.
Collapse
Affiliation(s)
- Hao Xie
- School of Computer Science and Engineering, Central South University, Hunan, Changsha 410000, China
| | - Leyao Wang
- College of Intelligence and Computing, Tianjin University, Tianjin, Tianjin 300350, China
| | - Yuqing Qian
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Sichuan, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Zhejiang, Quzhou 324000, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Zhejiang, Quzhou 324000, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Hunan, Changsha 410000, China
| |
Collapse
|
34
|
Sathian R, Dutta P, Ay F, Davuluri RV. Genomic Language Model for Predicting Enhancers and Their Allele-Specific Activity in the Human Genome. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.18.644040. [PMID: 40166250 PMCID: PMC11957021 DOI: 10.1101/2025.03.18.644040] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Predicting and deciphering the regulatory logic of enhancers is a challenging problem, due to the intricate sequence features and lack of consistent genetic or epigenetic signatures that can accurately discriminate enhancers from other genomic regions. Recent machine-learning based methods have spotlighted the importance of extracting nucleotide composition of enhancers but failed to learn the sequence context and perform suboptimally. Motivated by advances in genomic language models, we developed DNABERT-Enhancer, a novel enhancer prediction method, by applying DNABERT pre-trained language model on the human genome. We trained two different models, using large collection of enhancers curated from the ENCODE registry of candidate cis-Regulatory Elements. The best fine-tuned model achieved 88.05% accuracy with Matthews correlation coefficient of 76% on independent set aside data. Further, we present the analysis of the predicted enhancers for all chromosomes of the human genome by comparing with the enhancer regions reported in publicly available databases. Finally, we applied DNABERT-Enhancer along with other DNABERT based regulatory genomic region prediction models to predict candidate SNPs with allele-specific enhancer and transcription factor binding activity. The genome-wide enhancer annotations and candidate loss-of-function genetic variants predicted by DNABERT-Enhancer provide valuable resources for genome interpretation in functional and clinical genomics studies.
Collapse
|
35
|
Su Q, Phan LT, Pham NT, Wei L, Manavalan B. MST-m6A: A Novel Multi-Scale Transformer-based Framework for Accurate Prediction of m6A Modification Sites Across Diverse Cellular Contexts. J Mol Biol 2025; 437:168856. [PMID: 39510345 DOI: 10.1016/j.jmb.2024.168856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 10/23/2024] [Accepted: 11/02/2024] [Indexed: 11/15/2024]
Abstract
N6-methyladenosine (m6A) modification, a prevalent epigenetic mark in eukaryotic cells, is crucial in regulating gene expression and RNA metabolism. Accurately identifying m6A modification sites is essential for understanding their functions within biological processes and the intricate mechanisms that regulate them. Recent advances in high-throughput sequencing technologies have enabled the generation of extensive datasets characterizing m6A modification sites at single-nucleotide resolution, leading to the development of computational methods for identifying m6A RNA modification sites. However, most current methods focus on specific cell lines, limiting their generalizability and practical application across diverse biological contexts. To address the limitation, we propose MST-m6A, a novel approach for identifying m6A modification sites with higher accuracy across various cell lines and tissues. MST-m6A utilizes a multi-scale transformer-based architecture, employing dual k-mer tokenization to capture rich feature representations and global contextual information from RNA sequences at multiple levels of granularity. These representations are then effectively combined using a channel fusion mechanism and further processed by a convolutional neural network to enhance prediction accuracy. Rigorous validation demonstrates that MST-m6A significantly outperforms conventional machine learning models, deep learning models, and state-of-the-art predictors. We anticipate that the high precision and cross-cell-type adaptability of MST-m6A will provide valuable insights into m6A biology and facilitate advancements in related fields. The proposed approach is available at https://github.com/cbbl-skku-org/MST-m6A/ for prediction and reproducibility purposes.
Collapse
Affiliation(s)
- Qiaosen Su
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Le Thi Phan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Nhat Truong Pham
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, Macau
| | - Balachandran Manavalan
- Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Republic of Korea.
| |
Collapse
|
36
|
Nagarajan V, Shi G, Horai R, Yu CR, Gopalakrishnan J, Yadav M, Liew MH, Gentilucci C, Caspi RR. IAN: An Intelligent System for Omics Data Analysis and Discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.06.640921. [PMID: 40161796 PMCID: PMC11952324 DOI: 10.1101/2025.03.06.640921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
IAN is an R package that addresses the challenge of integrating, analyzing and interpreting high-throughput "omics" data, using a multi-agent artificial intelligence (AI) system. IAN leverages popular pathway and regulatory datasets (KEGG, WikiPathways, Reactome, GO, ChEA) and the STRING database for protein-protein interactions to perform standard enrichment analysis. The individual enrichment results are then used to generate insightful summaries, for each of the datasets, using a large language model (LLM) through a multi-agent architecture. These summaries are then contextually integrated and interpreted by the LLM, guided by carefully engineered prompts and grounding instructions, to provide insightful explanations, system overview, key regulators, novel observations etc. We demonstrate IAN's potential to facilitate biological discovery from complex omics data, by reanalyzing two already published data and evaluating the results. We also show remarkable performance of IAN, in terms of avoiding hallucination. IAN package, along with installation instructions and example usage, is available on https://github.com/NIH-NEI/IAN.
Collapse
Affiliation(s)
| | - Guangpu Shi
- Laboratory of Immunology, National Eye Institute, NIH, Bethesda 20892, USA
| | - Reiko Horai
- Laboratory of Immunology, National Eye Institute, NIH, Bethesda 20892, USA
| | - Cheng-Rong Yu
- Molecular Immunology Section, National Eye Institute, NIH, Bethesda 20892, USA
| | | | - Manoj Yadav
- Molecular Immunology Section, National Eye Institute, NIH, Bethesda 20892, USA
| | - Michael H Liew
- Neuro-Immune Regulome Unit, National Eye Institute, NIH, Bethesda 20892, USA
| | - Calla Gentilucci
- Laboratory of Immunology, National Eye Institute, NIH, Bethesda 20892, USA
| | - Rachel R Caspi
- Laboratory of Immunology, National Eye Institute, NIH, Bethesda 20892, USA
| |
Collapse
|
37
|
Cherednichenko O, Herbert A, Poptsova M. Benchmarking DNA large language models on quadruplexes. Comput Struct Biotechnol J 2025; 27:992-1000. [PMID: 40160857 PMCID: PMC11953744 DOI: 10.1016/j.csbj.2025.03.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2025] [Revised: 03/03/2025] [Accepted: 03/04/2025] [Indexed: 04/02/2025] Open
Abstract
Large language models (LLMs) in genomics have successfully predicted various functional genomic elements. While their performance is typically evaluated using genomic benchmark datasets, it remains unclear which LLM is best suited for specific downstream tasks, particularly for generating whole-genome annotations. Current LLMs in genomics fall into three main categories: transformer-based models, long convolution-based models, and state-space models (SSMs). In this study, we benchmarked three different types of LLM architectures for generating whole-genome maps of G-quadruplexes (GQ), a type of flipons, or non-B DNA structures, characterized by distinctive patterns and functional roles in diverse regulatory contexts. Although GQ forms from folding guanosine residues into tetrads, the computational task is challenging as the bases involved may be on different strands, separated by a large number of nucleotides, or made from RNA rather than DNA. All LLMs performed comparably well, with DNABERT-2 and HyenaDNA achieving superior results based on F1 and MCC. Analysis of whole-genome annotations revealed that HyenaDNA recovered more quadruplexes in distal enhancers and intronic regions. The models were better suited to detecting large GQ arrays that likely contribute to the nuclear condensates involved in gene transcription and chromosomal scaffolds. HyenaDNA and Caduceus formed a separate grouping in the generated de novo quadruplexes, while transformer-based models clustered together. Overall, our findings suggest that different types of LLMs complement each other. Genomic architectures with varying context lengths can detect distinct functional regulatory elements, underscoring the importance of selecting the appropriate model based on the specific genomic task. The code and data underlying this article are available at https://github.com/powidla/G4s-FMs.
Collapse
Affiliation(s)
| | - Alan Herbert
- International Laboratory of Bioinformatics, HSE University, Moscow, Russia
- InsideOutBio, Charlestown, MA, USA
| | - Maria Poptsova
- International Laboratory of Bioinformatics, HSE University, Moscow, Russia
| |
Collapse
|
38
|
Holur P, Enevoldsen KC, Rajesh S, Mboning L, Georgiou T, Bouchard LS, Pellegrini M, Roychowdhury V. Embed-Search-Align: DNA sequence alignment using Transformer models. Bioinformatics 2025; 41:btaf041. [PMID: 39913380 PMCID: PMC11919449 DOI: 10.1093/bioinformatics/btaf041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 12/05/2024] [Accepted: 02/03/2025] [Indexed: 03/20/2025] Open
Abstract
MOTIVATION DNA sequence alignment, an important genomic task, involves assigning short DNA reads to the most probable locations on an extensive reference genome. Conventional methods tackle this challenge in two steps: genome indexing followed by efficient search to locate likely positions for given reads. Building on the success of Large Language Models in encoding text into embeddings, where the distance metric captures semantic similarity, recent efforts have encoded DNA sequences into vectors using Transformers and have shown promising results in tasks involving classification of short DNA sequences. Performance at sequence classification tasks does not, however, guarantee sequence alignment, where it is necessary to conduct a genome-wide search to align every read successfully, a significantly longer-range task by comparison. RESULTS We bridge this gap by developing a "Embed-Search-Align" (ESA) framework, where a novel Reference-Free DNA Embedding (RDE) Transformer model generates vector embeddings of reads and fragments of the reference in a shared vector space; read-fragment distance metric is then used as a surrogate for sequence similarity. ESA introduces: (i) Contrastive loss for self-supervised training of DNA sequence representations, facilitating rich reference-free, sequence-level embeddings, and (ii) a DNA vector store to enable search across fragments on a global scale. RDE is 99% accurate when aligning 250-length reads onto a human reference genome of 3 gigabases (single-haploid), rivaling conventional algorithmic sequence alignment methods such as Bowtie and BWA-Mem. RDE far exceeds the performance of six recent DNA-Transformer model baselines such as Nucleotide Transformer, Hyena-DNA, and shows task transfer across chromosomes and species. AVAILABILITY AND IMPLEMENTATION Please see https://anonymous.4open.science/r/dna2vec-7E4E/readme.md.
Collapse
Affiliation(s)
- Pavan Holur
- Department of Electrical and Computer Engineering, UCLA, Los Angeles, California, 90024, United States
| | - K C Enevoldsen
- Center for Humanities Computing, Aarhus University, Aarhus, 8000, Denmark
- Center for Quantitative Genetics and Genomics, Aarhus University, Aarhus, 8000, Denmark
| | - Shreyas Rajesh
- Department of Electrical and Computer Engineering, UCLA, Los Angeles, California, 90024, United States
| | - Lajoyce Mboning
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, California, 90024, United States
| | - Thalia Georgiou
- Department of Biochemistry, Biophysics, and Structural Biology (MBIDP), UCLA, Los Angeles, California, 90024, United States
| | - Louis-S Bouchard
- Department of Chemistry and Biochemistry, UCLA, Los Angeles, California, 90024, United States
| | - Matteo Pellegrini
- Department of Molecular, Cell, and Developmental Biology, UCLA, Los Angeles, California, 90024, United States
| | - Vwani Roychowdhury
- Department of Electrical and Computer Engineering, UCLA, Los Angeles, California, 90024, United States
| |
Collapse
|
39
|
Wang T, Cui Y, Sun T, Li H, Wang C, Hou Y, Wang M, Chen L, Wu J. A Feature Engineering Method for Whole-Genome DNA Sequence with Nucleotide Resolution. Int J Mol Sci 2025; 26:2281. [PMID: 40076901 PMCID: PMC11899767 DOI: 10.3390/ijms26052281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2024] [Revised: 01/17/2025] [Accepted: 03/01/2025] [Indexed: 03/14/2025] Open
Abstract
Feature engineering for whole-genome DNA sequences plays a critical role in predicting plant phenotypic traits. However, due to limitations in the models' analytical capabilities and computational resources, the existing methods are predominantly confined to SNP-based approaches, which typically extract genetic variation sites for dimensionality reduction before feature extraction. These methods not only suffer from incomplete locus coverage and insufficient genetic information but also overlook the relationships between nucleotides, thereby restricting the accuracy of phenotypic trait prediction. Inspired by the parallels between gene sequences and natural language, the emergence of large language models (LLMs) offers novel approaches for addressing the challenge of constructing genome-wide feature representations with nucleotide granularity. This study proposes FE-WDNA, a whole-genome DNA sequence feature engineering method, using HyenaDNA to fine-tune it on whole-genome data from 1000 soybean samples. We thus provide deep insights into the contextual and long-range dependencies among nucleotide sites to derive comprehensive genome-wide feature vectors. We further evaluated the application of FE-WDNA in agronomic trait prediction, examining factors such as the context window length of the DNA input, feature vector dimensions, and trait prediction methods, achieving significant improvements compared to the existing SNP-based approaches. FE-WDNA provides a mode of high-quality DNA sequence feature engineering at nucleotide resolution, which can be transformed to other plants and directly applied to various computational breeding tasks.
Collapse
Affiliation(s)
- Ting Wang
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Yunpeng Cui
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Tan Sun
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Huan Li
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Chao Wang
- Digital Agriculture and Rural Research Institute, Chinese Academy of Agricultural Sciences, Zibo 255035, China
| | - Ying Hou
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Mo Wang
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Li Chen
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| | - Jinming Wu
- Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (T.W.); (H.L.); (Y.H.); (M.W.); (L.C.); (J.W.)
- Key Laboratory of Agricultural Big Data, Ministry of Agriculture and Rural Affairs, Beijing 100081, China
| |
Collapse
|
40
|
Sereshki S, Lonardi S. Predicting differentially methylated cytosines in TET and DNMT3 knockout mutants via a large language model. Brief Bioinform 2025; 26:bbaf092. [PMID: 40079264 PMCID: PMC11904404 DOI: 10.1093/bib/bbaf092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2024] [Revised: 02/03/2025] [Accepted: 02/18/2025] [Indexed: 03/15/2025] Open
Abstract
DNA methylation is an epigenetic marker that directly or indirectly regulates several critical cellular processes. While cytosines in mammalian genomes generally maintain stable methylation patterns over time, other cytosines that belong to specific regulatory regions, such as promoters and enhancers, can exhibit dynamic changes. These changes in methylation are driven by a complex cellular machinery, in which the enzymes DNMT3 and TET play key roles. The objective of this study is to design a machine learning model capable of accurately predicting which cytosines have a fluctuating methylation level [hereafter called differentially methylated cytosines (DMCs)] from the surrounding DNA sequence. Here, we introduce L-MAP, a transformer-based large language model that is trained on DNMT3-knockout and TET-knockout data in human and mouse embryonic stem cells. Our extensive experimental results demonstrate the high accuracy of L-MAP in predicting DMCs. Our experiments also explore whether a classifier trained on human knockout data could predict DMCs in the mouse genome (and vice versa), and whether a classifier trained on DNMT3 knockout data could predict DMCs in TET knockouts (and vice versa). L-MAP enables the identification of sequence motifs associated with the enzymatic activity of DNMT3 and TET, which include known motifs but also novel binding sites that could provide new insights into DNA methylation in stem cells. L-MAP is available at https://github.com/ucrbioinfo/dmc_prediction.
Collapse
Affiliation(s)
- Saleh Sereshki
- Department of Computer Science and Engineering, University of California, Riverside, 900 University Ave, Riverside, CA 92521, United States
| | - Stefano Lonardi
- Department of Computer Science and Engineering, University of California, Riverside, 900 University Ave, Riverside, CA 92521, United States
| |
Collapse
|
41
|
Albors C, Li JC, Benegas G, Ye C, Song YS. A Phylogenetic Approach to Genomic Language Modeling. ARXIV 2025:arXiv:2503.03773v1. [PMID: 40093357 PMCID: PMC11908359] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.
Collapse
Affiliation(s)
- Carlos Albors
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Jianan Canal Li
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Gonzalo Benegas
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| | - Yun S Song
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720, USA
- Department of Statistics, University of California, Berkeley, CA 94720, USA
| |
Collapse
|
42
|
Bai Z, Zhang YZ, Pang Y, Imoto S. PharaCon: a new framework for identifying bacteriophages via conditional representation learning. Bioinformatics 2025; 41:btaf085. [PMID: 39992229 PMCID: PMC11928753 DOI: 10.1093/bioinformatics/btaf085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 01/08/2025] [Accepted: 02/20/2025] [Indexed: 02/25/2025] Open
Abstract
MOTIVATION Identifying bacteriophages (phages) within metagenomic sequences is essential for understanding microbial community dynamics. Transformer-based foundation models have been successfully employed to address various biological challenges. However, these models are typically pre-trained with self-supervised tasks that do not consider label variance in the pre-training data. This presents a challenge for phage identification as pre-training on mixed bacterial and phage data may lead to information bias due to the imbalance between bacterial and phage samples. RESULTS To overcome this limitation, we proposed a novel conditional BERT framework that incorporates label classes as special tokens during pre-training. Specifically, our conditional BERT model attaches labels directly during tokenization, introducing label constraints into the model's input. Additionally, we introduced a new fine-tuning scheme that enables the conditional BERT to be effectively utilized for classification tasks. This framework allows the BERT model to acquire label-specific contextual representations from mixed sequence data during pre-training and applies the conditional BERT as a classifier during fine-tuning, and we named the fine-tuned model as PharaCon. We evaluated PharaCon against several existing methods on both simulated sequence datasets and real metagenomic contig datasets. The results demonstrate PharaCon's effectiveness and efficiency in phage identification, highlighting the advantages of incorporating label information during both pre-training and fine-tuning. AVAILABILITY AND IMPLEMENTATION The source code and associated data can be accessed at https://github.com/Celestial-Bai/PharaCon.
Collapse
Affiliation(s)
- Zeheng Bai
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - Yao-zhong Zhang
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - Yuxuan Pang
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
| | - Seiya Imoto
- Division of Health Medical Intelligence, Human Genome Center, The Institute of Medical Science, The University of Tokyo, 4-6-1, Shirokanedai, Minato-ku, Tokyo, 108-8639, Japan
- Collaborative Research Institute for Innovative Microbiology, The University of Tokyo, 1-1-1, Yayoi, Bunkyo-ku, Tokyo, 113-8657, Japan
| |
Collapse
|
43
|
Cherednichenko O, Poptsova M. Kolmogorov-Arnold networks for genomic tasks. Brief Bioinform 2025; 26:bbaf129. [PMID: 40163820 PMCID: PMC11957273 DOI: 10.1093/bib/bbaf129] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2024] [Revised: 02/12/2025] [Accepted: 03/05/2025] [Indexed: 04/02/2025] Open
Abstract
Kolmogorov-Arnold networks (KANs) emerged as a promising alternative for multilayer perceptrons (MLPs) in dense fully connected networks. Multiple attempts have been made to integrate KANs into various deep learning architectures in the domains of computer vision and natural language processing. Integrating KANs into deep learning models for genomic tasks has not been explored. Here, we tested linear KANs (LKANs) and convolutional KANs (CKANs) as a replacement for MLP in baseline deep learning architectures for classification and generation of genomic sequences. We used three genomic benchmark datasets: Genomic Benchmarks, Genome Understanding Evaluation, and Flipon Benchmark. We demonstrated that LKANs outperformed both baseline and CKANs on almost all datasets. CKANs can achieve comparable results but struggle with scaling over large number of parameters. Ablation analysis demonstrated that the number of KAN layers correlates with the model performance. Overall, linear KANs show promising results in improving the performance of deep learning models with relatively small number of parameters. Unleashing KAN potential in different state-of-the-art deep learning architectures currently used in genomics requires further research.
Collapse
Affiliation(s)
- Oleksandr Cherednichenko
- International Laboratory of Bioinformatics, HSE University, 11 Pokrovksy Bulvar, Moscow, 109028, Russia
| | - Maria Poptsova
- International Laboratory of Bioinformatics, HSE University, 11 Pokrovksy Bulvar, Moscow, 109028, Russia
| |
Collapse
|
44
|
Duan C, Zang Z, Xu Y, He H, Li S, Liu Z, Lei Z, Zheng JS, Li SZ. FGeneBERT: function-driven pre-trained gene language model for metagenomics. Brief Bioinform 2025; 26:bbaf149. [PMID: 40211978 PMCID: PMC11986344 DOI: 10.1093/bib/bbaf149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2024] [Revised: 02/22/2025] [Accepted: 03/14/2025] [Indexed: 04/14/2025] Open
Abstract
Metagenomic data, comprising mixed multi-species genomes, are prevalent in diverse environments like oceans and soils, significantly impacting human health and ecological functions. However, current research relies on K-mer, which limits the capture of structurally and functionally relevant gene contexts. Moreover, these approaches struggle with encoding biologically meaningful genes and fail to address the one-to-many and many-to-one relationships inherent in metagenomic data. To overcome these challenges, we introduce FGeneBERT, a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware and structure-relevant tokenizer. FGeneBERT incorporates masked gene modeling to enhance the understanding of inter-gene contextual relationships and triplet enhanced metagenomic contrastive learning to elucidate gene sequence-function relationships. Pre-trained on over 100 million metagenomic sequences, FGeneBERT demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels and ranging from 1 to 213 k input sequences. Case studies of ATP synthase and gene operons highlight FGeneBERT's capability for functional recognition and its biological relevance in metagenomic research.
Collapse
Affiliation(s)
- Chenrui Duan
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zelin Zang
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
| | - Yongjie Xu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Hang He
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Siyuan Li
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zihan Liu
- College of Computer Science and Technology, Zhejiang University, No. 866, Yuhangtang Road, 310058 Zhejiang, P. R. China
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Zhen Lei
- Centre for Artificial Intelligence and Robotics (CAIR), HKISI-CAS Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong 310000, China
- State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences (CASIA), Beijing 100190, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences (UCAS), Beijing 100049, China
| | - Ju-Sheng Zheng
- School of Medicine and School of Life Sciences, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| | - Stan Z Li
- School of Engineering, Westlake University, No. 600 Dunyu Road, 310030 Zhejiang, P. R. China
| |
Collapse
|
45
|
Creux C, Zehraoui F, Radvanyi F, Tahi F. MMnc: multi-modal interpretable representation for non-coding RNA classification and class annotation. BIOINFORMATICS (OXFORD, ENGLAND) 2025; 41:btaf051. [PMID: 39891346 PMCID: PMC11890286 DOI: 10.1093/bioinformatics/btaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 01/16/2025] [Accepted: 01/29/2025] [Indexed: 02/03/2025]
Abstract
MOTIVATION As the biological roles and disease implications of non-coding RNAs continue to emerge, the need to thoroughly characterize previously unexplored non-coding RNAs becomes increasingly urgent. These molecules hold potential as biomarkers and therapeutic targets. However, the vast and complex nature of non-coding RNAs data presents a challenge. We introduce MMnc, an interpretable deep-learning approach designed to classify non-coding RNAs into functional groups. MMnc leverages multiple data sources-such as the sequence, secondary structure, and expression-using attention-based multi-modal data integration. This ensures the learning of meaningful representations while accounting for missing sources in some samples. RESULTS Our findings demonstrate that MMnc achieves high classification accuracy across diverse non-coding RNA classes. The method's modular architecture allows for the consideration of multiple types of modalities, whereas other tools only consider one or two at most. MMnc is resilient to missing data, ensuring that all available information is effectively utilized. Importantly, the generated attention scores offer interpretable insights into the underlying patterns of the different non-coding RNA classes, potentially driving future non-coding RNA research and applications. AVAILABILITY AND IMPLEMENTATION Data and source code can be found at EvryRNA.ibisc.univ-evry.fr/EvryRNA/MMnc.
Collapse
Affiliation(s)
- Constance Creux
- Université Paris-Saclay, Univ Evry, IBISC, Evry-Courcouronnes 91020, France
- Molecular Oncology, PSL Research University, CNRS, UMR 144, Institut Curie, Paris 75248, France
| | - Farida Zehraoui
- Université Paris-Saclay, Univ Evry, IBISC, Evry-Courcouronnes 91020, France
| | - François Radvanyi
- Molecular Oncology, PSL Research University, CNRS, UMR 144, Institut Curie, Paris 75248, France
| | - Fariza Tahi
- Université Paris-Saclay, Univ Evry, IBISC, Evry-Courcouronnes 91020, France
| |
Collapse
|
46
|
Yu T, Cheng L, Khalitov R, Olsson EB, Yang Z. Self-distillation improves self-supervised learning for DNA sequence inference. Neural Netw 2025; 183:106978. [PMID: 39667220 DOI: 10.1016/j.neunet.2024.106978] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2024] [Revised: 10/28/2024] [Accepted: 11/26/2024] [Indexed: 12/14/2024]
Abstract
Self-supervised Learning (SSL) has been recognized as a method to enhance prediction accuracy in various downstream tasks. However, its efficacy for DNA sequences remains somewhat constrained. This limitation stems primarily from the fact that most existing SSL approaches in genomics focus on masked language modeling of individual sequences, neglecting the crucial aspect of encoding statistics across multiple sequences. To overcome this challenge, we introduce an innovative deep neural network model, which incorporates collaborative learning between a 'student' and a 'teacher' subnetwork. In this model, the student subnetwork employs masked learning on nucleotides and progressively adapts its parameters to the teacher subnetwork through an exponential moving average approach. Concurrently, both subnetworks engage in contrastive learning, deriving insights from two augmented representations of the input sequences. This self-distillation process enables our model to effectively assimilate both contextual information from individual sequences and distributional data across the sequence population. We validated our approach with preliminary pretraining using the human reference genome, followed by applying it to 20 downstream inference tasks. The empirical results from these experiments demonstrate that our novel method significantly boosts inference performance across the majority of these tasks. Our code is available at https://github.com/wiedersehne/FinDNA.
Collapse
Affiliation(s)
- Tong Yu
- Norwegian University of Science and Technology, Trondheim, Norway.
| | - Lei Cheng
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Ruslan Khalitov
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Erland B Olsson
- Norwegian University of Science and Technology, Trondheim, Norway
| | - Zhirong Yang
- Norwegian University of Science and Technology, Trondheim, Norway
| |
Collapse
|
47
|
Li H, Meng J, Wang Z, Luan Y. misORFPred: A Novel Method to Mine Translatable sORFs in Plant Pri-miRNAs Using Enhanced Scalable k-mer and Dynamic Ensemble Voting Strategy. Interdiscip Sci 2025; 17:114-133. [PMID: 39397199 DOI: 10.1007/s12539-024-00661-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2023] [Revised: 09/18/2024] [Accepted: 09/22/2024] [Indexed: 10/15/2024]
Abstract
The primary microRNAs (pri-miRNAs) have been observed to contain translatable small open reading frames (sORFs) that can encode peptides as an independent element. Relevant studies have proven that those of sORFs are of significance in regulating the expression of biological traits. The existing methods for predicting the coding potential of sORFs frequently overlook this data or categorize them as negative samples, impeding the identification of additional translatable sORFs in pri-miRNAs. In light of this, a novel method named misORFPred has been proposed. Specifically, an enhanced scalable k-mer (ESKmer) that simultaneously integrates the composition information within a sequence and distance information between sequences is designed to extract the nucleotide sequence features. After feature selection, the optimal features and several machine learning classifiers are combined to construct the ensemble model, where a newly devised dynamic ensemble voting strategy (DEVS) is proposed to dynamically adjust the weights of base classifiers and adaptively select the optimal base classifiers for each unlabeled sample. Cross-validation results suggest that ESKmer and DEVS are essential for this classification task and could boost model performance. Independent testing results indicate that misORFPred outperforms the state-of-the-art methods. Furthermore, we execute misORFPerd on the genomes of various plant species and perform a thorough analysis of the predicted outcomes. Taken together, misORFPred is a powerful tool for identifying the translatable sORFs in plant pri-miRNAs and can provide highly trusted candidates for subsequent biological experiments.
Collapse
Affiliation(s)
- Haibin Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China.
| | - Zhaowei Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, 116024, China.
| |
Collapse
|
48
|
Aspromonte MC, Del Conte A, Zhu S, Tan W, Shen Y, Zhang Y, Li Q, Wang MH, Babbi G, Bovo S, Martelli PL, Casadio R, Althagafi A, Toonsi S, Kulmanov M, Hoehndorf R, Katsonis P, Williams A, Lichtarge O, Xian S, Surento W, Pejaver V, Mooney SD, Sunderam U, Srinivasan R, Murgia A, Piovesan D, Tosatto SCE, Leonardi E. CAGI6 ID panel challenge: assessment of phenotype and variant predictions in 415 children with neurodevelopmental disorders (NDDs). Hum Genet 2025; 144:227-242. [PMID: 39786577 PMCID: PMC11976362 DOI: 10.1007/s00439-024-02722-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Accepted: 12/13/2024] [Indexed: 01/12/2025]
Abstract
The Genetics of Neurodevelopmental Disorders Lab in Padua provided a new intellectual disability (ID) Panel challenge for computational methods to predict patient phenotypes and their causal variants in the context of the Critical Assessment of the Genome Interpretation, 6th edition (CAGI6). Eight research teams submitted a total of 30 models to predict phenotypes based on the sequences of 74 genes (VCF format) in 415 pediatric patients affected by Neurodevelopmental Disorders (NDDs). NDDs are clinically and genetically heterogeneous conditions, with onset in infant age. Here, we assess the ability and accuracy of computational methods to predict comorbid phenotypes based on clinical features described in each patient and their causal variants. We also evaluated predictions for possible genetic causes in patients without a clear genetic diagnosis. Like the previous ID Panel challenge in CAGI5, seven clinical features (ID, ASD, ataxia, epilepsy, microcephaly, macrocephaly, hypotonia), and variants (Pathogenic/Likely Pathogenic, Variants of Uncertain Significance and Risk Factors) were provided. The phenotypic traits and variant data of 150 patients from the CAGI5 ID Panel Challenge were provided as training set for predictors. The CAGI6 challenge confirms CAGI5 results that predicting phenotypes from gene panel data is highly challenging, with AUC values close to random, and no method able to predict relevant variants with both high accuracy and precision. However, a significant improvement is noted for the best method, with recall increasing from 66% to 82%. Several groups also successfully predicted difficult-to-detect variants, emphasizing the importance of variants initially excluded by the Padua NDD Lab.
Collapse
Affiliation(s)
- Maria Cristina Aspromonte
- Department of Biomedical Sciences, University of Padova, Padova, Italy
- Department of Women's and Children's Health, University of Padova, Padova, Italy
| | - Alessio Del Conte
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| | - Shaowen Zhu
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Yexian Zhang
- CUHK Shenzhen Research Institute, Shenzhen, China
- JC School of Public Health and Primary Care, Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Qi Li
- CUHK Shenzhen Research Institute, Shenzhen, China
- JC School of Public Health and Primary Care, Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Maggie Haitian Wang
- CUHK Shenzhen Research Institute, Shenzhen, China
- JC School of Public Health and Primary Care, Chinese University of Hong Kong, Hong Kong, SAR, China
| | - Giulia Babbi
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Samuele Bovo
- Department of Agricultural and Food Sciences, University of Bologna, Bologna, Italy
| | - Pier Luigi Martelli
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Rita Casadio
- Biocomputing Group, Department of Pharmacy and Biotechnology, University of Bologna, Bologna, Italy
| | - Azza Althagafi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
- Computer Science Department, College of Computers and Information Technology, Taif University, Taif, 26571, Saudi Arabia
| | - Sumyyah Toonsi
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Maxat Kulmanov
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Robert Hoehndorf
- Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences & Engineering Division (CEMSE), King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Saudi Arabia
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA
| | - Amanda Williams
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, TX, 77030, USA
| | - Su Xian
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, 98195, USA
| | - Wesley Surento
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, 98195, USA
| | - Vikas Pejaver
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, 10029, USA
| | - Sean D Mooney
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, 98195, USA
| | - Uma Sunderam
- Innovation Labs, Tata Consultancy Services, Hyderabad, India
| | | | - Alessandra Murgia
- Department of Women's and Children's Health, University of Padova, Padova, Italy
| | - Damiano Piovesan
- Department of Biomedical Sciences, University of Padova, Padova, Italy
| | - Silvio C E Tosatto
- Department of Biomedical Sciences, University of Padova, Padova, Italy.
- Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, National Research Council (CNR- IBIOM), Bari, Italy.
| | - Emanuela Leonardi
- Department of Biomedical Sciences, University of Padova, Padova, Italy.
- Department of Women's and Children's Health, University of Padova, Padova, Italy.
| |
Collapse
|
49
|
Gao Y, Shi R, Yu G, Huang Y, Yang Y. ZeRPI: A graph neural network model for zero-shot prediction of RNA-protein interactions. Methods 2025; 235:45-52. [PMID: 39892680 DOI: 10.1016/j.ymeth.2025.01.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 12/29/2024] [Accepted: 01/16/2025] [Indexed: 02/04/2025] Open
Abstract
RNA-protein interactions are crucial for biological functions across multiple levels. RNA binding proteins (RBPs) intricately engage in diverse biological processes through specific RNA molecule interactions. Previous studies have revealed the indispensable role of RBPs in both health and disease development. With the increase of experimental data, machine-learning methods have been widely used to predict RNA-protein interactions. However, most current methods either train models for individual RBPs or develop multi-task models for a fixed set of multiple RBPs. These approaches are incapable of predicting interactions with previously unseen RBPs. In this study, we present ZeRPI, a zero-shot method for predicting RNA-protein interactions. Based on a graph neural network model, ZeRPI integrates RNA and protein information to generate detailed representations, using a novel loss function based on contrastive learning principles to augment the alignment between interacting pairs in feature space. ZeRPI demonstrates competitive performance in predicting RNA-protein interactions across a wide array of RBPs. Notably, our model exhibits remarkable versatility in accurately predicting interactions for unseen RBPs, demonstrating its capacity to transfer knowledge learned from known RBPs.
Collapse
Affiliation(s)
- Yifei Gao
- SJTU Paris Elite Institute of Technology (SPEIT), Shanghai, 200240, China; Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Runhan Shi
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Gufeng Yu
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuyang Huang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
50
|
Brady S, Auge G, Ayalew M, Balasubramanian S, Hamann T, Inze D, Saito K, Brychkova G, Berardini TZ, Friesner J, Ho C, Hauser M, Kobayashi M, Lepiniec L, Mähönen AP, Mutwil M, May S, Parry G, Rigas S, Stepanova AN, Williams M, Provart NJ. Arabidopsis research in 2030: Translating the computable plant. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2025; 121:e70047. [PMID: 40028766 PMCID: PMC11874203 DOI: 10.1111/tpj.70047] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 01/29/2025] [Indexed: 03/05/2025]
Abstract
Plants are essential for human survival. Over the past three decades, work with the reference plant Arabidopsis thaliana has significantly advanced plant biology research. One key event was the sequencing of its genome 25 years ago, which fostered many subsequent research technologies and datasets. Arabidopsis has been instrumental in elucidating plant-specific aspects of biology, developing research tools, and translating findings to crop improvement. It not only serves as a model for understanding plant biology and but also biology in other fields, with discoveries in Arabidopsis also having led to applications in human health, including insights into immunity, protein degradation, and circadian rhythms. Arabidopsis research has also fostered the development of tools useful for the wider biological research community, such as optogenetic systems and auxin-based degrons. This 4th Multinational Arabidopsis Steering Committee Roadmap outlines future directions, with emphasis on computational approaches, research support, translation to crops, conference accessibility, coordinated research efforts, climate change mitigation, sustainable production, and fundamental research. Arabidopsis will remain a nexus for discovery, innovation, and application, driving advances in both plant and human biology to the year 2030, and beyond.
Collapse
Affiliation(s)
- Siobhan Brady
- Howard Hughes Medical InstituteUniversity of California DavisDavisCaliforniaUSA
| | - Gabriela Auge
- Institute for Agrobiotechnology and Molecular BiologyInstituto Nacional de Tecnología Agropecuaria (INTA) ‐ Consejo Nacional de Investigaciones Científicas y Tecnológicas (CONICET)Buenos AiresArgentina
| | | | | | - Thorsten Hamann
- Department of Biology, Faculty of Natural SciencesNorwegian University of Science and TechnologyTrondheimNorway
| | - Dirk Inze
- University of Gent Center for Plant Systems BiologyGhentBelgium
| | - Kazuki Saito
- RIKEN Center for Sustainable Resource ScienceYokohamaJapan
| | - Galina Brychkova
- School of Biological & Chemical Sciences, Ryan InstituteUniversity of GalwayGalwayIreland
| | - Tanya Z. Berardini
- The Arabidopsis Information Resource/Phoenix BioinformaticsNewarkCaliforniaUSA
| | - Joanna Friesner
- North American Arabidopsis Steering CommitteeCorvallisOregonUSA
| | - Cheng‐Hsun Ho
- Agricultural Biotechnology Research CentreAcademia SinicaTaipeiTaiwan
| | | | | | - Loic Lepiniec
- AgroParisTech, Institut Jean‐Pierre Bourgin for Plant Sciences (IJPB)Universite Paris‐Saclay, INRAEVersailles78000France
| | - Ari Pekka Mähönen
- Faculty of Biological and Environmental SciencesUniversity of HelsinkiHelsinkiFinland
| | - Marek Mutwil
- Nanyang Technological UniversitySingaporeSingapore
| | - Sean May
- University of NottinghamNottinghamUK
| | | | | | - Anna N. Stepanova
- Department of Plant and Microbial Biology, Genetics and Genomics AcademyNorth Carolina State UniversityRaleigh27695North CarolinaUSA
| | - Mary Williams
- American Society of Plant BiologyRockvilleMarylandUSA
| | - Nicholas J. Provart
- Department of Cell & Systems Biology/Centre for the Analysis of Genome Evolution and FunctionUniversity of TorontoTorontoOntarioCanada
| |
Collapse
|