1
|
Nayak DSK, Das R, Sahoo SK, Swarnkar T. ARGai 1.0: A GAN augmented in silico approach for identifying resistant genes and strains in E. coli using vision transformer. Comput Biol Chem 2025; 115:108342. [PMID: 39813877 DOI: 10.1016/j.compbiolchem.2025.108342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 11/08/2024] [Accepted: 01/03/2025] [Indexed: 01/18/2025]
Abstract
The emergence of infectious disease and antibiotic resistance in bacteria like Escherichia coli (E. coli) shows the necessity for novel computational techniques for identifying essential genes that contribute to resistance. The task of identifying resistant strains and multi-drug patterns in E. coli is a major challenge with whole genome sequencing (WGS) and next-generation sequencing (NGS) data. To address this issue, we suggest ARGai 1.0 a deep learning architecture enhanced with generative adversarial networks (GANs). We mitigate data scarcity difficulties by augmenting limited experimental datasets with synthetic data generated by GANs. Our in-silico method (augmentation with feature selection) improves the identification of resistance genes in E. coli by using feature extraction techniques to identify valuable features from actual and GAN-generated data. Employing comprehensive validation, we exhibit the effectiveness of our ARGai 1.0 in precisely identifying the informative and resistant genes. In addition, our ARGai 1.0 identifies the resistant strains with a classification accuracy of 98.96 % on Deep Convolutional Generative Adversarial Network (DCGAN) augmented data. Additionally, ARGai 1.0 achieves more than 98 % of sensitivity and specificity. We also benchmark our ARGai 1.0 with several state-of-the-art AI models for resistant strain classification. In the fight against antibiotic resistance, ARGai 1.0 offers a promising avenue for computational genomics. With implications for research and clinical practice, this work shows the potential of deep networks with GAN augmentation as a practical and successful method for gene identification in E. coli.
Collapse
Affiliation(s)
- Debasish Swapnesh Kumar Nayak
- Department of Computer Science and Engineering, Siksha 'O' Anusandhan (Deemed to be University), Odisha, India; Department of Computer Science and Engineering, Centurion University of Technology and Management, Bhubaneswar, Odisha, India.
| | - Ruchika Das
- Department of Computer Science and Engineering, Siksha 'O' Anusandhan (Deemed to be University), Odisha, India.
| | - Santanu Kumar Sahoo
- Department of Electronics and Communication Engineering, Siksha 'O' Anusandhan (Deemed to be University), Odisha, India.
| | - Tripti Swarnkar
- Department of Computer Application, National Institute of Technology, Raipur, India.
| |
Collapse
|
2
|
Huang K, Tian J, Sun L, Hu H, Huang X, Zhou S, Deng A, Zhou Z, Jiang M, Li G, Xie P, Wang Y, Jiang X. TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments. BMC Genomics 2025; 26:259. [PMID: 40098114 PMCID: PMC11912617 DOI: 10.1186/s12864-025-11434-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Accepted: 03/04/2025] [Indexed: 03/19/2025] Open
Abstract
Gene mining is crucial for understanding the regulatory mechanisms underlying complex biological processes, particularly in plants responding to environmental conditions. Traditional machine learning methods, while useful, often overlook important gene relationships due to their reliance on manual feature selection and limited ability to capture complex inter-gene regulatory dynamics. Deep learning approaches, while powerful, are often unsuitable for small sample sizes. This study introduces TransGeneSelector, the first deep learning framework specifically designed for mining key genes from small transcriptomic datasets. By integrating a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) for sample generation and a Transformer-based network for classification, TransGeneSelector efficiently addresses the challenges of small-sample transcriptomic data, capturing both global gene regulatory interactions and specific biological processes. Evaluated in Arabidopsis thaliana, the model achieved high classification accuracy in predicting seed germination and heat stress conditions, outperforming traditional methods like Random Forest and Support Vector Machines (SVM). Moreover, Shapley Additive Explanations (SHAP) analysis and gene regulatory network construction revealed that TransGeneSelector effectively identified genes that appear to have upstream regulatory functions based on our analyses, enriching them in multiple key pathways which are critical for seed germination and heat stress response. RT-qPCR validation further confirmed the model's gene selection accuracy, demonstrating consistent expression patterns across varying germination conditions. The findings underscore the potential of TransGeneSelector as a robust tool for gene mining, offering deeper insights into gene regulation and organism adaptation under diverse environmental conditions. This work provides a framework that leverages deep learning for key gene identification in small transcriptomic datasets.
Collapse
Affiliation(s)
- Kerui Huang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Jianhong Tian
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China
| | - Lei Sun
- Key Laboratory of Research and Utilization of Ethnomedicinal Plant Resources of Hunan Province, College of Biological and Food Engineering, Huaihua University, Huaihua, 418000, China
| | - Haoliang Hu
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Xuebin Huang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Shiqi Zhou
- Rice Research Institute of Jiangxi Academy of Agricultural Sciences, Nanchang, 330000, China
| | - Aihua Deng
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Zhibo Zhou
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Ming Jiang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Guiwu Li
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China
| | - Peng Xie
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China.
| | - Yun Wang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China.
| | - Xiaocheng Jiang
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China.
| |
Collapse
|
3
|
Sefer E. DRGAT: Predicting Drug Responses Via Diffusion-Based Graph Attention Network. J Comput Biol 2025; 32:330-350. [PMID: 39639802 DOI: 10.1089/cmb.2024.0807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/07/2024] Open
Abstract
Accurately predicting drug response depending on a patient's genomic profile is critical for advancing personalized medicine. Deep learning approaches rise and especially the rise of graph neural networks leveraging large-scale omics datasets have been a key driver of research in this area. However, these biological datasets, which are typically high dimensional but have small sample sizes, present challenges such as overfitting and poor generalization in predictive models. As a complicating matter, gene expression (GE) data must capture complex inter-gene relationships, exacerbating these issues. In this article, we tackle these challenges by introducing a drug response prediction method, called drug response graph attention network (DRGAT), which combines a denoising diffusion implicit model for data augmentation with a recently introduced graph attention network (GAT) with high-order neighbor propagation (HO-GATs) prediction module. Our proposed approach achieved almost 5% improvement in the area under receiver operating characteristic curve compared with state-of-the-art models for the many studied drugs, indicating our method's reasonable generalization capabilities. Moreover, our experiments confirm the potential of diffusion-based generative models, a core component of our method, to mitigate the inherent limitations of omics datasets by effectively augmenting GE data.
Collapse
Affiliation(s)
- Emre Sefer
- Artificial Intelligence and Data Engineering Department, Ozyegin University, Istanbul, Turkey
| |
Collapse
|
4
|
Vidanagamachchi SM, Waidyarathna KMGTR. Opportunities, challenges and future perspectives of using bioinformatics and artificial intelligence techniques on tropical disease identification using omics data. Front Digit Health 2024; 6:1471200. [PMID: 39654982 PMCID: PMC11625773 DOI: 10.3389/fdgth.2024.1471200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 11/06/2024] [Indexed: 12/12/2024] Open
Abstract
Tropical diseases can often be caused by viruses, bacteria, parasites, and fungi. They can be spread over vectors. Analysis of multiple omics data types can be utilized in providing comprehensive insights into biological system functions and disease progression. To this end, bioinformatics tools and diverse AI techniques are pivotal in identifying and understanding tropical diseases through the analysis of omics data. In this article, we provide a thorough review of opportunities, challenges, and future directions of utilizing Bioinformatics tools and AI-assisted models on tropical disease identification using various omics data types. We conducted the review from 2015 to 2024 considering reliable databases of peer-reviewed journals and conference articles. Several keywords were taken for the article searching and around 40 articles were reviewed. According to the review, we observed that utilization of omics data with Bioinformatics tools like BLAST, and Clustal Omega can make significant outcomes in tropical disease identification. Further, the integration of multiple omics data improves biomarker identification, and disease predictions including disease outbreak predictions. Moreover, AI-assisted models can improve the precision, cost-effectiveness, and efficiency of CRISPR-based gene editing, optimizing gRNA design, and supporting advanced genetic correction. Several AI-assisted models including XAI can be used to identify diseases and repurpose therapeutic targets and biomarkers efficiently. Furthermore, recent advancements including Transformer-based models such as BERT and GPT-4, have been mainly applied for sequence analysis and functional genomics. Finally, the most recent GeneViT model, utilizing Vision Transformers, and other AI techniques like Generative Adversarial Networks, Federated Learning, Transfer Learning, Reinforcement Learning, Automated ML and Attention Mechanism have shown significant performance in disease classification using omics data.
Collapse
Affiliation(s)
- S. M. Vidanagamachchi
- Department of Computer Science, Faculty of Science, University of Ruhuna, Matara, Sri Lanka
| | - K. M. G. T. R. Waidyarathna
- Department of Information Technology, Sri Lanka Institute of Advanced Technological Education, Galle, Sri Lanka
| |
Collapse
|
5
|
Nahali S, Safari L, Khanteymoori A, Huang J. StructmRNA a BERT based model with dual level and conditional masking for mRNA representation. Sci Rep 2024; 14:26043. [PMID: 39472486 PMCID: PMC11522565 DOI: 10.1038/s41598-024-77172-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2024] [Accepted: 10/21/2024] [Indexed: 11/02/2024] Open
Abstract
In this study, we introduce StructmRNA, a new BERT-based model that was designed for the detailed analysis of mRNA sequences and structures. The success of DNABERT in understanding the intricate language of non-coding DNA with bidirectional encoder representations is extended to mRNA with StructmRNA. This new model uses a special dual-level masking technique that covers both sequence and structure, along with conditional masking. This enables StructmRNA to adeptly generate meaningful embeddings for mRNA sequences, even in the absence of explicit structural data, by capitalizing on the intricate sequence-structure correlations learned during extensive pre-training on vast datasets. Compared to well-known models like those in the Stanford OpenVaccine project, StructmRNA performs better in important tasks such as predicting RNA degradation. Thus, StructmRNA can inform better RNA-based treatments by predicting the secondary structures and biological functions of unseen mRNA sequences. The proficiency of this model is further confirmed by rigorous evaluations, revealing its unprecedented ability to generalize across various organisms and conditions, thereby marking a significant advance in the predictive analysis of mRNA for therapeutic design. With this work, we aim to set a new standard for mRNA analysis, contributing to the broader field of genomics and therapeutic development.
Collapse
Affiliation(s)
- Sepideh Nahali
- Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Ontario, Canada.
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran.
| | - Leila Safari
- Department of Computer Engineering, University of Zanjan, Zanjan, Iran
| | | | - Jimmy Huang
- Information Retrieval and Knowledge Management Research Lab, York University, Toronto, Ontario, Canada
| |
Collapse
|
6
|
Wang Y, Chen Q, Shao H, Zhang R, Shen H. Generating bulk RNA-Seq gene expression data based on generative deep learning models and utilizing it for data augmentation. Comput Biol Med 2024; 169:107828. [PMID: 38101117 DOI: 10.1016/j.compbiomed.2023.107828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2023] [Revised: 11/22/2023] [Accepted: 12/04/2023] [Indexed: 12/17/2023]
Abstract
Large-scale high-throughput transcriptome sequencing data holds significant value in biomedical research. However, practical challenges such as difficulty in sample acquisition often limit the availability of large sample sizes, leading to decreased reliability of the analysis results. In practice, generative deep learning models, such as Generative Adversarial Networks (GANs) and Diffusion Models (DMs), have been proven to generate realistic data and may be used to solve this promblem. In this study, we utilized bulk RNA-Seq gene expression data to construct different generative models with two data preprocessing methods: Min-Max-GAN, Z-Score-GAN, Min-Max-DM, and Z-Score-DM. We demonstrated that the generated data from the Min-Max-GAN model exhibited high similarity to real data, surpassing the performance of the other models significantly. Furthermore, we trained the models on the largest dataset available to date, achieving MMD (Maximum Mean Discrepancy) of 0.030 and 0.033 on the training and independent datasets, respectively. Through SHAP (SHapley Additive exPlanations) explanations of our generative model, we also enhanced our model's credibility. Finally, we applied the generated data to data augmentation and observed a significant improvement in the performance of classification models. In summary, this study establishes a GAN-based approach for generating bulk RNA-Seq gene expression data, which contributes to enhancing the performance and reliability of downstream tasks in high-throughput transcriptome analysis.
Collapse
Affiliation(s)
- Yinglun Wang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Qiurui Chen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Hongwei Shao
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China
| | - Rongxin Zhang
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| | - Han Shen
- School of Life Sciences and Biopharmaceutics, Guangdong Pharmaceutical University, Guangzhou, 51006, PR China.
| |
Collapse
|
7
|
Ung CY, Correia C, Li H, Adams CM, Westendorf JJ, Zhu S. Multiorgan locked-state model of chronic diseases and systems pharmacology opportunities. Drug Discov Today 2024; 29:103825. [PMID: 37967790 PMCID: PMC11109989 DOI: 10.1016/j.drudis.2023.103825] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2023] [Revised: 10/29/2023] [Accepted: 11/08/2023] [Indexed: 11/17/2023]
Abstract
With increasing human life expectancy, the global medical burden of chronic diseases is growing. Hence, chronic diseases are a pressing health concern and will continue to be in decades to come. Chronic diseases often involve multiple malfunctioning organs in the body. An imminent question is how interorgan crosstalk contributes to the etiology of chronic diseases. We conceived the locked-state model (LoSM), which illustrates how interorgan communication can give rise to body-wide memory-like properties that 'lock' healthy or pathological conditions. Next, we propose cutting-edge systems biology and artificial intelligence strategies to decipher chronic multiorgan locked states. Finally, we discuss the clinical implications of the LoSM and assess the power of systems-based therapies to dismantle pathological multiorgan locked states while improving treatments for chronic diseases.
Collapse
Affiliation(s)
- Choong Yong Ung
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| | - Cristina Correia
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| | - Hu Li
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA
| | - Christopher M Adams
- Division of Endocrinology, Diabetes, Metabolism and Nutrition, Mayo Clinic, Rochester, MN, USA; Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA
| | - Jennifer J Westendorf
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA; Department of Orthopedic Surgery, Mayo Clinic, Rochester, MN, USA
| | - Shizhen Zhu
- Department of Molecular Pharmacology and Experimental Therapeutics, Mayo Clinic, Rochester, MN, USA; Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|