1
|
Asim MN, Asif T, Hassan F, Dengel A. Protein Sequence Analysis landscape: A Systematic Review of Task Types, Databases, Datasets, Word Embeddings Methods, and Language Models. Database (Oxford) 2025; 2025:baaf027. [PMID: 40448683 DOI: 10.1093/database/baaf027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/06/2025] [Accepted: 03/26/2025] [Indexed: 06/02/2025]
Abstract
Protein sequence analysis examines the order of amino acids within protein sequences to unlock diverse types of a wealth of knowledge about biological processes and genetic disorders. It helps in forecasting disease susceptibility by finding unique protein signatures, or biomarkers that are linked to particular disease states. Protein Sequence analysis through wet-lab experiments is expensive, time-consuming and error prone. To facilitate large-scale proteomics sequence analysis, the biological community is striving for utilizing AI competence for transitioning from wet-lab to computer aided applications. However, Proteomics and AI are two distinct fields and development of AI-driven protein sequence analysis applications requires knowledge of both domains. To bridge the gap between both fields, various review articles have been written. However, these articles focus revolves around few individual tasks or specific applications rather than providing a comprehensive overview about wide tasks and applications. Following the need of a comprehensive literature that presents a holistic view of wide array of tasks and applications, contributions of this manuscript are manifold: It bridges the gap between Proteomics and AI fields by presenting a comprehensive array of AI-driven applications for 63 distinct protein sequence analysis tasks. It equips AI researchers by facilitating biological foundations of 63 protein sequence analysis tasks. It enhances development of AI-driven protein sequence analysis applications by providing comprehensive details of 68 protein databases. It presents a rich data landscape, encompassing 627 benchmark datasets of 63 diverse protein sequence analysis tasks. It highlights the utilization of 25 unique word embedding methods and 13 language models in AI-driven protein sequence analysis applications. It accelerates the development of AI-driven applications by facilitating current state-of-the-art performances across 63 protein sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Faiza Hassan
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern 67663, Germany
- Department of Computer Science, Rheinland Pfälzische Technische Universität, Kaiserslautern 67663, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
2
|
Hu B, Su Y, Tian X, Chen C, Chen C, Lv X. GMAMDA: Predicting Metabolite-Disease Associations Based on Adaptive Hardness Negative Sampling and Adaptive Graph Multiple Convolution. J Chem Inf Model 2025; 65:5242-5254. [PMID: 40372801 DOI: 10.1021/acs.jcim.5c00694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/17/2025]
Abstract
Metabolites are small molecules produced during organism metabolism, with their abnormal concentrations closely linked to the onset and progression of various diseases. Accurate prediction of metabolite-disease associations is crucial for early diagnosis, mechanistic exploration, and treatment optimization. However, existing algorithms often overlook the integration of node features and neglect the impact of different hop domains on nodes in the processing of heterogeneous graphs. Furthermore, current methods solely rely on random sampling for selecting negative samples without considering their reliability, thereby compromising model stability. A novel metabolite-disease association prediction model, GMAMDA, is proposed to address these challenges. GMAMDA integrates adaptive hardness negative sampling, adaptive graph multiple convolution techniques, and a multiheterogeneous graph fusion strategy to forecast potential metabolite-disease associations. Initially, by computing multisource similarity information for metabolites and diseases, multiple heterogeneous graph networks are established for metabolite-disease association networks. Subsequently, the adaptive graph's multiconvolution mechanism is employed to generate feature-rich node representations across various heterogeneous graphs by dynamically leveraging information from different hop neighborhoods. The model then utilizes an adaptive hardness negative sampling approach based on principal component analysis to select negative samples with the highest information content for training, enabling the prediction of potential associations between new metabolites and diseases. Experimental findings demonstrate that GMAMDA outperforms state-of-the-art methods across various evaluation metrics, including AUC (0.9962 ± 0.0014), AUPR (0.9967 ± 0.0009), and accuracy (0.9733 ± 0.0042). Case studies focusing on Alzheimer's disease and kidney disease further validate GMAMDA's clinical potential in predicting metabolite markers.
Collapse
Affiliation(s)
- Binglu Hu
- College of Software, Xinjiang University, Urumqi 830046, Xinjiang, China
| | - Ying Su
- College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
- Xinjiang Aiqiside Detection Technology Co, Ltd, Urumqi 830063, China
| | - Xuecong Tian
- College of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
| | - Chen Chen
- College of Software, Xinjiang University, Urumqi 830046, Xinjiang, China
| | - Cheng Chen
- College of Software, Xinjiang University, Urumqi 830046, Xinjiang, China
| | - Xiaoyi Lv
- College of Software, Xinjiang University, Urumqi 830046, Xinjiang, China
| |
Collapse
|
3
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
4
|
Asim MN, Asif T, Mehmood F, Dengel A. Peptide classification landscape: An in-depth systematic literature review on peptide types, databases, datasets, predictors architectures and performance. Comput Biol Med 2025; 188:109821. [PMID: 39987697 DOI: 10.1016/j.compbiomed.2025.109821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 02/03/2025] [Accepted: 02/05/2025] [Indexed: 02/25/2025]
Abstract
Peptides are gaining significant attention in diverse fields such as the pharmaceutical market has seen a steady rise in peptide-based therapeutics over the past six decades. Peptides have been utilized in the development of distinct applications including inhibitors of SARS-COV-2 and treatments for conditions like cancer and diabetes. Distinct types of peptides possess unique characteristics, and development of peptide-specific applications require the discrimination of one peptide type from others. To the best of our knowledge, approximately 230 Artificial Intelligence (AI) driven applications have been developed for 22 distinct types of peptides, yet there remains significant room for development of new predictors. A Comprehensive review addresses the critical gap by providing a consolidated platform for the development of AI-driven peptide classification applications. This paper offers several key contributions, including presenting the biological foundations of 22 unique peptide types and categorizes them into four main classes: Regulatory, Therapeutic, Nutritional, and Delivery Peptides. It offers an in-depth overview of 47 databases that have been used to develop peptide classification benchmark datasets. It summarizes details of 288 benchmark datasets that are used in development of diverse types AI-driven peptide classification applications. It provides a detailed summary of 197 sequence representation learning methods and 94 classifiers that have been used to develop 230 distinct AI-driven peptide classification applications. Across 22 distinct types peptide classification tasks related to 288 benchmark datasets, it demonstrates performance values of 230 AI-driven peptide classification applications. It summarizes experimental settings and various evaluation measures that have been employed to assess the performance of AI-driven peptide classification applications. The primary focus of this manuscript is to consolidate scattered information into a single comprehensive platform. This resource will greatly assist researchers who are interested in developing new AI-driven peptide classification applications.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany.
| | - Tayyaba Asif
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Faiza Mehmood
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Institute of Data Sciences, University of Engineering and Technology, Lahore, Pakistan
| | - Andreas Dengel
- German Research Center for Artificial Intelligence, Kaiserslautern, 67663, Germany; Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany; Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| |
Collapse
|
5
|
Abbasi AF, Asim MN, Dengel A. Transitioning from wet lab to artificial intelligence: a systematic review of AI predictors in CRISPR. J Transl Med 2025; 23:153. [PMID: 39905452 PMCID: PMC11796103 DOI: 10.1186/s12967-024-06013-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Accepted: 12/18/2024] [Indexed: 02/06/2025] Open
Abstract
The revolutionary CRISPR-Cas9 system leverages a programmable guide RNA (gRNA) and Cas9 proteins to precisely cleave problematic regions within DNA sequences. This groundbreaking technology holds immense potential for the development of targeted therapies for a wide range of diseases, including cancers, genetic disorders, and hereditary diseases. CRISPR-Cas9 based genome editing is a multi-step process such as designing a precise gRNA, selecting the appropriate Cas protein, and thoroughly evaluating both on-target and off-target activity of the Cas9-gRNA complex. To ensure the accuracy and effectiveness of CRISPR-Cas9 system, after the targeted DNA cleavage, the process requires careful analysis of the resultant outcomes such as indels and deletions. Following the success of artificial intelligence (AI) in various fields, researchers are now leveraging AI algorithms to catalyze and optimize the multi-step process of CRISPR-Cas9 system. To achieve this goal AI-driven applications are being integrated into each step, but existing AI predictors have limited performance and many steps still rely on expensive and time-consuming wet-lab experiments. The primary reason behind low performance of AI predictors is the gap between CRISPR and AI fields. Effective integration of AI into multi-step CRISPR-Cas9 system demands comprehensive knowledge of both domains. This paper bridges the knowledge gap between AI and CRISPR-Cas9 research. It offers a unique platform for AI researchers to grasp deep understanding of the biological foundations behind each step in the CRISPR-Cas9 multi-step process. Furthermore, it provides details of 80 available CRISPR-Cas9 system-related datasets that can be utilized to develop AI-driven applications. Within the landscape of AI predictors in CRISPR-Cas9 multi-step process, it provides insights of representation learning methods, machine and deep learning methods trends, and performance values of existing 50 predictive pipelines. In the context of representation learning methods and classifiers/regressors, a thorough analysis of existing predictive pipelines is utilized for recommendations to develop more robust and precise predictive pipelines.
Collapse
Affiliation(s)
- Ahtisham Fazeel Abbasi
- Smart Data and Knowledge Services, German Research Center for Artificial Intelligence, 67663, Kaiserslautern, Germany.
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany.
| | - Muhammad Nabeel Asim
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
| | - Andreas Dengel
- Smart Data and Knowledge Services, German Research Center for Artificial Intelligence, 67663, Kaiserslautern, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University Kaiserslautern-Landau, 67663, Kaiserslautern, Germany
| |
Collapse
|
6
|
Asim MN, Ibrahim MA, Asif T, Dengel A. RNA sequence analysis landscape: A comprehensive review of task types, databases, datasets, word embedding methods, and language models. Heliyon 2025; 11:e41488. [PMID: 39897847 PMCID: PMC11783440 DOI: 10.1016/j.heliyon.2024.e41488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 12/23/2024] [Accepted: 12/24/2024] [Indexed: 02/04/2025] Open
Abstract
Deciphering information of RNA sequences reveals their diverse roles in living organisms, including gene regulation and protein synthesis. Aberrations in RNA sequence such as dysregulation and mutations can drive a diverse spectrum of diseases including cancers, genetic disorders, and neurodegenerative conditions. Furthermore, researchers are harnessing RNA's therapeutic potential for transforming traditional treatment paradigms into personalized therapies through the development of RNA-based drugs and gene therapies. To gain insights of biological functions and to detect diseases at early stages and develop potent therapeutics, researchers are performing diverse types RNA sequence analysis tasks. RNA sequence analysis through conventional wet-lab methods is expensive, time-consuming and error prone. To enable large-scale RNA sequence analysis, empowerment of wet-lab experimental methods with Artificial Intelligence (AI) applications necessitates scientists to have a comprehensive knowledge of both DNA and AI fields. While molecular biologists encounter challenges in understanding AI methods, computer scientists often lack basic foundations of RNA sequence analysis tasks. Considering the absence of a comprehensive literature that bridges this research gap and promotes the development of AI-driven RNA sequence analysis applications, the contributions of this manuscript are manifold: It equips AI researchers with biological foundations of 47 distinct RNA sequence analysis tasks. It sets a stage for development of benchmark datasets related to 47 distinct RNA sequence analysis tasks by facilitating cruxes of 64 different biological databases. It presents word embeddings and language models applications across 47 distinct RNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 58 word embeddings and 70 language models based predictive pipelines performance values as well as top performing traditional sequence encoding based predictors and their performances across 47 RNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Tayyaba Asif
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, 67663, Germany
- Department of Computer Science, Rhineland-Palatinate Technical University of Kaiserslautern-Landau, Kaiserslautern, 67663, Germany
| |
Collapse
|