1
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
2
|
Zhou Y, Wang P, Zhang H, Wang T, Han S, Ma X, Liang S, Bai M, Fan P, Wang L, Wang J, Wang Q. Prediction of influenza virus infection based on deep learning and peripheral blood proteomics: A diagnostic study. J Adv Res 2025:S2090-1232(25)00211-5. [PMID: 40158620 DOI: 10.1016/j.jare.2025.03.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2025] [Revised: 03/26/2025] [Accepted: 03/26/2025] [Indexed: 04/02/2025] Open
Abstract
INTRODUCTION Influenza viruses cause seasonal epidemics almost every year, and it is difficult to diagnose quickly and accurately. Machine learning and peripheral blood protein omics have brought new ideas to the research of clinical markers. OBJECTIVES Prediction of key molecular marker of influenza virus infection by the established machine learning model and peripheral blood protein omics. METHODS This study used the testing data of 850 patients (including influenza, COVID-19 and mixed infections) and 265 healthy individuals, to establish and validate a diagnostic prediction model for influenza infection and verified the potential value of this model in the differential diagnosis of influenza, COVID-19 and healthy people. RESULTS The overall analysis showed that there were significant differences in 9 clinical features in the influenza group. Principal component analysis can effectively group samples based on these clinical features. Based on the random forest model and LASSO regression model found that the selected features are clinical indicators that can accurately distinguish influenza patients. We performed proteome sequencing combined with machine learning and found a total of 26 DEPs. Through PPI and WGCNA analysis, we identified several genes related to the proportion of monocytes. We then analyzed the correlation of these factors with immune cell proportions and found that SAA1 and SAA2 were highly correlated with various vital immunocyte. ROC curve analysis shows that SERPINA3 can distinguish influenza, COVID-19, mixed infection and healthy people; SAA1 can distinguish COVID-19, mixed infection and healthy people; SAA2 can distinguish influenza and healthy people. In influenza, high expression of SERPINA3, SAA1, and SAA2 is associated with higher risk. Finally, we used the ELISA method to confirm that SAA2 protein can be used as an auxiliary diagnostic indicator for influenza infection. CONCLUSIONS Preliminary results showed that SAA2 is an important molecular marker specific to influenza infection.
Collapse
Affiliation(s)
- Yumei Zhou
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China
| | - Pengbo Wang
- Xinxiang Medical University, Xinxiang, Henan 453003, PR China
| | - Haiyun Zhang
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China; Medical Laboratory Center, Dalian Municipal Women and Children's Medical Center (Group), Dalian, Liaoning 116033, PR China
| | - Taihao Wang
- Capital Medical University, Beijing 100069, PR China
| | - Shuai Han
- Inner Mongolia Medical University, Hohhot, Inner Mongolia 010070, PR China
| | - Xin Ma
- China Railway Construction Corporation, Beijing Tiejian Hospital, Beijing 100039, PR China
| | - Shuang Liang
- Department of Radiology, The Second Affiliated Hospital to Mudanjiang Medical University, Mudanjiang, Heilongjiang 157000, PR China
| | - Minghua Bai
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China
| | - Pengbei Fan
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China
| | - Lei Wang
- Hubei Shizhen Laboratory, Hubei University of Chinese Medicine, Wuhan, Hubei 430065, PR China.
| | - Ji Wang
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China; Hubei Shizhen Laboratory, Hubei University of Chinese Medicine, Wuhan, Hubei 430065, PR China.
| | - Qi Wang
- National Institute of TCM Constitution and Preventive Treatment of Disease, Wangqi Academy of Beijing University of Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, PR China; Hubei Shizhen Laboratory, Hubei University of Chinese Medicine, Wuhan, Hubei 430065, PR China
| |
Collapse
|
3
|
Dlugas H, Kim S. A Comparative Study of Network-Based Machine Learning Approaches for Binary Classification in Metabolomics. Metabolites 2025; 15:174. [PMID: 40137139 PMCID: PMC11944042 DOI: 10.3390/metabo15030174] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2025] [Revised: 02/21/2025] [Accepted: 02/27/2025] [Indexed: 03/27/2025] Open
Abstract
Background/Objectives: Metabolomics has recently emerged as a key tool in the biological sciences, offering insights into metabolic pathways and processes. Over the last decade, network-based machine learning approaches have gained significant popularity and application across various fields. While several studies have utilized metabolomics profiles for sample classification, many network-based machine learning approaches remain unexplored for metabolomic-based classification tasks. This study aims to compare the performance of various network-based machine learning approaches, including recently developed methods, in metabolomics-based classification. Methods: A standard data preprocessing procedure was applied to 17 metabolomic datasets, and Bayesian neural network (BNN), convolutional neural network (CNN), feedforward neural network (FNN), Kolmogorov-Arnold network (KAN), and spiking neural network (SNN) were evaluated on each dataset. The datasets varied widely in size, mass spectrometry method, and response variable. Results: With respect to AUC on test data, BNN, CNN, FNN, KAN, and SNN were the top-performing models in 4, 1, 5, 3, and 4 of the 17 datasets, respectively. Regarding F1-score, the top-performing models were BNN (3 datasets), CNN (3 datasets), FNN (4 datasets), KAN (4 datasets), and SNN (3 datasets). For accuracy, BNN, CNN, FNN, KAN, and SNN performed best in 4, 1, 4, 4, and 4 datasets, respectively. Conclusions: No network-based modeling approach consistently outperformed others across the metrics of AUC, F1-score, or accuracy. Our results indicate that while no single network-based modeling approach is superior for metabolomics-based classification tasks, BNN, KAN, and SNN may be underappreciated and underutilized relative to the more commonly used CNN and FNN.
Collapse
Affiliation(s)
- Hunter Dlugas
- Biostatistics and Bioinformatics Core, Karmanos Cancer Institute, Detroit, MI 48201, USA
- Department of Oncology, Wayne State University School of Medicine, Detroit, MI 48201, USA
| | - Seongho Kim
- Biostatistics and Bioinformatics Core, Karmanos Cancer Institute, Detroit, MI 48201, USA
- Department of Oncology, Wayne State University School of Medicine, Detroit, MI 48201, USA
| |
Collapse
|
4
|
Achterberg T, de Jong A. ProPr54 web server: predicting σ 54 promoters and regulon with a hybrid convolutional and recurrent deep neural network. NAR Genom Bioinform 2025; 7:lqae188. [PMID: 39781509 PMCID: PMC11704786 DOI: 10.1093/nargab/lqae188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Revised: 11/19/2024] [Accepted: 12/23/2024] [Indexed: 01/12/2025] Open
Abstract
σ54 serves as an unconventional sigma factor with a distinct mechanism of transcription initiation, which depends on the involvement of a transcription activator. This unique sigma factor σ54 is indispensable for orchestrating the transcription of genes crucial to nitrogen regulation, flagella biosynthesis, motility, chemotaxis and various other essential cellular processes. Currently, no comprehensive tools are available to determine σ54 promoters and regulon in bacterial genomes. Here, we report a σ54 promoter prediction method ProPr54, based on a convolutional neural network trained on a set of 446 validated σ54 binding sites derived from 33 bacterial species. Model performance was tested and compared with respect to bacterial intergenic regions, demonstrating robust applicability. ProPr54 exhibits high performance when tested on various bacterial species, highly surpassing other available σ54 regulon identification methods. Furthermore, analysis on bacterial genomes, which have no experimentally validated σ54 binding sites, demonstrates the generalization of the model. ProPr54 is the first reliable in silico method for predicting σ54 binding sites, making it a valuable tool to support experimental studies on σ54. In conclusion, ProPr54 offers a reliable, broadly applicable tool for predicting σ54 promoters and regulon genes in bacterial genome sequences. A web server is freely accessible at http://propr54.molgenrug.nl.
Collapse
Affiliation(s)
- Tristan Achterberg
- Department of Molecular Genetics, Groningen, Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands
| | - Anne de Jong
- Department of Molecular Genetics, Groningen, Biomolecular Sciences and Biotechnology Institute, University of Groningen, Nijenborgh 7, 9747 AG Groningen, the Netherlands
| |
Collapse
|
5
|
Liu H, Zhang X, Liu Q. A review of AI-based radiogenomics in neurodegenerative disease. Front Big Data 2025; 8:1515341. [PMID: 40052173 PMCID: PMC11882605 DOI: 10.3389/fdata.2025.1515341] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 01/31/2025] [Indexed: 03/09/2025] Open
Abstract
Neurodegenerative diseases are chronic, progressive conditions that cause irreversible damage to the nervous system, particularly in aging populations. Early diagnosis is a critical challenge, as these diseases often develop slowly and without clear symptoms until significant damage has occurred. Recent advances in radiomics and genomics have provided valuable insights into the mechanisms of these diseases by identifying specific imaging features and genomic patterns. Radiogenomics enhances diagnostic capabilities by linking genomics with imaging phenotypes, offering a more comprehensive understanding of disease progression. The growing field of artificial intelligence (AI), including machine learning and deep learning, opens new opportunities for improving the accuracy and timeliness of these diagnoses. This review examines the application of AI-based radiogenomics in neurodegenerative diseases, summarizing key model designs, performance metrics, publicly available data resources, significant findings, and future research directions. It provides a starting point and guidance for those seeking to explore this emerging area of study.
Collapse
Affiliation(s)
- Huanjing Liu
- The Department of Applied Computer Science, Faculty of Science, University of Winnipeg, Winnipeg, MB, Canada
| | - Xiao Zhang
- The Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| | - Qian Liu
- The Department of Applied Computer Science, Faculty of Science, University of Winnipeg, Winnipeg, MB, Canada
- The Department of Biochemistry and Medical Genetics, Max Rady College of Medicine, Rady Faculty of Health Sciences, University of Manitoba, Winnipeg, MB, Canada
| |
Collapse
|
6
|
Qingge L, Badal K, Annan R, Sturtz J, Liu X, Zhu B. Generative AI Models for the Protein Scaffold Filling Problem. J Comput Biol 2025; 32:127-142. [PMID: 39441716 DOI: 10.1089/cmb.2024.0510] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2024] Open
Abstract
De novo protein sequencing is an important problem in proteomics, playing a crucial role in understanding protein functions, drug discovery, design and evolutionary studies, etc. Top-down and bottom-up tandem mass spectrometry are popular approaches used in the field of mass spectrometry to analyze and sequence proteins. However, these approaches often produce incomplete protein sequences with gaps, namely scaffolds. The protein scaffold filling problem refers to filling the missing amino acids in the gaps of a scaffold to infer the complete protein sequence. In this article, we tackle the protein scaffold filling problem based on generative AI techniques, such as convolutional denoising autoencoder, transformer, and generative pretrained transformer (GPT) models, to complete the protein sequences and compare our results with recently developed convolutional long short-term memory-based sequence model. We evaluate the model performance both on a real dataset and generated datasets. All proposed models show outstanding prediction accuracy. Notably, the GPT-2 model achieves 100% gap-filling accuracy and 100% full sequence accuracy on the MabCampth protein scaffold, which outperforms the other models.
Collapse
Affiliation(s)
- Letu Qingge
- Department of Computer Science, North Carolina A&T State University, Greensboro, North Carolina, USA
| | - Kushal Badal
- Department of Computer Science, North Carolina A&T State University, Greensboro, North Carolina, USA
| | - Richard Annan
- Department of Computer Science, North Carolina A&T State University, Greensboro, North Carolina, USA
| | - Jordan Sturtz
- Department of Computer Science, North Carolina A&T State University, Greensboro, North Carolina, USA
| | - Xiaowen Liu
- John W. Deming Department of Medicine, Tulane University, New Orleans, Louisiana, USA
| | - Binhai Zhu
- Gianforte School of Computing, Montana State University, Bozeman, Montana, USA
| |
Collapse
|
7
|
Wilcox A, Griffith M, Griffith O. Looking Forward to AI and Medicine: Where Are We, and Where Are We Going? MISSOURI MEDICINE 2025; 122:34-38. [PMID: 39958602 PMCID: PMC11827648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 02/18/2025]
Abstract
Artificial intelligence (AI) has emerged as a significant area of interest in medicine, with the potential to influence various aspects of health care. However, the real benefits of applications of AI to medicine often become obscured by the considerable attention, and hype, around its capabilities. To better understand AI's role in medicine, it is important to contextualize its development and review the stages of AI evolution that have contributed to its present state.
Collapse
Affiliation(s)
- Adam Wilcox
- Director, Center for Applied Health Informatics, Professor of Medicine, Division of General Medicine & Geriatrics, Washington University School of Medicine, St. Louis, Missouri
| | - Malachi Griffith
- Associate Professors of Medicine (Oncology) and Genetics and Assistant Director of the McDonnell Genome Institute at Washington University, St. Louis, Missouri
| | - Obi Griffith
- Associate Professors of Medicine (Oncology) and Genetics and Assistant Director of the McDonnell Genome Institute at Washington University, St. Louis, Missouri
| |
Collapse
|
8
|
Kihlman R, Launonen I, Sillanpää MJ, Waldmann P. Sub-sampling graph neural networks for genomic prediction of quantitative phenotypes. G3 (BETHESDA, MD.) 2024; 14:jkae216. [PMID: 39250757 PMCID: PMC11540326 DOI: 10.1093/g3journal/jkae216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Accepted: 09/03/2024] [Indexed: 09/11/2024]
Abstract
In genomics, use of deep learning (DL) is rapidly growing and DL has successfully demonstrated its ability to uncover complex relationships in large biological and biomedical data sets. With the development of high-throughput sequencing techniques, genomic markers can now be allocated to large sections of a genome. By analyzing allele sharing between individuals, one may calculate realized genomic relationships from single-nucleotide polymorphisms (SNPs) data rather than relying on known pedigree relationships under polygenic model. The traditional approaches in genome-wide prediction (GWP) of quantitative phenotypes utilize genomic relationships in fixed global covariance modeling, possibly with some nonlinear kernel mapping (for example Gaussian processes). On the other hand, the DL approaches proposed so far for GWP fail to take into account the non-Euclidean graph structure of relationships between individuals over several generations. In this paper, we propose one global convolutional neural network (GCN) and one local sub-sampling architecture (GCN-RS) that are specifically designed to perform regression analysis based on genomic relationship information. A GCN is tailored to non-Euclidean spaces and consists of several layers of graph convolutions. The GCN-RS architecture is designed to further improve the GCN's performance by sub-sampling the graph to reduce the dimensionality of the input data. Through these graph convolutional layers, the GCN maps input genomic markers to their quantitative phenotype values. The graphs are constructed using an iterative nearest neighbor approach. Comparisons show that the GCN-RS outperforms the popular Genomic Best Linear Unbiased Predictor method on one simulated and three real datasets from wheat, mice and pig with a predictive improvement of 4.4% to 49.4% in terms of test mean squared error. This indicates that GCN-RS is a promising tool for genomic predictions in plants and animals. Furthermore, GCN-RS is computationally efficient, making it a viable option for large-scale applications.
Collapse
Affiliation(s)
- Ragini Kihlman
- Research Unit of Mathematical Sciences, University of Oulu, FI-90014 University of Oulu, Finland
| | - Ilkka Launonen
- Research Unit of Mathematical Sciences, University of Oulu, FI-90014 University of Oulu, Finland
| | - Mikko J Sillanpää
- Research Unit of Mathematical Sciences, University of Oulu, FI-90014 University of Oulu, Finland
| | - Patrik Waldmann
- Research Unit of Mathematical Sciences, University of Oulu, FI-90014 University of Oulu, Finland
| |
Collapse
|
9
|
Jaiswal S, Murthy HA, Narayanan M. SpecGMM: Integrating Spectral analysis and Gaussian Mixture Models for taxonomic classification and identification of discriminative DNA regions. BIOINFORMATICS ADVANCES 2024; 4:vbae171. [PMID: 39659586 PMCID: PMC11631429 DOI: 10.1093/bioadv/vbae171] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 10/17/2024] [Accepted: 11/01/2024] [Indexed: 12/12/2024]
Abstract
Motivation Genomic signal processing (GSP), which transforms biomolecular sequences into discrete signals for spectral analysis, has provided valuable insights into DNA sequence, structure, and evolution. However, challenges persist with spectral representations of variable-length sequences for tasks like species classification and in interpreting these spectra to identify discriminative DNA regions. Results We introduce SpecGMM, a novel framework that integrates sliding window-based Spectral analysis with a Gaussian Mixture Model to transform variable-length DNA sequences into fixed-dimensional spectral representations for taxonomic classification. SpecGMM's hyperparameters were selected using a dataset of plant sequences, and applied unchanged across diverse datasets, including mitochondrial DNA, viral and bacterial genome, and 16S rRNA sequences. Across these datasets, SpecGMM outperformed a baseline method, with 9.45% average and 35.55% maximum improvement in test accuracies for a Linear Discriminant classifier. Regarding interpretability, SpecGMM revealed discriminative hypervariable regions in 16S rRNA sequences-particularly V3/V4 for discriminating higher taxa and V2/V3 for lower taxa-corroborating their known classification relevance. SpecGMM's spectrogram video analysis helped visualize species-specific DNA signatures. SpecGMM thus provides a robust and interpretable method for spectral DNA analysis, opening new avenues in GSP research. Availability and implementation SpecGMM's source code is available at https://github.com/BIRDSgroup/SpecGMM.
Collapse
Affiliation(s)
- Saish Jaiswal
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai 600036, India
| | - Hema A Murthy
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai 600036, India
- Department of Computer Science and Engineering, Shiv Nadar University, Chennai 603110, India
| | - Manikandan Narayanan
- Department of Computer Science and Engineering, Indian Institute of Technology (IIT) Madras, Chennai 600036, India
- Center for Integrative Biology and Systems Medicine, IIT Madras, Chennai 600036, India
- Robert Bosch Centre for Data Science and Artificial Intelligence, IIT Madras, Chennai 600036, India
| |
Collapse
|
10
|
Tripathi S, Gabriel K, Tripathi PK, Kim E. Large language models reshaping molecular biology and drug development. Chem Biol Drug Des 2024; 103:e14568. [PMID: 38898381 DOI: 10.1111/cbdd.14568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Revised: 05/18/2024] [Accepted: 06/04/2024] [Indexed: 06/21/2024]
Abstract
The utilization of large language models (LLMs) has become a significant advancement in the domains of medicine and clinical informatics, providing a revolutionary potential for scientific breakthroughs and customized therapies. LLM models are trained on large datasets and exhibit the capacity to comprehend and analyze intricate biological data, encompassing genomic sequences, protein structures, and clinical health records. With the utilization of their comprehension of the language of biology, they possess the ability to reveal concealed patterns and insights that may evade human researchers. LLMs have been shown to positively impact various aspects of molecular biology, including the following: genomic analysis, drug development, precision medicine, biomarker development, experimental design, collaborative research, and accessibility to specialized expertise. However, it is imperative to acknowledge and tackle the obstacles and ethical implications involved. The careful consideration of data bias and generalization, data privacy and security, explainability and interpretability, and ethical concerns around responsible application is vital. The successful resolution of these obstacles will enable us to fully utilize the capabilities of LLMs, leading to substantial progress in the fields of molecular biology and pharmaceutical research. This progression also has the ability to bolster influential impacts for both the individual and the broader community.
Collapse
Affiliation(s)
- Satvik Tripathi
- Drexel University, Philadelphia, Pennsylvania, USA
- Harvard Medical School, Boston, Massachusetts, USA
| | - Kyla Gabriel
- Harvard Medical School, Boston, Massachusetts, USA
| | | | - Edward Kim
- Drexel University, Philadelphia, Pennsylvania, USA
| |
Collapse
|
11
|
Marchi F, Bellini E, Iandelli A, Sampieri C, Peretti G. Exploring the landscape of AI-assisted decision-making in head and neck cancer treatment: a comparative analysis of NCCN guidelines and ChatGPT responses. Eur Arch Otorhinolaryngol 2024; 281:2123-2136. [PMID: 38421392 DOI: 10.1007/s00405-024-08525-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 02/02/2024] [Indexed: 03/02/2024]
Abstract
PURPOSE Recent breakthroughs in natural language processing and machine learning, exemplified by ChatGPT, have spurred a paradigm shift in healthcare. Released by OpenAI in November 2022, ChatGPT rapidly gained global attention. Trained on massive text datasets, this large language model holds immense potential to revolutionize healthcare. However, existing literature often overlooks the need for rigorous validation and real-world applicability. METHODS This head-to-head comparative study assesses ChatGPT's capabilities in providing therapeutic recommendations for head and neck cancers. Simulating every NCCN Guidelines scenarios. ChatGPT is queried on primary treatments, adjuvant treatment, and follow-up, with responses compared to the NCCN Guidelines. Performance metrics, including sensitivity, specificity, and F1 score, are employed for assessment. RESULTS The study includes 68 hypothetical cases and 204 clinical scenarios. ChatGPT exhibits promising capabilities in addressing NCCN-related queries, achieving high sensitivity and overall accuracy across primary treatment, adjuvant treatment, and follow-up. The study's metrics showcase robustness in providing relevant suggestions. However, a few inaccuracies are noted, especially in primary treatment scenarios. CONCLUSION Our study highlights the proficiency of ChatGPT in providing treatment suggestions. The model's alignment with the NCCN Guidelines sets the stage for a nuanced exploration of AI's evolving role in oncological decision support. However, challenges related to the interpretability of AI in clinical decision-making and the importance of clinicians understanding the underlying principles of AI models remain unexplored. As AI continues to advance, collaborative efforts between models and medical experts are deemed essential for unlocking new frontiers in personalized cancer care.
Collapse
Affiliation(s)
- Filippo Marchi
- Unit of Otorhinolaryngology-Head and Neck Surgery, IRCCS Ospedale Policlinico San Martino, Largo Rosanna Benzi, 10, 16132, Genoa, Italy
- Department of Surgical Sciences and Integrated Diagnostics (DISC), University of Genoa, 16132, Genoa, Italy
| | - Elisa Bellini
- Unit of Otorhinolaryngology-Head and Neck Surgery, IRCCS Ospedale Policlinico San Martino, Largo Rosanna Benzi, 10, 16132, Genoa, Italy.
- Department of Surgical Sciences and Integrated Diagnostics (DISC), University of Genoa, 16132, Genoa, Italy.
| | - Andrea Iandelli
- Unit of Otorhinolaryngology-Head and Neck Surgery, IRCCS Ospedale Policlinico San Martino, Largo Rosanna Benzi, 10, 16132, Genoa, Italy
| | - Claudio Sampieri
- Department of Experimental Medicine (DIMES), University of Genoa, Genoa, Italy
- Department of Otolaryngology-Hospital Cliníc, Barcelona, Spain
- Functional Unit of Head and Neck Tumors-Hospital Cliníc, Barcelona, Spain
| | - Giorgio Peretti
- Unit of Otorhinolaryngology-Head and Neck Surgery, IRCCS Ospedale Policlinico San Martino, Largo Rosanna Benzi, 10, 16132, Genoa, Italy
- Department of Surgical Sciences and Integrated Diagnostics (DISC), University of Genoa, 16132, Genoa, Italy
| |
Collapse
|