1
|
Zhu Y, Yu W, Li X. A Multi-objective transfer learning framework for time series forecasting with Concept Echo State Networks. Neural Netw 2025; 186:107272. [PMID: 39999532 DOI: 10.1016/j.neunet.2025.107272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2024] [Revised: 01/22/2025] [Accepted: 02/11/2025] [Indexed: 02/27/2025]
Abstract
This paper introduces a novel transfer learning framework for time series forecasting that uses Concept Echo State Network (CESN) and a multi-objective optimization strategy. Our approach addresses the challenges of feature extraction and knowledge transfer in heterogeneous data environments. By optimizing CESN for each data source, we extract targeted features that capture the unique characteristics of individual datasets. Additionally, our multi-network architecture enables effective knowledge sharing among different ESNs, leading to improved forecasting performance. To further enhance efficiency, CESN reduces the need for extensive hyperparameter tuning by focusing on optimizing only the concept matrix and output weights. Our proposed framework offers a promising solution for forecasting problems where data is diverse, limited, or missing.
Collapse
Affiliation(s)
- Yingqin Zhu
- CINVESTAV-IPN Departamento de Control Automático, Av. IPN 2508, Mexico city, 07360, Mexico
| | - Wen Yu
- CINVESTAV-IPN Departamento de Control Automático, Av. IPN 2508, Mexico city, 07360, Mexico.
| | - Xiaoou Li
- CINVESTAV-IPN Departamento de Computación, Av. IPN 2508, Mexico city, 07360, Mexico
| |
Collapse
|
2
|
Liu Y, Li C, Shen LC, Yan H, Wei G, Gasser RB, Hu X, Song J, Yu DJ. scRCA: A Siamese network-based pipeline for annotating cell types using noisy single-cell RNA-seq reference data. Comput Biol Med 2025; 190:110068. [PMID: 40158457 DOI: 10.1016/j.compbiomed.2025.110068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 03/19/2025] [Accepted: 03/20/2025] [Indexed: 04/02/2025]
Abstract
Accurate cell type annotation is fundamentally critical for single-cell sequencing (scRNA-seq) data analysis to provide insightful knowledge of tissue-specific cell heterogeneity and cell state transition tracking. Cell type annotation is usually conducted by comparative analysis with known data (i.e., reference) - which contains a presumably accurate representation of cell types. However, this assumption is often problematic, as factors such as human errors in wet-lab experiments and methodological limitations can introduce annotation errors in the reference dataset. As current pipelines for single-cell transcriptomic analysis do not adequately consider this challenge, there is a major demand for constructing a computational pipeline that achieves high-quality cell type annotation using reference datasets containing inherent errors (referred to as "noise" in this study). Here, we built a Siamese network-based pipeline, termed scRCA, to accurately annotate cell types based on noisy reference data. To help users evaluate the reliability of scRCA annotations, an interpreter was also developed to explore the factors underlying the model's predictions. Our experiments demonstrate that, across 14 datasets, scRCA outperformed other widely adopted reference-based methods for cell type annotation. Using an independent dataset of four multiple myeloma patients, we further illustrated that scRCA can distinguish cancerous cells based on gene expression levels and identify genes closely associated with multiple myeloma through scRCA's interpretable module, providing significant information for subsequent clinical treatments. With these advancements, we anticipate that scRCA will serve as a practical reference-based approach for accurate annotating cell type annotation.
Collapse
Affiliation(s)
- Yan Liu
- Department of Computer Science, Yangzhou University, Yangzhou, 225100, China
| | - Chen Li
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia
| | - Long-Chen Shen
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - He Yan
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China
| | - Guo Wei
- School of Life Sciences, Nanjing University, Nanjing, 210023, China
| | - Robin B Gasser
- Monash Data Futures Institute, Monash University, Melbourne, Victoria, 3800, Australia
| | - Xiaohua Hu
- Information Department, The First Affiliated Hospital of Naval Military Medical University, Changhai Road 168, Shanghai, 200433, China
| | - Jiangning Song
- Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Victoria, 3800, Australia; Monash Data Futures Institute, Monash University, Melbourne, Victoria, 3800, Australia.
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, 200 Xiaolingwei, Nanjing, 210094, China.
| |
Collapse
|
3
|
Lu Y, Liu D, Liang Z, Liu R, Chen P, Liu Y, Li J, Feng Z, Li LM, Sheng B, Jia W, Chen L, Li H, Wang Y. A pretrained transformer model for decoding individual glucose dynamics from continuous glucose monitoring data. Natl Sci Rev 2025; 12:nwaf039. [PMID: 40191259 PMCID: PMC11970253 DOI: 10.1093/nsr/nwaf039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2025] [Revised: 01/22/2025] [Accepted: 02/05/2025] [Indexed: 04/09/2025] Open
Abstract
Continuous glucose monitoring (CGM) technology has grown rapidly to track real-time blood glucose levels and trends with improved sensor accuracy. The ease of use and wide availability of CGM will facilitate safe and effective decision making for diabetes management. Here, we developed an attention-based deep learning model, CGMformer, pretrained on a well-controlled and diverse corpus of CGM data to represent individual's intrinsic metabolic state and enable clinical applications. During pretraining, CGMformer encodes glucose dynamics including glucose level, fluctuation, hyperglycemia, and hypoglycemia into latent space with self-supervised learning. It shows generalizability in imputing glucose value across five external datasets with different populations and metabolic states (MAE = 3.7 mg/dL). We then fine-tuned CGMformer towards a diverse panel of downstream tasks in the screening of diabetes and its complications using task-specific data, which demonstrated a consistently boosted predictive accuracy over direct fine-tuning on a single task (AUROC = 0.914 for type 2 diabetes (T2D) screening and 0.741 for complication screening). By learning an intrinsic representation of an individual's glucose dynamics, CGMformer classifies non-diabetic individuals into six clusters with elevated T2D risks, and identifies a specific cluster with lean body-shape but high risk of glucose metabolism disorders, which is overlooked by traditional glucose measurements. Furthermore, CGMformer achieves high accuracy in predicting an individual's postprandial glucose response with dietary modelling (Pearson correlation coefficient = 0.763) and helps personalized dietary recommendations. Overall, CGMformer pretrains a transformer neural network architecture to learn an intrinsic representation by borrowing information from a large amount of daily glucose profiles, and demonstrates predictive capabilities fine-tuned towards a broad range of downstream applications, holding promise for the early warning of T2D and recommendations for lifestyle modification in diabetes management.
Collapse
Affiliation(s)
- Yurun Lu
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Dan Liu
- Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai Key Laboratory of Diabetes Mellitus, Shanghai 200233, China
| | - Zhongming Liang
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
- BGI-Research, Hangzhou 310030, China
| | - Rui Liu
- School of Mathematics, South China University of Technology, Guangzhou 510640, China
| | - Pei Chen
- School of Mathematics, South China University of Technology, Guangzhou 510640, China
| | - Yitong Liu
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Jiachen Li
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
| | - Zhanying Feng
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- Department of Statistics, Department of Biomedical Data Science, Bio-X Program, Stanford University, Stanford CA 94305, USA
| | - Lei M Li
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
| | - Bin Sheng
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Weiping Jia
- Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai Key Laboratory of Diabetes Mellitus, Shanghai 200233, China
| | - Luonan Chen
- State Key Laboratory of Cell Biology, Center for Excellence in Molecular Cell Science, Shanghai Institute of Biochemistry and Cell Biology, Chinese Academy of Sciences, Shanghai 200031, China
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
- Guangdong Institute of Intelligence Science and Technology, Zhuhai 519031, China
- Pazhou Laboratory (Huangpu), Guangzhou 510555, China
| | - Huating Li
- Department of Endocrinology and Metabolism, Shanghai Sixth People's Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai Diabetes Institute, Shanghai Clinical Center for Diabetes, Shanghai Key Laboratory of Diabetes Mellitus, Shanghai 200233, China
| | - Yong Wang
- Center for Excellence in Mathematical Sciences, National Center for Mathematics and Interdisciplinary Sciences, Hua Loo-Keng Center for Mathematical Sciences, Key Laboratory of Management, Decision and Information System, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China
- School of Mathematics, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou 310024, China
| |
Collapse
|
4
|
Zhang Y, Wang J, Li C, Duan H, Wang W. Attention-based deep learning models for predicting anomalous shock of wastewater treatment plants. WATER RESEARCH 2025; 275:123192. [PMID: 39893907 DOI: 10.1016/j.watres.2025.123192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Revised: 12/24/2024] [Accepted: 01/22/2025] [Indexed: 02/04/2025]
Abstract
Quickly grasping the time-consuming water quality indicators (WQIs) such as total nitrogen (TN) and total phosphorus (TP) of influent is an essential prerequisite for wastewater treatment plants (WWTPs) to prompt respond to sudden shock loads. Soft detection methods based on machine learning models, especially deep learning models, perform well in predicting the normal fluctuations of these time-consuming WQIs but hardly predict their sudden fluctuations mainly due to the lack of extreme fluctuation data for model training. This work employs attention mechanisms to aid deep learning models in learning patterns of anomalous water quality. The lack of interpretability has always hindered deep learning models from optimizing for different application scenarios. Therefore, the local and global sensitivity analyses are performed based on the best-performing attention-based deep learning and ordinary machine learning models, respectively, allowing for reliable feature importance quantification with a small computational burden. In the case study, three types of attention-based deep learning models were developed, including attention-based multilayer perceptron (A-MLP), Transformer composed of stacked A-MLP encoder and A-MLP decoder, and feature-temporal attention-based long short-term memory (FTA-LSTM) neural network with encoder-decoder architecture. These developed attention-based deep learning models consistently outperform the corresponding baseline models in predicting the testing set of TN, TP, and chemical oxygen demand (COD) time series and the anomalous values therein, clearly demonstrating the positive effect of the integrated attention mechanism. Among them, the prediction performance of FTA-LSTM outperforms A-MLP and Transformer (2.01-38.48 % higher R2, 0-85.14 % higher F1-score, 0-62.57 % higher F2-score). Predicting anomalous water quality using attention-based deep learning models is a novel attempt that drives the WWTPs' operation towards being safer, cleaner, and more cost-efficient.
Collapse
Affiliation(s)
- Yituo Zhang
- School of Ecology and Environment, Harbin Institute of Technology, Shenzhen, 518055, China
| | - Jihong Wang
- School of Ecology and Environment, Harbin Institute of Technology, Shenzhen, 518055, China
| | - Chaolin Li
- School of Ecology and Environment, Harbin Institute of Technology, Shenzhen, 518055, China; State Key Laboratory of Urban Water Resource and Environment, Harbin Institute of Technology, Harbin, 150090, China
| | - Hengpan Duan
- School of Ecology and Environment, Harbin Institute of Technology, Shenzhen, 518055, China
| | - Wenhui Wang
- School of Ecology and Environment, Harbin Institute of Technology, Shenzhen, 518055, China.
| |
Collapse
|
5
|
Joshi CP, Baldi A, Kumar N, Pradhan J. Harnessing network pharmacology in drug discovery: an integrated approach. NAUNYN-SCHMIEDEBERG'S ARCHIVES OF PHARMACOLOGY 2025; 398:4689-4703. [PMID: 39621088 DOI: 10.1007/s00210-024-03625-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 11/09/2024] [Indexed: 04/11/2025]
Abstract
Traditional drug discovery approach is based on one drug-one target, that is associated with very lengthy timelines, high costs and very low success rates. Network pharmacology (NP) is a novel method of drug designing, that is based on a multiple-target approach. NP integrates systems such as biology, pharmacology and computational techniques to address the limitations of traditional methods of drug discovery. With help of mapping biological networks, it provides deep insights into biological molecules' interactions and enhances our understanding to the mechanism of drugs, polypharmacology and disease etiology. This review explores the theoretical framework of network pharmacology, discussing the principles and methodologies that enable the construction of drug-target and disease-gene networks. It highlights how data mining, bioinformatics tools and computational models are utilised to predict drug behaviour, repurpose existing drugs and identify novel therapeutic targets. Applications of network pharmacology in the treatment of complex diseases-such as cancer, neurodegenerative disorders, cardiovascular diseases and infectious diseases-are extensively covered, demonstrating its potential to identify multi-target drugs for multifaceted disease mechanisms. Despite the promising results, NP faces challenges due to incomplete and quality of biological data, computational complexities and biological system redundancy. It also faces regulatory challenges in drug approval, demanding revision in regulatory guidelines towards multi-target therapies. Advancements in AI and machine learning, dynamic network modelling and global collaboration can further enhance the efficacy of network pharmacology. This integrative approach has the potential to revolutionise drug discovery, offering new solutions for personalised medicine, drug repurposing and tackling the complexities of modern diseases.
Collapse
Affiliation(s)
- Chandra Prakash Joshi
- Department of Pharmaceutical Sciences, Mohanlal Sukhadia University, Udaipur, Rajasthan, India
| | - Ashish Baldi
- Pharma Innovation Lab, Department of Pharmaceutical Sciences and Technology, Maharaja Ranjit Singh Punjab Technical University, Bathinda, Punjab, India.
| | - Neeraj Kumar
- B N College of Pharmacy, B. N. University, Udaipur, Rajasthan, India
| | - Joohee Pradhan
- Department of Pharmaceutical Sciences, Mohanlal Sukhadia University, Udaipur, Rajasthan, India.
| |
Collapse
|
6
|
Pekayvaz K, Heinig M, Stark K. Predictive cardio-omics: translating single-cell multiomics into tools for personalized medicine. Nat Rev Cardiol 2025; 22:305-306. [PMID: 39900732 DOI: 10.1038/s41569-025-01132-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/05/2025]
Affiliation(s)
- Kami Pekayvaz
- Medizinische Klinik und Poliklinik I, LMU University Hospital, Munich, Germany.
- DZHK (German Centre for Cardiovascular Research), Partner Site Munich Heart Alliance, Munich, Germany.
| | - Matthias Heinig
- DZHK (German Centre for Cardiovascular Research), Partner Site Munich Heart Alliance, Munich, Germany.
- Institute of Computational Biology, German Research Center for Environmental Health, Helmholtz Zentrum München, Neuherberg, Germany.
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
| | - Konstantin Stark
- Medizinische Klinik und Poliklinik I, LMU University Hospital, Munich, Germany.
- DZHK (German Centre for Cardiovascular Research), Partner Site Munich Heart Alliance, Munich, Germany.
| |
Collapse
|
7
|
Csendes G, Sanz G, Szalay KZ, Szalai B. Benchmarking foundation cell models for post-perturbation RNA-seq prediction. BMC Genomics 2025; 26:393. [PMID: 40269681 PMCID: PMC12016270 DOI: 10.1186/s12864-025-11600-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2024] [Accepted: 04/14/2025] [Indexed: 04/25/2025] Open
Abstract
Accurately predicting cellular responses to perturbations is essential for understanding cell behaviour in both healthy and diseased states. While perturbation data is ideal for building such predictive models, its availability is considerably lower than baseline (non-perturbed) cellular data. To address this limitation, several foundation cell models have been developed using large-scale single-cell gene expression data. These models are fine-tuned after pre-training for specific tasks, such as predicting post-perturbation gene expression profiles, and are considered state-of-the-art for these problems. However, proper benchmarking of these models remains an unsolved challenge. In this study, we benchmarked two recently published foundation models, scGPT and scFoundation, against baseline models. Surprisingly, we found that even the simplest baseline model-taking the mean of training examples-outperformed scGPT and scFoundation. Furthermore, basic machine learning models that incorporate biologically meaningful features outperformed scGPT by a large margin. Additionally, we identified that the current Perturb-Seq benchmark datasets exhibit low perturbation-specific variance, making them suboptimal for evaluating such models. Our results highlight important limitations in current benchmarking approaches and provide insights into more effectively evaluating post-perturbation gene expression prediction models.
Collapse
|
8
|
Kedzierska KZ, Crawford L, Amini AP, Lu AX. Zero-shot evaluation reveals limitations of single-cell foundation models. Genome Biol 2025; 26:101. [PMID: 40251685 PMCID: PMC12007350 DOI: 10.1186/s13059-025-03574-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Accepted: 04/09/2025] [Indexed: 04/20/2025] Open
Abstract
Foundation models such as scGPT and Geneformer have not been rigorously evaluated in a setting where they are used without any further training (i.e., zero-shot). Understanding the performance of models in zero-shot settings is critical to applications that exclude the ability to fine-tune, such as discovery settings where labels are unknown. Our evaluation of the zero-shot performance of Geneformer and scGPT suggests that, in some cases, these models may face reliability challenges and could be outperformed by simpler methods. Our findings underscore the importance of zero-shot evaluations in development and deployment of foundation models in single-cell research.
Collapse
Affiliation(s)
| | | | | | - Alex X Lu
- Microsoft Research, Cambridge, MA, USA.
| |
Collapse
|
9
|
da Silva WM, Cazella SC, Rech RS. Deep learning algorithms to assist in imaging diagnosis in individuals with disc herniation or spondylolisthesis: A scoping review. Int J Med Inform 2025; 201:105933. [PMID: 40252304 DOI: 10.1016/j.ijmedinf.2025.105933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2025] [Revised: 04/13/2025] [Accepted: 04/16/2025] [Indexed: 04/21/2025]
Abstract
BACKGROUND Deep learning applications in medical imaging have advanced significantly, supporting the diagnosis of spinal disorders such as disc herniation and spondylolisthesis. This study aimed to review deep learning algorithms used in diagnostic imaging for these conditions. METHODS A scoping review was conducted following PRISMA-ScR guidelines and registered in the Open Science Framework. Literature searches were performed in PubMed, Lilacs, ScienceDirect, Web of Science, Wiley Online Library, Embase, IEEE Xplore, and Google Scholar. Studies published in the last ten years in English, Portuguese, or Spanish applying deep learning to lumbar spine imaging were included. Exclusions comprised reviews, expert opinions, and studies not focusing on lumbar imaging. Of 258 identified records, 71 duplicates were removed, leaving 187 for screening. After full-text assessment, 18 met eligibility criteria. RESULTS Nine studies investigated disc herniation, primarily using magnetic resonance imaging (MRI), while the remaining nine focused on spondylolisthesis based on X-ray imaging. Convolutional neural networks (CNNs), particularly ResNet-based architectures, were the most frequently used models, demonstrating high accuracy and sensitivity in classification tasks. MRI was predominant for disc herniation, while X-ray was preferred for spondylolisthesis. However, limitations included small dataset sizes, lack of external validation, and challenges in generalizing findings across populations. CONCLUSION While deep learning holds promise for enhancing diagnostic accuracy and efficiency, further research is needed to standardize evaluation methods, expand dataset diversity, and improve model robustness for real-world clinical applications.
Collapse
Affiliation(s)
- William Moraes da Silva
- Universidade Federal de Ciências da Saúde de Porto Alegre - UFCSPA, Porto Alegre/RS, Brazil.
| | - Silvio César Cazella
- Universidade Federal de Ciências da Saúde de Porto Alegre - UFCSPA, Porto Alegre/RS, Brazil.
| | - Rafaela Soares Rech
- Universidade Federal de Ciências da Saúde de Porto Alegre - UFCSPA, Porto Alegre/RS, Brazil.
| |
Collapse
|
10
|
Kalfon J, Samaran J, Peyré G, Cantini L. scPRINT: pre-training on 50 million cells allows robust gene network predictions. Nat Commun 2025; 16:3607. [PMID: 40240364 PMCID: PMC12003772 DOI: 10.1038/s41467-025-58699-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Accepted: 03/24/2025] [Indexed: 04/18/2025] Open
Abstract
A cell is governed by the interaction of myriads of macromolecules. Inferring such a network of interactions has remained an elusive milestone in cellular biology. Building on recent advances in large foundation models and their ability to learn without supervision, we present scPRINT, a large cell model for the inference of gene networks pre-trained on more than 50 million cells from the cellxgene database. Using innovative pretraining tasks and model architecture, scPRINT pushes large transformer models towards more interpretability and usability when uncovering the complex biology of the cell. Based on our atlas-level benchmarks, scPRINT demonstrates superior performance in gene network inference to the state of the art, as well as competitive zero-shot abilities in denoising, batch effect correction, and cell label prediction. On an atlas of benign prostatic hyperplasia, scPRINT highlights the profound connections between ion exchange, senescence, and chronic inflammation.
Collapse
Affiliation(s)
- Jérémie Kalfon
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, F-75015, Paris, France
| | - Jules Samaran
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, F-75015, Paris, France
| | - Gabriel Peyré
- CNRS and DMA de l'Ecole Normale Supérieure, CNRS, Ecole Normale Supérieure, Université PSL, 75005, Paris, France
| | - Laura Cantini
- Institut Pasteur, Université Paris Cité, CNRS UMR 3738, Machine Learning for Integrative Genomics group, F-75015, Paris, France.
| |
Collapse
|
11
|
Yates J, Van Allen EM. New horizons at the interface of artificial intelligence and translational cancer research. Cancer Cell 2025; 43:708-727. [PMID: 40233719 PMCID: PMC12007700 DOI: 10.1016/j.ccell.2025.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/24/2025] [Revised: 03/04/2025] [Accepted: 03/12/2025] [Indexed: 04/17/2025]
Abstract
Artificial intelligence (AI) is increasingly being utilized in cancer research as a computational strategy for analyzing multiomics datasets. Advances in single-cell and spatial profiling technologies have contributed significantly to our understanding of tumor biology, and AI methodologies are now being applied to accelerate translational efforts, including target discovery, biomarker identification, patient stratification, and therapeutic response prediction. Despite these advancements, the integration of AI into clinical workflows remains limited, presenting both challenges and opportunities. This review discusses AI applications in multiomics analysis and translational oncology, emphasizing their role in advancing biological discoveries and informing clinical decision-making. Key areas of focus include cellular heterogeneity, tumor microenvironment interactions, and AI-aided diagnostics. Challenges such as reproducibility, interpretability of AI models, and clinical integration are explored, with attention to strategies for addressing these hurdles. Together, these developments underscore the potential of AI and multiomics to enhance precision oncology and contribute to advancements in cancer care.
Collapse
Affiliation(s)
- Josephine Yates
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA; Institute for Machine Learning, Department of Computer Science, ETH Zürich, Zurich, Switzerland; ETH AI Center, ETH Zurich, Zurich, Switzerland; Swiss Institute for Bioinformatics (SIB), Lausanne, Switzerland
| | - Eliezer M Van Allen
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA; Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Division of Medical Sciences, Harvard University, Boston, MA, USA; Parker Institute for Cancer Immunotherapy, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
12
|
McDermott M, Mehta R, Roussos Torres ET, MacLean AL. Modeling the dynamics of EMT reveals genes associated with pan-cancer intermediate states and plasticity. NPJ Syst Biol Appl 2025; 11:31. [PMID: 40210876 PMCID: PMC11986130 DOI: 10.1038/s41540-025-00512-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2024] [Accepted: 03/28/2025] [Indexed: 04/12/2025] Open
Abstract
Epithelial-mesenchymal transition (EMT) is a cell state transition co-opted by cancer that drives metastasis via stable intermediate states. Here we study EMT dynamics to identify marker genes of highly metastatic intermediate cells via mathematical modeling with single-cell RNA sequencing (scRNA-seq) data. Across multiple tumor types and stimuli, we identified genes consistently upregulated in EMT intermediate states, many previously unrecognized as EMT markers. Bayesian parameter inference of a simple EMT mathematical model revealed tumor-specific transition rates, providing a framework to quantify EMT progression. Consensus analysis of differential expression, RNA velocity, and model-derived dynamics highlighted SFN and NRG1 as key regulators of intermediate EMT. Independent validation confirmed SFN as an intermediate state marker. Our approach integrates modeling and inference to identify genes associated with EMT dynamics, offering biomarkers and therapeutic targets to modulate tumor-promoting cell state transitions driven by EMT.
Collapse
Affiliation(s)
- MeiLu McDermott
- Department of Quantitative and Computational Biology, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, USA
| | - Riddhee Mehta
- Department of Quantitative and Computational Biology, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, USA
| | - Evanthia T Roussos Torres
- Department of Medicine, Division of Medical Oncology, Keck School of Medicine, Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, CA, USA
| | - Adam L MacLean
- Department of Quantitative and Computational Biology, Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
13
|
Haber E, Deshpande A, Ma J, Krieger S. Unified integration of spatial transcriptomics across platforms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.31.646238. [PMID: 40236180 PMCID: PMC11996334 DOI: 10.1101/2025.03.31.646238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
Spatial transcriptomics (ST) has transformed our understanding of tissue architecture and cellular interactions, but integrating ST data across platforms remains challenging due to differences in gene panels, data sparsity, and technical variability. Here, we introduce LLOKI, a novel framework for integrating imaging-based ST data from diverse platforms without requiring shared gene panels. LLOKI addresses ST integration through two key alignment tasks: feature alignment across technologies and batch alignment across datasets. Feature alignment constructs a graph based on spatial proximity and gene expression to propagate features and impute missing values. Optimal transport adjusts data sparsity to match scRNA-seq references, enabling single-cell foundation models such as scGPT to generate unified features. Batch alignment then refines scGPT-transformed embeddings, mitigating batch effects while preserving biological variability. Evaluations on mouse brain samples from five different technologies demonstrate that LLOKI outperforms existing methods and is effective for cross-technology spatial gene program identification and tissue slice alignment. Applying LLOKI to five ovarian cancer datasets, we identify an integrated gene program indicative of tumor-infiltrating T cells across gene panels. Together, LLOKI provides a robust foundation for cross-platform ST studies, with the potential to scale to large atlas datasets, enabling deeper insights into cellular organization and tissue environments.
Collapse
|
14
|
Chhibbar P, Das J. Machine learning approaches enable the discovery of therapeutics across domains. Mol Ther 2025:S1525-0016(25)00275-8. [PMID: 40186352 DOI: 10.1016/j.ymthe.2025.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2025] [Revised: 03/21/2025] [Accepted: 04/01/2025] [Indexed: 04/07/2025] Open
Abstract
Multi-modal datasets have grown exponentially in the last decade. This has created an enormous demand for machine learning models that can predict complex outcomes by leveraging cellular, molecular, and humoral profiles. Corresponding inference of mechanisms can help to uncover new therapeutic targets. Here, we discuss how biological principles guide the design of predictive models and how interpretable machine learning can lead to novel mechanistic insights. We provide descriptions of multiple learning techniques and how suited they are to domain adaptations. Finally, we talk about broad learning capabilities of foundation models on large datasets and whether they can be used to provide meaningful inference about biological datasets.
Collapse
Affiliation(s)
- Prabal Chhibbar
- Centre for Systems Immunology, Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA; Integrative Systems Biology PhD Program, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| | - Jishnu Das
- Centre for Systems Immunology, Department of Immunology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA; Department of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA.
| |
Collapse
|
15
|
Lin M, Guo J, Gu Z, Tang W, Tao H, You S, Jia D, Sun Y, Jia P. Machine learning and multi-omics integration: advancing cardiovascular translational research and clinical practice. J Transl Med 2025; 23:388. [PMID: 40176068 PMCID: PMC11966820 DOI: 10.1186/s12967-025-06425-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2024] [Accepted: 03/25/2025] [Indexed: 04/04/2025] Open
Abstract
The global burden of cardiovascular diseases continues to rise, making their prevention, diagnosis and treatment increasingly critical. With advancements and breakthroughs in omics technologies such as high-throughput sequencing, multi-omics approaches can offer a closer reflection of the complex physiological and pathological changes in the body from a molecular perspective, providing new microscopic insights into cardiovascular diseases research. However, due to the vast volume and complexity of data, accurately describing, utilising, and translating these biomedical data demands substantial effort. Researchers and clinicians are actively developing artificial intelligence (AI) methods for data-driven knowledge discovery and causal inference using various omics data. These AI approaches, integrated with multi-omics research, have shown promising outcomes in cardiovascular studies. In this review, we outline the methods for integrating machine learning, one of the most successful applications of AI, with omics data and summarise representative AI models developed that leverage various omics data to facilitate the exploration of cardiovascular diseases from underlying mechanisms to clinical practice. Particular emphasis is placed on the effectiveness of using AI to extract potential molecular information to address current knowledge gaps. We discuss the challenges and opportunities of integrating omics with AI into routine diagnostic and therapeutic practices and anticipate the future development of novel AI models for wider application in the field of cardiovascular diseases.
Collapse
Affiliation(s)
- Mingzhi Lin
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Jiuqi Guo
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Zhilin Gu
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Wenyi Tang
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Hongqian Tao
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Shilong You
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China
| | - Dalin Jia
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China.
| | - Yingxian Sun
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China.
- Key Laboratory of Environmental Stress and Chronic Disease Control and Prevention, Ministry of Education, China Medical University, Shenyang, Liaoning, China.
| | - Pengyu Jia
- Department of Cardiology, The First Hospital of China Medical University, 155 Nanjing North Street, Heping District, Shenyang, 110001, People's Republic of China.
| |
Collapse
|
16
|
Liang H, Berger B, Singh R. Tracing the Shared Foundations of Gene Expression and Chromatin Structure. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.31.646349. [PMID: 40235997 PMCID: PMC11996408 DOI: 10.1101/2025.03.31.646349] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The three-dimensional organization of chromatin into topologically associating domains (TADs) may impact gene regulation by bringing distant genes into contact. However, many questions about TADs' function and their influence on transcription remain unresolved due to technical limitations in defining TAD boundaries and measuring the direct effect that TADs have on gene expression. Here, we develop consensus TAD maps for human and mouse with a novel "bag-of-genes" approach for defining the gene composition within TADs. This approach enables new functional interpretations of TADs by providing a way to capture species-level differences in chromatin organization. We also leverage a generative AI foundation model computed from 33 million transcriptomes to define contextual similarity, an embedding-based metric that is more powerful than co-expression at representing functional gene relationships. Our analytical framework directly leads to testable hypotheses about chromatin organization across cellular states. We find that TADs play an active role in facilitating gene co-regulation, possibly through a mechanism involving transcriptional condensates. We also discover that the TAD-linked enhancement of transcriptional context is strongest in early developmental stages and systematically declines with aging. Investigation of cancer cells show distinct patterns of TAD usage that shift with chemotherapy treatment, suggesting specific roles for TAD-mediated regulation in cellular development and plasticity. Finally, we develop "TAD signatures" to improve statistical analysis of single-cell transcriptomic data sets in predicting cancer cell-line drug response. These findings reshape our understanding of cellular plasticity in development and disease, indicating that chromatin organization acts through probabilistic mechanisms rather than deterministic rules. Software availability https://singhlab.net/tadmap.
Collapse
|
17
|
Li D, Zhu Y, Mehmood A, Liu Y, Qin X, Dong Q. Intelligent identification of foodborne pathogenic bacteria by self-transfer deep learning and ensemble prediction based on single-cell Raman spectrum. Talanta 2025; 285:127268. [PMID: 39644671 DOI: 10.1016/j.talanta.2024.127268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/12/2024] [Accepted: 11/21/2024] [Indexed: 12/09/2024]
Abstract
Foodborne pathogenic infections pose a significant threat to human health. Accurate detection of foodborne diseases is essential in preventing disease transmission. This study proposed an AI model for precisely identifying foodborne pathogenic bacteria based on single-cell Raman spectrum. Self-transfer deep learning and ensemble prediction algorithms had been incorporated into the model framework to improve training efficiency and predictive performance, significantly improving prediction results. Our model can identify simultaneously gram-negative and positive, genus, species of foodborne pathogenic bacteria with an accuracy over 99.99 %, as well as recognized strain with over 99.49 %. At all four classification levels, unprecedented excellent predictive performance had been achieved. This advancement holds practical significance for medical detection and diagnosis of foodborne diseases by reducing false negatives.
Collapse
Affiliation(s)
- Daixi Li
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China; Peng Cheng National Laboratory, Vanke Cloud City Phase I Building 8, Xili Street, Nanshan District, Shenzhen, Guangdong, 518055, China.
| | - Yuqi Zhu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Aamir Mehmood
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences and School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Yangtai Liu
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Xiaojie Qin
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| | - Qingli Dong
- Institute of Biothermal Engineering, University of Shanghai for Science and Technology, Shanghai, 20093, China
| |
Collapse
|
18
|
Sun Y, Tan W, Gu Z, He R, Chen S, Pang M, Yan B. A data-efficient strategy for building high-performing medical foundation models. Nat Biomed Eng 2025; 9:539-551. [PMID: 40044818 DOI: 10.1038/s41551-025-01365-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/04/2025] [Indexed: 04/04/2025]
Abstract
Foundation models are pretrained on massive datasets. However, collecting medical datasets is expensive and time-consuming, and raises privacy concerns. Here we show that synthetic data generated via conditioning with disease labels can be leveraged for building high-performing medical foundation models. We pretrained a retinal foundation model, first with approximately one million synthetic retinal images with physiological structures and feature distribution consistent with real counterparts, and then with only 16.7% of the 904,170 real-world colour fundus photography images required in a recently reported retinal foundation model (RETFound). The data-efficient model performed as well or better than RETFound across nine public datasets and four diagnostic tasks; and for diabetic-retinopathy grading, it used only 40% of the expert-annotated training data used by RETFound. We also support the generalizability of the data-efficient strategy by building a classifier for the detection of tuberculosis on chest X-ray images. The text-conditioned generation of synthetic data may enhance the performance and generalization of medical foundation models.
Collapse
Affiliation(s)
- Yuqi Sun
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Weimin Tan
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Zhuoyao Gu
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Ruian He
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Siyuan Chen
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Miao Pang
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Bo Yan
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China.
| |
Collapse
|
19
|
Cui H, Tejada-Lapuerta A, Brbić M, Saez-Rodriguez J, Cristea S, Goodarzi H, Lotfollahi M, Theis FJ, Wang B. Towards multimodal foundation models in molecular cell biology. Nature 2025; 640:623-633. [PMID: 40240854 DOI: 10.1038/s41586-025-08710-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 01/29/2025] [Indexed: 04/18/2025]
Abstract
The rapid advent of high-throughput omics technologies has created an exponential growth in biological data, often outpacing our ability to derive molecular insights. Large-language models have shown a way out of this data deluge in natural language processing by integrating massive datasets into a joint model with manifold downstream use cases. Here we envision developing multimodal foundation models, pretrained on diverse omics datasets, including genomics, transcriptomics, epigenomics, proteomics, metabolomics and spatial profiling. These models are expected to exhibit unprecedented potential for characterizing the molecular states of cells across a broad continuum, thereby facilitating the creation of holistic maps of cells, genes and tissues. Context-specific transfer learning of the foundation models can empower diverse applications from novel cell-type recognition, biomarker discovery and gene regulation inference, to in silico perturbations. This new paradigm could launch an era of artificial intelligence-empowered analyses, one that promises to unravel the intricate complexities of molecular cell biology, to support experimental design and, more broadly, to profoundly extend our understanding of life sciences.
Collapse
Affiliation(s)
- Haotian Cui
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada
| | - Alejandro Tejada-Lapuerta
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
| | - Maria Brbić
- School of Computer and Communication Sciences, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
- School of Life Sciences, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Heidelberg University, Faculty of Medicine, Heidelberg University Hospital, Heidelberg, Germany
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Simona Cristea
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Hani Goodarzi
- Arc Institute, Palo Alto, CA, USA
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | - Mohammad Lotfollahi
- Wellcome Sanger Institute, Cambridge, UK
- Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
| | - Bo Wang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada.
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada.
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada.
| |
Collapse
|
20
|
El Hachimi C, Belaqziz S, Khabba S, Daccache A, Ait Hssaine B, Karjoun H, Ouassanouan Y, Sebbar B, Kharrou MH, Er-Raki S, Chehbouni A. Physics-informed neural networks for enhanced reference evapotranspiration estimation in Morocco: Balancing semi-physical models and deep learning. CHEMOSPHERE 2025; 374:144238. [PMID: 39983624 DOI: 10.1016/j.chemosphere.2025.144238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 01/22/2025] [Accepted: 02/16/2025] [Indexed: 02/23/2025]
Abstract
Reference evapotranspiration (ETo) is essential for agricultural water management, crop productivity, and irrigation systems. The Penman-Monteith (PM) equation is the standard method for estimating ETo, but its data-intensive nature makes it impractical, especially in situations where the cost of full standardized weather station is prohibitive, maintenance is inadequate, or data quality and continuity are compromised. To overcome those limitations, various semi-physical (SP) and empirical models with limited weather parameters were developed. In this context, artificial intelligence methods for ETo estimation are gaining more attention, balancing simplicity, minimal data requirements, and high accuracy. However, their data-driven nature raises concerns regarding explainability, trustworthiness, adherence to bio-physical laws, and reliability in operational settings. To address this issue, this paper, inspired by the emerging field of Physics-Informed Neural Networks (PINNs), evaluates the integration of SP models into the loss function during the learning process. The new residual loss combines two losses -the data-driven loss and the loss from SP- through a θ parameter, allowing for a convex combination. In-situ agrometeorological data were collected at four automatic weather stations in Tensift Watershed in Morocco, including air temperature (Ta), solar radiation (Rs), relative humidity (RH), and wind speed (Ws). The study integrates Priestley-Taylor (PT), Makkink (MK), Hargreaves-Samani (HS), and Abtew (AB), under four scenarios of data availability levels: (1) Ta, Rs and RH; (2) Ta and Rs; (3) only Ta; and (4) only Rs. The investigation begins with quality-controlling the data and studying the driving factors of ETo. Next, the SP models were calibrated using the CMA-ES optimization algorithm. The proposed PINN was trained and evaluated, first, for the equal contribution scenario (θ = 0.5) and then for θ in the interval [0, 1] with a step of 0.2, thus analyzing the impact of θ on the PINN performance. For the equal contribution, the results showed that the integration had improved the PINN performance in all scenarios in terms of the RMSE and R2, surpassing the fully data-driven model (θ = 0) and the baseline model (θ = 1). Additionally, for all θ within the interval [0.2, 0.8], the PINN required less training to reach optimal values. Finally, the optimal θ values were determined for each scenario using CMA-ES and were 0.258, 0.771, 0.7226 and 0.169 for PT, MK, HS and AB, respectively. While PINNs demonstrated a promising approach for accurate ETo estimation and consequently improved water resource management, the study also represents a step towards implementing controlled, trustworthy, and physics-informed AI in environmental science.
Collapse
Affiliation(s)
- Chouaib El Hachimi
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; Department of Biological and Agricultural Engineering, University of California, Davis, CA, 95616, USA.
| | - Salwa Belaqziz
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; LabSIV Laboratory, Faculty of Science, Department of Computer Science, Ibn Zohr University, Agadir, Morocco
| | - Saïd Khabba
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; LMFE, Department of Physics, Faculty of Sciences Semlalia (FSSM), Cadi Ayyad University (UCA), Marrakesh, Morocco
| | - Andre Daccache
- Department of Biological and Agricultural Engineering, University of California, Davis, CA, 95616, USA
| | - Bouchra Ait Hssaine
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco
| | - Hasan Karjoun
- Lab. Computer Science, Artificial Intelligence and Cyber Security (2IACS), ENSET, Hassan II University of Casablanca, Morocco
| | - Youness Ouassanouan
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco
| | - Badreddine Sebbar
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; Centre d'Etudes Spatiales de la Biosphère (CESBIO), Université de Toulouse, CNES, CNRS, IRD, UPS, 31400, Toulouse, France
| | - Mohamed Hakim Kharrou
- International Water Research Institute (IWRI), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco
| | - Salah Er-Raki
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; ProcEDE/AgroBiotech Center, Department of Physics, Faculty of Sciences and Technics (FSTM), Cadi Ayyad University (UCA), Marrakesh, Morocco
| | - Abdelghani Chehbouni
- Center for Remote Sensing Applications (CRSA), Mohammed VI Polytechnic University (UM6P), Benguerir, Morocco; Centre d'Etudes Spatiales de la Biosphère (CESBIO), Université de Toulouse, CNES, CNRS, IRD, UPS, 31400, Toulouse, France
| |
Collapse
|
21
|
Dutta S, Goswami S, Debnath S, Adhikary S, Majumder A. MusicalBSI - musical genres responses to fMRI signals analysis with prototypical model agnostic meta-learning for brain state identification in data scarce environment. Comput Biol Med 2025; 188:109795. [PMID: 39946786 DOI: 10.1016/j.compbiomed.2025.109795] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 12/04/2024] [Accepted: 02/02/2025] [Indexed: 03/05/2025]
Abstract
Functional magnetic resonance imaging is a popular non-invasive brain-computer interfacing technique to monitor brain activities corresponding to several physical or neurological responses by measuring blood flow changes at different brain parts. Recent studies have shown that blood flow within the brain can have signature activity patterns in response to various musical genres. However, limited studies exist in the state of the art for automatized recognition of the musical genres from functional magnetic resonance imaging. This is because the feasibility of obtaining these kinds of data is limited, and currently available open-sourced data is insufficient to build an accurate deep-learning model. To solve this, we propose a prototypical model agnostic meta-learning framework for accurately classifying musical genres by studying blood flow dynamics using functional magnetic resonance imaging. A test with open-sourced data collected from 20 human subjects with consent for 6 different mental states resulted in up to 97.25 ± 1.38% accuracy by training with only 30 samples surpassing state-of-the-art methods. Further, a detailed evaluation of the performances confirms the model's reliability.
Collapse
Affiliation(s)
- Subhayu Dutta
- Department of Computer Science & Engineering, Dr. B.C. Roy Engineering College, Durgapur, 713206, West Bengal, India.
| | - Saptiva Goswami
- Department of Computer Science & Engineering, Dr. B.C. Roy Engineering College, Durgapur, 713206, West Bengal, India.
| | - Sonali Debnath
- Department of Computer Science & Engineering, Dr. B.C. Roy Engineering College, Durgapur, 713206, West Bengal, India.
| | - Subhrangshu Adhikary
- Department of Research & Development, Spiraldevs Automation Industries Pvt. Ltd., Raiganj, 733123, West Bengal, India.
| | - Anandaprova Majumder
- Department of Computer Science & Engineering, Dr. B.C. Roy Engineering College, Durgapur, 713206, West Bengal, India.
| |
Collapse
|
22
|
Sheinin R, Sharan R, Madi A. scNET: learning context-specific gene and cell embeddings by integrating single-cell gene expression data with protein-protein interactions. Nat Methods 2025; 22:708-716. [PMID: 40097811 PMCID: PMC11978505 DOI: 10.1038/s41592-025-02627-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 02/07/2025] [Indexed: 03/19/2025]
Abstract
Recent advances in single-cell RNA sequencing (scRNA-seq) techniques have provided unprecedented insights into the heterogeneity of various tissues. However, gene expression data alone often fails to capture and identify changes in cellular pathways and complexes, as they are more discernible at the protein level. Moreover, analyzing scRNA-seq data presents further challenges due to inherent characteristics such as high noise levels and zero inflation. In this study, we propose an approach to address these limitations by integrating scRNA-seq datasets with a protein-protein interaction network. Our method utilizes a unique dual-view architecture based on graph neural networks, enabling joint representation of gene expression and protein-protein interaction network data. This approach models gene-to-gene relationships under specific biological contexts and refines cell-cell relations using an attention mechanism. Next, through comprehensive evaluations, we demonstrate that scNET better captures gene annotation, pathway characterization and gene-gene relationship identification, while improving cell clustering and pathway analysis across diverse cell types and biological conditions.
Collapse
Affiliation(s)
- Ron Sheinin
- Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel
| | - Roded Sharan
- Blavatnik School of Computer Science and AI, Tel Aviv University, Tel Aviv, Israel.
| | - Asaf Madi
- Department of Pathology, Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
23
|
Khodaee F, Zandie R, Edelman ER. Multimodal learning for mapping genotype-phenotype dynamics. NATURE COMPUTATIONAL SCIENCE 2025; 5:333-344. [PMID: 39875699 DOI: 10.1038/s43588-024-00765-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 12/20/2024] [Indexed: 01/30/2025]
Abstract
How complex phenotypes emerge from intricate gene expression patterns is a fundamental question in biology. Integrating high-content genotyping approaches such as single-cell RNA sequencing and advanced learning methods such as language models offers an opportunity for dissecting this complex relationship. Here we present a computational integrated genetics framework designed to analyze and interpret the high-dimensional landscape of genotypes and their associated phenotypes simultaneously. We applied this approach to develop a multimodal foundation model to explore the genotype-phenotype relationship manifold for human transcriptomics at the cellular level. Analyzing this joint manifold showed a refined resolution of cellular heterogeneity, uncovered potential cross-tissue biomarkers and provided contextualized embeddings to investigate the polyfunctionality of genes shown for the von Willebrand factor (VWF) gene in endothelial cells. Overall, this study advances our understanding of the dynamic interplay between gene expression and phenotypic manifestation and demonstrates the potential of integrated genetics in uncovering new dimensions of cellular function and complexity.
Collapse
Affiliation(s)
- Farhan Khodaee
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Rohola Zandie
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Elazer R Edelman
- Institute for Medical Engineering and Science, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Medicine (Cardiovascular Medicine), Brigham and Women's Hospital, Boston, MA, USA
| |
Collapse
|
24
|
Chen Y, Zou J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat Biomed Eng 2025; 9:483-493. [PMID: 39643729 DOI: 10.1038/s41551-024-01284-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 10/16/2024] [Indexed: 12/09/2024]
Abstract
Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.
Collapse
Affiliation(s)
- Yiqun Chen
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
- Department of Computer Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
25
|
Zou Z, Liu Y, Bai Y, Luo J, Zhang Z. scTrans: Sparse attention powers fast and accurate cell type annotation in single-cell RNA-seq data. PLoS Comput Biol 2025; 21:e1012904. [PMID: 40184563 PMCID: PMC11970913 DOI: 10.1371/journal.pcbi.1012904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 02/24/2025] [Indexed: 04/06/2025] Open
Abstract
Cell type annotation is crucial in single-cell RNA sequencing data analysis because it enables significant biological discoveries and deepens our understanding of tissue biology. Given the high-dimensional and highly sparse nature of single-cell RNA sequencing data, most existing annotation tools focus on highly variable genes to reduce dimensionality and computational load. However, this approach inevitably results in information loss, potentially weakening the model's generalization performance and adaptability to novel datasets. To mitigate this issue, we developed scTrans, a single cell Transformer-based model, which employs sparse attention to utilize all non-zero genes, thereby effectively reducing the input data dimensionality while minimizing information loss. We validated the speed and accuracy of scTrans by performing cell type annotation on 31 different tissues within the Mouse Cell Atlas. Remarkably, even with datasets nearing a million cells, scTrans efficiently perform cell type annotation in limited computational resources. Furthermore, scTrans demonstrates strong generalization capabilities, accurately annotating cells in novel datasets and generating high-quality latent representations, which are essential for precise clustering and trajectory analysis.
Collapse
Affiliation(s)
- Zhiyi Zou
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Yuting Bai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
26
|
Chen SF, Steele RJ, Hocky GM, Lemeneh B, Lad SP, Oermann EK. Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions. ARXIV 2025:arXiv:2408.16245v3. [PMID: 40236839 PMCID: PMC11998858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energyΔ G of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
Collapse
Affiliation(s)
- Sully F Chen
- Duke University School of Medicine, Durham, NC 27710, USA
| | | | - Glen M Hocky
- Department of Chemistry and Simons Center for Computational Physical Chemistry, New York University, New York, NY 10012, USA
| | | | - Shivanand P Lad
- Duke University School of Medicine, Department of Neurological Surgery, Durham, NC 27710, USA
| | - Eric K Oermann
- NYU Langone Health, Department of Neurological Surgery, New York, NY 10016, USA
| |
Collapse
|
27
|
Lei Y, Tsang JS. Systems Human Immunology and AI: Immune Setpoint and Immune Health. Annu Rev Immunol 2025; 43:693-722. [PMID: 40279304 DOI: 10.1146/annurev-immunol-090122-042631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2025]
Abstract
The immune system, critical for human health and implicated in many diseases, defends against pathogens, monitors physiological stress, and maintains tissue and organismal homeostasis. It exhibits substantial variability both within and across individuals and populations. Recent technological and conceptual progress in systems human immunology has provided predictive insights that link personal immune states to intervention responses and disease susceptibilities. Artificial intelligence (AI), particularly machine learning (ML), has emerged as a powerful tool for analyzing complex immune data sets, revealing hidden patterns across biological scales, and enabling predictive models for individualistic immune responses and potentially personalized interventions. This review highlights recent advances in deciphering human immune variation and predicting outcomes, particularly through the concepts of immune setpoint, immune health, and use of the immune system as a window for measuring health. We also provide a brief history of AI; review ML modeling approaches, including their applications in systems human immunology; and explore the potential of AI to develop predictive models and personal immune state embeddings to detect early signs of disease, forecast responses to interventions, and guide personalized health strategies.
Collapse
Affiliation(s)
- Yona Lei
- Yale Center for Systems and Engineering Immunology and Department of Immunobiology, Yale University School of Medicine, New Haven, Connecticut, USA;
| | - John S Tsang
- Yale Center for Systems and Engineering Immunology and Department of Immunobiology, Yale University School of Medicine, New Haven, Connecticut, USA;
- Department of Biomedical Engineering, Yale University, New Haven, Connecticut, USA
- Chan Zuckerberg Biohub NY, New Haven, Connecticut, USA
| |
Collapse
|
28
|
Zhu J, Meng Y, Gao W, Yang S, Zhu W, Ji X, Zhai X, Liu WQ, Luo Y, Ling S, Li J, Liu Y. AI-driven high-throughput droplet screening of cell-free gene expression. Nat Commun 2025; 16:2720. [PMID: 40108186 PMCID: PMC11923291 DOI: 10.1038/s41467-025-58139-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2024] [Accepted: 03/13/2025] [Indexed: 03/22/2025] Open
Abstract
Cell-free gene expression (CFE) systems enable transcription and translation using crude cellular extracts, offering a versatile platform for synthetic biology by eliminating the need to maintain living cells. However, Such systems are constrained by cumbersome composition, high costs, and limited yields due to numerous additional components required to maintain biocatalytic efficiency. Here, we introduce DropAI, a droplet-based, AI-driven screening strategy designed to optimize CFE systems with high throughput and economic efficiency. DropAI employs microfluidics to generate picoliter reactors and utilizes a fluorescent color-coding system to address and screen massive chemical combinations. The in-droplet screening is complemented by in silico optimization, where experimental results train a machine-learning model to estimate the contribution of the components and predict high-yield combinations. By applying DropAI, we significantly simplified the composition of an Escherichia coli-based CFE system, achieving a fourfold reduction in the unit cost of expressed superfolder green fluorescent protein (sfGFP). This optimized formulation was further validated across 12 different proteins. Notably, the established E. coli model is successfully adapted to a Bacillus subtilis-based system through transfer learning, leading to doubled yield through prediction. Beyond CFE, DropAI offers a high-throughput and scalable solution for combinatorial screening and optimization of biochemical systems.
Collapse
Affiliation(s)
- Jiawei Zhu
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Yaru Meng
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Wenli Gao
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Shuo Yang
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Wenjie Zhu
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xiangyang Ji
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Xuanpei Zhai
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Wan-Qiu Liu
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China
| | - Yuan Luo
- State Key Laboratory of Transducer Technology, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences, Shanghai, China
| | - Shengjie Ling
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China.
- State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China.
- Shanghai Clinical Research and Trial Center, Shanghai, China.
- State Key Laboratory of Molecular Engineering of Polymers, Department of Macromolecular Science, Laboratory of Advanced Materials, Fudan University, Shanghai, China.
| | - Jian Li
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China.
- State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China.
- Shanghai Clinical Research and Trial Center, Shanghai, China.
| | - Yifan Liu
- School of Physical Science and Technology, ShanghaiTech University, Shanghai, China.
- State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China.
- Shanghai Clinical Research and Trial Center, Shanghai, China.
| |
Collapse
|
29
|
Song L, Chen W, Hou J, Guo M, Yang J. Spatially resolved mapping of cells associated with human complex traits. Nature 2025:10.1038/s41586-025-08757-x. [PMID: 40108460 DOI: 10.1038/s41586-025-08757-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 02/07/2025] [Indexed: 03/22/2025]
Abstract
Depicting spatial distributions of disease-relevant cells is crucial for understanding disease pathology1,2. Here we present genetically informed spatial mapping of cells for complex traits (gsMap), a method that integrates spatial transcriptomics data with summary statistics from genome-wide association studies to map cells to human complex traits, including diseases, in a spatially resolved manner. Using embryonic spatial transcriptomics datasets covering 25 organs, we benchmarked gsMap through simulation and by corroborating known trait-associated cells or regions in various organs. Applying gsMap to brain spatial transcriptomics data, we reveal that the spatial distribution of glutamatergic neurons associated with schizophrenia more closely resembles that for cognitive traits than that for mood traits such as depression. The schizophrenia-associated glutamatergic neurons were distributed near the dorsal hippocampus, with upregulated expression of calcium signalling and regulation genes, whereas depression-associated glutamatergic neurons were distributed near the deep medial prefrontal cortex, with upregulated expression of neuroplasticity and psychiatric drug target genes. Our study provides a method for spatially resolved mapping of trait-associated cells and demonstrates the gain of biological insights (such as the spatial distribution of trait-relevant cells and related signature genes) through these maps.
Collapse
Affiliation(s)
- Liyang Song
- School of Life Sciences, Westlake University, Hangzhou, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
| | - Wenhao Chen
- School of Life Sciences, Westlake University, Hangzhou, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
| | - Junren Hou
- School of Life Sciences, Westlake University, Hangzhou, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
| | - Minmin Guo
- School of Life Sciences, Westlake University, Hangzhou, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, China.
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, China.
| |
Collapse
|
30
|
Liu T, Lin Y, Luo X, Sun Y, Zhao H. VISTA Uncovers Missing Gene Expression and Spatial-induced Information for Spatial Transcriptomic Data Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.08.26.609718. [PMID: 40166134 PMCID: PMC11957009 DOI: 10.1101/2024.08.26.609718] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Characterizing cell activities within a spatially resolved context is essential to enhance our understanding of spatially-induced cellular states and features. While single-cell RNA-seq (scRNA-seq) offers comprehensive profiling of cells within a tissue, it fails to capture spatial context. Conversely, subcellular spatial transcriptomics (SST) technologies provide high-resolution spatial profiles of gene expression, yet their utility is constrained by the limited number of genes they can simultaneously profile. To address this limitation, we introduce VISTA, a novel approach designed to predict the expression levels of unobserved genes specifically tailored for SST data. VISTA jointly models scRNA-seq data and SST data based on variational inference and geometric deep learning, and incorporates uncertainty quantification. Using four SST datasets, we demonstrate VISTA's superior performance in imputation and in analyzing large-scale SST datasets with satisfactory time efficiency and memory consumption. The imputation of VISTA enables a multitude of downstream applications, including the detection of new spatially variable genes, the discovery of novel ligand-receptor interactions, the inference of spatial RNA velocity, the generation for spatial transcriptomics with in-silico perturbation, and an improved decomposition of spatial and intrinsic variations.
Collapse
Affiliation(s)
- Tianyu Liu
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, 06511, CT, USA
| | - Yingxin Lin
- Department of Biostatistics, Yale University, New Haven, 06511, CT, USA
| | - Xiao Luo
- Department of Computer Science, University of California, Los Angeles, Los Angeles, 90095, CA, USA
| | - Yizhou Sun
- Department of Computer Science, University of California, Los Angeles, Los Angeles, 90095, CA, USA
| | - Hongyu Zhao
- Interdepartmental Program in Computational Biology & Bioinformatics, Yale University, New Haven, 06511, CT, USA
- Department of Biostatistics, Yale University, New Haven, 06511, CT, USA
| |
Collapse
|
31
|
Hutchins NT, Meziane M, Lu C, Mitalipova M, Fischer D, Li P. Reconstructing signaling histories of single cells via perturbation screens and transfer learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.16.643448. [PMID: 40166200 PMCID: PMC11957020 DOI: 10.1101/2025.03.16.643448] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Manipulating the signaling environment is an effective approach to alter cellular states for broad-ranging applications, from engineering tissues to treating diseases. Such manipulation requires knowing the signaling states and histories of the cells in situ , for which high-throughput discovery methods are lacking. Here, we present an integrated experimental-computational framework that learns signaling response signatures from a high-throughput in vitro perturbation atlas and infers combinatorial signaling activities in in vivo cell types with high accuracy and temporal resolution. Specifically, we generated signaling perturbation atlas across diverse cell types/states through multiplexed sequential combinatorial screens on human pluripotent stem cells. Using the atlas to train IRIS, a neural network-based model, and predicting on mouse embryo scRNAseq atlas, we discovered global features of combinatorial signaling code usage over time, identified biologically meaningful heterogeneity of signaling states within each cell type, and reconstructed signaling histories along diverse cell lineages. We further demonstrated that IRIS greatly accelerates the optimization of stem cell differentiation protocols by drastically reducing the combinatorial space that needs to be tested. This framework leads to the revelation that different cell types share robust signal response signatures, and provides a scalable solution for mapping complex signaling interactions in vivo to guide targeted interventions.
Collapse
|
32
|
Ran R, Uslu M, Siddiqui MF, Brubaker DK, Trapecar M. Single-Cell Analysis Reveals Tissue-Specific T Cell Adaptation and Clonal Distribution Across the Human Gut-Liver-Blood Axis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.11.642626. [PMID: 40161783 PMCID: PMC11952442 DOI: 10.1101/2025.03.11.642626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Understanding T cell clonal relationships and tissue-specific adaptations is crucial for deciphering human immune responses, particularly within the gut-liver axis. We performed paired single-cell RNA and T cell receptor sequencing on matched colon (epithelium, lamina propria), liver, and blood T cells from the same human donors. This approach tracked clones across sites and assessed microenvironmental impacts on T cell phenotype. While some clones were shared between blood and tissues, colonic intraepithelial lymphocytes (IELs) exhibited limited overlap with lamina propria T cells, suggesting a largely resident population. Furthermore, tissue-resident memory T cells (TRM) in the colon and liver displayed distinct transcriptional profiles. Notably, our analysis suggested that factors enriched in the liver microenvironment may influence the phenotype of colon lamina propria TRM. This integrated single-cell analysis maps T cell clonal distribution and adaptation across the gut-liver-blood axis, highlighting a potential liver role in shaping colonic immunity.
Collapse
Affiliation(s)
- Ran Ran
- Center for Global Health and Diseases, Department of Pathology, Case Western Reserve University, Cleveland, OH
| | - Merve Uslu
- Department of Medicine, Johns Hopkins University School of Medicine, Institute for Fundamental Biomedical Research, Johns Hopkins All Children’s Hospital, St. Petersburg, FL, USA
| | - Mohd Farhan Siddiqui
- Department of Medicine, Johns Hopkins University School of Medicine, Institute for Fundamental Biomedical Research, Johns Hopkins All Children’s Hospital, St. Petersburg, FL, USA
| | - Douglas K. Brubaker
- Center for Global Health and Diseases, Department of Pathology, Case Western Reserve University, Cleveland, OH
- The Blood, Heart, Lung, and Immunology Research Center, Case Western Reserve University, University Hospitals of Cleveland, Cleveland, OH
| | - Martin Trapecar
- Department of Medicine, Johns Hopkins University School of Medicine, Institute for Fundamental Biomedical Research, Johns Hopkins All Children’s Hospital, St. Petersburg, FL, USA
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
33
|
Yu JL, Zhou C, Ning XL, Mou J, Meng FB, Wu JW, Chen YT, Tang BD, Liu XG, Li GB. Knowledge-guided diffusion model for 3D ligand-pharmacophore mapping. Nat Commun 2025; 16:2269. [PMID: 40050649 PMCID: PMC11885826 DOI: 10.1038/s41467-025-57485-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Accepted: 02/21/2025] [Indexed: 03/09/2025] Open
Abstract
Pharmacophores are abstractions of essential chemical interaction patterns, holding an irreplaceable position in drug discovery. Despite the availability of many pharmacophore tools, the adoption of deep learning for pharmacophore-guided drug discovery remains relatively rare. We herein propose a knowledge-guided diffusion framework for 'on-the-fly' 3D ligand-pharmacophore mapping, named DiffPhore. It leverages ligand-pharmacophore matching knowledge to guide ligand conformation generation, meanwhile utilizing calibrated sampling to mitigate the exposure bias of the iterative conformation search process. By training on two self-established datasets of 3D ligand-pharmacophore pairs, DiffPhore achieves state-of-the-art performance in predicting ligand binding conformations, surpassing traditional pharmacophore tools and several advanced docking methods. It also manifests superior virtual screening power for lead discovery and target fishing. Using DiffPhore, we successfully identify structurally distinct inhibitors for human glutaminyl cyclases, and their binding modes are further validated through co-crystallographic analysis. We believe this work will advance the AI-enabled pharmacophore-guided drug discovery techniques.
Collapse
Affiliation(s)
- Jun-Lin Yu
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Cong Zhou
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Xiang-Li Ning
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Jun Mou
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Fan-Bo Meng
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Jing-Wei Wu
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Yi-Ting Chen
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Biao-Dan Tang
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China
| | - Xiang-Gen Liu
- College of Computer Science, Sichuan University, Chengdu, Sichuan, China.
| | - Guo-Bo Li
- Key Laboratory of Drug Targeting and Drug Delivery System of Ministry of Education, Department of Medicinal Chemistry, West China School of Pharmacy, Sichuan University, Chengdu, Sichuan, China.
| |
Collapse
|
34
|
Li J, Zhang J, Guo R, Dai J, Niu Z, Wang Y, Wang T, Jiang X, Hu W. Progress of machine learning in the application of small molecule druggability prediction. Eur J Med Chem 2025; 285:117269. [PMID: 39808972 DOI: 10.1016/j.ejmech.2025.117269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Revised: 01/07/2025] [Accepted: 01/08/2025] [Indexed: 01/16/2025]
Abstract
Machine learning (ML) has become an important tool for predicting the pharmaceutical properties of small molecules. Recent advancements in ML algorithms enable the rapid and accurate evaluation of solubility, activity, toxicity, pharmacokinetics, and other molecular properties through ML-based models. By conducting virtual screening of drug targets and elucidating drug-target protein interactions, researchers can conduct preliminary evaluations of the activity and safety of compounds from the ultra-large drug compound libraries, thereby accelerating the screening process for lead compounds. Moreover, ML leverages existing experimental data to train and generate new datasets, addressing the challenge of limited compounds and protein target data. This review provided a concise overview of ML applications in predicting small molecule properties, focusing on model construction principles, molecular feature selection, and other essential aspects. It also discussed the potential applications of ML in the screening of pharmaceutical small molecules.
Collapse
Affiliation(s)
- Junyao Li
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China; School of Life Sciences, Huaiyin Normal University, Huaian, 223300, China; Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Jianmei Zhang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China
| | - Rui Guo
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China; Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Jiawei Dai
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Zhiqiang Niu
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China
| | - Yan Wang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China
| | - Taoyun Wang
- School of Chemistry and Life Sciences, Suzhou University of Science and Technology, Suzhou, China.
| | - Xiaojian Jiang
- School of Life Sciences, Huaiyin Normal University, Huaian, 223300, China.
| | - Weicheng Hu
- Institute of Translational Medicine, School of Medicine, Yangzhou University, Yangzhou, 225009, China.
| |
Collapse
|
35
|
Li S, Hua H, Chen S. Graph neural networks for single-cell omics data: a review of approaches and applications. Brief Bioinform 2025; 26:bbaf109. [PMID: 40091193 PMCID: PMC11911123 DOI: 10.1093/bib/bbaf109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2024] [Revised: 02/09/2025] [Accepted: 02/25/2025] [Indexed: 03/19/2025] Open
Abstract
Rapid advancement of sequencing technologies now allows for the utilization of precise signals at single-cell resolution in various omics studies. However, the massive volume, ultra-high dimensionality, and high sparsity nature of single-cell data have introduced substantial difficulties to traditional computational methods. The intricate non-Euclidean networks of intracellular and intercellular signaling molecules within single-cell datasets, coupled with the complex, multimodal structures arising from multi-omics joint analysis, pose significant challenges to conventional deep learning operations reliant on Euclidean geometries. Graph neural networks (GNNs) have extended deep learning to non-Euclidean data, allowing cells and their features in single-cell datasets to be modeled as nodes within a graph structure. GNNs have been successfully applied across a broad range of tasks in single-cell data analysis. In this survey, we systematically review 107 successful applications of GNNs and their six variants in various single-cell omics tasks. We begin by outlining the fundamental principles of GNNs and their six variants, followed by a systematic review of GNN-based models applied in single-cell epigenomics, transcriptomics, spatial transcriptomics, proteomics, and multi-omics. In each section dedicated to a specific omics type, we have summarized the publicly available single-cell datasets commonly utilized in the articles reviewed in that section, totaling 77 datasets. Finally, we summarize the potential shortcomings of current research and explore directions for future studies. We anticipate that this review will serve as a guiding resource for researchers to deepen the application of GNNs in single-cell omics.
Collapse
Affiliation(s)
- Sijie Li
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Heyang Hua
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and The Key Laboratory of Pure Mathematics and Combinatorics, Ministry of Education (LPMC), Nankai University, No. 94 Weijin Road, Nankai District, Tianjin 300071, China
| |
Collapse
|
36
|
Ge S, Sun S, Xu H, Cheng Q, Ren Z. Deep learning in single-cell and spatial transcriptomics data analysis: advances and challenges from a data science perspective. Brief Bioinform 2025; 26:bbaf136. [PMID: 40185158 PMCID: PMC11970898 DOI: 10.1093/bib/bbaf136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2024] [Revised: 02/17/2025] [Accepted: 03/05/2025] [Indexed: 04/07/2025] Open
Abstract
The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. Despite this progress, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, and are often contaminated by noise and uncertainty, obscuring the underlying biological signal. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, metabolite levels, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering approaches struggle with the complexity of biological networks, while deep learning, with its ability to handle high-dimensional data and automatically identify meaningful patterns, has shown great promise in overcoming these challenges. Besides systematically reviewing the strengths and weaknesses of advanced deep learning methods, we have curated 21 datasets from nine benchmarks to evaluate the performance of 58 computational methods. Our analysis reveals that model performance can vary significantly across different benchmark datasets and evaluation metrics, providing a useful perspective for selecting the most appropriate approach based on a specific application scenario. We highlight three key areas for future development, offering valuable insights into how deep learning can be effectively applied to transcriptomic data analysis in biological, medical, and clinical settings.
Collapse
Affiliation(s)
- Shuang Ge
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Shuqing Sun
- Shenzhen International Graduate School, Tsinghua University, 2279 Lishui Road, Nanshan District, Shenzhen 518055, Guangdong, China
| | - Huan Xu
- School of Public Health, Anhui University of Science and Technology, 15 Fengxia Road, Changfeng County, Hefei 231131, Anhui, China
| | - Qiang Cheng
- Department of Computer Science, University of Kentucky, 329 Rose Street, Lexington 40506, Kentucky, USA
- Institute for Biomedical Informatics, University of Kentucky, 800 Rose Street, Lexington 40506, Kentucky, USA
| | - Zhixiang Ren
- Pengcheng Laboratory, 6001 Shahe West Road, Nanshan District, Shenzhen 518055, Guangdong, China
| |
Collapse
|
37
|
Hu Y, Li X, Yi Y, Huang Y, Wang G, Wang D. Deep learning-driven survival prediction in pan-cancer studies by integrating multimodal histology-genomic data. Brief Bioinform 2025; 26:bbaf121. [PMID: 40116660 PMCID: PMC11926983 DOI: 10.1093/bib/bbaf121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 02/10/2025] [Accepted: 02/28/2025] [Indexed: 03/23/2025] Open
Abstract
Accurate cancer prognosis is essential for personalized clinical management, guiding treatment strategies and predicting patient survival. Conventional methods, which depend on the subjective evaluation of histopathological features, exhibit significant inter-observer variability and limited predictive power. To overcome these limitations, we developed cross-attention transformer-based multimodal fusion network (CATfusion), a deep learning framework that integrates multimodal histology-genomic data for comprehensive cancer survival prediction. By employing self-supervised learning strategy with TabAE for feature extraction and utilizing cross-attention mechanisms to fuse diverse data types, including mRNA-seq, miRNA-seq, copy number variation, DNA methylation variation, mutation data, and histopathological images. By successfully integrating this multi-tiered patient information, CATfusion has become an advanced survival prediction model to utilize the most diverse data types across various cancer types. CATfusion's architecture, which includes a bidirectional multimodal attention mechanism and self-attention block, is adept at synchronizing the learning and integration of representations from various modalities. CATfusion achieves superior predictive performance over traditional and unimodal models, as demonstrated by enhanced C-index and survival area under the curve scores. The model's high accuracy in stratifying patients into distinct risk groups is a boon for personalized medicine, enabling tailored treatment plans. Moreover, CATfusion's interpretability, enabled by attention-based visualization, offers insights into the biological underpinnings of cancer prognosis, underscoring its potential as a transformative tool in oncology.
Collapse
Affiliation(s)
- Yongfei Hu
- Dermatology Hospital, Southern Medical University, No. 2, Lujing Road, Yuexiu District, Guangzhou 510091, China
| | - Xinyu Li
- Department of Bioinformatics, School of Basic Medical Sciences, Guangdong Province Key Laboratory of Molecular Tumor Pathology, Southern Medical University, 1023 Shatai South Road, Baiyun District, Guangzhou 510515, China
| | - Ying Yi
- Dermatology Hospital, Southern Medical University, No. 2, Lujing Road, Yuexiu District, Guangzhou 510091, China
| | - Yan Huang
- Cancer Research Institute, School of Basic Medical Sciences, Southern Medical University, 1023 Shatai South Road, Baiyun District, Guangzhou 510515, China
| | - Guangyu Wang
- Department of Gastrointestinal Medical Oncology, Harbin Medical University Cancer Hospital, No. 150 Haping Road, Nangang District, Harbin 150000, China
| | - Dong Wang
- Dermatology Hospital, Southern Medical University, No. 2, Lujing Road, Yuexiu District, Guangzhou 510091, China
- Department of Bioinformatics, School of Basic Medical Sciences, Guangdong Province Key Laboratory of Molecular Tumor Pathology, Southern Medical University, 1023 Shatai South Road, Baiyun District, Guangzhou 510515, China
| |
Collapse
|
38
|
Li CY, Hong YJ, Li B, Zhang XF. Benchmarking single-cell cross-omics imputation methods for surface protein expression. Genome Biol 2025; 26:46. [PMID: 40038818 DOI: 10.1186/s13059-025-03514-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Accepted: 02/24/2025] [Indexed: 03/06/2025] Open
Abstract
BACKGROUND Recent advances in single-cell multimodal omics sequencing have facilitated the simultaneous profiling of transcriptomes and surface proteomes within individual cells, offering insights into cellular functions and heterogeneity. However, the high costs and technical complexity of protocols like CITE-seq and REAP-seq constrain large-scale dataset generation. To overcome this limitation, surface protein data imputation methods have emerged to predict protein abundances from scRNA-seq data. RESULTS We present a comprehensive benchmark of twelve state-of-the-art imputation methods across eleven datasets and six scenarios. Our analysis evaluates the methods' accuracy, sensitivity to training data size, robustness across experiments, and usability in terms of running time, memory usage, popularity, and user-friendliness. With benchmark experiments in diverse scenarios and a comprehensive evaluation framework of the results, our study offers valuable insights into the performance and applicability of surface protein data imputation methods in single-cell omics research. CONCLUSIONS Based on our results, Seurat v4 (PCA) and Seurat v3 (PCA) demonstrate exceptional performance, offering promising avenues for further research in single-cell omics.
Collapse
Affiliation(s)
- Chen-Yang Li
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
| | - Yong-Jia Hong
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
| | - Bo Li
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, 430079, China
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics, and Hubei Key Lab-Math. Sci., Central China Normal University, Wuhan, 430079, China.
- Key Laboratory of Nonlinear Analysis & Applications (Ministry of Education), Central China Normal University, Wuhan, 430079, China.
| |
Collapse
|
39
|
Wei S, Lu Y, Wang P, Li Q, Shuai J, Zhao Q, Lin H, Peng Y. Investigation of cell development and tissue structure network based on natural Language processing of scRNA-seq data. J Transl Med 2025; 23:264. [PMID: 40038714 DOI: 10.1186/s12967-025-06263-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Accepted: 02/14/2025] [Indexed: 03/06/2025] Open
Abstract
BACKGROUND Single-cell multi-omics technologies, particularly single-cell RNA sequencing (scRNA-seq), have revolutionized our understanding of cellular heterogeneity and development by providing insights into gene expression at the single-cell level. Investigating the influence of genes on cellular behavior is crucial for elucidating cell fate determination and differentiation, cell development processes, and disease mechanisms. METHODS Inspired by NLP, we present a novel scRNA-seq analysis method that treats genes as analogous to words. Using word2vec to embed gene sequences derived from gene networks, we generate vector representations of genes, which are then used to represent cells by summing gene vectors and subsequently tissues by aggregating cell vectors. RESULTS Our NLP-based approach analyzes scRNA-seq data by generating vector representations of genes, cells, and tissues. This multi-scale analysis includes mapping cell states in vector space to reveal developmental trajectories, quantifying cell similarity using Euclidean distance, and constructing inter-tissue relationship networks from aggregated cell vectors. CONCLUSIONS This method offers a computationally efficient approach for analyzing scRNA-seq data by constructing embedding representations similar to those used in large language model pre-training, but without requiring high-performance computing clusters. By generating gene embeddings that capture functional relationships, this method facilitates the study of cell development trajectories, the impact of gene perturbations, cell clustering, and the construction and analysis of tissue networks. This provides a valuable tool for single-cell data analysis.
Collapse
Affiliation(s)
- Suwen Wei
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China
| | - Yuer Lu
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China
| | - Peng Wang
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China
- Postgraduate Training Base Alliance of Wenzhou Medical University, Wenzhou, 325001, Zhejiang, P. R. China
| | - Qichao Li
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China
- Postgraduate Training Base Alliance of Wenzhou Medical University, Wenzhou, 325001, Zhejiang, P. R. China
| | - Jianwei Shuai
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China
| | - Qi Zhao
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China.
- School of Computer Science and Software Engineering, University of Science and Technology Liaoning, Anshan, 114051, P.R. China.
| | - Hai Lin
- Oujiang Laboratory (Zhejiang Lab for Regenerative Medicine, Vision and Brain Health), Wenzhou Institute, University of Chinese Academy of Sciences, Wenzhou, 325001, Zhejiang, P. R. China.
| | - Yuming Peng
- Department of General Practice, Central Hospital of Karamay, Xinjiang, 834000, P. R. China.
| |
Collapse
|
40
|
Wang B, Zhang T, Liu Q, Sutcharitchan C, Zhou Z, Zhang D, Li S. Elucidating the role of artificial intelligence in drug development from the perspective of drug-target interactions. J Pharm Anal 2025; 15:101144. [PMID: 40099205 PMCID: PMC11910364 DOI: 10.1016/j.jpha.2024.101144] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2024] [Revised: 10/29/2024] [Accepted: 11/08/2024] [Indexed: 03/19/2025] Open
Abstract
Drug development remains a critical issue in the field of biomedicine. With the rapid advancement of information technologies such as artificial intelligence (AI) and the advent of the big data era, AI-assisted drug development has become a new trend, particularly in predicting drug-target associations. To address the challenge of drug-target prediction, AI-driven models have emerged as powerful tools, offering innovative solutions by effectively extracting features from complex biological data, accurately modeling molecular interactions, and precisely predicting potential drug-target outcomes. Traditional machine learning (ML), network-based, and advanced deep learning architectures such as convolutional neural networks (CNNs), graph convolutional networks (GCNs), and transformers play a pivotal role. This review systematically compiles and evaluates AI algorithms for drug- and drug combination-target predictions, highlighting their theoretical frameworks, strengths, and limitations. CNNs effectively identify spatial patterns and molecular features critical for drug-target interactions. GCNs provide deep insights into molecular interactions via relational data, whereas transformers increase prediction accuracy by capturing complex dependencies within biological sequences. Network-based models offer a systematic perspective by integrating diverse data sources, and traditional ML efficiently handles large datasets to improve overall predictive accuracy. Collectively, these AI-driven methods are transforming drug-target predictions and advancing the development of personalized therapy. This review summarizes the application of AI in drug development, particularly in drug-target prediction, and offers recommendations on models and algorithms for researchers engaged in biomedical research. It also provides typical cases to better illustrate how AI can further accelerate development in the fields of biomedicine and drug discovery.
Collapse
Affiliation(s)
- Boyang Wang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Tingyu Zhang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Qingyuan Liu
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Chayanis Sutcharitchan
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Ziyi Zhou
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Dingfan Zhang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Shao Li
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics, Bioinformatics Division, BNRist, Department of Automation, Tsinghua University, Beijing, 100084, China
| |
Collapse
|
41
|
Li B, Liu W, Xu J, Huang X, Yang L, Xu F. Decoding maize meristems maintenance and differentiation: integrating single-cell and spatial omics. J Genet Genomics 2025; 52:319-333. [PMID: 39921079 DOI: 10.1016/j.jgg.2025.01.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2024] [Revised: 01/22/2025] [Accepted: 01/22/2025] [Indexed: 02/10/2025]
Abstract
All plant organs are derived from stem cell-containing meristems. In maize, the shoot apical meristem (SAM) is responsible for generating all above-ground structures, including the male and female inflorescence meristems (IMs), which give rise to tassel and ear, respectively. Forward and reverse genetic studies on maize meristem mutants have driven forward our fundamental understanding of meristem maintenance and differentiation mechanisms. However, the high genetic redundancy of the maize genome has impeded progress in functional genomics. This review comprehensively summarizes recent advancements in understanding maize meristem development, with a focus on the integration of single-cell and spatial technologies. We discuss the mechanisms governing stem cell maintenance and differentiation in SAM and IM, emphasizing the roles of gene regulatory networks, hormonal pathways, and cellular omics insights into stress responses and adaptation. Future directions include cross-species comparisons, multi-omics integration, and the application of these technologies to precision breeding and stress adaptation research, with the ultimate goal of translating our understanding of meristem into the development of higher yield varieties.
Collapse
Affiliation(s)
- Bin Li
- The Key Laboratory of Plant Development and Environmental Adaptation Biology, Ministry of Education, Shandong Key Laboratory of Precision Molecular Crop Design and Breeding School of Life Sciences, Shandong University, Qingdao, Shandong 266237, China
| | - Wenhao Liu
- The Key Laboratory of Plant Development and Environmental Adaptation Biology, Ministry of Education, Shandong Key Laboratory of Precision Molecular Crop Design and Breeding School of Life Sciences, Shandong University, Qingdao, Shandong 266237, China
| | - Jie Xu
- Housing and Urban Rural Development Bureau of Jimo District, Qingdao, Shandong 266200, China
| | - Xuxu Huang
- The Key Laboratory of Plant Development and Environmental Adaptation Biology, Ministry of Education, Shandong Key Laboratory of Precision Molecular Crop Design and Breeding School of Life Sciences, Shandong University, Qingdao, Shandong 266237, China
| | - Long Yang
- Agricultural Big-Data Research Center and College of Plant Protection, Shandong Agricultural University, Tai'an, Shandong 271018, China
| | - Fang Xu
- The Key Laboratory of Plant Development and Environmental Adaptation Biology, Ministry of Education, Shandong Key Laboratory of Precision Molecular Crop Design and Breeding School of Life Sciences, Shandong University, Qingdao, Shandong 266237, China.
| |
Collapse
|
42
|
Ito K, Hirakawa T, Shigenobu S, Fujiyoshi H, Yamashita T. Mouse-Geneformer: A deep learning model for mouse single-cell transcriptome and its cross-species utility. PLoS Genet 2025; 21:e1011420. [PMID: 40106407 PMCID: PMC11964219 DOI: 10.1371/journal.pgen.1011420] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2024] [Revised: 04/02/2025] [Accepted: 02/17/2025] [Indexed: 03/22/2025] Open
Abstract
Deep learning techniques are increasingly utilized to analyze large-scale single-cell RNA sequencing (scRNA-seq) data, offering valuable insights from complex transcriptome datasets. Geneformer, a pre-trained model using a Transformer Encoder architecture and human scRNA-seq datasets, has demonstrated remarkable success in human transcriptome analysis. However, given the prominence of the mouse, Mus musculus, as a primary mammalian model in biological and medical research, there is an acute need for a mouse-specific version of Geneformer. In this study, we developed a mouse-specific Geneformer (mouse-Geneformer) by constructing a large transcriptome dataset consisting of 21 million mouse scRNA-seq profiles and pre-training Geneformer on this dataset. The mouse-Geneformer effectively models the mouse transcriptome and, upon fine-tuning for downstream tasks, enhances the accuracy of cell type classification. In silico perturbation experiments using mouse-Geneformer successfully identified disease-causing genes that have been validated in in vivo experiments. These results demonstrate the feasibility of analyzing mouse data with mouse-Geneformer and highlight the robustness of the Geneformer architecture, applicable to any species with large-scale transcriptome data available. Furthermore, we found that mouse-Geneformer can analyze human transcriptome data in a cross-species manner. After the ortholog-based gene name conversion, the analysis of human scRNA-seq data using mouse-Geneformer, followed by fine-tuning with human data, achieved cell type classification accuracy comparable to that obtained using the original human Geneformer. In in silico simulation experiments using human disease models, we obtained results similar to human-Geneformer for the myocardial infarction model but only partially consistent results for the COVID-19 model, a trait unique to humans (laboratory mice are not susceptible to SARS-CoV-2). These findings suggest the potential for cross-species application of the Geneformer model while emphasizing the importance of species-specific models for capturing the full complexity of disease mechanisms. Despite the existence of the original Geneformer tailored for humans, human research could benefit from mouse-Geneformer due to its inclusion of samples that are ethically or technically inaccessible for humans, such as embryonic tissues and certain disease models. Additionally, this cross-species approach indicates potential use for non-model organisms, where obtaining large-scale single-cell transcriptome data is challenging.
Collapse
Affiliation(s)
- Keita Ito
- Graduate School of Engineering, Chubu University, Kasugai, Aichi, Japan
| | - Tsubasa Hirakawa
- Department of Artificial Intelligence and Robotics, Center for Mathematical Science and Artificial Intelligence, Chubu University, Kasugai, Aichi, Japan
| | - Shuji Shigenobu
- Trans-Scale Biology Center, National Institute for Basic Biology, Okazaki, Aichi, Japan
- Life Science Center for Survival Dynamics, Tsukuba Advanced Research Alliance (TARA), University of Tsukuba, Tsukuba, Ibaraki, Japan
| | | | - Takayoshi Yamashita
- Department of Artificial Intelligence and Robotics, Chubu University, Kasugai, Aichi, Japan
| |
Collapse
|
43
|
Reina-Campos M, Monell A, Ferry A, Luna V, Cheung KP, Galletti G, Scharping NE, Takehara KK, Quon S, Challita PP, Boland B, Lin YH, Wong WH, Indralingam CS, Neadeau H, Alarcón S, Yeo GW, Chang JT, Heeg M, Goldrath AW. Tissue-resident memory CD8 T cell diversity is spatiotemporally imprinted. Nature 2025; 639:483-492. [PMID: 39843748 PMCID: PMC11903307 DOI: 10.1038/s41586-024-08466-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 11/27/2024] [Indexed: 01/24/2025]
Abstract
Tissue-resident memory CD8 T (TRM) cells provide protection from infection at barrier sites. In the small intestine, TRM cells are found in at least two distinct subpopulations: one with higher expression of effector molecules and another with greater memory potential1. However, the origins of this diversity remain unknown. Here we proposed that distinct tissue niches drive the phenotypic heterogeneity of TRM cells. To test this, we leveraged spatial transcriptomics of human samples, a mouse model of acute systemic viral infection and a newly established strategy for pooled optically encoded gene perturbations to profile the locations, interactions and transcriptomes of pathogen-specific TRM cell differentiation at single-transcript resolution. We developed computational approaches to capture cellular locations along three anatomical axes of the small intestine and to visualize the spatiotemporal distribution of cell types and gene expression. Our study reveals that the regionalized signalling of the intestinal architecture supports two distinct TRM cell states: differentiated TRM cells and progenitor-like TRM cells, located in the upper villus and lower villus, respectively. This diversity is mediated by distinct ligand-receptor activities, cytokine gradients and specialized cellular contacts. Blocking TGFβ or CXCL9 and CXCL10 sensing by antigen-specific CD8 T cells revealed a model consistent with anatomically delineated, early fate specification. Ultimately, our framework for the study of tissue immune networks reveals that T cell location and functional state are fundamentally intertwined.
Collapse
Affiliation(s)
- Miguel Reina-Campos
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
- La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Alexander Monell
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Amir Ferry
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Vida Luna
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Kitty P Cheung
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Giovanni Galletti
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Nicole E Scharping
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Kennidy K Takehara
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Sara Quon
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Peter P Challita
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA
| | - Brigid Boland
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Yun Hsuan Lin
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - William H Wong
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
| | | | | | - Suzie Alarcón
- La Jolla Institute for Immunology, La Jolla, CA, USA
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA, USA
| | - John T Chang
- Department of Medicine, University of California, San Diego, La Jolla, CA, USA
- Department of Medicine, Veteran Affairs San Diego Healthcare System, San Diego, CA, USA
| | - Maximilian Heeg
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA.
- Allen Institute for Immunology, Seattle, WA, USA.
| | - Ananda W Goldrath
- School of Biological Sciences, Department of Molecular Biology, University of California, San Diego, La Jolla, CA, USA.
- Allen Institute for Immunology, Seattle, WA, USA.
| |
Collapse
|
44
|
Stock M, Losert C, Zambon M, Popp N, Lubatti G, Hörmanseder E, Heinig M, Scialdone A. Leveraging prior knowledge to infer gene regulatory networks from single-cell RNA-sequencing data. Mol Syst Biol 2025; 21:214-230. [PMID: 39939367 DOI: 10.1038/s44320-025-00088-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/29/2025] [Accepted: 01/30/2025] [Indexed: 02/14/2025] Open
Abstract
Many studies have used single-cell RNA sequencing (scRNA-seq) to infer gene regulatory networks (GRNs), which are crucial for understanding complex cellular regulation. However, the inherent noise and sparsity of scRNA-seq data present significant challenges to accurate GRN inference. This review explores one promising approach that has been proposed to address these challenges: integrating prior knowledge into the inference process to enhance the reliability of the inferred networks. We categorize common types of prior knowledge, such as experimental data and curated databases, and discuss methods for representing priors, particularly through graph structures. In addition, we classify recent GRN inference algorithms based on their ability to incorporate these priors and assess their performance in different contexts. Finally, we propose a standardized benchmarking framework to evaluate algorithms more fairly, ensuring biologically meaningful comparisons. This review provides guidance for researchers selecting GRN inference methods and offers insights for developers looking to improve current approaches and foster innovation in the field.
Collapse
Affiliation(s)
- Marco Stock
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Helmholtz Center Munich Institute of Functional Epigenetics, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Corinna Losert
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Matteo Zambon
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Helmholtz Center Munich Institute of Functional Epigenetics, Munich, Germany
| | - Niclas Popp
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Helmholtz Center Munich Institute of Functional Epigenetics, Munich, Germany
| | - Gabriele Lubatti
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Helmholtz Center Munich Institute of Functional Epigenetics, Munich, Germany
| | - Eva Hörmanseder
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany
| | - Matthias Heinig
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany
- Department of Computer Science, TUM School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- German Centre for Cardiovascular Research (DZHK), Munich Heart Association, Partner Site Munich, Berlin, Germany
| | - Antonio Scialdone
- Helmholtz Center Munich Institute of Epigenetics und Stem Cells, Munich, Germany.
- Helmholtz Center Munich Institute of Computational Biology, Munich, Germany.
- Helmholtz Center Munich Institute of Functional Epigenetics, Munich, Germany.
| |
Collapse
|
45
|
Ding S. Therapeutic Reprogramming toward Regenerative Medicine. Chem Rev 2025; 125:1805-1822. [PMID: 39907153 DOI: 10.1021/acs.chemrev.4c00332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2025]
Abstract
Therapeutic reprogramming represents a transformative paradigm in regenerative medicine, developing new approaches in cell therapy, small molecule drugs, biologics, and gene therapy to address unmet medical challenges. This paradigm encompasses the precise modulation of cellular fate and function to either generate safe and functional cells ex vivo for cell-based therapies or to directly reprogram endogenous cells in vivo or in situ for tissue repair and regeneration. Building on the discovery of induced pluripotent stem cells (iPSCs), advancements in chemical modulation and CRISPR-based gene editing have propelled a new iterative medicine paradigm, focusing on developing scalable, standardized cell therapy products from universal starting materials and enabling iterative improvements for more effective therapeutic profiles. Beyond cell-based therapies, non-cell-based therapeutic strategies targeting endogenous cells may offer a less invasive, more convenient, accessible, and cost-effective alternative for treating a broad range of diseases, potentially rejuvenating tissues and extending healthspan.
Collapse
Affiliation(s)
- Sheng Ding
- New Cornerstone Science Laboratory, School of Pharmaceutical Sciences, Tsinghua University, Beijing 100084, China
- Tsinghua-Peking Joint Center for Life Sciences, Tsinghua University, Beijing 100084, China
- Global Health Drug Discovery Institute, Beijing 100192, China
- CRE Life Institute, Beijing 100192, China
| |
Collapse
|
46
|
Nadig A, Thoutam A, Hughes M, Gupta A, Navia AW, Fusi N, Raghavan S, Winter PS, Amini AP, Crawford L. Consequences of training data composition for deep learning models in single-cell biology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.19.639127. [PMID: 40060416 PMCID: PMC11888162 DOI: 10.1101/2025.02.19.639127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/17/2025]
Abstract
Foundation models for single-cell transcriptomics have the potential to augment (or replace) purpose-built tools for a variety of common analyses, especially when data are sparse. Recent work with large language models has shown that training data composition greatly shapes performance; however, to date, single-cell foundation models have ignored this aspect, opting instead to train on the largest possible corpus. We systematically investigate the consequences of training dataset composition on the behavior of deep learning models of single-cell transcriptomics, focusing on human hematopoiesis as a tractable model system and including cells from adult and developing tissues, disease states, and perturbation atlases. We find that (1) these models generalize poorly to unseen cell types, (2) adding malignant cells to a healthy cell training corpus does not necessarily improve modeling of unseen malignant cells, and (3) including an embryonic stem cell differentiation atlas during training improves performance on out-of-distribution tasks. Our results emphasize the importance of diverse training data and suggest strategies to optimize future single-cell foundation models.
Collapse
Affiliation(s)
- Ajay Nadig
- Harvard Medical School, Boston, MA, USA
- Massachusetts General Hospital, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | | | - Anay Gupta
- Georgia Institute of Technology, Atlanta, GA, USA
| | | | | | - Srivatsan Raghavan
- Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Brigham and Women's Hospital, Boston, MA, USA
- Dana-Farber Cancer Institute, Boston, MA, USA
| | | | | | | |
Collapse
|
47
|
Gan D, Li J. Small, Open-Source Text-Embedding Models as Substitutes to OpenAI Models for Gene Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.15.638462. [PMID: 40027770 PMCID: PMC11870524 DOI: 10.1101/2025.02.15.638462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
While foundation transformer-based models developed for gene expression data analysis can be costly to train and operate, a recent approach known as GenePT offers a low-cost and highly efficient alternative. GenePT utilizes OpenAI's text-embedding function to encode background information, which is in textual form, about genes. However, the closed-source, online nature of OpenAI's text-embedding service raises concerns regarding data privacy, among other issues. In this paper, we explore the possibility of replacing OpenAI's models with open-source transformer-based text-embedding models. We identified ten models from Hugging Face that are small in size, easy to install, and light in computation. Across all four gene classification tasks we considered, some of these models have outperformed OpenAI's, demonstrating their potential as viable, or even superior, alternatives. Additionally, we find that fine-tuning these models often does not lead to significant improvements in performance.
Collapse
|
48
|
Saunders RA, Allen WE, Pan X, Sandhu J, Lu J, Lau TK, Smolyar K, Sullivan ZA, Dulac C, Weissman JS, Zhuang X. A platform for multimodal in vivo pooled genetic screens reveals regulators of liver function. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.18.624217. [PMID: 39605605 PMCID: PMC11601512 DOI: 10.1101/2024.11.18.624217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Organ function requires coordinated activities of thousands of genes in distinct, spatially organized cell types. Understanding the basis of emergent tissue function requires approaches to dissect the genetic control of diverse cellular and tissue phenotypes in vivo. Here, we develop paired imaging and sequencing methods to construct large-scale, multi-modal genotype-phenotypes maps in tissue with pooled genetic perturbations. Using imaging, we identify genetic perturbations in individual cells while simultaneously measuring their gene expression and subcellular morphology. Using single-cell sequencing, we measure transcriptomic responses to the same genetic perturbations. We apply this approach to study hundreds of genetic perturbations in the mouse liver. Our study reveals regulators of hepatocyte zonation and liver unfolded protein response, as well as distinct pathways that cause hepatocyte steatosis. Our approach enables new ways of interrogating the genetic basis of complex cellular and organismal physiology and provides crucial training data for emerging machine-learning models of cellular function.
Collapse
Affiliation(s)
- Reuben A. Saunders
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Whitehead Institute, Cambridge, MA 02139, USA
- University of California, San Francisco, San Francisco, CA 94158, USA
- Present address: Society of Fellows, Harvard University, MA 02138, USA
- These authors contributed equally
| | - William E. Allen
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Society of Fellows, Harvard University, Cambridge, MA 02138, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
- Present address: Department of Developmental Biology, Stanford University School of Medicine, Stanford, CA 94305; Arc Institute, Palo Alto, CA 94304
- These authors contributed equally
- Lead contact
| | - Xingjie Pan
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
- Lead AI Scientist
| | - Jaspreet Sandhu
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Whitehead Institute, Cambridge, MA 02139, USA
- Division of Gastroenterology, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Jiaqi Lu
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
| | - Thomas K. Lau
- Department of Statistics, Stanford University, Stanford, CA 94305
| | - Karina Smolyar
- Whitehead Institute, Cambridge, MA 02139, USA
- Department of Biology, MIT, Cambridge, MA 02139 USA
| | - Zuri A. Sullivan
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Catherine Dulac
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA 02138, USA
| | - Jonathan S. Weissman
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Whitehead Institute, Cambridge, MA 02139, USA
- Department of Biology, MIT, Cambridge, MA 02139 USA
| | - Xiaowei Zhuang
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
- Department of Physics, Harvard University, Cambridge, MA 02138, USA
| |
Collapse
|
49
|
Dibaeinia P, Ojha A, Sinha S. Interpretable AI for inference of causal molecular relationships from omics data. SCIENCE ADVANCES 2025; 11:eadk0837. [PMID: 39951525 PMCID: PMC11827637 DOI: 10.1126/sciadv.adk0837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 01/14/2025] [Indexed: 02/16/2025]
Abstract
The discovery of molecular relationships from high-dimensional data is a major open problem in bioinformatics. Machine learning and feature attribution models have shown great promise in this context but lack causal interpretation. Here, we show that a popular feature attribution model, under certain assumptions, estimates an average of a causal quantity reflecting the direct influence of one variable on another. We leverage this insight to propose a precise definition of a gene regulatory relationship and implement a new tool, CIMLA (Counterfactual Inference by Machine Learning and Attribution Models), to identify differences in gene regulatory networks between biological conditions, a problem that has received great attention in recent years. Using extensive benchmarking on simulated data, we show that CIMLA is more robust to confounding variables and is more accurate than leading methods. Last, we use CIMLA to analyze a previously published single-cell RNA sequencing dataset from subjects with and without Alzheimer's disease (AD), discovering several potential regulators of AD.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Abhishek Ojha
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Saurabh Sinha
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
- H. Milton School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
50
|
Chakraborty C, Bhattacharya M, Pal S, Chatterjee S, Das A, Lee SS. AI-enabled language models (LMs) to large language models (LLMs) and multimodal large language models (MLLMs) in drug discovery and development. J Adv Res 2025:S2090-1232(25)00109-2. [PMID: 39952319 DOI: 10.1016/j.jare.2025.02.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2024] [Revised: 01/03/2025] [Accepted: 02/08/2025] [Indexed: 02/17/2025] Open
Abstract
BACKGROUND Due to the recent revolution of artificial intelligence (AI), AI-enabled large language models (LLMs) have flourished and started to be applied in various sectors of science and medicine. Drug discovery and development are time-consuming, complex processes that require high investment. The conventional method of drug discovery is costly and has a high failure rate. AI-enabled LLMs are used in various steps of drug discovery to solve the challenges of time and cost. AIM OF REVIEW The article aims to provide a comprehensive understanding of AI-enabled LLMs and their use in various steps of drug discovery to ease the challenges. KEY SCIENTIFIC CONCEPTS OF REVIEW The review provides an overview of the LLMs and their current state-of-the-art application in structure-based drug molecule design and de novo drug design. The different applications of AI-enabled LLMshave been illustrated, such as drug target identification, validation, interaction, and ADME/ADMET. Several domain-specific models of LLMs are developed in this direction and applied in drug discovery and development to speed up the process. We discussed all these domain-specific models of LLMs and their applications in this field. Finally, we illustrated the challenges and future perspectives on the applications of AI-enabled LLMs in drug discovery and development.
Collapse
Affiliation(s)
- Chiranjib Chakraborty
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India.
| | - Manojit Bhattacharya
- Department of Zoology, Fakir Mohan University, Vyasa Vihar, Balasore 756020, Odisha, India
| | - Soumen Pal
- School of Mechanical Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Srijan Chatterjee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do, 24252, Republic of Korea
| | - Arpita Das
- Department of Biotechnology, School of Life Science and Biotechnology, Adamas University, Kolkata, West Bengal 700126, India
| | - Sang-Soo Lee
- Institute for Skeletal Aging & Orthopedic Surgery, Hallym University-Chuncheon Sacred Heart Hospital, Chuncheon, Gangwon-Do, 24252, Republic of Korea.
| |
Collapse
|