1
|
Yu H, Zheng Y, Yang X. scDM: A deep generative method for cell surface protein prediction with diffusion model. J Mol Biol 2024; 436:168610. [PMID: 38754773 DOI: 10.1016/j.jmb.2024.168610] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2024] [Revised: 05/06/2024] [Accepted: 05/09/2024] [Indexed: 05/18/2024]
Abstract
The executors of organismal functions are proteins, and the transition from RNA to protein is subject to post-transcriptional regulation; therefore, considering both RNA and surface protein expression simultaneously can provide additional evidence of biological processes. Cellular indexing of transcriptomes and epitopes by sequencing (CITE-Seq) technology can measure both RNA and protein expression in single cells, but these experiments are expensive and time-consuming. Due to the lack of computational tools for predicting surface proteins, we used datasets obtained with CITE-seq technology to design a deep generative prediction method based on diffusion models and to find biological discoveries through the prediction results. In our method, the scDM, which predicts protein expression values from RNA expression values of individual cells, uses a novel way of encoding the data into a model and generates predicted samples by introducing Gaussian noise to gradually remove the noise to learn the data distribution during the modelling process. Comprehensive evaluation across different datasets demonstrated that our predictions yielded satisfactory results and further demonstrated the effectiveness of incorporating information from single-cell multiomics data into diffusion models for biological studies. We also found that new directions for discovering therapeutic drug targets could be provided by jointly analysing the predictive value of surface protein expression and cancer cell drug scores.
Collapse
Affiliation(s)
- Hanlei Yu
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China
| | - Yuanjie Zheng
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China.
| | - Xinbo Yang
- School of Information Science and Engineering, Shandong Normal University, Jinan 250358, China
| |
Collapse
|
2
|
Sharma A, López Y, Jia S, Lysenko A, Boroevich KA, Tsunoda T. Enhanced analysis of tabular data through Multi-representation DeepInsight. Sci Rep 2024; 14:12851. [PMID: 38834670 DOI: 10.1038/s41598-024-63630-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2023] [Accepted: 05/30/2024] [Indexed: 06/06/2024] Open
Abstract
Tabular data analysis is a critical task in various domains, enabling us to uncover valuable insights from structured datasets. While traditional machine learning methods can be used for feature engineering and dimensionality reduction, they often struggle to capture the intricate relationships and dependencies within real-world datasets. In this paper, we present Multi-representation DeepInsight (MRep-DeepInsight), a novel extension of the DeepInsight method designed to enhance the analysis of tabular data. By generating multiple representations of samples using diverse feature extraction techniques, our approach is able to capture a broader range of features and reveal deeper insights. We demonstrate the effectiveness of MRep-DeepInsight on single-cell datasets, Alzheimer's data, and artificial data, showcasing an improved accuracy over the original DeepInsight approach and machine learning methods like random forest, XGBoost, LightGBM, FT-Transformer and L2-regularized logistic regression. Our results highlight the value of incorporating multiple representations for robust and accurate tabular data analysis. By leveraging the power of diverse representations, MRep-DeepInsight offers a promising new avenue for advancing decision-making and scientific discovery across a wide range of fields.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
| | | | - Shangru Jia
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
3
|
Lv Z, Wei X, Hu S, Lin G, Qiu W. iSUMO-RsFPN: A predictor for identifying lysine SUMOylation sites based on multi-features and feature pyramid networks. Anal Biochem 2024; 687:115460. [PMID: 38191118 DOI: 10.1016/j.ab.2024.115460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 01/10/2024]
Abstract
SUMOylation is a protein post-translational modification that plays an essential role in cellular functions. For predicting SUMO sites, numerous researchers have proposed advanced methods based on ordinary machine learning algorithms. These reported methods have shown excellent predictive performance, but there is room for improvement. In this study, we constructed a novel deep neural network Residual Pyramid Network (RsFPN), and developed an ensemble deep learning predictor called iSUMO-RsFPN. Initially, three feature extraction methods were employed to extract features from samples. Following this, weak classifiers were trained based on RsFPN for each feature type. Ultimately, the weak classifiers were integrated to construct the final classifier. Moreover, the predictor underwent systematically testing on an independent test dataset, where the results demonstrated a significant improvement over the existing state-of-the-art predictors. The code of iSUMO-RsFPN is free and available at https://github.com/454170054/iSUMO-RsFPN.
Collapse
Affiliation(s)
- Zhe Lv
- School of Mega Data, Jiangxi Institute of Fashion Technology, 330201, Nanchang, Jiangxi, China
| | - Xin Wei
- Business School, Jiangxi Institute of Fashion Technology, 330201, Nanchang, Jiangxi, China
| | - Siqin Hu
- School of Mega Data, Jiangxi Institute of Fashion Technology, 330201, Nanchang, Jiangxi, China
| | - Gang Lin
- School of Mega Data, Jiangxi Institute of Fashion Technology, 330201, Nanchang, Jiangxi, China
| | - Wangren Qiu
- Computer Department, Jingdezhen Ceramic University, 333403, Jingdezhen, Jiangxi, China.
| |
Collapse
|
4
|
Sharma A, Lysenko A, Jia S, Boroevich KA, Tsunoda T. Advances in AI and machine learning for predictive medicine. J Hum Genet 2024:10.1038/s10038-024-01231-y. [PMID: 38424184 DOI: 10.1038/s10038-024-01231-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 02/04/2024] [Accepted: 02/12/2024] [Indexed: 03/02/2024]
Abstract
The field of omics, driven by advances in high-throughput sequencing, faces a data explosion. This abundance of data offers unprecedented opportunities for predictive modeling in precision medicine, but also presents formidable challenges in data analysis and interpretation. Traditional machine learning (ML) techniques have been partly successful in generating predictive models for omics analysis but exhibit limitations in handling potential relationships within the data for more accurate prediction. This review explores a revolutionary shift in predictive modeling through the application of deep learning (DL), specifically convolutional neural networks (CNNs). Using transformation methods such as DeepInsight, omics data with independent variables in tabular (table-like, including vector) form can be turned into image-like representations, enabling CNNs to capture latent features effectively. This approach not only enhances predictive power but also leverages transfer learning, reducing computational time, and improving performance. However, integrating CNNs in predictive omics data analysis is not without challenges, including issues related to model interpretability, data heterogeneity, and data size. Addressing these challenges requires a multidisciplinary approach, involving collaborations between ML experts, bioinformatics researchers, biologists, and medical doctors. This review illuminates these complexities and charts a course for future research to unlock the full predictive potential of CNNs in omics data analysis and related fields.
Collapse
Affiliation(s)
- Alok Sharma
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Institute for Integrated and Intelligent Systems, Griffith University, Queensland, Australia.
| | - Artem Lysenko
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
| | - Shangru Jia
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Keith A Boroevich
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Tatsuhiko Tsunoda
- Laboratory for Medical Science Mathematics, Department of Biological Sciences, School of Science, The University of Tokyo, Tokyo, Japan.
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan.
- Laboratory for Medical Science Mathematics, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan.
| |
Collapse
|
5
|
Wang X, Chai Z, Li S, Liu Y, Li C, Jiang Y, Liu Q. CTISL: a dynamic stacking multi-class classification approach for identifying cell types from single-cell RNA-seq data. Bioinformatics 2024; 40:btae063. [PMID: 38317054 PMCID: PMC10873586 DOI: 10.1093/bioinformatics/btae063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 02/07/2024] Open
Abstract
MOTIVATION Effective identification of cell types is of critical importance in single-cell RNA-sequencing (scRNA-seq) data analysis. To date, many supervised machine learning-based predictors have been implemented to identify cell types from scRNA-seq datasets. Despite the technical advances of these state-of-the-art tools, most existing predictors were single classifiers, of which the performances can still be significantly improved. It is therefore highly desirable to employ the ensemble learning strategy to develop more accurate computational models for robust and comprehensive identification of cell types on scRNA-seq datasets. RESULTS We propose a two-layer stacking model, termed CTISL (Cell Type Identification by Stacking ensemble Learning), which integrates multiple classifiers to identify cell types. In the first layer, given a reference scRNA-seq dataset with known cell types, CTISL dynamically combines multiple cell-type-specific classifiers (i.e. support-vector machine and logistic regression) as the base learners to deliver the outcomes for the input of a meta-classifier in the second layer. We conducted a total of 24 benchmarking experiments on 17 human and mouse scRNA-seq datasets to evaluate and compare the prediction performance of CTISL and other state-of-the-art predictors. The experiment results demonstrate that CTISL achieves superior or competitive performance compared to these state-of-the-art approaches. We anticipate that CTISL can serve as a useful and reliable tool for cost-effective identification of cell types from scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION The webserver and source code are freely available at http://bigdata.biocie.cn/CTISLweb/home and https://zenodo.org/records/10568906, respectively.
Collapse
Affiliation(s)
- Xiao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shaohua Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yu Jiang
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
- Shaanxi Engineering Research Center of Agricultural Information Intelligent Perception and Analysis, Northwest A&F University, Yangling 712100, China
| |
Collapse
|
6
|
Alsaggaf I, Buchan D, Wan C. Improving cell type identification with Gaussian noise-augmented single-cell RNA-seq contrastive learning. Brief Funct Genomics 2024:elad059. [PMID: 38242863 DOI: 10.1093/bfgp/elad059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Revised: 12/14/2023] [Accepted: 12/18/2023] [Indexed: 01/21/2024] Open
Abstract
Cell type identification is an important task for single-cell RNA-sequencing (scRNA-seq) data analysis. Many prediction methods have recently been proposed, but the predictive accuracy of difficult cell type identification tasks is still low. In this work, we proposed a novel Gaussian noise augmentation-based scRNA-seq contrastive learning method (GsRCL) to learn a type of discriminative feature representations for cell type identification tasks. A large-scale computational evaluation suggests that GsRCL successfully outperformed other state-of-the-art predictive methods on difficult cell type identification tasks, while the conventional random genes masking augmentation-based contrastive learning method also improved the accuracy of easy cell type identification tasks in general.
Collapse
Affiliation(s)
- Ibrahim Alsaggaf
- School of Computing and Mathematical Sciences, Birkbeck, University of London, Malet Street, WC1E 7HX, London, United Kingdom
| | - Daniel Buchan
- Department of Computer Science, University College London, Gower Street, WC1E 6BT, London, United Kingdom
| | - Cen Wan
- School of Computing and Mathematical Sciences, Birkbeck, University of London, Malet Street, WC1E 7HX, London, United Kingdom
| |
Collapse
|
7
|
Yin Q, Chen L. CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data. Brief Bioinform 2023; 25:bbad449. [PMID: 38061196 PMCID: PMC10703497 DOI: 10.1093/bib/bbad449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/30/2023] [Accepted: 11/14/2023] [Indexed: 12/18/2023] Open
Abstract
Identifying cell types is crucial for understanding the functional units of an organism. Machine learning has shown promising performance in identifying cell types, but many existing methods lack biological significance due to poor interpretability. However, it is of the utmost importance to understand what makes cells share the same function and form a specific cell type, motivating us to propose a biologically interpretable method. CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types. CellTICS usually outperforms existing methods in prediction accuracy. Moreover, CellTICS can reveal pathways that define a cell type or a cell type under specific physiological conditions, such as disease or aging. The nonlinear nature of neural networks enables us to identify many novel pathways. Interestingly, some of the pathways identified by CellTICS exhibit differential expression "variability" rather than differential expression across cell types, indicating that expression stochasticity within a pathway could be an important feature characteristic of a cell type. Overall, CellTICS provides a biologically interpretable method for identifying and characterizing cell types, shedding light on the underlying pathways that define cellular heterogeneity and its role in organismal function. CellTICS is available at https://github.com/qyyin0516/CellTICS.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Liang Chen
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| |
Collapse
|
8
|
Awuah WA, Ahluwalia A, Ghosh S, Roy S, Tan JK, Adebusoye FT, Ferreira T, Bharadwaj HR, Shet V, Kundu M, Yee ALW, Abdul-Rahman T, Atallah O. The molecular landscape of neurological disorders: insights from single-cell RNA sequencing in neurology and neurosurgery. Eur J Med Res 2023; 28:529. [PMID: 37974227 PMCID: PMC10652629 DOI: 10.1186/s40001-023-01504-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 11/03/2023] [Indexed: 11/19/2023] Open
Abstract
Single-cell ribonucleic acid sequencing (scRNA-seq) has emerged as a transformative technology in neurological and neurosurgical research, revolutionising our comprehension of complex neurological disorders. In brain tumours, scRNA-seq has provided valuable insights into cancer heterogeneity, the tumour microenvironment, treatment resistance, and invasion patterns. It has also elucidated the brain tri-lineage cancer hierarchy and addressed limitations of current models. Neurodegenerative diseases such as Alzheimer's disease, Parkinson's disease, and amyotrophic lateral sclerosis have been molecularly subtyped, dysregulated pathways have been identified, and potential therapeutic targets have been revealed using scRNA-seq. In epilepsy, scRNA-seq has explored the cellular and molecular heterogeneity underlying the condition, uncovering unique glial subpopulations and dysregulation of the immune system. ScRNA-seq has characterised distinct cellular constituents and responses to spinal cord injury in spinal cord diseases, as well as provided molecular signatures of various cell types and identified interactions involved in vascular remodelling. Furthermore, scRNA-seq has shed light on the molecular complexities of cerebrovascular diseases, such as stroke, providing insights into specific genes, cell-specific expression patterns, and potential therapeutic interventions. This review highlights the potential of scRNA-seq in guiding precision medicine approaches, identifying clinical biomarkers, and facilitating therapeutic discovery. However, challenges related to data analysis, standardisation, sample acquisition, scalability, and cost-effectiveness need to be addressed. Despite these challenges, scRNA-seq has the potential to transform clinical practice in neurological and neurosurgical research by providing personalised insights and improving patient outcomes.
Collapse
Affiliation(s)
- Wireko Andrew Awuah
- Faculty of Medicine, Sumy State University, Zamonstanksya 7, Sumy, 40007, Ukraine
| | | | - Shankaneel Ghosh
- Institute of Medical Sciences and SUM Hospital, Bhubaneswar, India
| | - Sakshi Roy
- School of Medicine, Queen's University Belfast, Belfast, UK
| | | | | | - Tomas Ferreira
- Department of Clinical Neurosciences, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | | | - Vallabh Shet
- Faculty of Medicine, Bangalore Medical College and Research Institute, Bangalore, Karnataka, India
| | - Mrinmoy Kundu
- Institute of Medical Sciences and SUM Hospital, Bhubaneswar, India
| | | | - Toufik Abdul-Rahman
- Faculty of Medicine, Sumy State University, Zamonstanksya 7, Sumy, 40007, Ukraine
| | - Oday Atallah
- Department of Neurosurgery, Hannover Medical School, Carl-Neuberg-Strasse 1, 30625, Hannover, Germany
| |
Collapse
|
9
|
Li W, Xiang B, Yang F, Rong Y, Yin Y, Yao J, Zhang H. scMHNN: a novel hypergraph neural network for integrative analysis of single-cell epigenomic, transcriptomic and proteomic data. Brief Bioinform 2023; 24:bbad391. [PMID: 37930028 DOI: 10.1093/bib/bbad391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 09/09/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023] Open
Abstract
Technological advances have now made it possible to simultaneously profile the changes of epigenomic, transcriptomic and proteomic at the single cell level, allowing a more unified view of cellular phenotypes and heterogeneities. However, current computational tools for single-cell multi-omics data integration are mainly tailored for bi-modality data, so new tools are urgently needed to integrate tri-modality data with complex associations. To this end, we develop scMHNN to integrate single-cell multi-omics data based on hypergraph neural network. After modeling the complex data associations among various modalities, scMHNN performs message passing process on the multi-omics hypergraph, which can capture the high-order data relationships and integrate the multiple heterogeneous features. Followingly, scMHNN learns discriminative cell representation via a dual-contrastive loss in self-supervised manner. Based on the pretrained hypergraph encoder, we further introduce the pre-training and fine-tuning paradigm, which allows more accurate cell-type annotation with only a small number of labeled cells as reference. Benchmarking results on real and simulated single-cell tri-modality datasets indicate that scMHNN outperforms other competing methods on both cell clustering and cell-type annotation tasks. In addition, we also demonstrate scMHNN facilitates various downstream tasks, such as cell marker detection and enrichment analysis.
Collapse
Affiliation(s)
- Wei Li
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350 Tianjin, China
- AI Lab, Tencent, Gaoxin 9th South Road, 518000 Shenzhen, China
| | - Bin Xiang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Yueyang Road, 200031 Shanghai, China
| | - Fan Yang
- AI Lab, Tencent, Gaoxin 9th South Road, 518000 Shenzhen, China
| | - Yu Rong
- AI Lab, Tencent, Gaoxin 9th South Road, 518000 Shenzhen, China
| | - Yanbin Yin
- Department of Food Science and Technology, University of Nebraska - Lincoln, 1400 R Street, 68588 Nebraska, USA
| | - Jianhua Yao
- AI Lab, Tencent, Gaoxin 9th South Road, 518000 Shenzhen, China
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tongyan Road, 300350 Tianjin, China
| |
Collapse
|