1
|
Rollo C, Pancotti C, Sartori F, Caranzano I, D'Amico S, Carota L, Casadei F, Birolo G, Lanino L, Sauta E, Asti G, Buizza A, Delleani M, Zazzetti E, Bicchieri M, Maggioni G, Fenaux P, Platzbecker U, Diez-Campelo M, Haferlach T, Castellani G, Della Porta MG, Fariselli P, Sanavia T. VAE-Surv: A novel approach for genetic-based clustering and prognosis prediction in myelodysplastic syndromes. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 261:108605. [PMID: 39874934 DOI: 10.1016/j.cmpb.2025.108605] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Revised: 12/13/2024] [Accepted: 01/12/2025] [Indexed: 01/30/2025]
Abstract
BACKGROUND AND OBJECTIVES Several computational pipelines for biomedical data have been proposed to stratify patients and to predict their prognosis through survival analysis. However, these analyses are usually performed independently, without integrating the information derived from each of them. Clustering of survival data is an underexplored problem, and current approaches are limited for biomedical applications, whose data are usually heterogeneous and multimodal, with poor scalability for high-dimensionality. METHODS We introduce VAE-Surv, a multimodal computational framework for patients' stratification and prognosis prediction. VAE-Surv integrates a Variational Autoencoder (VAE), which reduces the high-dimensional space characterizing the molecular data, with a deep survival model, which combines the embedded information with the clinical features. The VAE embedding step prioritizes local coherence within the feature space to detect potential nonlinear relationships among the molecular markers. The latent representation is then exploited to perform K-means clustering. To test the clinical robustness of the algorithm, VAE-Surv was applied to the Genomed4all cohort of Myelodysplastic Syndromes (MDS), comparing the identified subtypes with the World Health Organization (WHO) classification. The survival outcome was compared with the state-of-the-art Cox model and its penalized versions. Finally, to assess the generalizability of the results, the method was also validated on an external MDS cohort. RESULTS Tested on 2,043 patients in the GenomMed4All cohort, VAE-Surv achieved a median C-index of 0.78, outperforming classical approaches. In addition, the latent space enhanced the clustering performance compared to a traditional approach that applies the clustering directly to the input data. Compared to the WHO 2016 MDS subtypes, the analysis of the identified clusters showed that the proposed framework can capture existing clinical categorizations while also suggesting novel, data-driven patient groups. Even tested in an external MDS cohort of 2,384 patients, VAE-Surv achieved a good prediction performance (median C-index=0.74), preserving the interpretability of the main clinical and genetic features. CONCLUSIONS VAE-Surv enables automatic identification of patients' clusters, while outperforming the traditional CoxPH model in survival prediction tasks at the same time. Applied to MDS use case, the obtained genetic-based clusters exhibit a clear survival stratification, and the application of the clinical information allowed high performance in prognosis prediction.
Collapse
Affiliation(s)
- Cesare Rollo
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| | - Corrado Pancotti
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| | - Flavio Sartori
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| | - Isabella Caranzano
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| | - Saverio D'Amico
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy; Train s.r.l., via Alessandro Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Luciana Carota
- Department of Medical and Surgical Sciences (DIMEC), University of Bologna, 40126 Bologna, Italy
| | - Francesco Casadei
- IRCCS Istituto delle Scienze Neurologiche di Bologna, 40138 Bologna, Italy
| | - Giovanni Birolo
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| | - Luca Lanino
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Elisabetta Sauta
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Gianluca Asti
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Alessandro Buizza
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Mattia Delleani
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy; Train s.r.l., via Alessandro Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Elena Zazzetti
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Marilena Bicchieri
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Giulia Maggioni
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy
| | - Pierre Fenaux
- Hematology and Bone Marrow Transplantation, Hôpital Saint-Louis/University Paris 7, Paris, France
| | - Uwe Platzbecker
- Medical Clinic and Policlinic 1, Hematology and Cellular Therapy, University Hospital Leipzig, Germany
| | - Maria Diez-Campelo
- Hematology Department, Hospital Universitario de Salamanca, Salamanca, Spain
| | - Torsten Haferlach
- MLL Munich Leukemia Laboratory, Max-Lebsche-Platz 31, 81377 Munich, Germany
| | - Gastone Castellani
- Department of Medical and Surgical Sciences (DIMEC), University of Bologna, 40126 Bologna, Italy; IRCCS Azienda Ospedaliero-Universitaria di Bologna S.Orsola, 40138 Bologna, Italy
| | - Matteo Giovanni Della Porta
- IRCCS Humanitas Research Hospital, via Manzoni 56, 20089 Rozzano - Milan, Italy; Department of Biomedical Sciences, Humanitas University, via Montalcini 4, 20072 Pieve Emanuele - Milan, Italy
| | - Piero Fariselli
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy.
| | - Tiziana Sanavia
- Computational Biomedicine Unit, Department of Medical Sciences, University of Torino, Via Santena 19, 10126, Torino, Italy
| |
Collapse
|
2
|
Wen G, Li L. Federated transfer learning with differential privacy for multi-omics survival analysis. Brief Bioinform 2025; 26:bbaf166. [PMID: 40230038 PMCID: PMC11996627 DOI: 10.1093/bib/bbaf166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2024] [Revised: 02/19/2025] [Accepted: 03/19/2025] [Indexed: 04/16/2025] Open
Abstract
Multi-omics data often suffer from the "big $p$, small $n$" problem where the dimensionality of features is significantly larger than the sample size, making the integration of multi-omics data for survival analysis of a specific cancer particularly challenging. One common strategy is to share multi-omics data from other related cancers across multiple institutions and leverage the abundant data from these cancers to enhance survival predictions for the target cancer. However, due to data privacy and data-sharing regulations, it is challenging to aggregate multi-omics data of related cancers from multiple institutions into a centralized database to learn more accurate and robust models for the target cancer. To address the limitation, we propose a multi-omics survival prediction model with self-attention mechanism (MOSAHit), trained within a federated transfer learning framework with differential privacy. This approach enables the learning of a more robust multi-omics survival prediction model for a local target cancer with limited training data by effectively leveraging multi-omics data of related cancers distributed across multiple institutions while preserving individual privacy. Results from the comprehensive experiments on real-world datasets show that the proposed method effectively alleviates data insufficiency and significantly improves the generalization performance of multi-omics survival prediction model for a target cancer while avoiding the direct sharing of multi-omics data for related cancers.
Collapse
Affiliation(s)
- Gang Wen
- School of Mathematics and Statistics, Xi’an Jiaotong University, 28 Xianning West, Xi’an 710049, Shaanxi, China
| | - Limin Li
- School of Mathematics and Statistics, Xi’an Jiaotong University, 28 Xianning West, Xi’an 710049, Shaanxi, China
| |
Collapse
|
3
|
Qiu W, Dincer AB, Janizek JD, Celik S, Pittet MJ, Naxerova K, Lee SI. Deep profiling of gene expression across 18 human cancers. Nat Biomed Eng 2025; 9:333-355. [PMID: 39690287 DOI: 10.1038/s41551-024-01290-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2022] [Accepted: 10/23/2024] [Indexed: 12/19/2024]
Abstract
Clinical and biological information in large datasets of gene expression across cancers could be tapped with unsupervised deep learning. However, difficulties associated with biological interpretability and methodological robustness have made this impractical. Here we describe an unsupervised deep-learning framework for the generation of low-dimensional latent spaces for gene-expression data from 50,211 transcriptomes across 18 human cancers. The framework, which we named DeepProfile, outperformed dimensionality-reduction methods with respect to biological interpretability and allowed us to unveil that genes that are universally important in defining latent spaces across cancer types control immune cell activation, whereas cancer-type-specific genes and pathways define molecular disease subtypes. By linking latent variables in DeepProfile to secondary characteristics of tumours, we discovered that mutation burden is closely associated with the expression of cell-cycle-related genes, and that the activity of biological pathways for DNA-mismatch repair and MHC class II antigen presentation are consistently associated with patient survival. We also found that tumour-associated macrophages are a source of survival-correlated MHC class II transcripts. Unsupervised learning can facilitate the discovery of biological insight from gene-expression data.
Collapse
Affiliation(s)
- Wei Qiu
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Ayse B Dincer
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Joseph D Janizek
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
- Medical Scientist Training Program, University of Washington, Seattle, WA, USA
| | - Safiye Celik
- Recursion Pharmaceuticals, Salt Lake City, UT, USA
| | - Mikael J Pittet
- Department of Pathology and Immunology, University of Geneva, Geneva, Switzerland
- Ludwig Institute for Cancer Research, Lausanne Branch, Lausanne, Switzerland
- Department of Oncology, Geneva University Hospitals, Geneva, Switzerland
- AGORA Cancer Research Center and Swiss Cancer Center Leman, Lausanne, Switzerland
| | - Kamila Naxerova
- Department of Genetics, Harvard Medical School, Boston, MA, USA.
- Center for Systems Biology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA.
| | - Su-In Lee
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
| |
Collapse
|
4
|
Ghofrani A, Taherdoost H. Biomedical data analytics for better patient outcomes. Drug Discov Today 2025; 30:104280. [PMID: 39732322 DOI: 10.1016/j.drudis.2024.104280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 12/16/2024] [Accepted: 12/20/2024] [Indexed: 12/30/2024]
Abstract
Medical professionals today have access to immense amounts of data, which enables them to make decisions that enhance patient care and treatment efficacy. This innovative strategy can improve global health care by bridging the divide between clinical practice and medical research. This paper reviews biomedical developments aimed at improving patient outcomes by addressing three main questions regarding techniques, data sources and challenges. The review includes peer-reviewed articles from 2018 to 2023, found via systematic searches in PubMed, Scopus and Google Scholar. The results show diverse disease-specific applications. Challenges such as data quality and ethics are discussed, underscoring data analytics' potential for patient-focused health care. The review concludes that successful implementation requires addressing gaps, collaboration and innovation in biomedical science and data analytics.
Collapse
Affiliation(s)
| | - Hamed Taherdoost
- Hamta Business Corporation, Vancouver, Canada; University Canada West, Vancouver, Canada; Westcliff University, Irvine, USA; GUS Institute | Global University Systems, London, UK.
| |
Collapse
|
5
|
Wen G, Li L. MMOSurv: meta-learning for few-shot survival analysis with multi-omics data. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 41:btae684. [PMID: 39563482 DOI: 10.1093/bioinformatics/btae684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Revised: 10/14/2024] [Accepted: 11/16/2024] [Indexed: 11/21/2024]
Abstract
MOTIVATION High-throughput techniques have produced a large amount of high-dimensional multi-omics data, which makes it promising to predict patient survival outcomes more accurately. Recent work has showed the superiority of multi-omics data in survival analysis. However, it remains challenging to integrate multi-omics data to solve few-shot survival prediction problem, with only a few available training samples, especially for rare cancers. RESULTS In this work, we propose a meta-learning framework for multi-omics few-shot survival analysis, namely MMOSurv, which enables to learn an effective multi-omics survival prediction model from a very few training samples of a specific cancer type, with the meta-knowledge across tasks from relevant cancer types. By assuming a deep Cox survival model with multiple omics, MMOSurv first learns an adaptable parameter initialization for the multi-omics survival model from abundant data of relevant cancers, and then adapts the parameters quickly and efficiently for the target cancer task with a very few training samples. Our experiments on eleven cancer types in The Cancer Genome Atlas datasets show that, compared to single-omics meta-learning methods, MMOSurv can better utilize the meta-information of similarities and relationships between different omics data from relevant cancer datasets to improve survival prediction of the target cancer with a very few multi-omics training samples. Furthermore, MMOSurv achieves better prediction performance than other state-of-the-art strategies such as multitask learning and pretraining. AVAILABILITY AND IMPLEMENTATION MMOSurv is freely available at https://github.com/LiminLi-xjtu/MMOSurv.
Collapse
Affiliation(s)
- Gang Wen
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| | - Limin Li
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, Shaanxi 710049, China
| |
Collapse
|
6
|
Spirina Menand E, De Vries-Brilland M, Tessier L, Dauvé J, Campone M, Verrièle V, Jrad N, Marion JM, Chauvet P, Passot C, Morel A. Learning to Train and to Explain a Deep Survival Model with Large-Scale Ovarian Cancer Transcriptomic Data. Biomedicines 2024; 12:2881. [PMID: 39767787 PMCID: PMC11673231 DOI: 10.3390/biomedicines12122881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 12/05/2024] [Accepted: 12/06/2024] [Indexed: 01/11/2025] Open
Abstract
Background/Objectives: Ovarian cancer is a complex disease with poor outcomes that affects women worldwide. The lack of successful therapeutic options for this malignancy has led to the need to identify novel biomarkers for patient stratification. Here, we aim to develop the outcome predictors based on the gene expression data as they may serve to identify categories of patients who are more likely to respond to certain therapies. Methods: We used The Cancer Genome Atlas (TCGA) ovarian cancer transcriptomic data from 372 patients and approximately 16,600 genes to train and evaluate the deep learning survival models. In addition, we collected an in-house validation dataset of 12 patients to assess the performance of the trained survival models for their direct use in clinical practice. Despite deceptive generalization capabilities, we demonstrated how our model can be interpreted to uncover biological processes associated with survival. We calculated the contributions of the input genes to the output of the best trained model and derived the corresponding molecular pathways. Results: These pathways allowed us to stratify the TCGA patients into high-risk and low-risk groups (p-value 0.025). We validated the stratification ability of the identified pathways on the in-house dataset consisting of 12 patients (p-value 0.229) and on the external clinical and molecular dataset consisting of 274 patients (p-value 0.006). Conclusions: The deep learning-based models for survival prediction with RNA-seq data could be used to detect and interpret the gene-sets associated with survival in ovarian cancer patients and open a new avenue for future research.
Collapse
Affiliation(s)
- Elena Spirina Menand
- Laboratoire Angevin de Recherche en Ingénierie des Systèmes (EA7315), Université d’Angers, 49035 Angers, France
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Manon De Vries-Brilland
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
- Département d’Oncologie Médicale, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Leslie Tessier
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Jonathan Dauvé
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Mario Campone
- Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
- Univ Angers, Nantes Université, Inserm, CNRS, CRCI2NA, SFR ICAT, 49035 Angers, France
| | - Véronique Verrièle
- Département d’Anatomie et de Cytologie Pathologiques, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Nisrine Jrad
- Laboratoire Angevin de Recherche en Ingénierie des Systèmes (EA7315), Université d’Angers, 49035 Angers, France
| | - Jean-Marie Marion
- Laboratoire Angevin de Recherche en Ingénierie des Systèmes (EA7315), Université d’Angers, 49035 Angers, France
| | - Pierre Chauvet
- Laboratoire Angevin de Recherche en Ingénierie des Systèmes (EA7315), Université d’Angers, 49035 Angers, France
| | - Christophe Passot
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
| | - Alain Morel
- Unité de Génomique Fonctionnelle, Institut de Cancérologie de l’Ouest Nantes-Angers, 49055 Angers, France
- Univ Angers, Nantes Université, Inserm, CNRS, CRCI2NA, SFR ICAT, 49035 Angers, France
| |
Collapse
|
7
|
Mesinovic M, Watkinson P, Zhu T. DySurv: dynamic deep learning model for survival analysis with conditional variational inference. J Am Med Inform Assoc 2024:ocae271. [PMID: 39569428 DOI: 10.1093/jamia/ocae271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Revised: 09/24/2024] [Accepted: 10/11/2024] [Indexed: 11/22/2024] Open
Abstract
OBJECTIVE Machine learning applications for longitudinal electronic health records often forecast the risk of events at fixed time points, whereas survival analysis achieves dynamic risk prediction by estimating time-to-event distributions. Here, we propose a novel conditional variational autoencoder-based method, DySurv, which uses a combination of static and longitudinal measurements from electronic health records to estimate the individual risk of death dynamically. MATERIALS AND METHODS DySurv directly estimates the cumulative risk incidence function without making any parametric assumptions on the underlying stochastic process of the time-to-event. We evaluate DySurv on 6 time-to-event benchmark datasets in healthcare, as well as 2 real-world intensive care unit (ICU) electronic health records (EHR) datasets extracted from the eICU Collaborative Research (eICU) and the Medical Information Mart for Intensive Care database (MIMIC-IV). RESULTS DySurv outperforms other existing statistical and deep learning approaches to time-to-event analysis across concordance and other metrics. It achieves time-dependent concordance of over 60% in the eICU case. It is also over 12% more accurate and 22% more sensitive than in-use ICU scores like Acute Physiology and Chronic Health Evaluation (APACHE) and Sequential Organ Failure Assessment (SOFA) scores. The predictive capacity of DySurv is consistent and the survival estimates remain disentangled across different datasets. DISCUSSION Our interdisciplinary framework successfully incorporates deep learning, survival analysis, and intensive care to create a novel method for time-to-event prediction from longitudinal health records. We test our method on several held-out test sets from a variety of healthcare datasets and compare it to existing in-use clinical risk scoring benchmarks. CONCLUSION While our method leverages non-parametric extensions to deep learning-guided estimations of the survival distribution, further deep learning paradigms could be explored.
Collapse
Affiliation(s)
- Munib Mesinovic
- Department of Engineering Science, University of Oxford, Oxford OX1 3PJ, United Kingdom
| | - Peter Watkinson
- Critical Care Research Group, Nuffield Department of Clinical Neurosciences, University of Oxford, Oxford OX3 7JX, United Kingdom
| | - Tingting Zhu
- Department of Engineering Science, University of Oxford, Oxford OX1 3PJ, United Kingdom
| |
Collapse
|
8
|
Qiu W, Dincer AB, Janizek JD, Celik S, Pittet M, Naxerova K, Lee SI. A deep profile of gene expression across 18 human cancers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.17.585426. [PMID: 38559197 PMCID: PMC10980029 DOI: 10.1101/2024.03.17.585426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Clinically and biologically valuable information may reside untapped in large cancer gene expression data sets. Deep unsupervised learning has the potential to extract this information with unprecedented efficacy but has thus far been hampered by a lack of biological interpretability and robustness. Here, we present DeepProfile, a comprehensive framework that addresses current challenges in applying unsupervised deep learning to gene expression profiles. We use DeepProfile to learn low-dimensional latent spaces for 18 human cancers from 50,211 transcriptomes. DeepProfile outperforms existing dimensionality reduction methods with respect to biological interpretability. Using DeepProfile interpretability methods, we show that genes that are universally important in defining the latent spaces across all cancer types control immune cell activation, while cancer type-specific genes and pathways define molecular disease subtypes. By linking DeepProfile latent variables to secondary tumor characteristics, we discover that tumor mutation burden is closely associated with the expression of cell cycle-related genes. DNA mismatch repair and MHC class II antigen presentation pathway expression, on the other hand, are consistently associated with patient survival. We validate these results through Kaplan-Meier analyses and nominate tumor-associated macrophages as an important source of survival-correlated MHC class II transcripts. Our results illustrate the power of unsupervised deep learning for discovery of cancer biology from existing gene expression data.
Collapse
Affiliation(s)
- Wei Qiu
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| | - Ayse B. Dincer
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| | - Joseph D. Janizek
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
- Medical Scientist Training Program, University of Washington, Seattle, WA
| | | | - Mikael Pittet
- Department of Pathology and Immunology, University of Geneva, Switzerland
- Ludwig Institute for Cancer Research, Lausanne Branch, Switzerland
| | - Kamila Naxerova
- Department of Genetics, Harvard Medical School, Boston, MA, USA
- Center for Systems Biology, Massachusetts General Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | - Su-In Lee
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA
| |
Collapse
|
9
|
Sun A, Franzmann EJ, Chen Z, Cai X. Deep contrastive learning for predicting cancer prognosis using gene expression values. Brief Bioinform 2024; 25:bbae544. [PMID: 39471411 PMCID: PMC11521346 DOI: 10.1093/bib/bbae544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/09/2024] [Accepted: 10/18/2024] [Indexed: 11/01/2024] Open
Abstract
Recent advancements in image classification have demonstrated that contrastive learning (CL) can aid in further learning tasks by acquiring good feature representation from a limited number of data samples. In this paper, we applied CL to tumor transcriptomes and clinical data to learn feature representations in a low-dimensional space. We then utilized these learned features to train a classifier to categorize tumors into a high- or low-risk group of recurrence. Using data from The Cancer Genome Atlas (TCGA), we demonstrated that CL can significantly improve classification accuracy. Specifically, our CL-based classifiers achieved an area under the receiver operating characteristic curve (AUC) greater than 0.8 for 14 types of cancer, and an AUC greater than 0.9 for 3 types of cancer. We also developed CL-based Cox (CLCox) models for predicting cancer prognosis. Our CLCox models trained with the TCGA data outperformed existing methods significantly in predicting the prognosis of 19 types of cancer under consideration. The performance of CLCox models and CL-based classifiers trained with TCGA lung and prostate cancer data were validated using the data from two independent cohorts. We also show that the CLCox model trained with the whole transcriptome significantly outperforms the Cox model trained with the 16 genes of Oncotype DX that is in clinical use for breast cancer patients. The trained models and the Python codes are publicly accessible and provide a valuable resource that will potentially find clinical applications for many types of cancer.
Collapse
Affiliation(s)
- Anchen Sun
- Department of Electrical and Computer Engineering, University of Miami, Miami, FL 33146, United States
| | - Elizabeth J Franzmann
- Department of Otolaryngology, University of Miami, Miami, FL 33146, United States
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33146, United States
| | - Zhibin Chen
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33146, United States
- Department of Microbiology and Immunology, University of Miami, Miami, FL 33146, United States
| | - Xiaodong Cai
- Department of Electrical and Computer Engineering, University of Miami, Miami, FL 33146, United States
- Sylvester Comprehensive Cancer Center, University of Miami, Miami, FL 33146, United States
| |
Collapse
|
10
|
Wang FA, Zhuang Z, Gao F, He R, Zhang S, Wang L, Liu J, Li Y. TMO-Net: an explainable pretrained multi-omics model for multi-task learning in oncology. Genome Biol 2024; 25:149. [PMID: 38845006 PMCID: PMC11157742 DOI: 10.1186/s13059-024-03293-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Accepted: 05/29/2024] [Indexed: 06/09/2024] Open
Abstract
Cancer is a complex disease composing systemic alterations in multiple scales. In this study, we develop the Tumor Multi-Omics pre-trained Network (TMO-Net) that integrates multi-omics pan-cancer datasets for model pre-training, facilitating cross-omics interactions and enabling joint representation learning and incomplete omics inference. This model enhances multi-omics sample representation and empowers various downstream oncology tasks with incomplete multi-omics datasets. By employing interpretable learning, we characterize the contributions of distinct omics features to clinical outcomes. The TMO-Net model serves as a versatile framework for cross-modal multi-omics learning in oncology, paving the way for tumor omics-specific foundation models.
Collapse
Affiliation(s)
- Feng-Ao Wang
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China
- Guangzhou National Laboratory, Guangzhou, 510005, China
| | - Zhenfeng Zhuang
- Department of Computer Science at the School of Informatics, Xiamen University, Xiamen, 361005, China
| | - Feng Gao
- Department of Colorectal Surgery, The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510655, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200433, China
- Biomedical Innovation Center, The Sixth Affiliated Hospital, Sun Yat-Sen University, Guangzhou, 510655, China
| | - Ruikun He
- BYHEALTH Institute of Nutrition & Health, Guangzhou, 510000, China
| | - Shaoting Zhang
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200433, China
| | - Liansheng Wang
- Department of Computer Science at the School of Informatics, Xiamen University, Xiamen, 361005, China.
| | - Junwei Liu
- Guangzhou National Laboratory, Guangzhou, 510005, China.
| | - Yixue Li
- Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, 310024, China.
- Guangzhou National Laboratory, Guangzhou, 510005, China.
- Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, 200030, China.
- GZMU-GIBH Joint School of Life Sciences, The Guangdong-Hong Kong-Macau Joint Laboratory for Cell Fate Regulation and Diseases, Guangzhou Medical University, Guangzhou, 511436, China.
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Collaborative Innovation Center for Genetics and Development, Fudan University, Shanghai, 200433, China.
- Shanghai Institute for Biomedical and Pharmaceutical Technologies, Shanghai, 200032, China.
| |
Collapse
|
11
|
Feng X, Shu W, Li M, Li J, Xu J, He M. Pathogenomics for accurate diagnosis, treatment, prognosis of oncology: a cutting edge overview. J Transl Med 2024; 22:131. [PMID: 38310237 PMCID: PMC10837897 DOI: 10.1186/s12967-024-04915-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/20/2024] [Indexed: 02/05/2024] Open
Abstract
The capability to gather heterogeneous data, alongside the increasing power of artificial intelligence to examine it, leading a revolution in harnessing multimodal data in the life sciences. However, most approaches are limited to unimodal data, leaving integrated approaches across modalities relatively underdeveloped in computational pathology. Pathogenomics, as an invasive method to integrate advanced molecular diagnostics from genomic data, morphological information from histopathological imaging, and codified clinical data enable the discovery of new multimodal cancer biomarkers to propel the field of precision oncology in the coming decade. In this perspective, we offer our opinions on synthesizing complementary modalities of data with emerging multimodal artificial intelligence methods in pathogenomics. It includes correlation between the pathological and genomic profile of cancer, fusion of histology, and genomics profile of cancer. We also present challenges, opportunities, and avenues for future work.
Collapse
Affiliation(s)
- Xiaobing Feng
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Wen Shu
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Mingya Li
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Junyu Li
- College of Electrical and Information Engineering, Hunan University, Changsha, China
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Junyao Xu
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China
| | - Min He
- College of Electrical and Information Engineering, Hunan University, Changsha, China.
- Zhejiang Cancer Hospital, Hangzhou Institute of Medicine (HIM), Chinese Academy of Sciences, Hangzhou, 310022, Zhejiang, China.
| |
Collapse
|
12
|
Kumar N, Skubleny D, Parkes M, Verma R, Davis S, Kumar L, Aissiou A, Greiner R. Learning Individual Survival Models from PanCancer Whole Transcriptome Data. Clin Cancer Res 2023; 29:3924-3936. [PMID: 37463063 PMCID: PMC10543961 DOI: 10.1158/1078-0432.ccr-22-3493] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 02/11/2023] [Accepted: 07/11/2023] [Indexed: 07/20/2023]
Abstract
PURPOSE Personalized medicine attempts to predict survival time for each patient, based on their individual tumor molecular profile. We investigate whether our survival learner in combination with a dimension reduction method can produce useful survival estimates for a variety of patients with cancer. EXPERIMENTAL DESIGN This article provides a method that learns a model for predicting the survival time for individual patients with cancer from the PanCancer Atlas: given the (16,335 dimensional) gene expression profiles from 10,173 patients, each having one of 33 cancers, this method uses unsupervised nonnegative matrix factorization (NMF) to reexpress the gene expression data for each patient in terms of 100 learned NMF factors. It then feeds these 100 factors into the Multi-Task Logistic Regression (MTLR) learner to produce cancer-specific models for each of 20 cancers (with >50 uncensored instances); this produces "individual survival distributions" (ISD), which provide survival probabilities at each future time for each individual patient, which provides a patient's risk score and estimated survival time. RESULTS Our NMF-MTLR concordance indices outperformed the VAECox benchmark by 14.9% overall. We achieved optimal survival prediction using pan-cancer NMF in combination with cancer-specific MTLR models. We provide biological interpretation of the NMF model and clinical implications of ISDs for prognosis and therapeutic response prediction. CONCLUSIONS NMF-MTLR provides many benefits over other models: superior model discrimination, superior calibration, meaningful survival time estimates, and accurate probabilistic estimates of survival over time for each individual patient. We advocate for the adoption of these cancer survival models in clinical and research settings.
Collapse
Affiliation(s)
- Neeraj Kumar
- Alberta Machine Intelligence Institute, Edmonton, Alberta, Canada
| | - Daniel Skubleny
- Department of Surgery, University of Alberta, Edmonton, Alberta, Canada
| | - Michael Parkes
- Computing Science Department, University of Alberta, Edmonton, Alberta, Canada
| | - Ruchika Verma
- Alberta Machine Intelligence Institute, Edmonton, Alberta, Canada
| | - Sacha Davis
- Alberta Machine Intelligence Institute, Edmonton, Alberta, Canada
| | - Luke Kumar
- Microsoft, Vancouver, British Columbia, Canada
| | | | - Russell Greiner
- Alberta Machine Intelligence Institute, Edmonton, Alberta, Canada
- Computing Science Department, University of Alberta, Edmonton, Alberta, Canada
| |
Collapse
|
13
|
Wen G, Li L. FGCNSurv: dually fused graph convolutional network for multi-omics survival prediction. Bioinformatics 2023; 39:btad472. [PMID: 37522887 PMCID: PMC10412406 DOI: 10.1093/bioinformatics/btad472] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 05/24/2023] [Accepted: 07/29/2023] [Indexed: 08/01/2023] Open
Abstract
MOTIVATION Survival analysis is an important tool for modeling time-to-event data, e.g. to predict the survival time of patient after a cancer diagnosis or a certain treatment. While deep neural networks work well in standard prediction tasks, it is still unclear how to best utilize these deep models in survival analysis due to the difficulty of modeling right censored data, especially for multi-omics data. Although existing methods have shown the advantage of multi-omics integration in survival prediction, it remains challenging to extract complementary information from different omics and improve the prediction accuracy. RESULTS In this work, we propose a novel multi-omics deep survival prediction approach by dually fused graph convolutional network (GCN) named FGCNSurv. Our FGCNSurv is a complete generative model from multi-omics data to survival outcome of patients, including feature fusion by a factorized bilinear model, graph fusion of multiple graphs, higher-level feature extraction by GCN and survival prediction by a Cox proportional hazard model. The factorized bilinear model enables to capture cross-omics features and quantify complex relations from multi-omics data. By fusing single-omics features and the cross-omics features, and simultaneously fusing multiple graphs from different omics, GCN with the generated dually fused graph could capture higher-level features for computing the survival loss in the Cox-PH model. Comprehensive experimental results on real-world datasets with gene expression and microRNA expression data show that the proposed FGCNSurv method outperforms existing survival prediction methods, and imply its ability to extract complementary information for survival prediction from multi-omics data. AVAILABILITY AND IMPLEMENTATION The codes are freely available at https://github.com/LiminLi-xjtu/FGCNSurv.
Collapse
Affiliation(s)
- Gang Wen
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
| | - Limin Li
- School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi 710049, China
| |
Collapse
|
14
|
Lin SH, Chien CH, Chang KP, Lu MF, Chen YT, Chu YW. SaBrcada: Survival Intervals Prediction for Breast Cancer Patients by Dimension Raising and Age Stratification. Cancers (Basel) 2023; 15:3690. [PMID: 37509351 PMCID: PMC10378351 DOI: 10.3390/cancers15143690] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 07/03/2023] [Accepted: 07/18/2023] [Indexed: 07/30/2023] Open
Abstract
(1) Background: Breast cancer is the second leading cause of cancer death among women. The accurate prediction of survival intervals will help physicians make informed decisions about treatment strategies or the use of palliative care. (2) Methods: Gene expression is predictive and correlates to patient prognosis. To establish a reliable prediction tool, we collected a total of 1187 RNA-seq data points from breast cancer patients (median age 58 years) in Fragments Per Kilobase Million (FPKM) format from the TCGA database. Among them, we selected 144 patients with date of death information to establish the SaBrcada-AD dataset. We first normalized the SaBrcada-AD dataset to TPM to build the survival prediction model SaBrcada. After normalization and dimension raising, we used the differential gene expression data to test eight different deep learning architectures. Considering the effect of age on prognosis, we also performed a stratified random sampling test on all ages between the lower and upper quartiles of patient age, 48 and 69 years; (3) Results: Stratifying by age 61, the performance of SaBrcada built by GoogLeNet was improved to a highest accuracy of 0.798. We also built a free website tool to provide five predicted survival periods: within six months, six months to one year, one to three years, three to five years, or over five years, for clinician reference. (4) Conclusions: We built the prediction model, SaBrcada, and the website tool of the same name for breast cancer survival analysis. Through these models and tools, clinicians will be provided with survival interval information as a basis for formulating precision medicine.
Collapse
Affiliation(s)
- Shih-Huan Lin
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung 40227, Taiwan
| | - Ching-Hsuan Chien
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung 40227, Taiwan
| | - Kai-Po Chang
- Department of Pathology, China Medical University Hospital, Taichung 404327, Taiwan
| | - Min-Fang Lu
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung 40227, Taiwan
| | - Yu-Ting Chen
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung 40227, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung 40227, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung 40227, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung 40227, Taiwan
| | - Yen-Wei Chu
- Ph.D. Program in Medical Biotechnology, National Chung Hsing University, Taichung 40227, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung 40227, Taiwan
- Biotechnology Center, National Chung Hsing University, Taichung 40227, Taiwan
- Agricultural Biotechnology Center, National Chung Hsing University, Taichung 40227, Taiwan
- Institute of Molecular Biology, National Chung Hsing University, Taichung 40227, Taiwan
- Smart Sustainable New Agriculture Research Center (SMARTer), Taichung 40227, Taiwan
| |
Collapse
|
15
|
Duan M, Wang Y, Zhao D, Liu H, Zhang G, Li K, Zhang H, Huang L, Zhang R, Zhou F. Orchestrating information across tissues via a novel multitask GAT framework to improve quantitative gene regulation relation modeling for survival analysis. Brief Bioinform 2023; 24:bbad238. [PMID: 37427963 DOI: 10.1093/bib/bbad238] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 05/29/2023] [Accepted: 06/08/2023] [Indexed: 07/11/2023] Open
Abstract
Survival analysis is critical to cancer prognosis estimation. High-throughput technologies facilitate the increase in the dimension of genic features, but the number of clinical samples in cohorts is relatively small due to various reasons, including difficulties in participant recruitment and high data-generation costs. Transcriptome is one of the most abundantly available OMIC (referring to the high-throughput data, including genomic, transcriptomic, proteomic and epigenomic) data types. This study introduced a multitask graph attention network (GAT) framework DQSurv for the survival analysis task. We first used a large dataset of healthy tissue samples to pretrain the GAT-based HealthModel for the quantitative measurement of the gene regulatory relations. The multitask survival analysis framework DQSurv used the idea of transfer learning to initiate the GAT model with the pretrained HealthModel and further fine-tuned this model using two tasks i.e. the main task of survival analysis and the auxiliary task of gene expression prediction. This refined GAT was denoted as DiseaseModel. We fused the original transcriptomic features with the difference vector between the latent features encoded by the HealthModel and DiseaseModel for the final task of survival analysis. The proposed DQSurv model stably outperformed the existing models for the survival analysis of 10 benchmark cancer types and an independent dataset. The ablation study also supported the necessity of the main modules. We released the codes and the pretrained HealthModel to facilitate the feature encodings and survival analysis of transcriptome-based future studies, especially on small datasets. The model and the code are available at http://www.healthinformaticslab.org/supp/.
Collapse
Affiliation(s)
- Meiyu Duan
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
| | - Yueying Wang
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
| | - Dong Zhao
- School of Biology and Engineering, and Engineering Research Center of Medical Biotechnology, Guizhou Medical University, Guiyang, Guizhou 550025, China
| | - Hongmei Liu
- School of Biology and Engineering, and Engineering Research Center of Medical Biotechnology, Guizhou Medical University, Guiyang, Guizhou 550025, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| | - Gongyou Zhang
- School of Biology and Engineering, and Engineering Research Center of Medical Biotechnology, Guizhou Medical University, Guiyang, Guizhou 550025, China
| | - Kewei Li
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
| | - Haotian Zhang
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
| | - Lan Huang
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| | - Ruochi Zhang
- School of Artificial Intelligence, Jilin University, Changchun, China, 130012
| | - Fengfeng Zhou
- College of Computer Science and Technology, Jilin University, Changchun, Jilin, China, 130012
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, Jilin, China, 130012
| |
Collapse
|
16
|
Lacan A, Sebag M, Hanczar B. GAN-based data augmentation for transcriptomics: survey and comparative assessment. Bioinformatics 2023; 39:i111-i120. [PMID: 37387181 DOI: 10.1093/bioinformatics/btad239] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models' full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. RESULTS This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. AVAILABILITY AND IMPLEMENTATION All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics.
Collapse
Affiliation(s)
- Alice Lacan
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| | - Michèle Sebag
- TAU, CNRS-INRIA-LISN, University Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Blaise Hanczar
- IBISC, University Paris-Saclay (Univ. Evry), Evry 91000, France
| |
Collapse
|
17
|
Hao Y, Jing XY, Sun Q. Cancer survival prediction by learning comprehensive deep feature representation for multiple types of genetic data. BMC Bioinformatics 2023; 24:267. [PMID: 37380946 DOI: 10.1186/s12859-023-05392-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 06/19/2023] [Indexed: 06/30/2023] Open
Abstract
BACKGROUND Cancer is one of the leading death causes around the world. Accurate prediction of its survival time is significant, which can help clinicians make appropriate therapeutic schemes. Cancer data can be characterized by varied molecular features, clinical behaviors and morphological appearances. However, the cancer heterogeneity problem usually makes patient samples with different risks (i.e., short and long survival time) inseparable, thereby causing unsatisfactory prediction results. Clinical studies have shown that genetic data tends to contain more molecular biomarkers associated with cancer, and hence integrating multi-type genetic data may be a feasible way to deal with cancer heterogeneity. Although multi-type gene data have been used in the existing work, how to learn more effective features for cancer survival prediction has not been well studied. RESULTS To this end, we propose a deep learning approach to reduce the negative impact of cancer heterogeneity and improve the cancer survival prediction effect. It represents each type of genetic data as the shared and specific features, which can capture the consensus and complementary information among all types of data. We collect mRNA expression, DNA methylation and microRNA expression data for four cancers to conduct experiments. CONCLUSIONS Experimental results demonstrate that our approach substantially outperforms established integrative methods and is effective for cancer survival prediction. AVAILABILITY AND IMPLEMENTATION https://github.com/githyr/ComprehensiveSurvival .
Collapse
Affiliation(s)
- Yaru Hao
- School of Computer Science, Wuhan University, Wuhan, China.
| | - Xiao-Yuan Jing
- School of Computer Science, Wuhan University, Wuhan, China.
- School of Computer, Guangdong University of Petrochemical Technology, Maoming, China.
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China.
| | - Qixing Sun
- School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
18
|
Manganaro L, Bianco S, Bironzo P, Cipollini F, Colombi D, Corà D, Corti G, Doronzo G, Errico L, Falco P, Gandolfi L, Guerrera F, Monica V, Novello S, Papotti M, Parab S, Pittaro A, Primo L, Righi L, Sabbatini G, Sandri A, Vattakunnel S, Bussolino F, Scagliotti GV. Consensus clustering methodology to improve molecular stratification of non-small cell lung cancer. Sci Rep 2023; 13:7759. [PMID: 37173325 PMCID: PMC10182023 DOI: 10.1038/s41598-023-33954-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 04/21/2023] [Indexed: 05/15/2023] Open
Abstract
Recent advances in machine learning research, combined with the reduced sequencing costs enabled by modern next-generation sequencing, paved the way to the implementation of precision medicine through routine multi-omics molecular profiling of tumours. Thus, there is an emerging need of reliable models exploiting such data to retrieve clinically useful information. Here, we introduce an original consensus clustering approach, overcoming the intrinsic instability of common clustering methods based on molecular data. This approach is applied to the case of non-small cell lung cancer (NSCLC), integrating data of an ongoing clinical study (PROMOLE) with those made available by The Cancer Genome Atlas, to define a molecular-based stratification of the patients beyond, but still preserving, histological subtyping. The resulting subgroups are biologically characterized by well-defined mutational and gene-expression profiles and are significantly related to disease-free survival (DFS). Interestingly, it was observed that (1) cluster B, characterized by a short DFS, is enriched in KEAP1 and SKP2 mutations, that makes it an ideal candidate for further studies with inhibitors, and (2) over- and under-representation of inflammation and immune systems pathways in squamous-cell carcinomas subgroups could be potentially exploited to stratify patients treated with immunotherapy.
Collapse
Affiliation(s)
- L Manganaro
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - S Bianco
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - P Bironzo
- Medical Oncology Division at San Luigi Hospital, Department of Oncology, University of Torino, Orbassano (TO), Italy
| | - F Cipollini
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - D Colombi
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - D Corà
- Department of Translational Medicine, Piemonte Orientale University, Novara, Italy
- Center for Translational Research on Autoimmune and Allergic Diseases-CAAD, Novara, Italy
| | - G Corti
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - G Doronzo
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - L Errico
- Division of Thoracic Surgery at AOU San Luigi, Department of Oncology, University of Torino, Orbassano (TO), Italy
| | - P Falco
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - L Gandolfi
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - F Guerrera
- Division of Thoracic Surgery at AOU Città della Salute e della Scienza, Department of Surgical Sciences, University of Torino, Torino, Italy
| | - V Monica
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - S Novello
- Medical Oncology Division at San Luigi Hospital, Department of Oncology, University of Torino, Orbassano (TO), Italy
| | - M Papotti
- Pathology Division at AOU Città della Salute e della Scienza, Department of Oncology, University of Torino, Torino, Italy
| | - S Parab
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - A Pittaro
- Pathology Division at AOU Città della Salute e della Scienza, Department of Oncology, University of Torino, Torino, Italy
| | - L Primo
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - L Righi
- Pathology Division at AOU San Luigi, Department of Oncology, University of Torino, Orbassano (TO), Italy
| | - G Sabbatini
- aizoOn Technology Consulting S.R.L, Torino, Italy
| | - A Sandri
- Division of Thoracic Surgery at AOU San Luigi, Department of Oncology, University of Torino, Orbassano (TO), Italy
| | | | - F Bussolino
- Department of Oncology, University of Torino, 10060, Candiolo, Italy
- Candiolo Cancer Institute-IRCCS-FPO, 10060, Candiolo, Italy
| | - G V Scagliotti
- Medical Oncology Division at San Luigi Hospital, Department of Oncology, University of Torino, Orbassano (TO), Italy.
| |
Collapse
|
19
|
Hao Y, Jing XY, Sun Q. Joint learning sample similarity and correlation representation for cancer survival prediction. BMC Bioinformatics 2022; 23:553. [PMID: 36536289 PMCID: PMC9761951 DOI: 10.1186/s12859-022-05110-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 12/13/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND As a highly aggressive disease, cancer has been becoming the leading death cause around the world. Accurate prediction of the survival expectancy for cancer patients is significant, which can help clinicians make appropriate therapeutic schemes. With the high-throughput sequencing technology becoming more and more cost-effective, integrating multi-type genome-wide data has been a promising method in cancer survival prediction. Based on these genomic data, some data-integration methods for cancer survival prediction have been proposed. However, existing methods fail to simultaneously utilize feature information and structure information of multi-type genome-wide data. RESULTS We propose a Multi-type Data Joint Learning (MDJL) approach based on multi-type genome-wide data, which comprehensively exploits feature information and structure information. Specifically, MDJL exploits correlation representations between any two data types by cross-correlation calculation for learning discriminant features. Moreover, based on the learned multiple correlation representations, MDJL constructs sample similarity matrices for capturing global and local structures across different data types. With the learned discriminant representation matrix and fused similarity matrix, MDJL constructs graph convolutional network with Cox loss for survival prediction. CONCLUSIONS Experimental results demonstrate that our approach substantially outperforms established integrative methods and is effective for cancer survival prediction.
Collapse
Affiliation(s)
- Yaru Hao
- grid.49470.3e0000 0001 2331 6153School of Computer Science, Wuhan University, Wuhan, China
| | - Xiao-Yuan Jing
- grid.49470.3e0000 0001 2331 6153School of Computer Science, Wuhan University, Wuhan, China ,grid.459577.d0000 0004 1757 6559Guangdong Provincial Key Laboratory of Petrochemical Equipment Fault Diagnosis and School of Computer, Guangdong University of Petrochemical Technology, Maoming, China ,grid.41156.370000 0001 2314 964XState Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
| | - Qixing Sun
- grid.49470.3e0000 0001 2331 6153School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
20
|
P D, C G. A systematic review on machine learning and deep learning techniques in cancer survival prediction. PROGRESS IN BIOPHYSICS AND MOLECULAR BIOLOGY 2022; 174:62-71. [PMID: 35933043 DOI: 10.1016/j.pbiomolbio.2022.07.004] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 07/13/2022] [Accepted: 07/19/2022] [Indexed: 06/15/2023]
Abstract
Cancer is a disease which is characterised by the unusual and uncontrollable growth of body cells. This usually happens asymptomatically and gets spread to other parts of the body. The major problem in treating cancer is that its progress is not monitored once it is diagnosed. The progress or the prognosis can be done through survival analysis. The survival analysis is the branch of statistics that deals in predicting the time of event of occurrence. In the case of cancer prognosis the event is the survival time of the patient from the onset of the disease or it can be the recurrence of the disease after undergoing a treatment. This study aims to bring out the machine learning and deep learning models involved in providing the prognosis to the cancer patients.
Collapse
Affiliation(s)
- Deepa P
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Gunavathi C
- School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, India.
| |
Collapse
|
21
|
Yin Q, Chen W, Zhang C, Wei Z. A convolutional neural network model for survival prediction based on prognosis-related cascaded Wx feature selection. J Transl Med 2022; 102:1064-1074. [PMID: 35810236 DOI: 10.1038/s41374-022-00801-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Revised: 04/22/2022] [Accepted: 04/26/2022] [Indexed: 12/14/2022] Open
Abstract
Great advances in deep learning have provided effective solutions for prediction tasks in the biomedical field. However, accurate prognosis prediction using cancer genomics data remains challenging due to the severe overfitting problem caused by curse of dimensionality inherent to high-throughput sequencing data. Moreover, there are unique challenges to perform survival analysis, arising from the difficulty in utilizing censored samples whose events of interest are not observed. Convolutional neural network (CNN) models provide us the opportunity to extract meaningful hierarchical features to characterize cancer subtype and prognosis outcomes. On the other hand, feature selection can mitigate overfitting and reduce subsequent model training computation burden by screening out significant genes from redundant genes. To accomplish model simplification, we developed a concise and efficient survival analysis model, named CNN-Cox model, which combines a special CNN framework with prognosis-related feature selection cascaded Wx, with the advantage of less computation demand utilizing light training parameters. Experiment results show that CNN-Cox model achieved consistent higher C-index values and better survival prediction performance across seven cancer type datasets in The Cancer Genome Atlas cohort, including bladder carcinoma, head and neck squamous cell carcinoma, kidney renal cell carcinoma, brain low-grade glioma, lung adenocarcinoma (LUAD), lung squamous cell carcinoma, and skin cutaneous melanoma, compared with the existing state-of-the-art survival analysis methods. As an illustration of model interpretation, we examined potential prognostic gene signatures of LUAD dataset using the proposed CNN-Cox model. We conducted protein-protein interaction network analysis to identify potential prognostic genes and further analyzed the biological function of 13 hub genes, including ANLN, RACGAP1, KIF4A, KIF20A, KIF14, ASPM, CDK1, SPC25, NCAPG, MKI67, HJURP, EXO1, HMMR, whose high expression is significantly associated with poor survival of LUAD patients. These findings confirmed that CNN-Cox model is effective in extracting not only prognosis factors but also biologically meaningful gene features. The codes are available at the GitHub website: https://github.com/wangwangCCChen/CNN-Cox .
Collapse
Affiliation(s)
- Qingyan Yin
- School of Science, Xi'an University of Architecture and Technology, Xi'an, Shaanxi, 710055, China.
| | - Wangwang Chen
- School of Science, Xi'an University of Architecture and Technology, Xi'an, Shaanxi, 710055, China
| | - Chunxia Zhang
- School of Mathematics and Statistics, Xi'an Jiaotong University, Xi'an, Shaanxi, 710049, China
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, 07102, USA
| |
Collapse
|
22
|
Treppner M, Binder H, Hess M. Interpretable generative deep learning: an illustration with single cell gene expression data. Hum Genet 2022; 141:1481-1498. [PMID: 34988661 PMCID: PMC9360114 DOI: 10.1007/s00439-021-02417-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/06/2021] [Indexed: 11/26/2022]
Abstract
Deep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.
Collapse
Affiliation(s)
- Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104, Germany.
| | - Harald Binder
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| | - Moritz Hess
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| |
Collapse
|
23
|
Hanczar B, Bourgeais V, Zehraoui F. Assessment of deep learning and transfer learning for cancer prediction based on gene expression data. BMC Bioinformatics 2022; 23:262. [PMID: 35786378 PMCID: PMC9250744 DOI: 10.1186/s12859-022-04807-7] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 06/15/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Machine learning is now a standard tool for cancer prediction based on gene expression data. However, deep learning is still new for this task, and there is no clear consensus about its performance and utility. Few experimental works have evaluated deep neural networks and compared them with state-of-the-art machine learning. Moreover, their conclusions are not consistent. RESULTS We extensively evaluate the deep learning approach on 22 cancer prediction tasks based on gene expression data. We measure the impact of the main hyper-parameters and compare the performances of neural networks with the state-of-the-art. We also investigate the effectiveness of several transfer learning schemes in different experimental setups. CONCLUSION Based on our experimentations, we provide several recommendations to optimize the construction and training of a neural network model. We show that neural networks outperform the state-of-the-art methods only for very large training set size. For a small training set, we show that transfer learning is possible and may strongly improve the model performance in some cases.
Collapse
Affiliation(s)
- Blaise Hanczar
- IBISC, Université Paris-Saclay (Univ. Evry), 23 boulevard de France, 91034, Evry, France.
| | - Victoria Bourgeais
- IBISC, Université Paris-Saclay (Univ. Evry), 23 boulevard de France, 91034, Evry, France
| | - Farida Zehraoui
- IBISC, Université Paris-Saclay (Univ. Evry), 23 boulevard de France, 91034, Evry, France
| |
Collapse
|
24
|
A Novel Attention-Mechanism Based Cox Survival Model by Exploiting Pan-Cancer Empirical Genomic Information. Cells 2022; 11:cells11091421. [PMID: 35563727 PMCID: PMC9100007 DOI: 10.3390/cells11091421] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 01/27/2023] Open
Abstract
Cancer prognosis is an essential goal for early diagnosis, biomarker selection, and medical therapy. In the past decade, deep learning has successfully solved a variety of biomedical problems. However, due to the high dimensional limitation of human cancer transcriptome data and the small number of training samples, there is still no mature deep learning-based survival analysis model that can completely solve problems in the training process like overfitting and accurate prognosis. Given these problems, we introduced a novel framework called SAVAE-Cox for survival analysis of high-dimensional transcriptome data. This model adopts a novel attention mechanism and takes full advantage of the adversarial transfer learning strategy. We trained the model on 16 types of TCGA cancer RNA-seq data sets. Experiments show that our module outperformed state-of-the-art survival analysis models such as the Cox proportional hazard model (Cox-ph), Cox-lasso, Cox-ridge, Cox-nnet, and VAECox on the concordance index. In addition, we carry out some feature analysis experiments. Based on the experimental results, we concluded that our model is helpful for revealing cancer-related genes and biological functions.
Collapse
|
25
|
Ebbehoj A, Thunbo MØ, Andersen OE, Glindtvad MV, Hulman A. Transfer learning for non-image data in clinical research: A scoping review. PLOS DIGITAL HEALTH 2022; 1:e0000014. [PMID: 36812540 PMCID: PMC9931256 DOI: 10.1371/journal.pdig.0000014] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 12/15/2021] [Indexed: 01/14/2023]
Abstract
BACKGROUND Transfer learning is a form of machine learning where a pre-trained model trained on a specific task is reused as a starting point and tailored to another task in a different dataset. While transfer learning has garnered considerable attention in medical image analysis, its use for clinical non-image data is not well studied. Therefore, the objective of this scoping review was to explore the use of transfer learning for non-image data in the clinical literature. METHODS AND FINDINGS We systematically searched medical databases (PubMed, EMBASE, CINAHL) for peer-reviewed clinical studies that used transfer learning on human non-image data. We included 83 studies in the review. More than half of the studies (63%) were published within 12 months of the search. Transfer learning was most often applied to time series data (61%), followed by tabular data (18%), audio (12%) and text (8%). Thirty-three (40%) studies applied an image-based model to non-image data after transforming data into images (e.g. spectrograms). Twenty-nine (35%) studies did not have any authors with a health-related affiliation. Many studies used publicly available datasets (66%) and models (49%), but fewer shared their code (27%). CONCLUSIONS In this scoping review, we have described current trends in the use of transfer learning for non-image data in the clinical literature. We found that the use of transfer learning has grown rapidly within the last few years. We have identified studies and demonstrated the potential of transfer learning in clinical research in a wide range of medical specialties. More interdisciplinary collaborations and the wider adaption of reproducible research principles are needed to increase the impact of transfer learning in clinical research.
Collapse
Affiliation(s)
- Andreas Ebbehoj
- Department of Endocrinology and Internal Medicine, Aarhus University Hospital, Denmark
- Department of Clinical Medicine, Aarhus University, Denmark
| | | | | | | | - Adam Hulman
- Steno Diabetes Center Aarhus, Aarhus University Hospital, Denmark
| |
Collapse
|
26
|
Ma C, Wu M, Ma S. Analysis of cancer omics data: a selective review of statistical techniques. Brief Bioinform 2022; 23:6510158. [PMID: 35039832 DOI: 10.1093/bib/bbab585] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 12/19/2021] [Accepted: 12/20/2021] [Indexed: 11/13/2022] Open
Abstract
Cancer is an omics disease. The development in high-throughput profiling has fundamentally changed cancer research and clinical practice. Compared with clinical, demographic and environmental data, the analysis of omics data-which has higher dimensionality, weaker signals and more complex distributional properties-is much more challenging. Developments in the literature are often 'scattered', with individual studies focused on one or a few closely related methods. The goal of this review is to assist cancer researchers with limited statistical expertise in establishing the 'overall framework' of cancer omics data analysis. To facilitate understanding, we mainly focus on intuition, concepts and key steps, and refer readers to the original publications for mathematical details. This review broadly covers unsupervised and supervised analysis, as well as individual-gene-based, gene-set-based and gene-network-based analysis. We also briefly discuss 'special topics' including interaction analysis, multi-datasets analysis and multi-omics analysis.
Collapse
Affiliation(s)
- Chenjin Ma
- College of Statistics and Data Science, Faculty of Science, Beijing University of Technology, Beijing, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
27
|
A gradient tree boosting and network propagation derived pan-cancer survival network of the tumor microenvironment. iScience 2022; 25:103617. [PMID: 35106465 PMCID: PMC8786644 DOI: 10.1016/j.isci.2021.103617] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 11/12/2021] [Accepted: 12/09/2021] [Indexed: 12/22/2022] Open
Abstract
Predicting cancer survival from molecular data is an important aspect of biomedical research because it allows quantifying patient risks and thus individualizing therapy. We introduce XGBoost tree ensemble learning to predict survival from transcriptome data of 8,024 patients from 25 different cancer types and show highly competitive performance with state-of-the-art methods. To further improve plausibility of the machine learning approach we conducted two additional steps. In the first step, we applied pan-cancer training and showed that it substantially improves prognosis compared with cancer subtype-specific training. In the second step, we applied network propagation and inferred a pan-cancer survival network consisting of 103 genes. This network highlights cross-cohort features and is predictive for the tumor microenvironment and immune status of the patients. Our work demonstrates that pan-cancer learning combined with network propagation generalizes over multiple cancer types and identifies biologically plausible features that can serve as biomarkers for monitoring cancer survival. Highly performing cancer survival prediction with XGBoost Pan-cancer training outperforms single-cohort training Combined approach consisting of machine learning and network propagation Tumor microenvironment is most strongly involved in cancer survival prediction
Collapse
|
28
|
Malenová G, Rowson D, Boeva V. Exploring Pathway-Based Group Lasso for Cancer Survival Analysis: A Special Case of Multi-Task Learning. Front Genet 2021; 12:771301. [PMID: 34912376 PMCID: PMC8667553 DOI: 10.3389/fgene.2021.771301] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2021] [Accepted: 10/27/2021] [Indexed: 11/22/2022] Open
Abstract
Motivation: The Cox proportional hazard models are widely used in the study of cancer survival. However, these models often meet challenges such as the large number of features and small sample sizes of cancer data sets. While this issue can be partially solved by applying regularization techniques such as lasso, the models still suffer from unsatisfactory predictive power and low stability. Methods: Here, we investigated two methods to improve survival models. Firstly, we leveraged the biological knowledge that groups of genes act together in pathways and regularized both at the group and gene level using latent group lasso penalty term. Secondly, we designed and applied a multi-task learning penalty that allowed us leveraging the relationship between survival models for different cancers. Results: We observed modest improvements over the simple lasso model with the inclusion of latent group lasso penalty for six of the 16 cancer types tested. The addition of a multi-task penalty, which penalized coefficients in pairs of cancers from diverging too greatly, significantly improved accuracy for a single cancer, lung squamous cell carcinoma, while having minimal effect on other cancer types. Conclusion: While the use of pathway information and multi-tasking shows some promise, these methods do not provide a substantial improvement when compared with standard methods.
Collapse
Affiliation(s)
- Gabriela Malenová
- Department of Computer Science, Institute for Machine Learning, ETH Zurich, Zürich, Switzerland
| | - Daniel Rowson
- Department of Computer Science, Institute for Machine Learning, ETH Zurich, Zürich, Switzerland
| | - Valentina Boeva
- Department of Computer Science, Institute for Machine Learning, ETH Zurich, Zürich, Switzerland.,Swiss Institute for Bioinformatics (SIB), Zürich, Switzerland.,Institut Cochin, Inserm U1016, CNRS UMR 8104, Université de Paris UMR-S1016, Paris, France
| |
Collapse
|
29
|
Lai X, Zhou J, Wessely A, Heppt M, Maier A, Berking C, Vera J, Zhang L. A disease network-based deep learning approach for characterizing melanoma. Int J Cancer 2021; 150:1029-1044. [PMID: 34716589 DOI: 10.1002/ijc.33860] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2021] [Revised: 10/08/2021] [Accepted: 10/19/2021] [Indexed: 12/12/2022]
Abstract
Multiple types of genomic variations are present in cutaneous melanoma and some of the genomic features may have an impact on the prognosis of the disease. The access to genomics data via public repositories such as The Cancer Genome Atlas (TCGA) allows for a better understanding of melanoma at the molecular level, therefore making characterization of substantial heterogeneity in melanoma patients possible. Here, we proposed an approach that integrates genomics data, a disease network, and a deep learning model to classify melanoma patients for prognosis, assess the impact of genomic features on the classification and provide interpretation to the impactful features. We integrated genomics data into a melanoma network and applied an autoencoder model to identify subgroups in TCGA melanoma patients. The model utilizes communities identified in the network to effectively reduce the dimensionality of genomics data into a patient score profile. Based on the score profile, we identified three patient subtypes that show different survival times. Furthermore, we quantified and ranked the impact of genomic features on the patient score profile using a machine-learning technique. Follow-up analysis of the top-ranking features provided us with the biological interpretation of them at both pathway and molecular levels, such as their mutation and interactome profiles in melanoma and their involvement in pathways associated with signaling transduction, immune system and cell cycle. Taken together, we demonstrated the ability of the approach to identify disease subgroups using a deep learning model that captures the most relevant information of genomics data in the melanoma network.
Collapse
Affiliation(s)
- Xin Lai
- Department of Dermatology, Universitätsklinikum Erlangen and Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.,Deutsches Zentrum Immuntherapie, Erlangen, Germany.,Comprehensive Cancer Center Erlangen, Erlangen, Germany
| | - Jinfei Zhou
- College of Computer Science, Sichuan University, Chengdu, China
| | - Anja Wessely
- Department of Dermatology, Universitätsklinikum Erlangen and Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.,Deutsches Zentrum Immuntherapie, Erlangen, Germany.,Comprehensive Cancer Center Erlangen, Erlangen, Germany
| | - Markus Heppt
- Department of Dermatology, Universitätsklinikum Erlangen and Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.,Deutsches Zentrum Immuntherapie, Erlangen, Germany.,Comprehensive Cancer Center Erlangen, Erlangen, Germany
| | - Andreas Maier
- Pattern Recognition Lab, Department of Computer Science, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Carola Berking
- Department of Dermatology, Universitätsklinikum Erlangen and Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.,Deutsches Zentrum Immuntherapie, Erlangen, Germany.,Comprehensive Cancer Center Erlangen, Erlangen, Germany
| | - Julio Vera
- Department of Dermatology, Universitätsklinikum Erlangen and Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.,Deutsches Zentrum Immuntherapie, Erlangen, Germany.,Comprehensive Cancer Center Erlangen, Erlangen, Germany
| | - Le Zhang
- College of Computer Science, Sichuan University, Chengdu, China
| |
Collapse
|
30
|
Kuruc F, Binder H, Hess M. Stratified neural networks in a time-to-event setting. Brief Bioinform 2021; 23:6377517. [PMID: 34585236 DOI: 10.1093/bib/bbab392] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Revised: 08/25/2021] [Accepted: 08/30/2021] [Indexed: 12/28/2022] Open
Abstract
Deep neural networks are frequently employed to predict survival conditional on omics-type biomarkers, e.g., by employing the partial likelihood of Cox proportional hazards model as loss function. Due to the generally limited number of observations in clinical studies, combining different data sets has been proposed to improve learning of network parameters. However, if baseline hazards differ between the studies, the assumptions of Cox proportional hazards model are violated. Based on high dimensional transcriptome profiles from different tumor entities, we demonstrate how using a stratified partial likelihood as loss function allows for accounting for the different baseline hazards in a deep learning framework. Additionally, we compare the partial likelihood with the ranking loss, which is frequently employed as loss function in machine learning approaches due to its seemingly simplicity. Using RNA-seq data from the Cancer Genome Atlas (TCGA) we show that use of stratified loss functions leads to an overall better discriminatory power and lower prediction error compared to their non-stratified counterparts. We investigate which genes are identified to have the greatest marginal impact on prediction of survival when using different loss functions. We find that while similar genes are identified, in particular known prognostic genes receive higher importance from stratified loss functions. Taken together, pooling data from different sources for improved parameter learning of deep neural networks benefits largely from employing stratified loss functions that consider potentially varying baseline hazards. For easy application, we provide PyTorch code for stratified loss functions and an explanatory Jupyter notebook in a GitHub repository.
Collapse
Affiliation(s)
- Fabrizio Kuruc
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany
| | - Harald Binder
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany
| | - Moritz Hess
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany
| |
Collapse
|