1
|
Jilani M, Degras D, Haspel N. Elucidating Cancer Subtypes by Using the Relationship between DNA Methylation and Gene Expression. Genes (Basel) 2024; 15:631. [PMID: 38790260 PMCID: PMC11121157 DOI: 10.3390/genes15050631] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 05/10/2024] [Accepted: 05/14/2024] [Indexed: 05/26/2024] Open
Abstract
Advancements in the field of next generation sequencing (NGS) have generated vast amounts of data for the same set of subjects. The challenge that arises is how to combine and reconcile results from different omics studies, such as epigenome and transcriptome, to improve the classification of disease subtypes. In this study, we introduce sCClust (sparse canonical correlation analysis with clustering), a technique to combine high-dimensional omics data using sparse canonical correlation analysis (sCCA), such that the correlation between datasets is maximized. This stage is followed by clustering the integrated data in a lower-dimensional space. We apply sCClust to gene expression and DNA methylation data for three cancer genomics datasets from the Cancer Genome Atlas (TCGA) to distinguish between underlying subtypes. We evaluate the identified subtypes using Kaplan-Meier plots and hazard ratio analysis on the three types of cancer-GBM (glioblastoma multiform), lung cancer and colon cancer. Comparison with subtypes identified by both single- and multi-omics studies implies improved clinical association. We also perform pathway over-representation analysis in order to identify up-regulated and down-regulated genes as tentative drug targets. The main goal of the paper is twofold: the integration of epigenomic and transcriptomic datasets followed by elucidating subtypes in the latent space. The significance of this study lies in the enhanced categorization of cancer data, which is crucial to precision medicine.
Collapse
Affiliation(s)
- Muneeba Jilani
- Department of Computer Science, University of Massachusetts Boston, Boston, MA 02125, USA;
| | - David Degras
- Department of Mathematics, University of Massachusetts Boston, Boston, MA 02125, USA
| | - Nurit Haspel
- Department of Computer Science, University of Massachusetts Boston, Boston, MA 02125, USA;
| |
Collapse
|
2
|
Cao H, Jia C, Li Z, Yang H, Fang R, Zhang Y, Cui Y. wMKL: multi-omics data integration enables novel cancer subtype identification via weight-boosted multi-kernel learning. Br J Cancer 2024; 130:1001-1012. [PMID: 38278975 PMCID: PMC10951206 DOI: 10.1038/s41416-024-02587-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Revised: 01/09/2024] [Accepted: 01/15/2024] [Indexed: 01/28/2024] Open
Abstract
BACKGROUND Cancer is a heterogeneous disease driven by complex molecular alterations. Cancer subtypes determined from multi-omics data can provide novel insight into personalised precision treatment. It is recognised that incorporating prior weight knowledge into multi-omics data integration can improve disease subtyping. METHODS We develop a weighted method, termed weight-boosted Multi-Kernel Learning (wMKL) which incorporates heterogeneous data types as well as flexible weight functions, to boost subtype identification. Given a series of weight functions, we propose an omnibus combination strategy to integrate different weight-related P-values to improve subtyping precision. RESULTS wMKL models each data type with multiple kernel choices, thus alleviating the sensitivity and robustness issue due to selecting kernel parameters. Furthermore, wMKL integrates different data types by learning weights of different kernels derived from each data type, recognising the heterogeneous contribution of different data types to the final subtyping performance. The proposed wMKL outperforms existing weighted and non-weighted methods. The utility and advantage of wMKL are illustrated through extensive simulations and applications to two TCGA datasets. Novel subtypes are identified followed by extensive downstream bioinformatics analysis to understand the molecular mechanisms differentiating different subtypes. CONCLUSIONS The proposed wMKL method provides a novel strategy for disease subtyping. The wMKL is freely available at https://github.com/biostatcao/wMKL .
Collapse
Affiliation(s)
- Hongyan Cao
- Division of Health Statistics, Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
- MOE Key Laboratory of Coal Environmental Pathogenicity and Prevention, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
- Division of Mathematics, School of Basic Medical Science, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
| | - Congcong Jia
- Division of Health Statistics, Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
| | - Zhi Li
- Department of Hematology, Taiyuan Central Hospital of Shanxi Medical University, 030001, Taiyuan, Shanxi, China
| | - Haitao Yang
- Division of Health Statistics, School of Public Health, Hebei Medical University, 050017, Shijiazhuang, China
| | - Ruiling Fang
- Division of Health Statistics, Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
| | - Yanbo Zhang
- Division of Health Statistics, Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, Shanxi Medical University, 030001, Taiyuan, Shanxi, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI, 48824, USA.
| |
Collapse
|
3
|
Hao Y, Jing XY, Sun Q. Cancer survival prediction by learning comprehensive deep feature representation for multiple types of genetic data. BMC Bioinformatics 2023; 24:267. [PMID: 37380946 DOI: 10.1186/s12859-023-05392-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Accepted: 06/19/2023] [Indexed: 06/30/2023] Open
Abstract
BACKGROUND Cancer is one of the leading death causes around the world. Accurate prediction of its survival time is significant, which can help clinicians make appropriate therapeutic schemes. Cancer data can be characterized by varied molecular features, clinical behaviors and morphological appearances. However, the cancer heterogeneity problem usually makes patient samples with different risks (i.e., short and long survival time) inseparable, thereby causing unsatisfactory prediction results. Clinical studies have shown that genetic data tends to contain more molecular biomarkers associated with cancer, and hence integrating multi-type genetic data may be a feasible way to deal with cancer heterogeneity. Although multi-type gene data have been used in the existing work, how to learn more effective features for cancer survival prediction has not been well studied. RESULTS To this end, we propose a deep learning approach to reduce the negative impact of cancer heterogeneity and improve the cancer survival prediction effect. It represents each type of genetic data as the shared and specific features, which can capture the consensus and complementary information among all types of data. We collect mRNA expression, DNA methylation and microRNA expression data for four cancers to conduct experiments. CONCLUSIONS Experimental results demonstrate that our approach substantially outperforms established integrative methods and is effective for cancer survival prediction. AVAILABILITY AND IMPLEMENTATION https://github.com/githyr/ComprehensiveSurvival .
Collapse
Affiliation(s)
- Yaru Hao
- School of Computer Science, Wuhan University, Wuhan, China.
| | - Xiao-Yuan Jing
- School of Computer Science, Wuhan University, Wuhan, China.
- School of Computer, Guangdong University of Petrochemical Technology, Maoming, China.
- State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China.
| | - Qixing Sun
- School of Computer Science, Wuhan University, Wuhan, China
| |
Collapse
|
4
|
An Analysis of Transcriptomic Burden Identifies Biological Progression Roadmaps for Hematological Malignancies and Solid Tumors. Biomedicines 2022; 10:biomedicines10112720. [DOI: 10.3390/biomedicines10112720] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Accepted: 10/24/2022] [Indexed: 11/16/2022] Open
Abstract
Biological paths of tumor progression are difficult to predict without time-series data. Using median shift and abacus transformation in the analysis of RNA sequencing data sets, natural patient stratifications were found based on their transcriptomic burden (TcB). Using gene-behavior analysis, TcB groups were evaluated further to discover biological courses of tumor progression. We found that solid tumors and hematological malignancies (n = 4179) share conserved biological patterns, and biological network complexity decreases at increasing TcB levels. An analysis of gene expression datasets including pediatric leukemia patients revealed TcB patterns with biological directionality and survival implications. A prospective interventional study with PI3K targeted therapy in canine lymphomas proved that directional biological responses are dynamic. To conclude, TcB-enriched biological mechanisms detected the existence of biological trajectories within tumors. Using this prognostic informative novel informatics method, which can be applied to tumor transcriptomes and progressive diseases inspires the design of progression-specific therapeutic approaches.
Collapse
|
5
|
Cong Y, Endo T. Multi-Omics and Artificial Intelligence-Guided Drug Repositioning: Prospects, Challenges, and Lessons Learned from COVID-19. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:361-371. [PMID: 35759424 DOI: 10.1089/omi.2022.0068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Drug repurposing is of interest for therapeutics innovation in many human diseases including coronavirus disease 2019 (COVID-19). Methodological innovations in drug repurposing are currently being empowered by convergence of omics systems science and digital transformation of life sciences. This expert review article offers a systematic summary of the application of artificial intelligence (AI), particularly machine learning (ML), to drug repurposing and classifies and introduces the common clustering, dimensionality reduction, and other methods. We highlight, as a present-day high-profile example, the involvement of AI/ML-based drug discovery in the COVID-19 pandemic and discuss the collection and sharing of diverse data types, and the possible futures awaiting drug repurposing in an era of AI/ML and digital technologies. The article provides new insights on convergence of multi-omics and AI-based drug repurposing. We conclude with reflections on the various pathways to expedite innovation in drug development through drug repurposing for prompt responses to the current COVID-19 pandemic and future ecological crises in the 21st century.
Collapse
Affiliation(s)
- Yi Cong
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| | - Toshinori Endo
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| |
Collapse
|
6
|
Khan A, Maji P. Selective Update of Relevant Eigenspaces for Integrative Clustering of Multimodal Data. IEEE TRANSACTIONS ON CYBERNETICS 2022; 52:947-959. [PMID: 32452799 DOI: 10.1109/tcyb.2020.2990112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
One of the major problems in cancer subtype discovery from multimodal omic data is that all the available modalities may not encode relevant and homogeneous information about the subtypes. Moreover, the high-dimensional nature of the modalities makes sample clustering computationally expensive. In this regard, a novel algorithm is proposed to extract a low-rank joint subspace of the integrated data matrix. The proposed algorithm first evaluates the quality of subtype information provided by each of the modalities, and then judiciously selects only relevant ones to construct the joint subspace. The problem of incrementally updating the singular value decomposition of a data matrix is formulated for the multimodal data framework. The analytical formulation enables efficient construction of the joint subspace of integrated data from low-rank subspaces of the individual modalities. The construction of joint subspace by the proposed method is shown to be computationally more efficient compared to performing the principal component analysis (PCA) on the integrated data matrix. Some new quantitative indices are introduced to measure theoretically the accuracy of subspace construction by the proposed approach with respect to the principal subspace extracted by the PCA. The efficacy of clustering on the joint subspace constructed by the proposed algorithm is established over existing integrative clustering approaches on several real-life multimodal cancer data sets.
Collapse
|
7
|
Nguyen H, Tran D, Tran B, Roy M, Cassell A, Dascalu S, Draghici S, Nguyen T. SMRT: Randomized Data Transformation for Cancer Subtyping and Big Data Analysis. Front Oncol 2021; 11:725133. [PMID: 34745946 PMCID: PMC8563705 DOI: 10.3389/fonc.2021.725133] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 09/28/2021] [Indexed: 12/25/2022] Open
Abstract
Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com. The R package will be deposited to CRAN as part of our PINSPlus software suite.
Collapse
Affiliation(s)
- Hung Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Duc Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Bang Tran
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Monikrishna Roy
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Adam Cassell
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sergiu Dascalu
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, MI, United States
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada Reno, Reno, NV, United States
| |
Collapse
|
8
|
Shi K, Lin W, Zhao XM. Identifying Molecular Biomarkers for Diseases With Machine Learning Based on Integrative Omics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2514-2525. [PMID: 32305934 DOI: 10.1109/tcbb.2020.2986387] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Molecular biomarkers are certain molecules or set of molecules that can be of help for diagnosis or prognosis of diseases or disorders. In the past decades, thanks to the advances in high-throughput technologies, a huge amount of molecular 'omics' data, e.g., transcriptomics and proteomics, have been accumulated. The availability of these omics data makes it possible to screen biomarkers for diseases or disorders. Accordingly, a number of computational approaches have been developed to identify biomarkers by exploring the omics data. In this review, we present a comprehensive survey on the recent progress of identification of molecular biomarkers with machine learning approaches. Specifically, we categorize the machine learning approaches into supervised, un-supervised and recommendation approaches, where the biomarkers including single genes, gene sets and small gene networks. In addition, we further discuss potential problems underlying bio-medical data that may pose challenges for machine learning, and provide possible directions for future biomarker identification.
Collapse
|
9
|
Sun Y, Ou-Yang L, Dai DQ. WMLRR: A Weighted Multi-View Low Rank Representation to Identify Cancer Subtypes From Multiple Types of Omics Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2891-2897. [PMID: 33656995 DOI: 10.1109/tcbb.2021.3063284] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The identification of cancer subtypes is of great importance for understanding the heterogeneity of tumors and providing patients with more accurate diagnoses and treatments. However, it is still a challenge to effectively integrate multiple omics data to establish cancer subtypes. In this paper, we propose an unsupervised integration method, named weighted multi-view low rank representation (WMLRR), to identify cancer subtypes from multiple types of omics data. Given a group of patients described by multiple omics data matrices, we first learn a unified affinity matrix which encodes the similarities among patients by exploring the sparsity-consistent low-rank representations from the joint decompositions of multiple omics data matrices. Unlike existing subtype identification methods that treat each omics data matrix equally, we assign a weight to each omics data matrix and learn these weights automatically through the optimization process. Finally, we apply spectral clustering on the learned affinity matrix to identify cancer subtypes. Experiment results show that the survival times between our identified cancer subtypes are significantly different, and our predicted survivals are more accurate than other state-of-the-art methods. In addition, some clinical analyses of the diseases also demonstrate the effectiveness of our method in identifying molecular subtypes with biological significance and clinical relevance.
Collapse
|
10
|
Deep Learning for Integrated Analysis of Insulin Resistance with Multi-Omics Data. J Pers Med 2021; 11:jpm11020128. [PMID: 33671853 PMCID: PMC7918166 DOI: 10.3390/jpm11020128] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2021] [Revised: 01/25/2021] [Accepted: 02/10/2021] [Indexed: 02/06/2023] Open
Abstract
Technological advances in next-generation sequencing (NGS) have made it possible to uncover extensive and dynamic alterations in diverse molecular components and biological pathways across healthy and diseased conditions. Large amounts of multi-omics data originating from emerging NGS experiments require feature engineering, which is a crucial step in the process of predictive modeling. The underlying relationship among multi-omics features in terms of insulin resistance is not well understood. In this study, using the multi-omics data of type II diabetes from the Integrative Human Microbiome Project, from 10,783 features, we conducted a data analytic approach to elucidate the relationship between insulin resistance and multi-omics features, including microbiome data. To better explain the impact of microbiome features on insulin classification, we used a developed deep neural network interpretation algorithm for each microbiome feature’s contribution to the discriminative model output in the samples.
Collapse
|
11
|
A topological approach for cancer subtyping from gene expression data. J Biomed Inform 2020; 102:103357. [PMID: 31893527 DOI: 10.1016/j.jbi.2019.103357] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2019] [Revised: 11/27/2019] [Accepted: 12/12/2019] [Indexed: 12/27/2022]
Abstract
BACKGROUND Gene expression data contains key information which can be used for subtyping cancer patients. However, computational methods suffer from 'curse of dimensionality' due to very high dimensionality of omics data and therefore are not able to clearly distinguish between the discovered subtypes in terms of separation of survival plots. METHODS To address this we propose a framework based on Topological Mapper algorithm. The novelty of this work is that we suggest a method for defining the filter function on which the mapper algorithm heavily depends. Survival analysis of the discovered cancer subtypes is carried out and evaluated in terms of minimum pairwise separation between the Kaplan-Meier plots. Furthermore, we present a method to measure the separation between the discovered subtypes based on hazard ratios. RESULTS Five cancer genomics datasets obtained from The Cancer Genome Atlas portal have been used for comparisons with Robust Sparse Correlation-Otrimle (RSC-Otrimle) algorithm and Similarity Network Fusion(SNF). Comparisons show that the minimum pairwise life expectancy difference (in days) between the discovered subtypes for lung, colon, breast, glioblastoma and kidney cancers is 107, 204, 20, 88 and 425 days, respectively, for the proposed methodology whereas it is only 69, 43, 6, 61 and 282 days for RSC-Otrimle and 9, 95, 18, 60 and 148 days for SNF. Hazard ratio analysis also shows that the proposed methodology performs better in four of the five datasets. A visual inspection of Kaplan-Meier plots reveals that the proposed methodology achieves lesser overlap in Kaplan-Meier plots especially for lung, breast and kidney cases. Furthermore, relevant genetic pathways for each subtype have been obtained and pathways which can be possible targets for treatment have been discussed. CONCLUSION The significance of this work lies in individualized understanding of cancer from patient to patient which is the backbone of Precision Medicine.
Collapse
|
12
|
Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res 2019; 46:10546-10562. [PMID: 30295871 PMCID: PMC6237755 DOI: 10.1093/nar/gky889] [Citation(s) in RCA: 268] [Impact Index Per Article: 44.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2018] [Accepted: 09/20/2018] [Indexed: 12/18/2022] Open
Abstract
Recent high throughput experimental methods have been used to collect large biomedical omics datasets. Clustering of single omic datasets has proven invaluable for biological and medical research. The decreasing cost and development of additional high throughput methods now enable measurement of multi-omic data. Clustering multi-omic data has the potential to reveal further systems-level insights, but raises computational and biological challenges. Here, we review algorithms for multi-omics clustering, and discuss key issues in applying these algorithms. Our review covers methods developed specifically for omic data as well as generic multi-view methods developed in the machine learning community for joint clustering of multiple data types. In addition, using cancer data from TCGA, we perform an extensive benchmark spanning ten different cancer types, providing the first systematic comparison of leading multi-omics and multi-view clustering algorithms. The results highlight key issues regarding the use of single- versus multi-omics, the choice of clustering strategy, the power of generic multi-view methods and the use of approximated p-values for gauging solution quality. Due to the growing use of multi-omics data, we expect these issues to be important for future progress in the field.
Collapse
Affiliation(s)
- Nimrod Rappoport
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|