1
|
Ghareyazi A, Kazemi A, Hamidieh K, Dashti H, Tahaei MS, Rabiee HR, Alinejad-Rokny H, Dehzangi I. Pan-cancer integrative analysis of whole-genome De novo somatic point mutations reveals 17 cancer types. BMC Bioinformatics 2022; 23:298. [PMID: 35879674 PMCID: PMC9316662 DOI: 10.1186/s12859-022-04840-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Accepted: 07/14/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The advent of high throughput sequencing has enabled researchers to systematically evaluate the genetic variations in cancer, identifying many cancer-associated genes. Although cancers in the same tissue are widely categorized in the same group, they demonstrate many differences concerning their mutational profiles. Hence, there is no definitive treatment for most cancer types. This reveals the importance of developing new pipelines to identify cancer-associated genes accurately and re-classify patients with similar mutational profiles. Classification of cancer patients with similar mutational profiles may help discover subtypes of cancer patients who might benefit from specific treatment types. RESULTS In this study, we propose a new machine learning pipeline to identify protein-coding genes mutated in many samples to identify cancer subtypes. We apply our pipeline to 12,270 samples collected from the international cancer genome consortium, covering 19 cancer types. As a result, we identify 17 different cancer subtypes. Comprehensive phenotypic and genotypic analysis indicates distinguishable properties, including unique cancer-related signaling pathways. CONCLUSIONS This new subtyping approach offers a novel opportunity for cancer drug development based on the mutational profile of patients. Additionally, we analyze the mutational signatures for samples in each subtype, which provides important insight into their active molecular mechanisms. Some of the pathways we identified in most subtypes, including the cell cycle and the Axon guidance pathways, are frequently observed in cancer disease. Interestingly, we also identified several mutated genes and different rates of mutation in multiple cancer subtypes. In addition, our study on "gene-motif" suggests the importance of considering both the context of the mutations and mutational processes in identifying cancer-associated genes. The source codes for our proposed clustering pipeline and analysis are publicly available at: https://github.com/bcb-sut/Pan-Cancer .
Collapse
Affiliation(s)
- Amin Ghareyazi
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Amirreza Kazemi
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.,Department of Computer Engineering, Simon Fraser University, Burnaby, BC, 1S6, Canada
| | - Kimia Hamidieh
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 3H2, Canada
| | - Hamed Dashti
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Maedeh Sadat Tahaei
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran
| | - Hamid R Rabiee
- Bioinformatics and Computational Biology Lab, Department of Computer Engineering, Sharif University of Technology, Tehran, 11365, Iran.
| | - Hamid Alinejad-Rokny
- BioMedical Machine Learning Lab (BML), The Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, NSW, 2052, Australia.,UNSW Data Science Hub, The University of New South Wales (UNSW Sydney), Sydney, NSW, 2052, Australia.,AI-Enabled Processes (AIP) Research Centre, Macquarie University, Sydney, 2109, Australia
| | - Iman Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, 08102, USA. .,Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA.
| |
Collapse
|
2
|
Li S, Jiang L, Tang J, Gao N, Guo F. Kernel Fusion Method for Detecting Cancer Subtypes via Selecting Relevant Expression Data. Front Genet 2020; 11:979. [PMID: 33133130 PMCID: PMC7511763 DOI: 10.3389/fgene.2020.00979] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2020] [Accepted: 08/03/2020] [Indexed: 12/19/2022] Open
Abstract
Recently, cancer has been characterized as a heterogeneous disease composed of many different subtypes. Early diagnosis of cancer subtypes is an important study of cancer research, which can be of tremendous help to patients after treatment. In this paper, we first extract a novel dataset, which contains gene expression, miRNA expression, and isoform expression of five cancers from The Cancer Genome Atlas (TCGA). Next, to avoid the effect of noise existing in 60, 483 genes, we select a small number of genes by using LASSO that employs gene expression and survival time of patients. Then, we construct one similarity kernel for each expression data by using Chebyshev distance. And also, We used SKF to fused the three similarity matrix composed of gene, Iso, and miRNA, and finally clustered the fused similarity matrix with spectral clustering. In the experimental results, our method has better P-value in the Cox model than other methods on 10 cancer data from Jiang Dataset and Novel Dataset. We have drawn different survival curves for different cancers and found that some genes play a key role in cancer. For breast cancer, we find out that HSPA2A, RNASE1, CLIC6, and IFITM1 are highly expressed in some specific groups. For lung cancer, we ensure that C4BPA, SESN3, and IRS1 are highly expressed in some specific groups. The code and all supporting data files are available from https://github.com/guofei-tju/Uncovering-Cancer-Subtypes-via-LASSO.
Collapse
Affiliation(s)
- Shuhao Li
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Nan Gao
- School of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
3
|
Xi J, Li A, Wang M. HetRCNA: A Novel Method to Identify Recurrent Copy Number Alternations from Heterogeneous Tumor Samples Based on Matrix Decomposition Framework. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:422-434. [PMID: 29994262 DOI: 10.1109/tcbb.2018.2846599] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A common strategy to discovering cancer associated copy number aberrations (CNAs) from a cohort of cancer samples is to detect recurrent CNAs (RCNAs). Although the previous methods can successfully identify communal RCNAs shared by nearly all tumor samples, detecting subgroup-specific RCNAs and their related subgroup samples from cancer samples with heterogeneity is still invalid for these existing approaches. In this paper, we introduce a novel integrated method called HetRCNA, which can identify statistically significant subgroup-specific RCNAs and their related subgroup samples. Based on matrix decomposition framework with weight constraint, HetRCNA can successfully measure the subgroup samples by coefficients of left vectors with weight constraint and subgroup-specific RCNAs by coefficients of the right vectors and significance test. When we evaluate HetRCNA on simulated dataset, the results show that HetRCNA gives the best performances among the competing methods and is robust to the noise factors of the simulated data. When HetRCNA is applied on a real breast cancer dataset, our approach successfully identifies a bunch of RCNA regions and the result is highly correlated with the results of the other two investigated approaches. Notably, the genomic regions identified by HetRCNA harbor many breast cancer related genes reported by previous researches.
Collapse
|
4
|
Jiang L, Xiao Y, Ding Y, Tang J, Guo F. Discovering Cancer Subtypes via an Accurate Fusion Strategy on Multiple Profile Data. Front Genet 2019; 10:20. [PMID: 30804977 PMCID: PMC6370730 DOI: 10.3389/fgene.2019.00020] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 01/15/2019] [Indexed: 01/03/2023] Open
Abstract
Discovering cancer subtypes is useful for guiding clinical treatment of multiple cancers. Progressive profile technologies for tissue have accumulated diverse types of data. Based on these types of expression data, various computational methods have been proposed to predict cancer subtypes. It is crucial to study how to better integrate these multiple profiles of data. In this paper, we collect multiple profiles of data for five cancers on The Cancer Genome Atlas (TCGA). Then, we construct three similarity kernels for all patients of the same cancer by gene expression, miRNA expression and isoform expression data. We also propose a novel unsupervised multiple kernel fusion method, Similarity Kernel Fusion (SKF), in order to integrate three similarity kernels into one combined kernel. Finally, we make use of spectral clustering on the integrated kernel to predict cancer subtypes. In the experimental results, the P-values from the Cox regression model and survival curve analysis can be used to evaluate the performance of predicted subtypes on three datasets. Our kernel fusion method, SKF, has outstanding performance compared with single kernel and other multiple kernel fusion strategies. It demonstrates that our method can accurately identify more accurate subtypes on various kinds of cancers. Our cancer subtype prediction method can identify essential genes and biomarkers for disease diagnosis and prognosis, and we also discuss the possible side effects of therapies and treatment.
Collapse
Affiliation(s)
- Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yongkang Xiao
- School of Chemical Engineering and Technology, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- Department of Computer Science and Engineering, University of South Carolina, Columbia, SC, United States
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
5
|
A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining. Comput Biol Med 2017; 89:264-274. [PMID: 28850898 DOI: 10.1016/j.compbiomed.2017.08.021] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2017] [Revised: 08/19/2017] [Accepted: 08/20/2017] [Indexed: 12/22/2022]
|
6
|
Implicit feature selection for omics data phenotype discrimination. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2013.10.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
7
|
Wong G, Chan J, Kingwell BA, Leckie C, Meikle PJ. LICRE: unsupervised feature correlation reduction for lipidomics. ACTA ACUST UNITED AC 2014; 30:2832-3. [PMID: 24930143 DOI: 10.1093/bioinformatics/btu381] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
MOTIVATION Recent advances in high-throughput lipid profiling by liquid chromatography electrospray ionization tandem mass spectrometry (LC-ESI-MS/MS) have made it possible to quantify hundreds of individual molecular lipid species (e.g. fatty acyls, glycerolipids, glycerophospholipids, sphingolipids) in a single experimental run for hundreds of samples. This enables the lipidome of large cohorts of subjects to be profiled to identify lipid biomarkers significantly associated with disease risk, progression and treatment response. Clinically, these lipid biomarkers can be used to construct classification models for the purpose of disease screening or diagnosis. However, the inclusion of a large number of highly correlated biomarkers within a model may reduce classification performance, unnecessarily inflate associated costs of a diagnosis or a screen and reduce the feasibility of clinical translation. An unsupervised feature reduction approach can reduce feature redundancy in lipidomic biomarkers by limiting the number of highly correlated lipids while retaining informative features to achieve good classification performance for various clinical outcomes. Good predictive models based on a reduced number of biomarkers are also more cost effective and feasible from a clinical translation perspective. RESULTS The application of LICRE to various lipidomic datasets in diabetes and cardiovascular disease demonstrated superior discrimination in terms of the area under the receiver operator characteristic curve while using fewer lipid markers when predicting various clinical outcomes. AVAILABILITY AND IMPLEMENTATION The MATLAB implementation of LICRE is available from http://ww2.cs.mu.oz.au/∼gwong/LICRE
Collapse
Affiliation(s)
- Gerard Wong
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Jeffrey Chan
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Bronwyn A Kingwell
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Christopher Leckie
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Peter J Meikle
- Baker IDI Heart and Diabetes Institute, Melbourne, Victoria 3004, Australia and Department of Computing and Information Systems, The University of Melbourne, Parkville, Victoria 3010, Australia
| |
Collapse
|
8
|
Multiclass Prediction for Cancer Microarray Data Using Various Variables Range Selection Based on Random Forest. LECTURE NOTES IN COMPUTER SCIENCE 2013. [DOI: 10.1007/978-3-642-40319-4_22] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|