1
|
Liu H, Qiu C, Wang B, Bing P, Tian G, Zhang X, Ma J, He B, Yang J. Evaluating DNA Methylation, Gene Expression, Somatic Mutation, and Their Combinations in Inferring Tumor Tissue-of-Origin. Front Cell Dev Biol 2021; 9:619330. [PMID: 34012960 PMCID: PMC8126648 DOI: 10.3389/fcell.2021.619330] [Citation(s) in RCA: 70] [Impact Index Per Article: 23.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Accepted: 03/22/2021] [Indexed: 12/18/2022] Open
Abstract
Carcinoma of unknown primary (CUP) is a type of metastatic cancer, the primary tumor site of which cannot be identified. CUP occupies approximately 5% of cancer incidences in the United States with usually unfavorable prognosis, making it a big threat to public health. Traditional methods to identify the tissue-of-origin (TOO) of CUP like immunohistochemistry can only deal with around 20% CUP patients. In recent years, more and more studies suggest that it is promising to solve the problem by integrating machine learning techniques with big biomedical data involving multiple types of biomarkers including epigenetic, genetic, and gene expression profiles, such as DNA methylation. Different biomarkers play different roles in cancer research; for example, genomic mutations in a patient’s tumor could lead to specific anticancer drugs for treatment; DNA methylation and copy number variation could reveal tumor tissue of origin and molecular classification. However, there is no systematic comparison on which biomarker is better at identifying the cancer type and site of origin. In addition, it might also be possible to further improve the inference accuracy by integrating multiple types of biomarkers. In this study, we used primary tumor data rather than metastatic tumor data. Although the use of primary tumors may lead to some biases in our classification model, their tumor-of-origins are known. In addition, previous studies have suggested that the CUP prediction model built from primary tumors could efficiently predict TOO of metastatic cancers (Lal et al., 2013; Brachtel et al., 2016). We systematically compared the performances of three types of biomarkers including DNA methylation, gene expression profile, and somatic mutation as well as their combinations in inferring the TOO of CUP patients. First, we downloaded the gene expression profile, somatic mutation and DNA methylation data of 7,224 tumor samples across 21 common cancer types from the cancer genome atlas (TCGA) and generated seven different feature matrices through various combinations. Second, we performed feature selection by the Pearson correlation method. The selected features for each matrix were used to build up an XGBoost multi-label classification model to infer cancer TOO, an algorithm proven to be effective in a few previous studies. The performance of each biomarker and combination was compared by the 10-fold cross-validation process. Our results showed that the TOO tracing accuracy using gene expression profile was the highest, followed by DNA methylation, while somatic mutation performed the worst. Meanwhile, we found that simply combining multiple biomarkers does not have much effect in improving prediction accuracy.
Collapse
Affiliation(s)
- Haiyan Liu
- Academician Workstation, Changsha Medical University, Changsha, China.,College of Information Engineering, Changsha Medical University, Changsha, China
| | - Chun Qiu
- Department of Oncology, Hainan General Hospital, Haikou, China
| | - Bo Wang
- Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| | - Pingping Bing
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Geng Tian
- Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| | - Xueliang Zhang
- Department of Oncology, Jiamusi Cancer Hospital, Jiamusi, China
| | - Jun Ma
- College of Information Engineering, Changsha Medical University, Changsha, China
| | - Bingsheng He
- Academician Workstation, Changsha Medical University, Changsha, China
| | - Jialiang Yang
- Academician Workstation, Changsha Medical University, Changsha, China.,Geneis Beijing Co., Ltd., Beijing, China.,Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China
| |
Collapse
|
2
|
Chen S, Zhou W, Tu J, Li J, Wang B, Mo X, Tian G, Lv K, Huang Z. A Novel XGBoost Method to Infer the Primary Lesion of 20 Solid Tumor Types From Gene Expression Data. Front Genet 2021; 12:632761. [PMID: 33613644 PMCID: PMC7886791 DOI: 10.3389/fgene.2021.632761] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 01/06/2021] [Indexed: 11/19/2022] Open
Abstract
Purpose Establish a suitable machine learning model to identify its primary lesions for primary metastatic tumors in an integrated learning approach, making it more accurate to improve primary lesions’ diagnostic efficiency. Methods After deleting the features whose expression level is lower than the threshold, we use two methods to perform feature selection and use XGBoost for classification. After the optimal model is selected through 10-fold cross-validation, it is verified on an independent test set. Results Selecting features with around 800 genes for training, the R2-score of a 10-fold CV of training data can reach 96.38%, and the R2-score of test data can reach 83.3%. Conclusion These findings suggest that by combining tumor data with machine learning methods, each cancer has its corresponding classification accuracy, which can be used to predict primary metastatic tumors’ location. The machine-learning-based method can be used as an orthogonal diagnostic method to judge the machine learning model processing and clinical actual pathological conditions.
Collapse
Affiliation(s)
- Sijie Chen
- Department of Mathematics, Ocean University of China, Qingdao, China
| | - Wenjing Zhou
- Department of Oncology, Hiser Medical Center of Qingdao, Qingdao, China
| | - Jinghui Tu
- Department of Mathematics, Ocean University of China, Qingdao, China
| | - Jian Li
- Department of Mathematics, Ocean University of China, Qingdao, China
| | - Bo Wang
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China.,Geneis Beijing Co., Ltd., Beijing, China
| | - Xiaofei Mo
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China.,Geneis Beijing Co., Ltd., Beijing, China
| | - Geng Tian
- Qingdao Geneis Institute of Big Data Mining and Precision Medicine, Qingdao, China.,Geneis Beijing Co., Ltd., Beijing, China
| | - Kebo Lv
- Department of Mathematics, Ocean University of China, Qingdao, China
| | - Zhijian Huang
- Department of Breast Surgical Oncology, Fujian Cancer Hospital & Fujian Medical University Cancer Hospital, Fuzhou, China
| |
Collapse
|