1
|
Eledkawy A, Hamza T, El-Metwally S. Towards precision oncology: a multi-level cancer classification system integrating liquid biopsy and machine learning. BioData Min 2025; 18:29. [PMID: 40217526 PMCID: PMC11987386 DOI: 10.1186/s13040-025-00439-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Accepted: 03/10/2025] [Indexed: 04/14/2025] Open
Abstract
BACKGROUND Millions of people die from cancer every year. Early cancer detection is crucial for ensuring higher survival rates, as it provides an opportunity for timely medical interventions. This paper proposes a multi-level cancer classification system that uses plasma cfDNA/ctDNA mutations and protein biomarkers to identify seven distinct cancer types: colorectal, breast, upper gastrointestinal, lung, pancreas, ovarian, and liver. RESULTS The proposed system employs a multi-stage binary classification framework where each stage is customized for a specific cancer type. A majority vote feature selection process is employed by combining six feature selectors: Information Value, Chi-Square, Random Forest Feature Importance, Extra Tree Feature Importance, Recursive Feature Elimination, and L1 Regularization. Following the feature selection process, classifiers-including eXtreme Gradient Boosting, Random Forest, Extra Tree, and Quadratic Discriminant Analysis-are customized for each cancer type individually or in an ensemble soft voting setup to optimize predictive accuracy. The proposed system outperformed previously published results, achieving an AUC of 98.2% and an accuracy of 96.21%. To ensure reproducibility of the results, the trained models and the dataset used in this study are made publicly available via the GitHub repository ( https://github.com/SaraEl-Metwally/Towards-Precision-Oncology ). CONCLUSION The identified biomarkers enhance the interpretability of the diagnosis, facilitating more informed decision-making. The system's performance underscores its effectiveness in tissue localization, contributing to improved patient outcomes through timely medical interventions.
Collapse
Affiliation(s)
- Amr Eledkawy
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Taher Hamza
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| |
Collapse
|
2
|
Tan WY, Nagabhyrava S, Ang-Olson O, Das P, Ladel L, Sailo B, He L, Sharma A, Ahuja N. Translation of Epigenetics in Cell-Free DNA Liquid Biopsy Technology and Precision Oncology. Curr Issues Mol Biol 2024; 46:6533-6565. [PMID: 39057032 PMCID: PMC11276574 DOI: 10.3390/cimb46070390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2024] [Revised: 06/21/2024] [Accepted: 06/23/2024] [Indexed: 07/28/2024] Open
Abstract
Technological advancements in cell-free DNA (cfDNA) liquid biopsy have triggered exponential growth in numerous clinical applications. While cfDNA-based liquid biopsy has made significant strides in personalizing cancer treatment, the exploration and translation of epigenetics in liquid biopsy to clinical practice is still nascent. This comprehensive review seeks to provide a broad yet in-depth narrative of the present status of epigenetics in cfDNA liquid biopsy and its associated challenges. It highlights the potential of epigenetics in cfDNA liquid biopsy technologies with the hopes of enhancing its clinical translation. The momentum of cfDNA liquid biopsy technologies in recent years has propelled epigenetics to the forefront of molecular biology. We have only begun to reveal the true potential of epigenetics in both our understanding of disease and leveraging epigenetics in the diagnostic and therapeutic domains. Recent clinical applications of epigenetics-based cfDNA liquid biopsy revolve around DNA methylation in screening and early cancer detection, leading to the development of multi-cancer early detection tests and the capability to pinpoint tissues of origin. The clinical application of epigenetics in cfDNA liquid biopsy in minimal residual disease, monitoring, and surveillance are at their initial stages. A notable advancement in fragmentation patterns analysis has created a new avenue for epigenetic biomarkers. However, the widespread application of cfDNA liquid biopsy has many challenges, including biomarker sensitivity, specificity, logistics including infrastructure and personnel, data processing, handling, results interpretation, accessibility, and cost effectiveness. Exploring and translating epigenetics in cfDNA liquid biopsy technology can transform our understanding and perception of cancer prevention and management. cfDNA liquid biopsy has great potential in precision oncology to revolutionize conventional ways of early cancer detection, monitoring residual disease, treatment response, surveillance, and drug development. Adapting the implementation of liquid biopsy workflow to the local policy worldwide and developing point-of-care testing holds great potential to overcome global cancer disparity and improve cancer outcomes.
Collapse
Affiliation(s)
- Wan Ying Tan
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
- Department of Internal Medicine, Norwalk Hospital, Norwalk, CT 06850, USA
- Hematology & Oncology, Neag Comprehensive Cancer Center, UConn Health, Farmington, CT 06030, USA
| | | | - Olivia Ang-Olson
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
| | - Paromita Das
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
| | - Luisa Ladel
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
- Department of Internal Medicine, Norwalk Hospital, Norwalk, CT 06850, USA
| | - Bethsebie Sailo
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
| | - Linda He
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
| | - Anup Sharma
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
| | - Nita Ahuja
- Department of Surgery, Yale School of Medicine, New Haven, CT 06520-8000, USA; (W.Y.T.); (P.D.); (L.L.); (B.S.); (L.H.)
- Department of Pathology, Yale School of Medicine, New Haven, CT 06520-8000, USA
- Biological and Biomedical Sciences Program (BBS), Yale University, New Haven, CT 06520-8084, USA
| |
Collapse
|
3
|
Eledkawy A, Hamza T, El-Metwally S. Precision cancer classification using liquid biopsy and advanced machine learning techniques. Sci Rep 2024; 14:5841. [PMID: 38462648 PMCID: PMC10925597 DOI: 10.1038/s41598-024-56419-1] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Accepted: 03/06/2024] [Indexed: 03/12/2024] Open
Abstract
Cancer presents a significant global health burden, resulting in millions of annual deaths. Timely detection is critical for improving survival rates, offering a crucial window for timely medical interventions. Liquid biopsy, analyzing genetic variations, and mutations in circulating cell-free, circulating tumor DNA (cfDNA/ctDNA) or molecular biomarkers, has emerged as a tool for early detection. This study focuses on cancer detection using mutations in plasma cfDNA/ctDNA and protein biomarker concentrations. The proposed system initially calculates the correlation coefficient to identify correlated features, while mutual information assesses each feature's relevance to the target variable, eliminating redundant features to improve efficiency. The eXtrem Gradient Boosting (XGBoost) feature importance method iteratively selects the top ten features, resulting in a 60% dataset dimensionality reduction. The Light Gradient Boosting Machine (LGBM) model is employed for classification, optimizing its performance through a random search for hyper-parameters. Final predictions are obtained by ensembling LGBM models from tenfold cross-validation, weighted by their respective balanced accuracy, and averaged to get final predictions. Applying this methodology, the proposed system achieves 99.45% accuracy and 99.95% AUC for detecting the presence of cancer while achieving 93.94% accuracy and 97.81% AUC for cancer-type classification. Our methodology leads to enhanced healthcare outcomes for cancer patients.
Collapse
Affiliation(s)
- Amr Eledkawy
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Taher Hamza
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt
| | - Sara El-Metwally
- Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.
- Biomedical Informatics Department, Faculty of Computer Science and Engineering, New Mansoura University, Gamasa, 35712, Egypt.
| |
Collapse
|
4
|
Baul S, Ahmed KT, Filipek J, Zhang W. omicsGAT: Graph Attention Network for Cancer Subtype Analyses. Int J Mol Sci 2022; 23:10220. [PMID: 36142140 PMCID: PMC9499656 DOI: 10.3390/ijms231810220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 08/14/2022] [Accepted: 08/31/2022] [Indexed: 12/01/2022] Open
Abstract
The use of high-throughput omics technologies is becoming increasingly popular in all facets of biomedical science. The mRNA sequencing (RNA-seq) method reports quantitative measures of more than tens of thousands of biological features. It provides a more comprehensive molecular perspective of studied cancer mechanisms compared to traditional approaches. Graph-based learning models have been proposed to learn important hidden representations from gene expression data and network structure to improve cancer outcome prediction, patient stratification, and cell clustering. However, these graph-based methods cannot rank the importance of the different neighbors for a particular sample in the downstream cancer subtype analyses. In this study, we introduce omicsGAT, a graph attention network (GAT) model to integrate graph-based learning with an attention mechanism for RNA-seq data analysis. The multi-head attention mechanism in omicsGAT can more effectively secure information of a particular sample by assigning different attention coefficients to its neighbors. Comprehensive experiments on The Cancer Genome Atlas (TCGA) breast cancer and bladder cancer bulk RNA-seq data and two single-cell RNA-seq datasets validate that (1) the proposed model can effectively integrate neighborhood information of a sample and learn an embedding vector to improve disease phenotype prediction, cancer patient stratification, and cell clustering of the sample and (2) the attention matrix generated from the multi-head attention coefficients provides more useful information compared to the sample correlation-based adjacency matrix. From the results, we can conclude that some neighbors play a more important role than others in cancer subtype analyses of a particular sample based on the attention coefficient.
Collapse
Affiliation(s)
- Sudipto Baul
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Joseph Filipek
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
- Genomics and Bioinformatics Cluster, University of Central Florida, Orlando, FL 32816, USA
| |
Collapse
|
5
|
Upadhyay P, Ray S. A Regularized Multi-Task Learning Approach for Cell Type Detection in Single-Cell RNA Sequencing Data. Front Genet 2022; 13:788832. [PMID: 35495159 PMCID: PMC9043858 DOI: 10.3389/fgene.2022.788832] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2021] [Accepted: 02/16/2022] [Indexed: 11/29/2022] Open
Abstract
Cell type prediction is one of the most challenging goals in single-cell RNA sequencing (scRNA-seq) data. Existing methods use unsupervised learning to identify signature genes in each cluster, followed by a literature survey to look up those genes for assigning cell types. However, finding potential marker genes in each cluster is cumbersome, which impedes the systematic analysis of single-cell RNA sequencing data. To address this challenge, we proposed a framework based on regularized multi-task learning (RMTL) that enables us to simultaneously learn the subpopulation associated with a particular cell type. Learning the structure of subpopulations is treated as a separate task in the multi-task learner. Regularization is used to modulate the multi-task model (e.g., W1, W2, … Wt) jointly, according to the specific prior. For validating our model, we trained it with reference data constructed from a single-cell RNA sequencing experiment and applied it to a query dataset. We also predicted completely independent data (the query dataset) from the reference data which are used for training. We have checked the efficacy of the proposed method by comparing it with other state-of-the-art techniques well known for cell type detection. Results revealed that the proposed method performed accurately in detecting the cell type in scRNA-seq data and thus can be utilized as a useful tool in the scRNA-seq pipeline.
Collapse
Affiliation(s)
- Piu Upadhyay
- B.P. Poddar Institute of Management and Technology, Kolkata, India
| | - Sumanta Ray
- Department of Computer Science and Engineering, Aliah University, Kolkata, India
- Health Analytics Network, Pittsburgh, PA, United States
- *Correspondence: Sumanta Ray, ,
| |
Collapse
|
6
|
Wang L, Tan Y, Yang X, Kuang L, Ping P. Review on predicting pairwise relationships between human microbes, drugs and diseases: from biological data to computational models. Brief Bioinform 2022; 23:6553604. [PMID: 35325024 DOI: 10.1093/bib/bbac080] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2021] [Revised: 02/14/2022] [Accepted: 02/15/2022] [Indexed: 12/11/2022] Open
Abstract
In recent years, with the rapid development of techniques in bioinformatics and life science, a considerable quantity of biomedical data has been accumulated, based on which researchers have developed various computational approaches to discover potential associations between human microbes, drugs and diseases. This paper provides a comprehensive overview of recent advances in prediction of potential correlations between microbes, drugs and diseases from biological data to computational models. Firstly, we introduced the widely used datasets relevant to the identification of potential relationships between microbes, drugs and diseases in detail. And then, we divided a series of a lot of representative computing models into five major categories including network, matrix factorization, matrix completion, regularization and artificial neural network for in-depth discussion and comparison. Finally, we analysed possible challenges and opportunities in this research area, and at the same time we outlined some suggestions for further improvement of predictive performances as well.
Collapse
Affiliation(s)
- Lei Wang
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China.,Key Laboratory of Hunan Province for Internet of Things and Information Security, Xiangtan University, Xiangtan, 411105, Hunan, China
| | - Yaqin Tan
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China.,Key Laboratory of Hunan Province for Internet of Things and Information Security, Xiangtan University, Xiangtan, 411105, Hunan, China
| | - Xiaoyu Yang
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China.,Key Laboratory of Hunan Province for Internet of Things and Information Security, Xiangtan University, Xiangtan, 411105, Hunan, China
| | - Linai Kuang
- Key Laboratory of Hunan Province for Internet of Things and Information Security, Xiangtan University, Xiangtan, 411105, Hunan, China
| | - Pengyao Ping
- College of Computer Engineering & Applied Mathematics, Changsha University, Changsha, 410022, Hunan, China
| |
Collapse
|
7
|
Learning with joint cross-document information via multi-task learning for named entity recognition. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|
8
|
Xu Y, Cui X, Wang Y. Pan-Cancer Metastasis Prediction Based on Graph Deep Learning Method. Front Cell Dev Biol 2021; 9:675978. [PMID: 34179004 PMCID: PMC8220811 DOI: 10.3389/fcell.2021.675978] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2021] [Accepted: 04/12/2021] [Indexed: 11/29/2022] Open
Abstract
Tumor metastasis is the major cause of mortality from cancer. From this perspective, detecting cancer gene expression and transcriptome changes is important for exploring tumor metastasis molecular mechanisms and cellular events. Precisely estimating a patient’s cancer state and prognosis is the key challenge to develop a patient’s therapeutic schedule. In the recent years, a variety of machine learning techniques widely contributed to analyzing real-world gene expression data and predicting tumor outcomes. In this area, data mining and machine learning techniques have widely contributed to gene expression data analysis by supplying computational models to support decision-making on real-world data. Nevertheless, limitation of real-world data extremely restricted model predictive performance, and the complexity of data makes it difficult to extract vital features. Besides these, the efficacy of standard machine learning pipelines is far from being satisfactory despite the fact that diverse feature selection strategy had been applied. To address these problems, we developed directed relation-graph convolutional network to provide an advanced feature extraction strategy. We first constructed gene regulation network and extracted gene expression features based on relational graph convolutional network method. The high-dimensional features of each sample were regarded as an image pixel, and convolutional neural network was implemented to predict the risk of metastasis for each patient. Ten cross-validations on 1,779 cases from The Cancer Genome Atlas show that our model’s performance (area under the curve, AUC = 0.837; area under precision recall curve, AUPRC = 0.717) outstands that of an existing network-based method (AUC = 0.707, AUPRC = 0.555).
Collapse
Affiliation(s)
- Yining Xu
- Department of Computer Science, Harbin Institute of Technology, Harbin, China
| | - Xinran Cui
- Department of Computer Science, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- Department of Computer Science, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
9
|
Dizaji KG, Chen W, Huang H. Deep Large-Scale Multitask Learning Network for Gene Expression Inference. J Comput Biol 2021; 28:485-500. [PMID: 34014778 DOI: 10.1089/cmb.2020.0438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Gene expression profiling makes it possible to conduct many biological studies in a variety of fields due to its thorough characterization of cellular states under various experimental conditions. Despite recent advances in high-throughput technology, profiling an entire set of genomes is still difficult and expensive. Due to the high correlation between expression patterns of different genes, the aforementioned problem can be solved with a cost-effective approach that collects only a small subset of genes, called landmark genes, representing the entire set of genes, and infer the remaining genes, called target genes, using a computational model. There are several shallow and deep regression models in literature to estimate the expressions of target genes from the landmark genes. However, the shallow mostly have limited capacity in learning the nonlinear and complex gene expression data and are prone to underfitting, and the deep models generally do not take advantage of correlation among target genes in the learning process and suffer from overfitting. Considering the gene expression inference as a multitask learning problem, we propose a new deep multitask learning algorithm to tackle these issues. Our learning framework automatically learns the correlation between target genes and uses this knowledge to improve its generalization. Specifically, we utilize a subnetwork with low-dimensional latent variables to discover the relationships between target genes and enforce a seamless and easy to implement regularization to our deep regression model. Unlike the existing multitask learning methods that can only deal with dozens or hundreds of tasks, our algorithm is able to efficiently learn the relationships between ∼10,000 target genes and, thus, is scalable to a large number of tasks. Our proposed method outperforms the shallow and deep regression models for gene expression inference and alternative multitask learning algorithms on two large-scale datasets regardless of the network architecture.
Collapse
Affiliation(s)
- Kamran Ghasedi Dizaji
- Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Wei Chen
- Department of Pediatrics, UPMC Children's Hospital of Pittsburgh, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| | - Heng Huang
- Department of Electrical and Computer Engineering, University of Pittsburgh, Pittsburgh, Pennsylvania, USA.,Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
10
|
Sun C, Hong S, Song M, Li H, Wang Z. Predicting COVID-19 disease progression and patient outcomes based on temporal deep learning. BMC Med Inform Decis Mak 2021; 21:45. [PMID: 33557818 PMCID: PMC7869774 DOI: 10.1186/s12911-020-01359-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Accepted: 11/30/2020] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND The coronavirus disease 2019 (COVID-19) pandemic has caused health concerns worldwide since December 2019. From the beginning of infection, patients will progress through different symptom stages, such as fever, dyspnea or even death. Identifying disease progression and predicting patient outcome at an early stage helps target treatment and resource allocation. However, there is no clear COVID-19 stage definition, and few studies have addressed characterizing COVID-19 progression, making the need for this study evident. METHODS We proposed a temporal deep learning method, based on a time-aware long short-term memory (T-LSTM) neural network and used an online open dataset, including blood samples of 485 patients from Wuhan, China, to train the model. Our method can grasp the dynamic relations in irregularly sampled time series, which is ignored by existing works. Specifically, our method predicted the outcome of COVID-19 patients by considering both the biomarkers and the irregular time intervals. Then, we used the patient representations, extracted from T-LSTM units, to subtype the patient stages and describe the disease progression of COVID-19. RESULTS Using our method, the accuracy of the outcome of prediction results was more than 90% at 12 days and 98, 95 and 93% at 3, 6, and 9 days, respectively. Most importantly, we found 4 stages of COVID-19 progression with different patient statuses and mortality risks. We ranked 40 biomarkers related to disease and gave the reference values of them for each stage. Top 5 is Lymph, LDH, hs-CRP, Indirect Bilirubin, Creatinine. Besides, we have found 3 complications - myocardial injury, liver function injury and renal function injury. Predicting which of the 4 stages the patient is currently in can help doctors better assess and cure the patient. CONCLUSIONS To combat the COVID-19 epidemic, this paper aims to help clinicians better assess and treat infected patients, provide relevant researchers with potential disease progression patterns, and enable more effective use of medical resources. Our method predicted patient outcomes with high accuracy and identified a four-stage disease progression. We hope that the obtained results and patterns will aid in fighting the disease.
Collapse
Affiliation(s)
- Chenxi Sun
- School of Electronics Engineering and Computer Science, Peking University, Beijing, People's Republic of China
- Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, People's Republic of China
| | - Shenda Hong
- National Institute of Health Data Science, Peking University, Beijing, People's Republic of China
- Institute of Medical Technology, Health Science Center of Peking University, Beijing, People's Republic of China
| | - Moxian Song
- School of Electronics Engineering and Computer Science, Peking University, Beijing, People's Republic of China
- Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, People's Republic of China
| | - Hongyan Li
- School of Electronics Engineering and Computer Science, Peking University, Beijing, People's Republic of China.
- Key Laboratory of Machine Perception (Ministry of Education), Peking University, Beijing, People's Republic of China.
| | - Zhenjie Wang
- Institute of Population Research, Peking University, No.5 Yiheyuan Road, Beijing, 100871, People's Republic of China.
| |
Collapse
|
11
|
Rahaman S, Li X, Yu J, Wong KC. CancerEMC: frontline non-invasive cancer screening from circulating protein biomarkers and mutations in cell-free DNA. Bioinformatics 2021; 37:3319-3327. [PMID: 33515231 DOI: 10.1093/bioinformatics/btab044] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2020] [Revised: 12/19/2020] [Accepted: 01/20/2021] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The early detection of cancer through accessible blood tests can foster early patient interventions. Although there are developments in cancer detection from cell-free DNA (cfDNA), its accuracy remains speculative. Given its central importance with broad impacts, we aspire to address the challenge. METHODS A bagging Ensemble Meta Classifier (CancerEMC) is proposed for early cancer detection based on circulating protein biomarkers and mutations in cfDNA from the blood. CancerEMC is generally designed for both binary cancer detection and multi-class cancer type localization. It can address the class imbalance problem in multi-analyte blood test data based on robust oversampling and adaptive synthesis techniques. RESULTS Based on the clinical blood test data, we observe that the proposed CancerEMC has outperformed other algorithms and state-of-the-arts studies (including CancerSEEK published in Science, 2018) for cancer detection. The results reveal that our proposed method (i.e., CancerEMC) can achieve the best performance result for both binary cancer classification with 99.1748% accuracy (AUC = 0.999) and localized multiple cancer detection with 74.1214% accuracy (AUC = 0.938). For addressing the data imbalance issue with oversampling techniques, the accuracy can be increased to 91.4966% (AUC = 0.992), where the state-of-the-art method can only be estimated at 69.64% (AUC = 0.921). Similar results can also be observed on independent and isolated testing data. AVAILABILITY https://github.com/saifurcubd/Cancer-Detection.
Collapse
Affiliation(s)
- Saifur Rahaman
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Xiangtao Li
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jun Yu
- Institute of Digestive Diseases and The Department of Medicine and Therapeutics, State Key Laboratory of Digestive Disease, Li Ka Shing Institute of Health Sciences, CUHK Shenzhen Research Institute, The Chinese University of Hong Kong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
12
|
Ahmed KT, Park S, Jiang Q, Yeu Y, Hwang T, Zhang W. Network-based drug sensitivity prediction. BMC Med Genomics 2020; 13:193. [PMID: 33371891 PMCID: PMC7771088 DOI: 10.1186/s12920-020-00829-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 11/17/2020] [Indexed: 12/15/2022] Open
Abstract
Background Drug sensitivity prediction and drug responsive biomarker selection on high-throughput genomic data is a critical step in drug discovery. Many computational methods have been developed to serve this purpose including several deep neural network models. However, the modular relations among genomic features have been largely ignored in these methods. To overcome this limitation, the role of the gene co-expression network on drug sensitivity prediction is investigated in this study. Methods In this paper, we first introduce a network-based method to identify representative features for drug response prediction by using the gene co-expression network. Then, two graph-based neural network models are proposed and both models integrate gene network information directly into neural network for outcome prediction. Next, we present a large-scale comparative study among the proposed network-based methods, canonical prediction algorithms (i.e., Elastic Net, Random Forest, Partial Least Squares Regression, and Support Vector Regression), and deep neural network models for drug sensitivity prediction. All the source code and processed datasets in this study are available at https://github.com/compbiolabucf/drug-sensitivity-prediction. Results In the comparison of different feature selection methods and prediction methods on a non-small cell lung cancer (NSCLC) cell line RNA-seq gene expression dataset with 50 different drug treatments, we found that (1) the network-based feature selection method improves the prediction performance compared to Pearson correlation coefficients; (2) Random Forest outperforms all the other canonical prediction algorithms and deep neural network models; (3) the proposed graph-based neural network models show better prediction performance compared to deep neural network model; (4) the prediction performance is drug dependent and it may relate to the drug’s mechanism of action. Conclusions Network-based feature selection method and prediction models improve the performance of the drug response prediction. The relations between the genomic features are more robust and stable compared to the correlation between each individual genomic feature and the drug response in high dimension and low sample size genomic datasets.
Collapse
Affiliation(s)
- Khandakar Tanvir Ahmed
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA
| | - Sunho Park
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9211 Euclid Ave, Cleveland, OH, 44106, USA
| | - Qibing Jiang
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA
| | - Yunku Yeu
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9211 Euclid Ave, Cleveland, OH, 44106, USA
| | - TaeHyun Hwang
- Department of Quantitative Health Sciences, Lerner Research Institute, Cleveland Clinic, 9211 Euclid Ave, Cleveland, OH, 44106, USA
| | - Wei Zhang
- Department of Computer Science, University of Central Florida, 4000 Central Florida Blvd, Orlando, FL, 32816, USA.
| |
Collapse
|
13
|
Momenzadeh M, Sehhati M, Rabbani H. Using hidden Markov model to predict recurrence of breast cancer based on sequential patterns in gene expression profiles. J Biomed Inform 2020; 111:103570. [PMID: 32961308 DOI: 10.1016/j.jbi.2020.103570] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Revised: 09/06/2020] [Accepted: 09/10/2020] [Indexed: 12/16/2022]
Abstract
A new approach is presented to predict breast cancer recurrence through gene expression profiles using hidden Markov models (HMM). In this regard, 322 genes were selected from 44 published gene lists related to breast cancer prognosis. Afterwards, using gene set enrichment analysis, 922 gene sets were found from subsets of genes with the same biological meaning. In order to extract the sequential patterns from gene expression data, we ranked the gene sets using appropriate criteria and used HMM in which the ranked gene sets considered as observation sequences and hidden states represented priority of gene sets for discriminating between expression profiles. In this experiment, seven publicly available microarray datasets, including 1271 breast tumor samples, were used to classify cancer patients into two groups according to risk of recurrence. Our experiments indicated the greater performance and more robustness of the proposed model compared with other widely used classification methods.
Collapse
Affiliation(s)
- Mohammadreza Momenzadeh
- Department of Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran
| | - Mohammadreza Sehhati
- Department of Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran; Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran; Department of Bioinformatics, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran.
| | - Hossein Rabbani
- Department of Biomedical Engineering, School of Advanced Technologies in Medicine, Isfahan University of Medical Sciences, Isfahan, Iran; Medical Image and Signal Processing Research Center, Isfahan University of Medical Sciences, Isfahan, Iran
| |
Collapse
|