1
|
Zeng Y, Zhang Y, Xiao Z, Sui H. A multi-classification deep neural network for cancer type identification from high-dimension, small-sample and imbalanced gene microarray data. Sci Rep 2025; 15:5239. [PMID: 39939378 PMCID: PMC11822135 DOI: 10.1038/s41598-025-89475-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Accepted: 02/05/2025] [Indexed: 02/14/2025] Open
Abstract
Gene microarray technology provides an efficient way to diagnose cancer. However, microarray gene expression data face the challenges of high-dimension, small-sample, and multi-class imbalance. The coupling of these challenges leads to inaccurate results when using traditional feature selection and classification algorithms. Due to fast learning speed and good classification performance, deep neural network such as generative adversarial network has been proven one of the best classification algorithms, especially in bioinformatics domain. However, it is limited to binary application and inefficient in processing high-dimensional sparse features. This paper proposes a multi-classification generative adversarial network model combined with features bundling (MGAN-FB) to handle the coupling of high-dimension, small-sample, and multi-class imbalance for gene microarray data classification at both feature and algorithmic levels. At feature level, a deep encoder structure combining feature bundling (FB) mechanism and squeeze and excite (SE) mechanism, is designed for the generator. So, the sparsity, correlation and consequence of high-dimension features are all taken into consideration for adaptive features extraction. It achieves effective dimensionality reduction without transitional information loss. At algorithmic level, a softmax module coupled with multi-classifier are introduced into the discriminator, with a new objective function is distinctively designed for the proposed MGAN-FB model, considering encode loss, reconstruction loss, discrimination loss and multi-classification loss. We extend generative adversaria framework from the binary classification to the multi-classification field. Experiments are performed on eight open-source gene microarray datasets from classification performance, running time and non-parametric tests, which demonstrate that the proposed method has obvious advantages over other 7 compared methods.
Collapse
Affiliation(s)
- Yifu Zeng
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
- Department of Information Technology, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
| | - Yixiang Zhang
- Department of Infectious Diseases, The Second Affiliated Hospital of Fujian Medical University, Quanzhou, China
| | - Zikai Xiao
- Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China
| | - He Sui
- College of Aeronautical Engineering, Civil Aviation University of China, Tianjin, 300300, China.
- Information Security Evaluation Center, Civil Aviation University of China, Tianjin, 300300, China.
| |
Collapse
|
2
|
Li R, Yi H, Ma S. A Selective Review of Network Analysis Methods for Gene Expression Data. Methods Mol Biol 2025; 2880:293-307. [PMID: 39900765 DOI: 10.1007/978-1-0716-4276-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2025]
Abstract
With the development of high-throughput profiling techniques, gene expressions have drawn significant attention due to their important biological implications, widespread data availability, and promising biological findings. The complex interactions and regulations among genes naturally lead to a network structure, which can provide a global view of molecular mechanisms and biological processes. This chapter provides a selective overview of constructing gene expression networks and utilizing them in downstream analysis. It also includes a demonstrating example.
Collapse
Affiliation(s)
- Rong Li
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Huangdi Yi
- Servier Pharmaceuticals, Boston, MA, USA
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
| |
Collapse
|
3
|
Khullar S, Huang X, Ramesh R, Svaren J, Wang D. NetREm: Network Regression Embeddings reveal cell-type transcription factor coordination for gene regulation. BIOINFORMATICS ADVANCES 2024; 5:vbae206. [PMID: 40260118 PMCID: PMC12011367 DOI: 10.1093/bioadv/vbae206] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 10/22/2024] [Accepted: 12/18/2024] [Indexed: 04/23/2025]
Abstract
Motivation Transcription factor (TF) coordination plays a key role in gene regulation via direct and/or indirect protein-protein interactions (PPIs) and co-binding to regulatory elements on DNA. Single-cell technologies facilitate gene expression measurement for individual cells and cell-type identification, yet the connection between TF-TF coordination and target gene (TG) regulation of various cell types remains unclear. Results To address this, we introduce our innovative computational approach, Network Regression Embeddings (NetREm), to reveal cell-type TF-TF coordination activities for TG regulation. NetREm leverages network-constrained regularization, using prior knowledge of PPIs among TFs, to analyze single-cell gene expression data, uncovering cell-type coordinating TFs and identifying revolutionary TF-TG candidate regulatory network links. NetREm's performance is validated using simulation studies and benchmarked across several datasets in humans, mice, yeast. Further, we showcase NetREm's ability to prioritize valid novel human TF-TF coordination links in 9 peripheral blood mononuclear and 42 immune cell sub-types. We apply NetREm to examine cell-type networks in central and peripheral nerve systems (e.g. neuronal, glial, Schwann cells) and in Alzheimer's disease versus Controls. Top predictions are validated with experimental data from rat, mouse, and human models. Additional functional genomics data helps link genetic variants to our TF-TG regulatory and TF-TF coordination networks. Availability and implementation https://github.com/SaniyaKhullar/NetREm.
Collapse
Affiliation(s)
- Saniya Khullar
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, United States
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53076, United States
| | - Xiang Huang
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, United States
| | - Raghu Ramesh
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, United States
- Comparative Biomedical Sciences Training Program, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - John Svaren
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, United States
- Department of Comparative Biosciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI 53706, United States
| | - Daifeng Wang
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, United States
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53076, United States
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53706, United States
| |
Collapse
|
4
|
Sun C, Liu ZP. Discovering explainable biomarkers for breast cancer anti-PD1 response via network Shapley value analysis. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 257:108481. [PMID: 39488042 DOI: 10.1016/j.cmpb.2024.108481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Revised: 10/20/2024] [Accepted: 10/24/2024] [Indexed: 11/04/2024]
Abstract
BACKGROUND AND OBJECTIVE Immunotherapy holds promise in enhancing pathological complete response rates in breast cancer, albeit confined to a select cohort of patients. Consequently, pinpointing factors predictive of treatment responsiveness is of paramount importance. Gene expression and regulation, inherently operating within intricate networks, constitute fundamental molecular machinery for cellular processes and often serve as robust biomarkers. Nevertheless, contemporary feature selection approaches grapple with two key challenges: opacity in modeling and scarcity in accounting for gene-gene interactions METHODS: To address these limitations, we devise a novel feature selection methodology grounded in cooperative game theory, harmoniously integrating with sophisticated machine learning models. This approach identifies interconnected gene regulatory network biomarker modules with priori genetic linkage architecture. Specifically, we leverage Shapley values on network to quantify feature importance, while strategically constraining their integration based on network expansion principles and nodal adjacency, thereby fostering enhanced interpretability in feature selection. We apply our methods to a publicly available single-cell RNA sequencing dataset of breast cancer immunotherapy responses, using the identified feature gene set as biomarkers. Functional enrichment analysis with independent validations further illustrates their effective predictive performance RESULTS: We demonstrate the sophistication and excellence of the proposed method in data with network structure. It unveiled a cohesive biomarker module encompassing 27 genes for immunotherapy response. Notably, this module proves adept at precisely predicting anti-PD1 therapeutic outcomes in breast cancer patients with classification accuracy of 0.905 and AUC value of 0.971, underscoring its unique capacity to illuminate gene functionalities CONCLUSION: The proposed method is effective for identifying network module biomarkers, and the detected anti-PD1 response biomarkers can enrich our understanding of the underlying physiological mechanisms of immunotherapy, which have a promising application for realizing precision medicine.
Collapse
Affiliation(s)
- Chenxi Sun
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, Shandong 250061, China.
| |
Collapse
|
5
|
Liang H, Luo H, Sang Z, Jia M, Jiang X, Wang Z, Cong S, Yao X. GREMI: An Explainable Multi-Omics Integration Framework for Enhanced Disease Prediction and Module Identification. IEEE J Biomed Health Inform 2024; 28:6983-6996. [PMID: 39110558 DOI: 10.1109/jbhi.2024.3439713] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
Multi-omics integration has demonstrated promising performance in complex disease prediction. However, existing research typically focuses on maximizing prediction accuracy, while often neglecting the essential task of discovering meaningful biomarkers. This issue is particularly important in biomedicine, as molecules often interact rather than function individually to influence disease outcomes. To this end, we propose a two-phase framework named GREMI to assist multi-omics classification and explanation. In the prediction phase, we propose to improve prediction performance by employing a graph attention architecture on sample-wise co-functional networks to incorporate biomolecular interaction information for enhanced feature representation, followed by the integration of a joint-late mixed strategy and the true-class-probability block to adaptively evaluate classification confidence at both feature and omics levels. In the interpretation phase, we propose a multi-view approach to explain disease outcomes from the interaction module perspective, providing a more intuitive understanding and biomedical rationale. We incorporate Monte Carlo tree search (MCTS) to explore local-view subgraphs and pinpoint modules that highly contribute to disease characterization from the global-view. Extensive experiments demonstrate that the proposed framework outperforms state-of-the-art methods in seven different classification tasks, and our model effectively addresses data mutual interference when the number of omics types increases. We further illustrate the functional- and disease-relevance of the identified modules, as well as validate the classification performance of discovered modules using an independent cohort.
Collapse
|
6
|
Jagadesh P, Khan AH, Priya BS, Asheeka A, Zoubir Z, Magbool HM, Alam S, Bakather OY. Artificial neural network, machine learning modelling of compressive strength of recycled coarse aggregate based self-compacting concrete. PLoS One 2024; 19:e0303101. [PMID: 38739642 PMCID: PMC11090367 DOI: 10.1371/journal.pone.0303101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2024] [Accepted: 04/15/2024] [Indexed: 05/16/2024] Open
Abstract
This research study aims to understand the application of Artificial Neural Networks (ANNs) to forecast the Self-Compacting Recycled Coarse Aggregate Concrete (SCRCAC) compressive strength. From different literature, 602 available data sets from SCRCAC mix designs are collected, and the data are rearranged, reconstructed, trained and tested for the ANN model development. The models were established using seven input variables: the mass of cementitious content, water, natural coarse aggregate content, natural fine aggregate content, recycled coarse aggregate content, chemical admixture and mineral admixture used in the SCRCAC mix designs. Two normalization techniques are used for data normalization to visualize the data distribution. For each normalization technique, three transfer functions are used for modelling. In total, six different types of models were run in MATLAB and used to estimate the 28th day SCRCAC compressive strength. Normalization technique 2 performs better than 1 and TANSING is the best transfer function. The best k-fold cross-validation fold is k = 7. The coefficient of determination for predicted and actual compressive strength is 0.78 for training and 0.86 for testing. The impact of the number of neurons and layers on the model was performed. Inputs from standards are used to forecast the 28th day compressive strength. Apart from ANN, Machine Learning (ML) techniques like random forest, extra trees, extreme boosting and light gradient boosting techniques are adopted to predict the 28th day compressive strength of SCRCAC. Compared to ML, ANN prediction shows better results in terms of sensitive analysis. The study also extended to determine 28th day compressive strength from experimental work and compared it with 28th day compressive strength from ANN best model. Standard and ANN mix designs have similar fresh and hardened properties. The average compressive strength from ANN model and experimental results are 39.067 and 38.36 MPa, respectively with correlation coefficient is 1. It appears that ANN can validly predict the compressive strength of concrete.
Collapse
Affiliation(s)
- P. Jagadesh
- Department of Civil Engineering, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu
| | - Afzal Hussain Khan
- Civil Engineering Department, College of Engineering, Jazan University, Jazan, Saudi Arabia
| | - B. Shanmuga Priya
- Department of Civil Engineering, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu
| | - A. Asheeka
- Department of Civil Engineering, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu
| | - Zineb Zoubir
- Green Energy Park (IRESEN, UM6P), km2 R206, Benguerir, Morocco
| | - Hassan M. Magbool
- Civil Engineering Department, College of Engineering, Jazan University, Jazan, Saudi Arabia
| | - Shamshad Alam
- Civil Engineering Department, College of Engineering, Jazan University, Jazan, Saudi Arabia
| | - Omer Y. Bakather
- Department of Chemical Engineering, College of Engineering, Jazan University, Jazan, Saudi Arabia
| |
Collapse
|
7
|
Chereda H, Leha A, Beißbarth T. Stable feature selection utilizing Graph Convolutional Neural Network and Layer-wise Relevance Propagation for biomarker discovery in breast cancer. Artif Intell Med 2024; 151:102840. [PMID: 38658129 DOI: 10.1016/j.artmed.2024.102840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 03/05/2024] [Accepted: 03/10/2024] [Indexed: 04/26/2024]
Abstract
High-throughput technologies are becoming increasingly important in discovering prognostic biomarkers and in identifying novel drug targets. With Mammaprint, Oncotype DX, and many other prognostic molecular signatures breast cancer is one of the paradigmatic examples of the utility of high-throughput data to deliver prognostic biomarkers, that can be represented in a form of a rather short gene list. Such gene lists can be obtained as a set of features (genes) that are important for the decisions of a Machine Learning (ML) method applied to high-dimensional gene expression data. Several studies have identified predictive gene lists for patient prognosis in breast cancer, but these lists are unstable and have only a few genes in common. Instability of feature selection impedes biological interpretability: genes that are relevant for cancer pathology should be members of any predictive gene list obtained for the same clinical type of patients. Stability and interpretability of selected features can be improved by including information on molecular networks in ML methods. Graph Convolutional Neural Network (GCNN) is a contemporary deep learning approach applicable to gene expression data structured by a prior knowledge molecular network. Layer-wise Relevance Propagation (LRP) and SHapley Additive exPlanations (SHAP) are methods to explain individual decisions of deep learning models. We used both GCNN+LRP and GCNN+SHAP techniques to construct feature sets by aggregating individual explanations. We suggest a methodology to systematically and quantitatively analyze the stability, the impact on the classification performance, and the interpretability of the selected feature sets. We used this methodology to compare GCNN+LRP to GCNN+SHAP and to more classical ML-based feature selection approaches. Utilizing a large breast cancer gene expression dataset we show that, while feature selection with SHAP is useful in applications where selected features have to be impactful for classification performance, among all studied methods GCNN+LRP delivers the most stable (reproducible) and interpretable gene lists.
Collapse
Affiliation(s)
- Hryhorii Chereda
- Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany
| | - Andreas Leha
- Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany; Medical Statistics, University Medical Center Göttingen, Humboldtallee 32, Göttingen, 37073, Germany; Scientific Core Facility Medical Biometry and Statistical Bioinformatics, University Medical Center Göttingen, Humboldtallee 32, Göttingen, 37073, Germany
| | - Tim Beißbarth
- Medical Bioinformatics, University Medical Center Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany; Campus-Institute Data Science (CIDAS), University of Göttingen, Goldschmidtstraße 1, Göttingen, 37077, Germany.
| |
Collapse
|
8
|
Nissar I, Alam S, Masood S, Kashif M. MOB-CBAM: A dual-channel attention-based deep learning generalizable model for breast cancer molecular subtypes prediction using mammograms. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 248:108121. [PMID: 38531147 DOI: 10.1016/j.cmpb.2024.108121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 02/15/2024] [Accepted: 03/06/2024] [Indexed: 03/28/2024]
Abstract
BACKGROUND AND OBJECTIVE Deep Learning models have emerged as a significant tool in generating efficient solutions for complex problems including cancer detection, as they can analyze large amounts of data with high efficiency and performance. Recent medical studies highlight the significance of molecular subtype detection in breast cancer, aiding the development of personalized treatment plans as different subtypes of cancer respond better to different therapies. METHODS In this work, we propose a novel lightweight dual-channel attention-based deep learning model MOB-CBAM that utilizes the backbone of MobileNet-V3 architecture with a Convolutional Block Attention Module to make highly accurate and precise predictions about breast cancer. We used the CMMD mammogram dataset to evaluate the proposed model in our study. Nine distinct data subsets were created from the original dataset to perform coarse and fine-grained predictions, enabling it to identify masses, calcifications, benign, malignant tumors and molecular subtypes of cancer, including Luminal A, Luminal B, HER-2 Positive, and Triple Negative. The pipeline incorporates several image pre-processing techniques, including filtering, enhancement, and normalization, for enhancing the model's generalization ability. RESULTS While identifying benign versus malignant tumors, i.e., coarse-grained classification, the MOB-CBAM model produced exceptional results with 99 % accuracy, precision, recall, and F1-score values of 0.99 and MCC of 0.98. In terms of fine-grained classification, the MOB-CBAM model has proven to be highly efficient in accurately identifying mass with (benign/malignant) and calcification with (benign/malignant) classification tasks with an impressive accuracy rate of 98 %. We have also cross-validated the efficiency of the proposed MOB-CBAM deep learning architecture on two datasets: MIAS and CBIS-DDSM. On the MIAS dataset, an accuracy of 97 % was reported for the task of classifying benign, malignant, and normal images, while on the CBIS-DDSM dataset, an accuracy of 98 % was achieved for the classification of mass with either benign or malignant, and calcification with benign and malignant tumors. CONCLUSION This study presents lightweight MOB-CBAM, a novel deep learning framework, to address breast cancer diagnosis and subtype prediction. The model's innovative incorporation of the CBAM enhances precise predictions. The extensive evaluation of the CMMD dataset and cross-validation on other datasets affirm the model's efficacy.
Collapse
Affiliation(s)
- Iqra Nissar
- Department of Computer Engineering, Jamia Millia Islamia (A Central University), New Delhi, 110025, India.
| | - Shahzad Alam
- Department of Computer Engineering, Jamia Millia Islamia (A Central University), New Delhi, 110025, India
| | - Sarfaraz Masood
- Department of Computer Engineering, Jamia Millia Islamia (A Central University), New Delhi, 110025, India
| | - Mohammad Kashif
- Department of Computer Engineering, Jamia Millia Islamia (A Central University), New Delhi, 110025, India
| |
Collapse
|
9
|
Yang B, Wang L, Bao W. Identify Diabetes-related Targets based on ForgeNet_GPC. Curr Comput Aided Drug Des 2024; 20:1042-1054. [PMID: 38173214 DOI: 10.2174/0115734099258183230929173855] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 08/06/2023] [Accepted: 08/15/2023] [Indexed: 01/05/2024]
Abstract
BACKGROUND Research on potential therapeutic targets and new mechanisms of action can greatly improve the efficiency of new drug development. AIMS Polygenic genetic diseases, such as diabetes, are caused by the interaction of multiple gene loci and environmental factors. OBJECTIVES In this study, a disease target identification algorithm based on protein recognition is proposed. MATERIALS AND METHODS In this method, the related and unrelated targets are collected from literature databases for treating diabetes. The transcribed proteins corresponding to each target are queried in order to construct a protein dataset. Six protein feature extraction algorithms (AAC, CKSAAGP, DDE, DPC, GAAP, and TPC) are utilized to obtain the feature vectors of each protein, which are merged into the full feature vectors. RESULTS A novel classifier (forgeNet_GPC) based on forgeNet and Gaussian process classifier (GPC) is proposed to classify the proteins. CONCLUSION In forgeNet_GPC, forgeNet is utilized to select the important features, and GPC is utilized to solve the classification problem. The experimental results reveal that forgeNet_GPC performs better than 22 classifiers in terms of ROC-AUC, PR-AUC, MCC, Youden Index, and Kappa.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Linlin Wang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, 277160, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, 221018, China
| |
Collapse
|
10
|
Tian L, Yu T. An integrated deep learning framework for the interpretation of untargeted metabolomics data. Brief Bioinform 2023; 24:bbad244. [PMID: 37369636 DOI: 10.1093/bib/bbad244] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2023] [Revised: 06/02/2023] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Untargeted metabolomics is gaining widespread applications. The key aspects of the data analysis include modeling complex activities of the metabolic network, selecting metabolites associated with clinical outcome and finding critical metabolic pathways to reveal biological mechanisms. One of the key roadblocks in data analysis is not well-addressed, which is the problem of matching uncertainty between data features and known metabolites. Given the limitations of the experimental technology, the identities of data features cannot be directly revealed in the data. The predominant approach for mapping features to metabolites is to match the mass-to-charge ratio (m/z) of data features to those derived from theoretical values of known metabolites. The relationship between features and metabolites is not one-to-one since some metabolites share molecular composition, and various adduct ions can be derived from the same metabolite. This matching uncertainty causes unreliable metabolite selection and functional analysis results. Here we introduce an integrated deep learning framework for metabolomics data that take matching uncertainty into consideration. The model is devised with a gradual sparsification neural network based on the known metabolic network and the annotation relationship between features and metabolites. This architecture characterizes metabolomics data and reflects the modular structure of biological system. Three goals can be achieved simultaneously without requiring much complex inference and additional assumptions: (1) evaluate metabolite importance, (2) infer feature-metabolite matching likelihood and (3) select disease sub-networks. When applied to a COVID metabolomics dataset and an aging mouse brain dataset, our method found metabolic sub-networks that were easily interpretable.
Collapse
Affiliation(s)
- Leqi Tian
- School of Data Science, The Chinese University of Hong Kong - Shenzhen, Guangdong, China
- Shenzhen Research Institute of Big Data, Guangdong, China
| | - Tianwei Yu
- School of Data Science, The Chinese University of Hong Kong - Shenzhen, Guangdong, China
- Shenzhen Research Institute of Big Data, Guangdong, China
- Guangdong Provincial Key Laboratory of Big Data Computing, Guangdong, China
| |
Collapse
|
11
|
Tian L, Wu W, Yu T. Graph Random Forest: A Graph Embedded Algorithm for Identifying Highly Connected Important Features. Biomolecules 2023; 13:1153. [PMID: 37509188 PMCID: PMC10377046 DOI: 10.3390/biom13071153] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2023] [Revised: 06/26/2023] [Accepted: 06/30/2023] [Indexed: 07/30/2023] Open
Abstract
Random Forest (RF) is a widely used machine learning method with good performance on classification and regression tasks. It works well under low sample size situations, which benefits applications in the field of biology. For example, gene expression data often involve much larger numbers of features (p) compared to the size of samples (n). Though the predictive accuracy using RF is often high, there are some problems when selecting important genes using RF. The important genes selected by RF are usually scattered on the gene network, which conflicts with the biological assumption of functional consistency between effective features. To improve feature selection by incorporating external topological information between genes, we propose the Graph Random Forest (GRF) for identifying highly connected important features by involving the known biological network when constructing the forest. The algorithm can identify effective features that form highly connected sub-graphs and achieve equivalent classification accuracy to RF. To evaluate the capability of our proposed method, we conducted simulation experiments and applied the method to two real datasets-non-small cell lung cancer RNA-seq data from The Cancer Genome Atlas, and human embryonic stem cell RNA-seq dataset (GSE93593). The resulting high classification accuracy, connectivity of selected sub-graphs, and interpretable feature selection results suggest the method is a helpful addition to graph-based classification models and feature selection procedures.
Collapse
Affiliation(s)
- Leqi Tian
- School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
- Shenzhen Research Institute of Big Data, Shenzhen 518172, China
| | - Wenbin Wu
- School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Tianwei Yu
- School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
- Shenzhen Research Institute of Big Data, Shenzhen 518172, China
- Guangdong Provincial Key Laboratory of Big Data Computing, Shenzhen 518172, China
| |
Collapse
|
12
|
Lee S, Jung H, Park J, Ahn J. Accurate Prediction of Cancer Prognosis by Exploiting Patient-Specific Cancer Driver Genes. Int J Mol Sci 2023; 24:ijms24076445. [PMID: 37047418 PMCID: PMC10095073 DOI: 10.3390/ijms24076445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2023] [Revised: 03/17/2023] [Accepted: 03/28/2023] [Indexed: 04/03/2023] Open
Abstract
Accurate prediction of the prognoses of cancer patients and identification of prognostic biomarkers are both important for the improved treatment of cancer patients, in addition to enhanced anticancer drugs. Many previous bioinformatic studies have been carried out to achieve this goal; however, there remains room for improvement in terms of accuracy. In this study, we demonstrated that patient-specific cancer driver genes could be used to predict cancer prognoses more accurately. To identify patient-specific cancer driver genes, we first generated patient-specific gene networks before using modified PageRank to generate feature vectors that represented the impacts genes had on the patient-specific gene network. Subsequently, the feature vectors of the good and poor prognosis groups were used to train the deep feedforward network. For the 11 cancer types in the TCGA data, the proposed method showed a significantly better prediction performance than the existing state-of-the-art methods for three cancer types (BRCA, CESC and PAAD), better performance for five cancer types (COAD, ESCA, HNSC, KIRC and STAD), and a similar or slightly worse performance for the remaining three cancer types (BLCA, LIHC and LUAD). Furthermore, the case study for the identified breast cancer and cervical squamous cell carcinoma prognostic genes and their subnetworks included several pathways associated with the progression of breast cancer and cervical squamous cell carcinoma. These results suggested that heterogeneous cancer driver information may be associated with cancer prognosis.
Collapse
Affiliation(s)
- Suyeon Lee
- Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea
| | - Heewon Jung
- Samsung Electronics Company Ltd., Suwon 16677, Republic of Korea
| | - Jiwoo Park
- Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea
| | - Jaegyoon Ahn
- Department of Computer Science and Engineering, Incheon National University, Incheon 22012, Republic of Korea
- Correspondence:
| |
Collapse
|
13
|
Alharbi F, Vakanski A. Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review. Bioengineering (Basel) 2023; 10:bioengineering10020173. [PMID: 36829667 PMCID: PMC9952758 DOI: 10.3390/bioengineering10020173] [Citation(s) in RCA: 46] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 01/24/2023] [Accepted: 01/26/2023] [Indexed: 01/31/2023] Open
Abstract
Cancer is a term that denotes a group of diseases caused by the abnormal growth of cells that can spread in different parts of the body. According to the World Health Organization (WHO), cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer, as it is indicative of the biochemical processes in tissue and cells, as well as the genetic characteristics of an organism. Deoxyribonucleic acid (DNA) microarrays and ribonucleic acid (RNA)-sequencing methods for gene expression data allow quantifying the expression levels of genes and produce valuable data for computational analysis. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods. Both conventional and deep learning-based approaches are reviewed, with an emphasis on the application of deep learning models due to their comparative advantages for identifying gene patterns that are distinctive for various types of cancers. Relevant works that employ the most commonly used deep neural network architectures are covered, including multi-layer perceptrons, as well as convolutional, recurrent, graph, and transformer networks. This survey also presents an overview of the data collection methods for gene expression analysis and lists important datasets that are commonly used for supervised machine learning for this task. Furthermore, we review pertinent techniques for feature engineering and data preprocessing that are typically used to handle the high dimensionality of gene expression data, caused by a large number of genes present in data samples. The paper concludes with a discussion of future research directions for machine learning-based gene expression analysis for cancer classification.
Collapse
|
14
|
Hou X, Hou J, Huang G. Bi-dimensional principal gene feature selection from big gene expression data. PLoS One 2022; 17:e0278583. [PMID: 36477666 PMCID: PMC9728919 DOI: 10.1371/journal.pone.0278583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Accepted: 11/20/2022] [Indexed: 12/12/2022] Open
Abstract
Gene expression sample data, which usually contains massive expression profiles of genes, is commonly used for disease related gene analysis. The selection of relevant genes from huge amount of genes is always a fundamental process in applications of gene expression data. As more and more genes have been detected, the size of gene expression data becomes larger and larger; this challenges the computing efficiency for extracting the relevant and important genes from gene expression data. In this paper, we provide a novel Bi-dimensional Principal Feature Selection (BPFS) method for efficiently extracting critical genes from big gene expression data. It applies the principal component analysis (PCA) method on sample and gene domains successively, aiming at extracting the relevant gene features and reducing redundancies while losing less information. The experimental results on four real-world cancer gene expression datasets show that the proposed BPFS method greatly reduces the data size and achieves a nearly double processing speed compared to the counterpart methods, while maintaining better accuracy and effectiveness.
Collapse
Affiliation(s)
- Xiaoqian Hou
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
| | - Jingyu Hou
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
| | - Guangyan Huang
- School of Information Technology, Deakin University, Melbourne, Victoria, Australia
- * E-mail:
| |
Collapse
|
15
|
Wang C, Lye X, Kaalia R, Kumar P, Rajapakse JC. Deep learning and multi-omics approach to predict drug responses in cancer. BMC Bioinformatics 2022; 22:632. [PMID: 36443676 PMCID: PMC9703655 DOI: 10.1186/s12859-022-04964-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2022] [Accepted: 09/25/2022] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND Cancers are genetically heterogeneous, so anticancer drugs show varying degrees of effectiveness on patients due to their differing genetic profiles. Knowing patient's responses to numerous cancer drugs are needed for personalized treatment for cancer. By using molecular profiles of cancer cell lines available from Cancer Cell Line Encyclopedia (CCLE) and anticancer drug responses available in the Genomics of Drug Sensitivity in Cancer (GDSC), we will build computational models to predict anticancer drug responses from molecular features. RESULTS We propose a novel deep neural network model that integrates multi-omics data available as gene expressions, copy number variations, gene mutations, reverse phase protein array expressions, and metabolomics expressions, in order to predict cellular responses to known anti-cancer drugs. We employ a novel graph embedding layer that incorporates interactome data as prior information for prediction. Moreover, we propose a novel attention layer that effectively combines different omics features, taking their interactions into account. The network outperformed feedforward neural networks and reported 0.90 for [Formula: see text] values for prediction of drug responses from cancer cell lines data available in CCLE and GDSC. CONCLUSION The outstanding results of our experiments demonstrate that the proposed method is capable of capturing the interactions of genes and proteins, and integrating multi-omics features effectively. Furthermore, both the results of ablation studies and the investigations of the attention layer imply that gene mutation has a greater influence on the prediction of drug responses than other omics data types. Therefore, we conclude that our approach can not only predict the anti-cancer drug response precisely but also provides insights into reaction mechanisms of cancer cell lines and drugs as well.
Collapse
Affiliation(s)
- Conghao Wang
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Xintong Lye
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Rama Kaalia
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Parvin Kumar
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798 Singapore
| | - Jagath C. Rajapakse
- grid.59025.3b0000 0001 2224 0361School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798 Singapore
| |
Collapse
|
16
|
Sparse multi-label feature selection via dynamic graph manifold regularization. INT J MACH LEARN CYB 2022. [DOI: 10.1007/s13042-022-01679-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2022]
|
17
|
Guo X, Han J, Song Y, Yin Z, Liu S, Shang X. Using expression quantitative trait loci data and graph-embedded neural networks to uncover genotype–phenotype interactions. Front Genet 2022; 13:921775. [PMID: 36046233 PMCID: PMC9421127 DOI: 10.3389/fgene.2022.921775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2022] [Accepted: 07/04/2022] [Indexed: 11/13/2022] Open
Abstract
Motivation: A central goal of current biology is to establish a complete functional link between the genotype and phenotype, known as the so-called genotype–phenotype map. With the continuous development of high-throughput technology and the decline in sequencing costs, multi-omics analysis has become more widely employed. While this gives us new opportunities to uncover the correlation mechanisms between single-nucleotide polymorphism (SNP), genes, and phenotypes, multi-omics still faces certain challenges, specifically: 1) When the sample size is large enough, the number of omics types is often not large enough to meet the requirements of multi-omics analysis; 2) each omics’ internal correlations are often unclear, such as the correlation between genes in genomics; 3) when analyzing a large number of traits (p), the sample size (n) is often smaller than p, n << p, hindering the application of machine learning methods in the classification of disease outcomes.Results: To solve these issues with multi-omics and build a robust classification model, we propose a graph-embedded deep neural network (G-EDNN) based on expression quantitative trait loci (eQTL) data, which achieves sparse connectivity between network layers to prevent overfitting. The correlation within each omics is also considered such that the model more closely resembles biological reality. To verify the capabilities of this method, we conducted experimental analysis using the GSE28127 and GSE95496 data sets from the Gene Expression Omnibus (GEO) database, tested various neural network architectures, and used prior data for feature selection and graph embedding. Results show that the proposed method could achieve a high classification accuracy and easy-to-interpret feature selection. This method represents an extended application of genotype–phenotype association analysis in deep learning networks.
Collapse
Affiliation(s)
- Xinpeng Guo
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
- School of Air and Missile Defense, Air Force Engineering University, Xi’an, China
| | - Jinyu Han
- School of Economics and Management, Chang ‘an University, Xi’an, China
| | - Yafei Song
- School of Air and Missile Defense, Air Force Engineering University, Xi’an, China
| | - Zhilei Yin
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
| | - Shuaichen Liu
- School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, China
| | - Xuequn Shang
- School of Computer Science and Engineering, Northwestern Polytechnical University, Xi’an, China
- *Correspondence: Xuequn Shang,
| |
Collapse
|
18
|
Yang B, Bao W, Hong S. Alzheimer-Compound Identification Based on Data Fusion and forgeNet_SVM. Front Aging Neurosci 2022; 14:931729. [PMID: 35959292 PMCID: PMC9357977 DOI: 10.3389/fnagi.2022.931729] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 05/24/2022] [Indexed: 11/17/2022] Open
Abstract
Rapid screening and identification of potential candidate compounds are very important to understand the mechanism of drugs for the treatment of Alzheimer's disease (AD) and greatly promote the development of new drugs. In order to greatly improve the success rate of screening and reduce the cost and workload of research and development, this study proposes a novel Alzheimer-related compound identification algorithm namely forgeNet_SVM. First, Alzheimer related and unrelated compounds are collected using the data mining method from the literature databases. Three molecular descriptors (ECFP6, MACCS, and RDKit) are utilized to obtain the feature sets of compounds, which are fused into the all_feature set. The all_feature set is input to forgeNet_SVM, in which forgeNet is utilized to provide the importance of each feature and select the important features for feature extraction. The selected features are input to support vector machines (SVM) algorithm to identify the new compounds in Traditional Chinese Medicine (TCM) prescription. The experiment results show that the selected feature set performs better than the all_feature set and three single feature sets (ECFP6, MACCS, and RDKit). The performances of TPR, FPR, Precision, Specificity, F1, and AUC reveal that forgeNet_SVM could identify more accurately Alzheimer-related compounds than other classical classifiers.
Collapse
Affiliation(s)
- Bin Yang
- School of Information Science and Engineering, Zaozhuang University, Zaozhuang, China
| | - Wenzheng Bao
- School of Information and Electrical Engineering, Xuzhou University of Technology, Xuzhou, China
| | - Shichai Hong
- Department of Vascular Surgery, Zhongshan Hospital (Xiamen), Fudan University, Xiamen, China
| |
Collapse
|
19
|
Kumar R, Khatri A, Acharya V. Deep learning uncovers distinct behavior of rice network to pathogens response. iScience 2022; 25:104546. [PMID: 35754717 PMCID: PMC9218438 DOI: 10.1016/j.isci.2022.104546] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 05/06/2022] [Accepted: 06/02/2022] [Indexed: 12/15/2022] Open
Abstract
Rice, apart from abiotic stress, is prone to attack from multiple pathogens. Predominantly, the two rice pathogens, bacterial Xanthomonas oryzae (Xoo) and hemibiotrophic fungus, Magnaporthe oryzae, are extensively well explored for more than the last decade. However, because of lack of holistic studies, we design a deep learning-based rice network model (DLNet) that has explored the quantitative differences resulting in the distinct rice network architecture. Validation studies on rice in response to biotic stresses show that DLNet outperforms other machine learning methods. The current finding indicates the compactness of the rice PTI network and the rise of independent modules in the rice ETI network, resulting in similar patterns of the plant immune response. The results also show more independent network modules and minimum structural disorderness in rice-M. oryzae as compared to the rice-Xoo model revealing the different adaptation strategies of the rice plant to evade pathogen effectors.
Collapse
Affiliation(s)
- Ravi Kumar
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| | - Abhishek Khatri
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India
| | - Vishal Acharya
- Functional Genomics and Complex System Lab, Biotechnology Division, The Himalayan Centre for High-throughput Computational Biology (HiCHiCoB, A BIC Supported by DBT, India), CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh, India.,Academy of Scientific and Innovative Research (AcSIR), Ghaziabad 201002, India
| |
Collapse
|
20
|
Rezaee K, Jeon G, Khosravi MR, Attar HH, Sabzevari A. Deep learning‐based microarray cancer classification and ensemble gene selection approach. IET Syst Biol 2022; 16:120-131. [PMID: 35790076 PMCID: PMC9290776 DOI: 10.1049/syb2.12044] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2022] [Revised: 04/04/2022] [Accepted: 05/31/2022] [Indexed: 12/19/2022] Open
Abstract
Malignancies and diseases of various genetic origins can be diagnosed and classified with microarray data. There are many obstacles to overcome due to the large size of the gene and the small number of samples in the microarray. A combination strategy for gene expression in a variety of diseases is described in this paper, consisting of two steps: identifying the most effective genes via soft ensembling and classifying them with a novel deep neural network. The feature selection approach combines three strategies to select wrapper genes and rank them according to the k‐nearest neighbour algorithm, resulting in a very generalisable model with low error levels. Using soft ensembling, the most effective subsets of genes were identified from three microarray datasets of diffuse large cell lymphoma, leukaemia, and prostate cancer. A stacked deep neural network was used to classify all three datasets, achieving an average accuracy of 97.51%, 99.6%, and 96.34%, respectively. In addition, two previously unreported datasets from small, round blue cell tumors (SRBCTs)and multiple sclerosis‐related brain tissue lesions were examined to show the generalisability of the model method.
Collapse
Affiliation(s)
- Khosro Rezaee
- Department of Biomedical Engineering Meybod University Meybod Iran
| | - Gwanggil Jeon
- Department of Embedded Systems Engineering College of Information Technology Incheon National University Incheon Korea
| | | | - Hani H. Attar
- Department of Energy Engineering Zarqa University Zarqa Jordan
| | | |
Collapse
|
21
|
EGFAFS: A Novel Feature Selection Algorithm Based on Explosion Gravitation Field Algorithm. ENTROPY 2022; 24:e24070873. [PMID: 35885095 PMCID: PMC9322764 DOI: 10.3390/e24070873] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 06/15/2022] [Accepted: 06/22/2022] [Indexed: 02/04/2023]
Abstract
Feature selection (FS) is a vital step in data mining and machine learning, especially for analyzing the data in high-dimensional feature space. Gene expression data usually consist of a few samples characterized by high-dimensional feature space. As a result, they are not suitable to be processed by simple methods, such as the filter-based method. In this study, we propose a novel feature selection algorithm based on the Explosion Gravitation Field Algorithm, called EGFAFS. To reduce the dimensions of the feature space to acceptable dimensions, we constructed a recommended feature pool by a series of Random Forests based on the Gini index. Furthermore, by paying more attention to the features in the recommended feature pool, we can find the best subset more efficiently. To verify the performance of EGFAFS for FS, we tested EGFAFS on eight gene expression datasets compared with four heuristic-based FS methods (GA, PSO, SA, and DE) and four other FS methods (Boruta, HSICLasso, DNN-FS, and EGSG). The results show that EGFAFS has better performance for FS on gene expression data in terms of evaluation metrics, having more than the other eight FS algorithms. The genes selected by EGFAGS play an essential role in the differential co-expression network and some biological functions further demonstrate the success of EGFAFS for solving FS problems on gene expression data.
Collapse
|
22
|
Xing X, Yang F, Li H, Zhang J, Zhao Y, Gao M, Huang J, Yao J. Multi-level attention graph neural network based on co-expression gene modules for disease diagnosis and prognosis. Bioinformatics 2022; 38:2178-2186. [PMID: 35157021 DOI: 10.1093/bioinformatics/btac088] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2021] [Revised: 01/29/2022] [Accepted: 02/09/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Advanced deep learning techniques have been widely applied in disease diagnosis and prognosis with clinical omics, especially gene expression data. In the regulation of biological processes and disease progression, genes often work interactively rather than individually. Therefore, investigating gene association information and co-functional gene modules can facilitate disease state prediction. RESULTS To explore the gene modules and inter-gene relational information contained in the omics data, we propose a novel multi-level attention graph neural network (MLA-GNN) for disease diagnosis and prognosis. Specifically, we format omics data into co-expression graphs via weighted correlation network analysis, and then construct multi-level graph features, finally fuse them through a well-designed multi-level graph feature fully fusion module to conduct predictions. For model interpretation, a novel full-gradient graph saliency mechanism is developed to identify the disease-relevant genes. MLA-GNN achieves state-of-the-art performance on transcriptomic data from TCGA-LGG/TCGA-GBM and proteomic data from coronavirus disease 2019 (COVID-19)/non-COVID-19 patient sera. More importantly, the relevant genes selected by our model are interpretable and are consistent with the clinical understanding. AVAILABILITYAND IMPLEMENTATION The codes are available at https://github.com/TencentAILabHealthcare/MLA-GNN.
Collapse
Affiliation(s)
- Xiaohan Xing
- Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, Hong Kong 999077, China.,AI Lab, Tencent, Shenzhen 518000, China
| | - Fan Yang
- AI Lab, Tencent, Shenzhen 518000, China
| | - Hang Li
- AI Lab, Tencent, Shenzhen 518000, China.,School of Informatics, Xiamen University, Xiamen 361005, China
| | - Jun Zhang
- AI Lab, Tencent, Shenzhen 518000, China
| | - Yu Zhao
- AI Lab, Tencent, Shenzhen 518000, China
| | - Mingxuan Gao
- AI Lab, Tencent, Shenzhen 518000, China.,School of Informatics, Xiamen University, Xiamen 361005, China
| | | | | |
Collapse
|
23
|
Tan K, Huang W, Liu X, Hu J, Dong S. A multi-modal fusion framework based on multi-task correlation learning for cancer prognosis prediction. Artif Intell Med 2022; 126:102260. [DOI: 10.1016/j.artmed.2022.102260] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 01/07/2022] [Accepted: 02/16/2022] [Indexed: 12/30/2022]
|
24
|
Jin Z, Kang J, Yu T. Feature selection and classification over the network with missing node observations. Stat Med 2022; 41:1242-1262. [PMID: 34816464 PMCID: PMC9773124 DOI: 10.1002/sim.9267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 09/14/2021] [Accepted: 10/29/2021] [Indexed: 12/25/2022]
Abstract
Jointly analyzing transcriptomic data and the existing biological networks can yield more robust and informative feature selection results, as well as better understanding of the biological mechanisms. Selecting and classifying node features over genome-scale networks has become increasingly important in genomic biology and genomic medicine. Existing methods have some critical drawbacks. The first is they do not allow flexible modeling of different subtypes of selected nodes. The second is they ignore nodes with missing values, very likely to increase bias in estimation. To address these limitations, we propose a general modeling framework for Bayesian node classification (BNC) with missing values. A new prior model is developed for the class indicators incorporating the network structure. For posterior computation, we resort to the Swendsen-Wang algorithm for efficiently updating class indicators. BNC can naturally handle missing values in the Bayesian modeling framework, which improves the node classification accuracy and reduces the bias in estimating gene effects. We demonstrate the advantages of our methods via extensive simulation studies and the analysis of the cutaneous melanoma dataset from The Cancer Genome Atlas.
Collapse
Affiliation(s)
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan
| | - Tianwei Yu
- School of Data Science and Warshel Institute, The Chinese University of Hong Kong - Shenzhen, and Shenzhen Research Institute of Big Data, Shenzhen, China
| |
Collapse
|
25
|
|
26
|
Li L, Liu ZP. A connected network-regularized logistic regression model for feature selection. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02877-3] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
27
|
Zhang Y, Ma Y, Yang X. Multi-label feature selection based on logistic regression and manifold learning. APPL INTELL 2022. [DOI: 10.1007/s10489-021-03008-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
|
28
|
Qiao C, Yang L, Shi Y, Fang H, Kang Y. Deep belief networks with self-adaptive sparsity. APPL INTELL 2022. [DOI: 10.1007/s10489-021-02361-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
29
|
|
30
|
Li C, Gao Z, Su B, Xu G, Lin X. Data analysis methods for defining biomarkers from omics data. Anal Bioanal Chem 2021; 414:235-250. [PMID: 34951658 DOI: 10.1007/s00216-021-03813-7] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 11/26/2021] [Accepted: 11/29/2021] [Indexed: 02/01/2023]
Abstract
Omics mainly includes genomics, epigenomics, transcriptomics, proteomics and metabolomics. The rapid development of omics technology has opened up new ways to study disease diagnosis and prognosis and to define prospective information of complex diseases. Since omics data are usually large and complex, the method used to analyze the data and to define important information is crucial in omics study. In this review, we focus on advances in biomarker discovery methods based on omics data in the last decade, and categorize them as individual feature analysis, combinatorial feature analysis and network analysis. We also discuss the challenges and perspectives in this field.
Collapse
Affiliation(s)
- Chao Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, Liaoning, China
| | - Zhenbo Gao
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China
| | - Benzhe Su
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China
| | - Guowang Xu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, Liaoning, China
| | - Xiaohui Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China.
| |
Collapse
|
31
|
Yu K, Xie W, Wang L, Zhang S, Li W. Determination of biomarkers from microarray data using graph neural network and spectral clustering. Sci Rep 2021; 11:23828. [PMID: 34903818 PMCID: PMC8668890 DOI: 10.1038/s41598-021-03316-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 12/02/2021] [Indexed: 11/26/2022] Open
Abstract
In bioinformatics, the rapid development of gene sequencing technology has produced an increasing amount of microarray data. This type of data shares the typical characteristics of small sample size and high feature dimensions. Searching for biomarkers from microarray data, which expression features of various diseases, is essential for the disease classification. feature selection has therefore became fundemental for the analysis of microarray data, which designs to remove irrelevant and redundant features. There are a large number of redundant features and irrelevant features in microarray data, which severely degrade the classification effectiveness. We propose an innovative feature selection method with the goal of obtaining feature dependencies from a priori knowledge and removing redundant features using spectral clustering. In this paper, the graph structure is firstly constructed by using the gene interaction network as a priori knowledge, and then a link prediction method based on graph neural network is proposed to enhance the graph structure data. Finally, a feature selection method based on spectral clustering is proposed to determine biomarkers. The classification accuracy on DLBCL and Prostate can be improved by 10.90% and 16.22% compared to traditional methods. Link prediction provides an average classification accuracy improvement of 1.96% and 1.31%, and is up to 16.98% higher than the published method. The results show that the proposed method can have full use of a priori knowledge to effectively select disease prediction biomarkers with high classification accuracy.
Collapse
Affiliation(s)
- Kun Yu
- College of Medicine and Bioinformation Engineering, Northeastern University, Shenyang, China
| | - Weidong Xie
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Linjie Wang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Shoujia Zhang
- School of Computer Science and Engineering, Northeastern University, Shenyang, China
| | - Wei Li
- Key Laboratory of Intelligent Computing in Medical Image MIIC, Northeastern University, Ministry of Education, Shenyang, China.
| |
Collapse
|
32
|
Ma W, Su K, Wu H. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction. Genome Biol 2021; 22:264. [PMID: 34503564 PMCID: PMC8427961 DOI: 10.1186/s13059-021-02480-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Accepted: 08/25/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset. RESULTS In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data. CONCLUSIONS Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub ( https://github.com/marvinquiet/RefConstruction_supervisedCelltyping ).
Collapse
Affiliation(s)
- Wenjing Ma
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Kenong Su
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Hao Wu
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA.
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA.
| |
Collapse
|
33
|
Nguyen ND, Jin T, Wang D. Varmole: a biologically drop-connect deep neural network model for prioritizing disease risk variants and genes. Bioinformatics 2021; 37:1772-1775. [PMID: 33031552 PMCID: PMC8289382 DOI: 10.1093/bioinformatics/btaa866] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2020] [Revised: 09/07/2020] [Accepted: 09/23/2020] [Indexed: 12/23/2022] Open
Abstract
SUMMARY Population studies such as genome-wide association study have identified a variety of genomic variants associated with human diseases. To further understand potential mechanisms of disease variants, recent statistical methods associate functional omic data (e.g. gene expression) with genotype and phenotype and link variants to individual genes. However, how to interpret molecular mechanisms from such associations, especially across omics, is still challenging. To address this problem, we developed an interpretable deep learning method, Varmole, to simultaneously reveal genomic functions and mechanisms while predicting phenotype from genotype. In particular, Varmole embeds multi-omic networks into a deep neural network architecture and prioritizes variants, genes and regulatory linkages via biological drop-connect without needing prior feature selections. AVAILABILITY AND IMPLEMENTATION Varmole is available as a Python tool on GitHub at https://github.com/daifengwanglab/Varmole. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Nam D Nguyen
- Department of Computer Science, Stony Brook University, Stony Brook, NY 11794, USA.,Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, USA
| | - Ting Jin
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, USA.,Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA
| | - Daifeng Wang
- Waisman Center, University of Wisconsin-Madison, Madison, WI 53705, USA.,Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI 53726, USA
| |
Collapse
|
34
|
Tan K, Huang W, Liu X, Hu J, Dong S. A Hierarchical Graph Convolution Network for Representation Learning of Gene Expression Data. IEEE J Biomed Health Inform 2021; 25:3219-3229. [PMID: 33449889 DOI: 10.1109/jbhi.2021.3052008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The curse of dimensionality, which is caused by high-dimensionality and low-sample-size, is a major challenge in gene expression data analysis. However, the real situation is even worse: labelling data is laborious and time-consuming, so only a small part of the limited samples will be labelled. Having such few labelled samples further increases the difficulty of training deep learning models. Interpretability is an important requirement in biomedicine. Many existing deep learning methods are trying to provide interpretability, but rarely apply to gene expression data. Recent semi-supervised graph convolution network methods try to address these problems by smoothing the label information over a graph. However, to the best of our knowledge, these methods only utilize graphs in either the feature space or sample space, which restrict their performance. We propose a transductive semi-supervised representation learning method called a hierarchical graph convolution network (HiGCN) to aggregate the information of gene expression data in both feature and sample spaces. HiGCN first utilizes external knowledge to construct a feature graph and a similarity kernel to construct a sample graph. Then, two spatial-based GCNs are used to aggregate information on these graphs. To validate the model's performance, synthetic and real datasets are provided to lend empirical support. Compared with two recent models and three traditional models, HiGCN learns better representations of gene expression data, and these representations improve the performance of downstream tasks, especially when the model is trained on a few labelled samples. Important features can be extracted from our model to provide reliable interpretability.
Collapse
|
35
|
Wang X, Dong Y, Zheng Y, Chen Y. Multiomics metabolic and epigenetics regulatory network in cancer: A systems biology perspective. J Genet Genomics 2021; 48:520-530. [PMID: 34362682 DOI: 10.1016/j.jgg.2021.05.008] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2021] [Revised: 05/07/2021] [Accepted: 05/11/2021] [Indexed: 12/21/2022]
Abstract
Genetic, epigenetic, and metabolic alterations are all hallmarks of cancer. However, the epigenome and metabolome are both highly complex and dynamic biological networks in vivo. The interplay between the epigenome and metabolome contributes to a biological system that is responsive to the tumor microenvironment and possesses a wealth of unknown biomarkers and targets of cancer therapy. From this perspective, we first review the state of high-throughput biological data acquisition (i.e. multiomics data) and analysis (i.e. computational tools) and then propose a conceptual in silico metabolic and epigenetic regulatory network (MER-Net) that is based on these current high-throughput methods. The conceptual MER-Net is aimed at linking metabolomic and epigenomic networks through observation of biological processes, omics data acquisition, analysis of network information, and integration with validated database knowledge. Thus, MER-Net could be used to reveal new potential biomarkers and therapeutic targets using deep learning models to integrate and analyze large multiomics networks. We propose that MER-Net can serve as a tool to guide integrated metabolomics and epigenomics research or can be modified to answer other complex biological and clinical questions using multiomics data.
Collapse
Affiliation(s)
- Xuezhu Wang
- The State Key Laboratory of Medical Molecular Biology, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, School of Basic Medicine, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Yucheng Dong
- The State Key Laboratory of Medical Molecular Biology, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, School of Basic Medicine, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China
| | - Yongchang Zheng
- Department of Liver Surgery, Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100730, China
| | - Yang Chen
- The State Key Laboratory of Medical Molecular Biology, Department of Biochemistry and Molecular Biology, Institute of Basic Medical Sciences, School of Basic Medicine, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100005, China.
| |
Collapse
|
36
|
Yang H, Zhuang Z, Pan W. A graph convolutional neural network for gene expression data analysis with multiple gene networks. Stat Med 2021; 40:5547-5564. [PMID: 34258781 DOI: 10.1002/sim.9140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 04/07/2021] [Accepted: 06/21/2021] [Indexed: 02/01/2023]
Abstract
Spectral graph convolutional neural networks (GCN) are proposed to incorporate important information contained in graphs such as gene networks. In a standard spectral GCN, there is only one gene network to describe the relationships among genes. However, for genomic applications, due to condition- or tissue-specific gene function and regulation, multiple gene networks may be available; it is unclear how to apply GCNs to disease classification with multiple networks. Besides, which gene networks may provide more effective prior information for a given learning task is unknown a priori and is not straightforward to discover in many cases. A deep multiple graph convolutional neural network is therefore developed here to meet the challenge. The new approach not only computes a feature of a gene as the weighted average of those of itself and its neighbors through spectral GCNs, but also extracts features from gene-specific expression (or other feature) profiles via a feed-forward neural networks (FNN). We also provide two measures, the importance of a given gene and the relative importance score of each gene network, for the genes' and gene networks' contributions, respectively, to the learning task. To evaluate the new method, we conduct real data analyses using several breast cancer and diffuse large B-cell lymphoma datasets and incorporating multiple gene networks obtained from "GIANT 2.0" Compared with the standard FNN, GCN, and random forest, the new method not only yields high classification accuracy but also prioritizes the most important genes confirmed to be highly associated with cancer, strongly suggesting the usefulness of the new method in incorporating multiple gene networks.
Collapse
Affiliation(s)
- Hu Yang
- School of Information, Central University of Finance and Economics, Beijing, China
| | - Zhong Zhuang
- Department of EECE, University of Minnesota, Minneapolis, Minnesota, USA
| | - Wei Pan
- Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
37
|
Yang S, Zhu F, Ling X, Liu Q, Zhao P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front Genet 2021; 12:607471. [PMID: 33912213 PMCID: PMC8075004 DOI: 10.3389/fgene.2021.607471] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 03/05/2021] [Indexed: 12/24/2022] Open
Abstract
With the progress of medical technology, biomedical field ushered in the era of big data, based on which and driven by artificial intelligence technology, computational medicine has emerged. People need to extract the effective information contained in these big biomedical data to promote the development of precision medicine. Traditionally, the machine learning methods are used to dig out biomedical data to find the features from data, which generally rely on feature engineering and domain knowledge of experts, requiring tremendous time and human resources. Different from traditional approaches, deep learning, as a cutting-edge machine learning branch, can automatically learn complex and robust feature from raw data without the need for feature engineering. The applications of deep learning in medical image, electronic health record, genomics, and drug development are studied, where the suggestion is that deep learning has obvious advantage in making full use of biomedical data and improving medical health level. Deep learning plays an increasingly important role in the field of medical health and has a broad prospect of application. However, the problems and challenges of deep learning in computational medical health still exist, including insufficient data, interpretability, data privacy, and heterogeneity. Analysis and discussion on these problems provide a reference to improve the application of deep learning in medical health.
Collapse
Affiliation(s)
- Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Xinghong Ling
- School of Computer Science and Technology, Soochow University, Suzhou, China
- WenZheng College of Soochow University, Suzhou, China
| | - Quan Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Peiyao Zhao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
38
|
Feng J, Jiang L, Li S, Tang J, Wen L. Multi-Omics Data Fusion via a Joint Kernel Learning Model for Cancer Subtype Discovery and Essential Gene Identification. Front Genet 2021; 12:647141. [PMID: 33747053 PMCID: PMC7969795 DOI: 10.3389/fgene.2021.647141] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2020] [Accepted: 02/02/2021] [Indexed: 01/17/2023] Open
Abstract
The multiple sources of cancer determine its multiple causes, and the same cancer can be composed of many different subtypes. Identification of cancer subtypes is a key part of personalized cancer treatment and provides an important reference for clinical diagnosis and treatment. Some studies have shown that there are significant differences in the genetic and epigenetic profiles among different cancer subtypes during carcinogenesis and development. In this study, we first collect seven cancer datasets from the Broad Institute GDAC Firehose, including gene expression profile, isoform expression profile, DNA methylation expression data, and survival information correspondingly. Furthermore, we employ kernel principal component analysis (PCA) to extract features for each expression profile, convert them into three similarity kernel matrices by Gaussian kernel function, and then fuse these matrices as a global kernel matrix. Finally, we apply it to spectral clustering algorithm to get the clustering results of different cancer subtypes. In the experimental results, besides using the P-value from the Cox regression model and survival analysis as the primary evaluation measures, we also introduce statistical indicators such as Rand index (RI) and adjusted RI (ARI) to verify the performance of clustering. Then combining with gene expression profile, we obtain the differential expression of genes among different subtypes by gene set enrichment analysis. For lung cancer, GMPS, EPHA10, C10orf54, and MAGEA6 are highly expressed in different subtypes; for liver cancer, CMYA5, DEPDC6, FAU, VPS24, RCBTB2, LOC100133469, and SLC35B4 are significantly expressed in different subtypes.
Collapse
Affiliation(s)
- Jie Feng
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Limin Jiang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Shuhao Li
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.,School of Computational Science and Engineering, University of South Carolina, Columbia, SC, United States.,Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin, China
| | - Lan Wen
- Changsha Municipal Center of Disease Control, Changsha, China
| |
Collapse
|
39
|
GVES: machine learning model for identification of prognostic genes with a small dataset. Sci Rep 2021; 11:439. [PMID: 33431999 PMCID: PMC7801384 DOI: 10.1038/s41598-020-79889-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2020] [Accepted: 12/08/2020] [Indexed: 12/16/2022] Open
Abstract
Machine learning may be a powerful approach to more accurate identification of genes that may serve as prognosticators of cancer outcomes using various types of omics data. However, to date, machine learning approaches have shown limited prediction accuracy for cancer outcomes, primarily owing to small sample numbers and relatively large number of features. In this paper, we provide a description of GVES (Gene Vector for Each Sample), a proposed machine learning model that can be efficiently leveraged even with a small sample size, to increase the accuracy of identification of genes with prognostic value. GVES, an adaptation of the continuous bag of words (CBOW) model, generates vector representations of all genes for all samples by leveraging gene expression and biological network data. GVES clusters samples using their gene vectors, and identifies genes that divide samples into good and poor outcome groups for the prediction of cancer outcomes. Because GVES generates gene vectors for each sample, the sample size effect is reduced. We applied GVES to six cancer types and demonstrated that GVES outperformed existing machine learning methods, particularly for cancer datasets with a small number of samples. Moreover, the genes identified as prognosticators were shown to reside within a number of significant prognostic genetic pathways associated with pancreatic cancer.
Collapse
|
40
|
Liu J, Su R, Zhang J, Wei L. Classification and gene selection of triple-negative breast cancer subtype embedding gene connectivity matrix in deep neural network. Brief Bioinform 2021; 22:6067882. [PMID: 33415328 DOI: 10.1093/bib/bbaa395] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2020] [Revised: 11/16/2020] [Accepted: 12/01/2020] [Indexed: 12/13/2022] Open
Abstract
Triple-negative breast cancer (TNBC) has been a challenging breast cancer subtype for oncological therapy. Normally, it can be classified into different molecular subtypes. Accurate and stable classification of the six subtypes is essential for personalized treatment of TNBC. In this study, we proposed a new framework to distinguish the six subtypes of TNBC, and this is one of the handful studies that completed the classification based on mRNA and long noncoding RNA expression data. Particularly, we developed a gene selection approach named DGGA, which takes correlation information between genes into account in the process of measuring gene importance and then effectively removes redundant genes. A gene scoring approach that combined GeneRank scores with gene importance generated by deep neural network (DNN), taking inter-subtype discrimination and inner-gene correlations into account, was came up to improve gene selection performance. More importantly, we embedded a gene connectivity matrix in the DNN for sparse learning, which takes additional consideration with weight changes during training when obtaining the measurement of the relative importance of each gene. Finally, Genetic Algorithm was used to simulate the natural evolutionary process to search for the optimal subset of TNBC subtype classification. We validated the proposed method through cross-validation, and the results demonstrate that it can use fewer genes to obtain more accurate classification results. The implementation for the proposed method is available at https://github.com/RanSuLab/TNBC.
Collapse
Affiliation(s)
- Jin Liu
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Ran Su
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Jiahang Zhang
- School of Computer Software, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Leyi Wei
- School of Software, Shandong University, Shandong, China
| |
Collapse
|
41
|
Liu T, Huang J, Liao T, Pu R, Liu S, Peng Y. A Hybrid Deep Learning Model for Predicting Molecular Subtypes of Human Breast Cancer Using Multimodal Data. Ing Rech Biomed 2021. [DOI: 10.1016/j.irbm.2020.12.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
42
|
Lee S, Lim S, Lee T, Sung I, Kim S. Cancer subtype classification and modeling by pathway attention and propagation. Bioinformatics 2020; 36:3818-3824. [PMID: 32207514 DOI: 10.1093/bioinformatics/btaa203] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2019] [Revised: 01/13/2020] [Accepted: 03/19/2020] [Indexed: 01/04/2023] Open
Abstract
MOTIVATION Biological pathway is an important curated knowledge of biological processes. Thus, cancer subtype classification based on pathways will be very useful to understand differences in biological mechanisms among cancer subtypes. However, pathways include only a fraction of the entire gene set, only one-third of human genes in KEGG, and pathways are fragmented. For this reason, there are few computational methods to use pathways for cancer subtype classification. RESULTS We present an explainable deep-learning model with attention mechanism and network propagation for cancer subtype classification. Each pathway is modeled by a graph convolutional network. Then, a multi-attention-based ensemble model combines several hundreds of pathways in an explainable manner. Lastly, network propagation on pathway-gene network explains why gene expression profiles in subtypes are different. In experiments with five TCGA cancer datasets, our method achieved very good classification accuracies and, additionally, identified subtype-specific pathways and biological functions. AVAILABILITY AND IMPLEMENTATION The source code is available at http://biohealth.snu.ac.kr/software/GCN_MAE. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sangseon Lee
- Department of Computer Science and Engineering, Institute of Engineering Research
| | | | - Taeheon Lee
- Department of Computer Science and Engineering, Institute of Engineering Research
| | - Inyoung Sung
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Republic of Korea
| | - Sun Kim
- Department of Computer Science and Engineering, Institute of Engineering Research.,Bioinformatics Institute.,Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul 08826, Republic of Korea
| |
Collapse
|
43
|
Kong Y, Yu T. forgeNet: a graph deep neural network model using tree-based ensemble classifiers for feature graph construction. Bioinformatics 2020; 36:3507-3515. [PMID: 32163118 DOI: 10.1093/bioinformatics/btaa164] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2019] [Revised: 02/07/2020] [Accepted: 03/08/2020] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION A unique challenge in predictive model building for omics data has been the small number of samples (n) versus the large amount of features (p). This 'n≪p' property brings difficulties for disease outcome classification using deep learning techniques. Sparse learning by incorporating known functional relationships between the biological units, such as the graph-embedded deep feedforward network (GEDFN) model, has been a solution to this issue. However, such methods require an existing feature graph, and potential mis-specification of the feature graph can be harmful on classification and feature selection. RESULTS To address this limitation and develop a robust classification model without relying on external knowledge, we propose a forest graph-embedded deep feedforward network (forgeNet) model, to integrate the GEDFN architecture with a forest feature graph extractor, so that the feature graph can be learned in a supervised manner and specifically constructed for a given prediction task. To validate the method's capability, we experimented the forgeNet model with both synthetic and real datasets. The resulting high classification accuracy suggests that the method is a valuable addition to sparse deep learning models for omics data. AVAILABILITY AND IMPLEMENTATION The method is available at https://github.com/yunchuankong/forgeNet. CONTACT tianwei.yu@emory.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunchuan Kong
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
44
|
Gallins P, Saghapour E, Zhou YH. Exploring the Limits of Combined Image/'omics Analysis for Non-cancer Histological Phenotypes. Front Genet 2020; 11:555886. [PMID: 33193632 PMCID: PMC7644963 DOI: 10.3389/fgene.2020.555886] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Accepted: 09/09/2020] [Indexed: 11/13/2022] Open
Abstract
The last several years have witnessed an explosion of methods and applications for combining image data with 'omics data, and for prediction of clinical phenotypes. Much of this research has focused on cancer histology, for which genetic perturbations are large, and the signal to noise ratio is high. Related research on chronic, complex diseases is limited by tissue sample availability, lower genomic signal strength, and the less extreme and tissue-specific nature of intermediate histological phenotypes. Data from the GTEx Consortium provides a unique opportunity to investigate the connections among phenotypic histological variation, imaging data, and 'omics profiling, from multiple tissue-specific phenotypes at the sub-clinical level. Investigating histological designations in multiple tissues, we survey the evidence for genomic association and prediction of histology, and use the results to test the limits of prediction accuracy using machine learning methods applied to the imaging data, genomics data, and their combination. We find that expression data has similar or superior accuracy for pathology prediction as our use of imaging data, despite the fact that pathological determination is made from the images themselves. A variety of machine learning methods have similar performance, while network embedding methods offer at best limited improvements. These observations hold across a range of tissues and predictor types. The results are supportive of the use of genomic measurements for prediction, and in using the same target tissue in which pathological phenotyping has been performed. Although this last finding is sensible, to our knowledge our study is the first to demonstrate this fact empirically. Even while prediction accuracy remains a challenge, the results show clear evidence of pathway and tissue-specific biology.
Collapse
Affiliation(s)
- Paul Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| | - Ehsan Saghapour
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Yi-Hui Zhou
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
45
|
Xu D, Zhang J, Xu H, Zhang Y, Chen W, Gao R, Dehmer M. Multi-scale supervised clustering-based feature selection for tumor classification and identification of biomarkers and targets on genomic data. BMC Genomics 2020; 21:650. [PMID: 32962626 PMCID: PMC7510277 DOI: 10.1186/s12864-020-07038-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 08/30/2020] [Indexed: 12/19/2022] Open
Abstract
Background The small number of samples and the curse of dimensionality hamper the better application of deep learning techniques for disease classification. Additionally, the performance of clustering-based feature selection algorithms is still far from being satisfactory due to their limitation in using unsupervised learning methods. To enhance interpretability and overcome this problem, we developed a novel feature selection algorithm. In the meantime, complex genomic data brought great challenges for the identification of biomarkers and therapeutic targets. The current some feature selection methods have the problem of low sensitivity and specificity in this field. Results In this article, we designed a multi-scale clustering-based feature selection algorithm named MCBFS which simultaneously performs feature selection and model learning for genomic data analysis. The experimental results demonstrated that MCBFS is robust and effective by comparing it with seven benchmark and six state-of-the-art supervised methods on eight data sets. The visualization results and the statistical test showed that MCBFS can capture the informative genes and improve the interpretability and visualization of tumor gene expression and single-cell sequencing data. Additionally, we developed a general framework named McbfsNW using gene expression data and protein interaction data to identify robust biomarkers and therapeutic targets for diagnosis and therapy of diseases. The framework incorporates the MCBFS algorithm, network recognition ensemble algorithm and feature selection wrapper. McbfsNW has been applied to the lung adenocarcinoma (LUAD) data sets. The preliminary results demonstrated that higher prediction results can be attained by identified biomarkers on the independent LUAD data set, and we also structured a drug-target network which may be good for LUAD therapy. Conclusions The proposed novel feature selection method is robust and effective for gene selection, classification, and visualization. The framework McbfsNW is practical and helpful for the identification of biomarkers and targets on genomic data. It is believed that the same methods and principles are extensible and applicable to other different kinds of data sets.
Collapse
Affiliation(s)
- Da Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Jialin Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Hanxiao Xu
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Yusen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China.
| | - Wei Chen
- School of Mathematics and Statistics, Shandong University, Weihai, 264209, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan, 250061, China
| | - Matthias Dehmer
- Institute for Intelligent Production, Faculty for Management, University of Applied Sciences Upper Austria, Steyr Campus, Steyr, Austria.,College of Computer and Control Engineering, Nankai University, Tianjin, 300071, China.,Department of Mechatronics and Biomedical Computer Science, UMIT, Hall in Tyrol, Austria
| |
Collapse
|
46
|
Nakashima S, Nacher JC, Song J, Akutsu T. An Overview of Bioinformatics Methods for Analyzing Autism Spectrum Disorders. Curr Pharm Des 2020; 25:4552-4559. [PMID: 31713477 DOI: 10.2174/1381612825666191111154837] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2019] [Accepted: 11/07/2019] [Indexed: 02/06/2023]
Abstract
Autism Spectrum Disorders (ASD) are a group of neurodevelopmental disorders and are well recognized to be biologically heterogeneous in which various factors are associated, including genetic, metabolic, and environmental ones. Despite its high prevalence, only a few drugs have been approved for the treatment of ASD. Therefore, extensive studies have been conducted to identify ASD risk genes and novel drug targets. Since many genes and many other factors are associated with ASD, various bioinformatics methods have also been developed for the analysis of ASD. In this paper, we review bioinformatics methods for analyzing ASD data with the focus on computational aspects. We classify existing methods into two categories: (i) methods based on genomic variants and gene expression data, and (ii) methods using biological networks, which include gene co-expression networks and protein-protein interaction networks. Next, for each method, we provide an overall flow and elaborate on the computational techniques used. We also briefly review other approaches and discuss possible future directions and strategies for developing bioinformatics approaches to analyze ASD.
Collapse
Affiliation(s)
- Shogo Nakashima
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| | - Jose C Nacher
- Department of Information Science, Faculty of Science, Toho University, Kyoto, Japan
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Monash University, Clayton VIC 3800, Australia
| | - Tatsuya Akutsu
- Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan
| |
Collapse
|
47
|
Hu J, Li Y, Gao W, Zhang P. Robust multi-label feature selection with dual-graph regularization. Knowl Based Syst 2020. [DOI: 10.1016/j.knosys.2020.106126] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
48
|
A supervised machine learning-based methodology for analyzing dysregulation in splicing machinery: An application in cancer diagnosis. Artif Intell Med 2020; 108:101950. [PMID: 32972670 DOI: 10.1016/j.artmed.2020.101950] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Revised: 08/15/2020] [Accepted: 08/18/2020] [Indexed: 02/06/2023]
Abstract
Deregulated splicing machinery components have shown to be associated with the development of several types of cancer and, therefore, the determination of such alterations can help the development of tumor-specific molecular targets for early prognosis and therapy. Determining such splicing components, however, is not a straightforward task mainly due to the heterogeneity of tumors, the variability across samples, and the fat-short characteristic of genomic datasets. In this work, a supervised machine learning-based methodology is proposed, allowing the determination of subsets of relevant splicing components that best discriminate samples. The methodology comprises three main phases: first, a ranking of features is determined by means of applying feature weighting algorithms that compute the importance of each splicing component; second, the best subset of features that allows the induction of an accurate classifier is determined by means of conducting an effective heuristic search; then the confidence over the induced classifier is assessed by means of explaining the individual predictions and its global behavior. At the end, an extensive experimental study was conducted on a large collection of transcript-based datasets, illustrating the utility and benefit of the proposed methodology for analyzing dysregulation in splicing machinery.
Collapse
|
49
|
Li J, Ping Y, Li H, Li H, Liu Y, Liu B, Wang Y. Prognostic prediction of carcinoma by a differential-regulatory-network-embedded deep neural network. Comput Biol Chem 2020; 88:107317. [PMID: 32622180 DOI: 10.1016/j.compbiolchem.2020.107317] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Accepted: 06/21/2020] [Indexed: 02/04/2023]
Abstract
The accurate prognostic prediction is essential for precise diagnosis and treatment of carcinoma. In addition to clinical survival prediction method, many computational methods based on transcriptomic data have been proposed to build the prediction models and study the prognosis of cancer patients. We propose a differential-regulatory-network-embedded deep neural network (DRE-DNN) method by integrating differential regulatory analysis based on gene co-expression network and deep neural network (DNN) method. From three public hepatocellular carcinoma (HCC) datasets, we derive differential regulatory network and embed regulatory information into DNN. By employing 1869 differential regulatory genes and survival data, we apply DRE-DNN to build a prediction model. We compare our method with the one which has all gene features in normal DNN, and results show that our method has better generalization ability and accuracy. We modify the normal DNN and develop an efficient method to predict prognosis of HCC from gene expression data. Our method decreases the inconsistence caused by the overfitting problem when the training sample size is small. DRE-DNN is also extendable for prognostic prediction of other cancers.
Collapse
Affiliation(s)
- Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China.
| | - Yuan Ping
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Hong Li
- CAS Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institute of Nutrition and Health, Shanghai Institutes for Biological Sciences, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Huinian Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Ying Liu
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China
| | - Bo Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong 518055, China; School of Computer Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China.
| |
Collapse
|
50
|
Zhou X, Chai H, Zhao H, Luo CH, Yang Y. Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning-based neural network. Gigascience 2020; 9:giaa076. [PMID: 32649756 PMCID: PMC7350980 DOI: 10.1093/gigascience/giaa076] [Citation(s) in RCA: 28] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Revised: 04/23/2020] [Accepted: 06/24/2020] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methylation is an important epigenetic modification to regulate gene expression, it can be used to predict RNA-seq data. For this purpose, many methods have been developed. A common limitation of these methods is that they mainly focus on a single cancer dataset and do not fully utilize information from large pan-cancer datasets. RESULTS Here, we have developed a novel method to impute missing gene expression data from DNA methylation data through a transfer learning-based neural network, namely, TDimpute. In the method, the pan-cancer dataset from The Cancer Genome Atlas (TCGA) was utilized for training a general model, which was then fine-tuned on the specific cancer dataset. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in imputation accuracy with a 7-11% improvement under different missing rates. The imputed gene expression was further proved to be useful for downstream analyses, including the identification of both methylation-driving and prognosis-related genes, clustering analysis, and survival analysis on the TCGA dataset. More importantly, our method was indicated to be useful for general purposes by an independent test on the Wilms tumor dataset from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project. CONCLUSIONS TDimpute is an effective method for RNA-seq imputation with limited training samples.
Collapse
Affiliation(s)
- Xiang Zhou
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Hua Chai
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, 107 Yan Jiang West Road, Guangzhou 510120, China
| | - Ching-Hsing Luo
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
| | - Yuedong Yang
- School of Data and Computer Science, Sun Yat-sen University, 132 East Waihuan Road, Guangzhou 510006, China
- Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, 132 East Waihuan Road, Guangzhou 510006, China
| |
Collapse
|