1
|
Li Q, Wang P, Yuan J, Zhou Y, Mei Y, Ye M. A two-stage hybrid gene selection algorithm combined with machine learning models to predict the rupture status in intracranial aneurysms. Front Neurosci 2022; 16:1034971. [PMID: 36340761 PMCID: PMC9631203 DOI: 10.3389/fnins.2022.1034971] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 09/30/2022] [Indexed: 07/31/2023] Open
Abstract
An IA is an abnormal swelling of cerebral vessels, and a subset of these IAs can rupture causing aneurysmal subarachnoid hemorrhage (aSAH), often resulting in death or severe disability. Few studies have used an appropriate method of feature selection combined with machine learning by analyzing transcriptomic sequencing data to identify new molecular biomarkers. Following gene ontology (GO) and enrichment analysis, we found that the distinct status of IAs could lead to differential innate immune responses using all 913 differentially expressed genes, and considering that there are numerous irrelevant and redundant genes, we propose a mixed filter- and wrapper-based feature selection. First, we used the Fast Correlation-Based Filter (FCBF) algorithm to filter a large number of irrelevant and redundant genes in the raw dataset, and then used the wrapper feature selection method based on the he Multi-layer Perceptron (MLP) neural network and the Particle Swarm Optimization (PSO), accuracy (ACC) and mean square error (MSE) were then used as the evaluation criteria. Finally, we constructed a novel 10-gene signature (YIPF1, RAB32, WDR62, ANPEP, LRRCC1, AADAC, GZMK, WBP2NL, PBX1, and TOR1B) by the proposed two-stage hybrid algorithm FCBF-MLP-PSO and used different machine learning models to predict the rupture status in IAs. The highest ACC value increased from 0.817 to 0.919 (12.5% increase), the highest area under ROC curve (AUC) value increased from 0.87 to 0.94 (8.0% increase), and all evaluation metrics improved by approximately 10% after being processed by our proposed gene selection algorithm. Therefore, these 10 informative genes used to predict rupture status of IAs can be used as complements to imaging examinations in the clinic, meanwhile, this selected gene signature also provides new targets and approaches for the treatment of ruptured IAs.
Collapse
Affiliation(s)
- Qingqing Li
- School of Medical Information, Wannan Medical College, Wuhu, Anhui, China
- Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, Anhui, China
| | - Peipei Wang
- School of Medical Information, Wannan Medical College, Wuhu, Anhui, China
- Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, Anhui, China
| | - Jinlong Yuan
- Department of Neurosurgery, Yijishan Hospital of Wannan Medical College, Wannan Medical College, Wuhu, Anhui, China
| | - Yunfeng Zhou
- Department of Radiology, Yijishan Hospital of Wannan Medical College, Wannan Medical College, Wuhu, Anhui, China
| | - Yaxin Mei
- School of Medical Information, Wannan Medical College, Wuhu, Anhui, China
- Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, Anhui, China
| | - Mingquan Ye
- School of Medical Information, Wannan Medical College, Wuhu, Anhui, China
- Research Center of Health Big Data Mining and Applications, Wannan Medical College, Wuhu, Anhui, China
| |
Collapse
|
2
|
Liu X, Zhang T, Tan Z, Warden AR, Li S, Cheung E, Ding X. A Hashing-Based Framework for Enhancing Cluster Delineation of High-Dimensional Single-Cell Profiles. PHENOMICS (CHAM, SWITZERLAND) 2022; 2:323-335. [PMID: 36939755 PMCID: PMC9590516 DOI: 10.1007/s43657-022-00056-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/09/2022] [Revised: 04/08/2022] [Accepted: 04/15/2022] [Indexed: 10/18/2022]
Abstract
Although many methods have been developed to explore the function of cells by clustering high-dimensional (HD) single-cell omics data, the inconspicuously differential expressions of biomarkers of proteins or genes across all cells disturb the cell cluster delineation and downstream analysis. Here, we introduce a hashing-based framework to improve the delineation of cell clusters, which is based on the hypothesis that one variable with no significant differences can be decomposed into more diversely latent variables to distinguish cells. By projecting the original data into a sparse HD space, fly and densefly hashing preprocessing retain the local structure of data, and improve the cluster delineation of existing clustering methods, such as PhenoGraph. Moreover, the analyses on mass cytometry dataset show that our hashing-based framework manages to unveil new hidden heterogeneities in cell clusters. The proposed framework promotes the utilization of cell biomarkers and enriches the biological findings by introducing more latent variables. Supplementary Information The online version contains supplementary material available at 10.1007/s43657-022-00056-z.
Collapse
Affiliation(s)
- Xiao Liu
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Ting Zhang
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Ziyang Tan
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Antony R. Warden
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Shanhe Li
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| | - Edwin Cheung
- Cancer Centre, Faculty of Health Sciences, University of Macau, Taipa, 999078 China
- Centre of Precision Medicine Research and Training, Faculty of Health Sciences, University of Macau, Taipa, 999078 China
| | - Xianting Ding
- Institute of Personalized Medicine, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, 200030 China
| |
Collapse
|
3
|
A combinatory algorithm for identifying genes in childhood acute lymphoblastic leukemia. GENE REPORTS 2022. [DOI: 10.1016/j.genrep.2021.101433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
4
|
Wang CY, Yu N, Wu MJ, Gao YL, Liu JX, Wang J. Dual Hyper-Graph Regularized Supervised NMF for Selecting Differentially Expressed Genes and Tumor Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2375-2383. [PMID: 32086220 DOI: 10.1109/tcbb.2020.2975173] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Non-negative matrix factorization (NMF) is a dimensionality reduction technique based on high-dimensional mapping. It can learn part-based representations effectively. In this paper, we propose a method called Dual Hyper-graph Regularized Supervised Non-negative Matrix Factorization (HSNMF). To encode the geometric information of the data, the hyper-graph is introduced into the model as a regularization term. The advantage of hyper-graph learning is to find higher order data relationship to enhance data relevance. This method constructs the data hyper-graph and the feature hyper-graph to find the data manifold and the feature manifold simultaneously. The application of hyper-graph theory in cancer datasets can effectively find pathogenic genes. The discrimination information is further introduced into the objective function to obtain more information about the data. Supervised learning with label information greatly improves the classification effect. Furthermore, the real datasets of cancer usually contain sparse noise, so the L2,1-norm is applied to enhance the robustness of HSNMF algorithm. Experiments under The Cancer Genome Atlas (TCGA) datasets verify the feasibility of the HSNMF method.
Collapse
|
5
|
Zheng X, Zhang C. Gene selection for microarray data classification via dual latent representation learning. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.07.047] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
6
|
Gene Correlation Guided Gene Selection for Microarray Data Classification. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6490118. [PMID: 34435048 PMCID: PMC8382518 DOI: 10.1155/2021/6490118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 08/09/2021] [Indexed: 12/14/2022]
Abstract
The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.
Collapse
|
7
|
Mahendran N, Durai Raj Vincent PM, Srinivasan K, Chang CY. Machine Learning Based Computational Gene Selection Models: A Survey, Performance Evaluation, Open Issues, and Future Research Directions. Front Genet 2020; 11:603808. [PMID: 33362861 PMCID: PMC7758324 DOI: 10.3389/fgene.2020.603808] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2020] [Accepted: 10/29/2020] [Indexed: 12/20/2022] Open
Abstract
Gene Expression is the process of determining the physical characteristics of living beings by generating the necessary proteins. Gene Expression takes place in two steps, translation and transcription. It is the flow of information from DNA to RNA with enzymes' help, and the end product is proteins and other biochemical molecules. Many technologies can capture Gene Expression from the DNA or RNA. One such technique is Microarray DNA. Other than being expensive, the main issue with Microarray DNA is that it generates high-dimensional data with minimal sample size. The issue in handling such a heavyweight dataset is that the learning model will be over-fitted. This problem should be addressed by reducing the dimension of the data source to a considerable amount. In recent years, Machine Learning has gained popularity in the field of genomic studies. In the literature, many Machine Learning-based Gene Selection approaches have been discussed, which were proposed to improve dimensionality reduction precision. This paper does an extensive review of the various works done on Machine Learning-based gene selection in recent years, along with its performance analysis. The study categorizes various feature selection algorithms under Supervised, Unsupervised, and Semi-supervised learning. The works done in recent years to reduce the features for diagnosing tumors are discussed in detail. Furthermore, the performance of several discussed methods in the literature is analyzed. This study also lists out and briefly discusses the open issues in handling the high-dimension and less sample size data.
Collapse
Affiliation(s)
- Nivedhitha Mahendran
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - P. M. Durai Raj Vincent
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Kathiravan Srinivasan
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore, India
| | - Chuan-Yu Chang
- Department of Computer Science and Information Engineering, National Yunlin University of Science and Technology, Douliu, Taiwan
| |
Collapse
|
8
|
Davoudi A, Mahmoodian H. Stable gene selection by self-representation method in fuzzy sample classification. Med Biol Eng Comput 2020; 58:1213-1223. [PMID: 32212053 DOI: 10.1007/s11517-020-02160-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 03/12/2020] [Indexed: 11/27/2022]
Abstract
In recent years, microarray technology and gene expression profiles have been widely used to detect, predict, or classify the samples of various diseases. The presence of large genes in these profiles and the small number of samples are known challenges in this field and are widely considered in previous papers. In previous studies, other topics such as the noise of microarray data or the dependence of selected genes on samples have been less considered. Therefore, we have tried to address these two issues by using a fuzzy classifier and stability index of selected genes, respectively. The proposed method is based on the regression function between the genes and class labels which is determined by the self-representing method. This regression function is determined individually for each class of the database. To minimize the effect of noise in microarray data, a fuzzy classifier is applied in the proposed model. Four databases of gene expression profiles are examined in this article, and the results indicate that the proposed model has a relative advantage over the previous methods. Graphical abstract.
Collapse
Affiliation(s)
- Armaghan Davoudi
- Electrical Engineering Faculty, Najafabad Branch, Islamic Azad University, Najafabad, Iran
| | - Hamid Mahmoodian
- Electrical Engineering Faculty, Najafabad Branch, Islamic Azad University, Najafabad, Iran.
- Digital Processing and Machine Vision Research Center, Najafabad Branch, Islamic Azad University, Najafabad, Iran.
| |
Collapse
|
9
|
Unsupervised feature selection via adaptive hypergraph regularized latent representation learning. Neurocomputing 2020. [DOI: 10.1016/j.neucom.2019.10.018] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
10
|
|
11
|
Li J, Ge W, Wei Y, An D. Supervised discriminative manifold learning with subsidiary-view information for near infrared spectroscopic classification of crop seeds. Pattern Recognit Lett 2019. [DOI: 10.1016/j.patrec.2019.05.016] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
12
|
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning. Gene 2019; 706:188-200. [DOI: 10.1016/j.gene.2019.04.060] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Revised: 04/03/2019] [Accepted: 04/22/2019] [Indexed: 01/19/2023]
|
13
|
Tang C, Bian M, Liu X, Li M, Zhou H, Wang P, Yin H. Unsupervised feature selection via latent representation learning and manifold regularization. Neural Netw 2019; 117:163-178. [PMID: 31170576 DOI: 10.1016/j.neunet.2019.04.015] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2018] [Revised: 04/16/2019] [Accepted: 04/22/2019] [Indexed: 01/17/2023]
Abstract
With the rapid development of multimedia technology, massive unlabelled data with high dimensionality need to be processed. As a means of dimensionality reduction, unsupervised feature selection has been widely recognized as an important and challenging pre-step for many machine learning and data mining tasks. Traditional unsupervised feature selection algorithms usually assume that the data instances are identically distributed and there is no dependency between them. However, the data instances are not only associated with high dimensional features but also inherently interconnected with each other. Furthermore, the inevitable noises mixed in data could degenerate the performances of previous methods which perform feature selection in original data space. Without label information, the connection information between data instances can be exploited and could help select relevant features. In this work, we propose a robust unsupervised feature selection method which embeds the latent representation learning into feature selection. Instead of measuring the feature importances in original data space, the feature selection is carried out in the learned latent representation space which is more robust to noises. The latent representation is modelled by non-negative matrix factorization of the affinity matrix which explicitly reflects the relationships of data instances. Meanwhile, the local manifold structure of original data space is preserved by a graph based manifold regularization term in the transformed feature space. An efficient alternating algorithm is developed to optimize the proposed model. Experimental results on eight benchmark datasets demonstrate the effectiveness of the proposed method.
Collapse
Affiliation(s)
- Chang Tang
- School of Computer Science, China University of Geosciences, Wuhan 430074, China.
| | - Meiru Bian
- Department of Hematology, The Affiliated Huai'an Hospital of Xuzhou Medical University, Huai'an 223002, China.
| | - Xinwang Liu
- School of Computer Science, National University of Defense Technology, Changsha 410073, China.
| | - Miaomiao Li
- School of Computer Science, National University of Defense Technology, Changsha 410073, China.
| | - Hua Zhou
- Department of Hematology, The Affiliated Huai'an Hospital of Xuzhou Medical University, Huai'an 223002, China.
| | - Pichao Wang
- Alibaba Group (U.S.) Inc. Bellevue, WA, 98004, USA.
| | - Hailin Yin
- Department of Oncology, People's Hospital of Lian'shui County, Huai'an 223300, China.
| |
Collapse
|
14
|
Tang C, Zhou H, Zheng X, Zhang Y, Sha X. Dual Laplacian regularized matrix completion for microRNA-disease associations prediction. RNA Biol 2019; 16:601-611. [PMID: 30676207 PMCID: PMC6546388 DOI: 10.1080/15476286.2019.1570811] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2018] [Revised: 11/30/2018] [Accepted: 01/03/2019] [Indexed: 01/21/2023] Open
Abstract
Since lots of miRNA-disease associations have been verified, it is meaningful to discover more miRNA-disease associations for serving disease diagnosis and prevention of human complex diseases. However, it is not practical to identify potential associations using traditional biological experimental methods since the process is expensive and time consuming. Therefore, it is necessary to develop efficient computational methods to accomplish this task. In this work, we introduced a matrix completion model with dual Laplacian regularization (DLRMC) to infer unknown miRNA-disease associations in heterogeneous omics data. Specifically, DLRMC transformed the task of miRNA-disease association prediction into a matrix completion problem, in which the potential missing entries of the miRNA-disease association matrix were calculated, the missing association can be obtained based on the prediction scores after the completion procedure. Meanwhile, the miRNA functional similarity and the disease semantic similarity were fully exploited to serve the miRNA-disease association matrix completion by using a dual Laplacian regularization term. In the experiments, we conducted global and local Leave-One-Out Cross Validation (LOOCV) and case studies to evaluate the efficacy of DLRMC on the Human miRNA-disease associations dataset obtained from the HMDDv2.0 database. As a result, the AUCs of DLRMC is 0.9174 and 0.8289 in global LOOCV and local LOOCV, respectively, which significantly outperform a variety of previous methods. In addition, in the case studies on four significant diseases related to human health including Colon Neoplasms, Kidney neoplasms, Lymphoma and Prostate neoplasms, 90%, 92%, 92% and 94% out of the top 50 predicted miRNAs has been confirmed, respectively.
Collapse
Affiliation(s)
- Chang Tang
- School of Computer Science, China University of Geosciences, Wuhan, China
| | - Hua Zhou
- Department of Hematology, The Affiliated Huai’an Hospital of Xuzhou Medical University, Huai’an, China
| | - Xiao Zheng
- Wuhan University of Technology Hospital, Wuhan University of Technology, Wuhan, China
| | - Yanming Zhang
- Department of Hematology, The Affiliated Huai’an Hospital of Xuzhou Medical University, Huai’an, China
| | - Xiaofeng Sha
- Department of Oncology, Huai’an Hongze District People’s Hospital, Huai’an, China
| |
Collapse
|
15
|
|