1
|
Ding X, Yang F, Zhong Y, Cao J. A Novel Recursive Gene Selection Method Based on Least Square Kernel Extreme Learning Machine. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2026-2038. [PMID: 33764877 DOI: 10.1109/tcbb.2021.3068846] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
This paper presents a recursive feature elimination (RFE) mechanism to select the most informative genes with a least square kernel extreme learning machine (LSKELM) classifier. Describing the generalization ability of LSKELM in a way that is related to small norm of weights, we propose a ranking criterion to evaluate the importance of genes by the norm of weights obtained by LSKELM. The proposed method is called LSKELM-RFE which first employs the original genes to build a LSKELM classifier, and then ranks the genes according to their importance given by the norm of output weights of LSKELM and finally removes a "least important" gene. Benefiting from the random mapping mechanism of the extreme learning machine (ELM) kernel, there are no parameter of LSKELM-RFE needs to be manually tuned. A comparative study among our proposed algorithm and other two famous RFE algorithms has shown that LSKELM-RFE outperforms other RFE algorithms in both the computational cost and generalization ability.
Collapse
|
2
|
Gene Correlation Guided Gene Selection for Microarray Data Classification. BIOMED RESEARCH INTERNATIONAL 2021; 2021:6490118. [PMID: 34435048 PMCID: PMC8382518 DOI: 10.1155/2021/6490118] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Accepted: 08/09/2021] [Indexed: 12/14/2022]
Abstract
The microarray cancer data obtained by DNA microarray technology play an important role for cancer prevention, diagnosis, and treatment. However, predicting the different types of tumors is a challenging task since the sample size in microarray data is often small but the dimensionality is very high. Gene selection, which is an effective means, is aimed at mitigating the curse of dimensionality problem and can boost the classification accuracy of microarray data. However, many of previous gene selection methods focus on model design, but neglect the correlation between different genes. In this paper, we introduce a novel unsupervised gene selection method by taking the gene correlation into consideration, named gene correlation guided gene selection (G3CS). Specifically, we calculate the covariance of different gene dimension pairs and embed it into our unsupervised gene selection model to regularize the gene selection coefficient matrix. In such a manner, redundant genes can be effectively excluded. In addition, we utilize a matrix factorization term to exploit the cluster structure of original microarray data to assist the learning process. We design an iterative updating algorithm with convergence guarantee to solve the resultant optimization problem. Experimental results on six publicly available microarray datasets are conducted to validate the efficacy of our proposed method.
Collapse
|
3
|
Gupta M, Gupta B. A novel gene expression test method of minimizing breast cancer risk in reduced cost and time by improving SVM-RFE gene selection method combined with LASSO. J Integr Bioinform 2020; 18:139-153. [PMID: 34171941 PMCID: PMC7856389 DOI: 10.1515/jib-2019-0110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Accepted: 11/12/2020] [Indexed: 01/26/2023] Open
Abstract
Breast cancer is the leading diseases of death in women. It induces by a genetic mutation in breast cancer cells. Genetic testing has become popular to detect the mutation in genes but test cost is relatively expensive for several patients in developing countries like India. Genetic test takes between 2 and 4 weeks to decide the cancer. The time duration suffers the prognosis of genes because some patients have high rate of cancerous cell growth. In the research work, a cost and time efficient method is proposed to predict the gene expression level on the basis of clinical outcomes of the patient by using machine learning techniques. An improved SVM-RFE_MI gene selection technique is proposed to find the most significant genes related to breast cancer afterward explained variance statistical analysis is applied to extract the genes contain high variance. Least Absolute Shrinkage Selector Operator (LASSO) and Ridge regression techniques are used to predict the gene expression level. The proposed method predicts the expression of significant genes with reduced Root Mean Square Error and acceptable adjusted R-square value. As per the study, analysis of these selected genes is beneficial to diagnose the breast cancer at prior stage in reduced cost and time.
Collapse
Affiliation(s)
- Madhuri Gupta
- Department of Computer Engineering and Information Technology, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
| | - Bharat Gupta
- Department of CS&IT, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
| |
Collapse
|
4
|
Zhou M, Bian K, Hu F, Lai W. A New Method Based on CEEMD Combined With Iterative Feature Reduction for Aided Diagnosis of Epileptic EEG. Front Bioeng Biotechnol 2020; 8:669. [PMID: 32695761 PMCID: PMC7338793 DOI: 10.3389/fbioe.2020.00669] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Accepted: 05/28/2020] [Indexed: 11/26/2022] Open
Abstract
In the clinical diagnosis of epileptic diseases, the intelligent diagnosis of epileptic electroencephalogram (EEG) signals has become a research focus in the field of brain diseases. In order to solve the problem of time-consuming and easily influenced by human subjective factors, artificial intelligence pattern recognition algorithm has been applied to EEG signals recognition. However, at present, the common empirical mode decomposition (EMD) signal decomposition algorithm does not consider the problem of mode aliasing. The EEG features obtained by feature extraction may be mixed with some unimportant features that affect the classification accuracy. In this paper, we proposed a new method based on complementary ensemble empirical mode decomposition (CEEMD) combined with iterative feature reduction for aided diagnosis of epileptic EEG. First of all, the evaluation indexes of decomposing and reconstructing signals by several methods were compared. The CEEMD was selected as the decomposition method of the signals. Then, the support vector machine recursive elimination (SVM-RFE) was used to reduce 9 features extracted from EEG data. The support vector classification of the gray wolf optimizer (GWO-SVC) recognition model was established for different feature subsets. By comparing the classification accuracy of training set and test set of different feature subsets, and considering the complexity of the model reflected by the number of features selected by SVM-RFE, the analysis showed that the 6 feature subsets with fewer features and higher classification accuracy could reflect the key information of epileptic EEG. The accuracy of the training set classification was 99.38% and the test set was as high as 100%. The recognition time was only 1.6551 s. Finally, in order to verify the reliability of the algorithm proposed in this paper, the proposed algorithm compared with the classification model established by the raw EEG signals and the optimization model established by other intelligent optimization algorithms. It is found that the algorithm used in this paper has higher classification accuracy and faster recognition time than other processing methods. The experimental results show that CEEMD combined with SVM-RFE is feasible for rapid and accurate recognition of EEG signals, which provides a theoretical basis for the aided diagnosis of epilepsy.
Collapse
Affiliation(s)
- Mengran Zhou
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China.,State Key Laboratory of Mining Response and Disaster Prevention and Control in Deep Coal Mines, Anhui University of Science and Technology, Huainan, China
| | - Kai Bian
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| | - Feng Hu
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| | - Wenhao Lai
- School of Electrical and Information Engineering, Anhui University of Science and Technology, Huainan, China
| |
Collapse
|
5
|
Gene selection for microarray data classification via adaptive hypergraph embedded dictionary learning. Gene 2019; 706:188-200. [DOI: 10.1016/j.gene.2019.04.060] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2018] [Revised: 04/03/2019] [Accepted: 04/22/2019] [Indexed: 01/19/2023]
|
6
|
Okuwobi IP, Fan W, Yu C, Yuan S, Liu Q, Zhang Y, Loza B, Chen Q. Automated segmentation of hyperreflective foci in spectral domain optical coherence tomography with diabetic retinopathy. J Med Imaging (Bellingham) 2018; 5:014002. [PMID: 29430477 DOI: 10.1117/1.jmi.5.1.014002] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Accepted: 01/11/2018] [Indexed: 11/14/2022] Open
Abstract
We propose an automated segmentation method to detect, segment, and quantify hyperreflective foci (HFs) in three-dimensional (3-D) spectral domain optical coherence tomography (SD-OCT). The algorithm is divided into three stages: preprocessing, layer segmentation, and HF segmentation. In this paper, a supervised classifier (random forest) was used to produce the set of boundary probabilities in which an optimal graph search method was then applied to identify and produce the layer segmentation using the Sobel edge algorithm. An automated grow-cut algorithm was applied to segment the HFs. The proposed algorithm was tested on 20 3-D SD-OCT volumes from 20 patients diagnosed with proliferative diabetic retinopathy (PDR) and diabetic macular edema (DME). The average dice similarity coefficient and correlation coefficient ([Formula: see text]) are 62.30%, 96.90% for PDR, and 63.80%, 97.50% for DME, respectively. The proposed algorithm can provide clinicians with accurate quantitative information, such as the size and volume of the HFs. This can assist in clinical diagnosis, treatment, disease monitoring, and progression.
Collapse
Affiliation(s)
- Idowu Paul Okuwobi
- Nanjing University of Science and Technology, School of Computer Science and Engineering, Xiaolingwei, Nanjing, China
| | - Wen Fan
- The First Affiliated Hospital with Nanjing Medical University, Department of Ophthalmology, Nanjing, China
| | - Chenchen Yu
- Nanjing University of Science and Technology, School of Computer Science and Engineering, Xiaolingwei, Nanjing, China
| | - Songtao Yuan
- The First Affiliated Hospital with Nanjing Medical University, Department of Ophthalmology, Nanjing, China
| | - Qinghuai Liu
- The First Affiliated Hospital with Nanjing Medical University, Department of Ophthalmology, Nanjing, China
| | - Yuhan Zhang
- Nanjing University of Science and Technology, School of Computer Science and Engineering, Xiaolingwei, Nanjing, China
| | - Bekalo Loza
- Nanjing University of Science and Technology, School of Computer Science and Engineering, Xiaolingwei, Nanjing, China
| | - Qiang Chen
- Nanjing University of Science and Technology, School of Computer Science and Engineering, Xiaolingwei, Nanjing, China
| |
Collapse
|
7
|
Huang X, Lin X, Zeng J, Wang L, Yin P, Zhou L, Hu C, Yao W. A Computational Method of Defining Potential Biomarkers based on Differential Sub-Networks. Sci Rep 2017; 7:14339. [PMID: 29085035 PMCID: PMC5662748 DOI: 10.1038/s41598-017-14682-5] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2017] [Accepted: 10/16/2017] [Indexed: 01/05/2023] Open
Abstract
Analyzing omics data from a network-based perspective can facilitate biomarker discovery. To improve disease diagnosis and identify prospective information indicating the onset of complex disease, a computational method for identifying potential biomarkers based on differential sub-networks (PB-DSN) is developed. In PB-DSN, Pearson correlation coefficient (PCC) is used to measure the relationship between feature ratios and to infer potential networks. A differential sub-network is extracted to identify crucial information for discriminating different groups and indicating the emergence of complex diseases. Subsequently, PB-DSN defines potential biomarkers based on the topological analysis of these differential sub-networks. In this study, PB-DSN is applied to handle a static genomics dataset of small, round blue cell tumors and a time-series metabolomics dataset of hepatocellular carcinoma. PB-DSN is compared with support vector machine-recursive feature elimination, multivariate empirical Bayes statistics, analyzing time-series data based on dynamic networks, molecular networks based on PCC, PinnacleZ, graph-based iterative group analysis, KeyPathwayMiner and BioNet. The better performance of PB-DSN not only demonstrates its effectiveness for the identification of discriminative features that facilitate disease classification, but also shows its potential for the identification of warning signals.
Collapse
Affiliation(s)
- Xin Huang
- School of Computer Science & Technology, Dalian University of Technology, 116024, Dalian, China
| | - Xiaohui Lin
- School of Computer Science & Technology, Dalian University of Technology, 116024, Dalian, China.
| | - Jun Zeng
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
| | - Lichao Wang
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
| | - Peiyuan Yin
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
| | - Lina Zhou
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
| | - Chunxiu Hu
- CAS Key Laboratory of Separation Science for Analytical Chemistry, Dalian Institute of Chemical Physics, Chinese Academy of Sciences, Dalian, 116023, China
| | - Weihong Yao
- School of Computer Science & Technology, Dalian University of Technology, 116024, Dalian, China
| |
Collapse
|
8
|
Meng J, Zhang J, Luan YS, He XY, Li LS, Zhu YF. Parallel gene selection and dynamic ensemble pruning based on Affinity Propagation. Comput Biol Med 2017; 87:8-21. [PMID: 28544912 DOI: 10.1016/j.compbiomed.2017.05.016] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2017] [Revised: 05/13/2017] [Accepted: 05/13/2017] [Indexed: 12/01/2022]
Abstract
Gene selection and sample classification based on gene expression data are important research areas in bioinformatics. Selecting important genes closely related to classification is a challenging task due to high dimensionality and small sample size of microarray data. Extended rough set based on neighborhood has been successfully applied to gene selection, as it can select attributes without redundancy and deal with numerical attributes directly. However, the computation of approximations in rough set is extremely time consuming. In this paper, in order to accelerate the process of gene selection, a parallel computation method is proposed to calculate approximations of intersection neighborhood rough set. Furthermore, a novel dynamic ensemble pruning approach based on Affinity Propagation clustering and dynamic pruning framework is proposed to reduce memory usage and computational cost. Experimental results on three Arabidopsis thaliana biotic and abiotic stress response datasets demonstrate that the proposed method can obtain better classification performance than ensemble method with gene pre-selection.
Collapse
Affiliation(s)
- Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China
| | - Jing Zhang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China
| | - Yu-Shi Luan
- School of Life Science and Biotechnology, Dalian University of Technology, Dalian 116023, China.
| | - Xin-Yu He
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China
| | - Li-Shuang Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, China
| | - Yuan-Feng Zhu
- BorderX Lab Inc, Silicon Valley, California, 94086, USA
| |
Collapse
|
9
|
Gene selection for tumor classification using neighborhood rough sets and entropy measures. J Biomed Inform 2017; 67:59-68. [PMID: 28215562 DOI: 10.1016/j.jbi.2017.02.007] [Citation(s) in RCA: 69] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2016] [Revised: 01/25/2017] [Accepted: 02/09/2017] [Indexed: 01/04/2023]
Abstract
With the development of bioinformatics, tumor classification from gene expression data becomes an important useful technology for cancer diagnosis. Since a gene expression data often contains thousands of genes and a small number of samples, gene selection from gene expression data becomes a key step for tumor classification. Attribute reduction of rough sets has been successfully applied to gene selection field, as it has the characters of data driving and requiring no additional information. However, traditional rough set method deals with discrete data only. As for the gene expression data containing real-value or noisy data, they are usually employed by a discrete preprocessing, which may result in poor classification accuracy. In this paper, we propose a novel gene selection method based on the neighborhood rough set model, which has the ability of dealing with real-value data whilst maintaining the original gene classification information. Moreover, this paper addresses an entropy measure under the frame of neighborhood rough sets for tackling the uncertainty and noisy of gene expression data. The utilization of this measure can bring about a discovery of compact gene subsets. Finally, a gene selection algorithm is designed based on neighborhood granules and the entropy measure. Some experiments on two gene expression data show that the proposed gene selection is an effective method for improving the accuracy of tumor classification.
Collapse
|
10
|
A New Strategy for Analyzing Time-Series Data Using Dynamic Networks: Identifying Prospective Biomarkers of Hepatocellular Carcinoma. Sci Rep 2016; 6:32448. [PMID: 27578360 PMCID: PMC5006023 DOI: 10.1038/srep32448] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2015] [Accepted: 08/08/2016] [Indexed: 02/08/2023] Open
Abstract
Time-series metabolomics studies can provide insight into the dynamics of disease development and facilitate the discovery of prospective biomarkers. To improve the performance of early risk identification, a new strategy for analyzing time-series data based on dynamic networks (ATSD-DN) in a systematic time dimension is proposed. In ATSD-DN, the non-overlapping ratio was applied to measure the changes in feature ratios during the process of disease development and to construct dynamic networks. Dynamic concentration analysis and network topological structure analysis were performed to extract early warning information. This strategy was applied to the study of time-series lipidomics data from a stepwise hepatocarcinogenesis rat model. A ratio of lyso-phosphatidylcholine (LPC) 18:1/free fatty acid (FFA) 20:5 was identified as the potential biomarker for hepatocellular carcinoma (HCC). It can be used to classify HCC and non-HCC rats, and the area under the curve values in the discovery and external validation sets were 0.980 and 0.972, respectively. This strategy was also compared with a weighted relative difference accumulation algorithm (wRDA), multivariate empirical Bayes statistics (MEBA) and support vector machine-recursive feature elimination (SVM-RFE). The better performance of ATSD-DN suggests its potential for a more complete presentation of time-series changes and effective extraction of early warning information.
Collapse
|
11
|
Spetale FE, Bulacio P, Guillaume S, Murillo J, Tapia E. A spectral envelope approach towards effective SVM-RFE on infrared data. Pattern Recognit Lett 2016. [DOI: 10.1016/j.patrec.2015.12.007] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
12
|
Nguyen T, Khosravi A, Creighton D, Nahavandi S. A novel aggregate gene selection method for microarray data classification. Pattern Recognit Lett 2015. [DOI: 10.1016/j.patrec.2015.03.018] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
13
|
Meng J, Zhang J, Luan Y. Gene Selection Integrated with Biological Knowledge for Plant Stress Response Using Neighborhood System and Rough Set Theory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:433-444. [PMID: 26357229 DOI: 10.1109/tcbb.2014.2361329] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Mining knowledge from gene expression data is a hot research topic and direction of bioinformatics. Gene selection and sample classification are significant research trends, due to the large amount of genes and small size of samples in gene expression data. Rough set theory has been successfully applied to gene selection, as it can select attributes without redundancy. To improve the interpretability of the selected genes, some researchers introduced biological knowledge. In this paper, we first employ neighborhood system to deal directly with the new information table formed by integrating gene expression data with biological knowledge, which can simultaneously present the information in multiple perspectives and do not weaken the information of individual gene for selection and classification. Then, we give a novel framework for gene selection and propose a significant gene selection method based on this framework by employing reduction algorithm in rough set theory. The proposed method is applied to the analysis of plant stress response. Experimental results on three data sets show that the proposed method is effective, as it can select significant gene subsets without redundancy and achieve high classification accuracy. Biological analysis for the results shows that the interpretability is well.
Collapse
|
14
|
Gene selection using rough set based on neighborhood for the analysis of plant stress response. Appl Soft Comput 2014. [DOI: 10.1016/j.asoc.2014.09.013] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|