1
|
Sagkrioti E, Biz GM, Takan I, Asfa S, Nikitaki Z, Zanni V, Kars RH, Hellweg CE, Azzam EI, Logotheti S, Pavlopoulou A, Georgakilas AG. Radiation Type- and Dose-Specific Transcriptional Responses across Healthy and Diseased Mammalian Tissues. Antioxidants (Basel) 2022; 11:2286. [PMID: 36421472 PMCID: PMC9687520 DOI: 10.3390/antiox11112286] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 11/12/2022] [Accepted: 11/15/2022] [Indexed: 08/30/2023] Open
Abstract
Ionizing radiation (IR) is a genuine genotoxic agent and a major modality in cancer treatment. IR disrupts DNA sequences and exerts mutagenic and/or cytotoxic properties that not only alter critical cellular functions but also impact tissues proximal and distal to the irradiated site. Unveiling the molecular events governing the diverse effects of IR at the cellular and organismal levels is relevant for both radiotherapy and radiation protection. Herein, we address changes in the expression of mammalian genes induced after the exposure of a wide range of tissues to various radiation types with distinct biophysical characteristics. First, we constructed a publicly available database, termed RadBioBase, which will be updated at regular intervals. RadBioBase includes comprehensive transcriptomes of mammalian cells across healthy and diseased tissues that respond to a range of radiation types and doses. Pertinent information was derived from a hybrid analysis based on stringent literature mining and transcriptomic studies. An integrative bioinformatics methodology, including functional enrichment analysis and machine learning techniques, was employed to unveil the characteristic biological pathways related to specific radiation types and their association with various diseases. We found that the effects of high linear energy transfer (LET) radiation on cell transcriptomes significantly differ from those caused by low LET and are consistent with immunomodulation, inflammation, oxidative stress responses and cell death. The transcriptome changes also depend on the dose since low doses up to 0.5 Gy are related with cytokine cascades, while higher doses with ROS metabolism. We additionally identified distinct gene signatures for different types of radiation. Overall, our data suggest that different radiation types and doses can trigger distinct trajectories of cell-intrinsic and cell-extrinsic pathways that hold promise to be manipulated toward improving radiotherapy efficiency and reducing systemic radiotoxicities.
Collapse
Affiliation(s)
- Eftychia Sagkrioti
- DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou, 15780 Athens, Greece
- Biology Department, National and Kapodistrian University of Athens (NKUA), 15784 Athens, Greece
| | - Gökay Mehmet Biz
- Department of Technical Programs, Izmir Vocational School, Dokuz Eylül University, Buca, Izmir 35380, Turkey
| | - Işıl Takan
- Izmir Biomedicine and Genome Center (IBG), Balcova, Izmir 35340, Turkey
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, Balcova, Izmir 35340, Turkey
| | - Seyedehsadaf Asfa
- Izmir Biomedicine and Genome Center (IBG), Balcova, Izmir 35340, Turkey
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, Balcova, Izmir 35340, Turkey
| | - Zacharenia Nikitaki
- DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou, 15780 Athens, Greece
| | - Vassiliki Zanni
- DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou, 15780 Athens, Greece
| | - Rumeysa Hanife Kars
- Department of Biomedical Engineering, Istanbul Medipol University, Istanbul 34810, Turkey
| | - Christine E. Hellweg
- German Aerospace Center (DLR), Institute of Aerospace Medicine, Radiation Biology, Linder Höhe, D-51147 Köln, Germany
| | | | - Stella Logotheti
- DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou, 15780 Athens, Greece
| | - Athanasia Pavlopoulou
- Izmir Biomedicine and Genome Center (IBG), Balcova, Izmir 35340, Turkey
- Izmir International Biomedicine and Genome Institute, Dokuz Eylül University, Balcova, Izmir 35340, Turkey
| | - Alexandros G. Georgakilas
- DNA Damage Laboratory, Physics Department, School of Applied Mathematical and Physical Sciences, National Technical University of Athens (NTUA), Zografou, 15780 Athens, Greece
| |
Collapse
|
2
|
Elitist random swapped particle swarm optimization embedded with variable k-nearest neighbour classification: a new PSO variant applied to gene identification. Soft comput 2022. [DOI: 10.1007/s00500-022-07515-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2022]
|
3
|
Jayanthi S, Rene Robin CR. Analysis of Microarray Data by Empirical Wavelet Transform for Cancer Classification Using Block by Block Method. JOURNAL OF MEDICAL IMAGING AND HEALTH INFORMATICS 2021. [DOI: 10.1166/jmihi.2021.3318] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
In this study, DNA microarray data is analyzed from a signal processing perspective for cancer classification. An adaptive wavelet transform named Empirical Wavelet Transform (EWT) is analyzed using block-by-block procedure to characterize microarray data. The EWT wavelet basis depends
on the input data rather predetermined like in conventional wavelets. Thus, EWT gives more sparse representations than wavelets. The characterization of microarray data is made by block-by-block procedure with predefined block sizes in powers of 2 that starts from 128 to 2048. After characterization,
a statistical hypothesis test is employed to select the informative EWT coefficients. Only the selected coefficients are used for Microarray Data Classification (MDC) by the Support Vector Machine (SVM). Computational experiments are employed on five microarray datasets; colon, breast, leukemia,
CNS and ovarian to test the developed cancer classification system. The obtained results demonstrate that EWT coefficients with SVM emerged as an effective approach with no misclassification for MDC system.
Collapse
Affiliation(s)
- S. Jayanthi
- Research Scholar, Anna University, 600025, Tamilnadu, India; Department of Computer Science and Engineering, Agni College of Technology, 600130, Tamilnadu, India
| | - C. R. Rene Robin
- Department of Computer Science and Engineering, Jerusalem College of Engineering, 600100, Tamilnadu, India
| |
Collapse
|
4
|
Gupta M, Gupta B. A novel gene expression test method of minimizing breast cancer risk in reduced cost and time by improving SVM-RFE gene selection method combined with LASSO. J Integr Bioinform 2020; 18:139-153. [PMID: 34171941 PMCID: PMC7856389 DOI: 10.1515/jib-2019-0110] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Accepted: 11/12/2020] [Indexed: 01/26/2023] Open
Abstract
Breast cancer is the leading diseases of death in women. It induces by a genetic mutation in breast cancer cells. Genetic testing has become popular to detect the mutation in genes but test cost is relatively expensive for several patients in developing countries like India. Genetic test takes between 2 and 4 weeks to decide the cancer. The time duration suffers the prognosis of genes because some patients have high rate of cancerous cell growth. In the research work, a cost and time efficient method is proposed to predict the gene expression level on the basis of clinical outcomes of the patient by using machine learning techniques. An improved SVM-RFE_MI gene selection technique is proposed to find the most significant genes related to breast cancer afterward explained variance statistical analysis is applied to extract the genes contain high variance. Least Absolute Shrinkage Selector Operator (LASSO) and Ridge regression techniques are used to predict the gene expression level. The proposed method predicts the expression of significant genes with reduced Root Mean Square Error and acceptable adjusted R-square value. As per the study, analysis of these selected genes is beneficial to diagnose the breast cancer at prior stage in reduced cost and time.
Collapse
Affiliation(s)
- Madhuri Gupta
- Department of Computer Engineering and Information Technology, ABES Engineering College, Ghaziabad, Uttar Pradesh, India
| | - Bharat Gupta
- Department of CS&IT, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India
| |
Collapse
|
5
|
Nguyen TTH, Nguyen PV, Tran QV, Vo NX, Vo TQ. Cancer classification from microarray data for genomic disorder research using optimal discriminant independent component analysis and kernel extreme learning machine. INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN BIOMEDICAL ENGINEERING 2020; 36:e3372. [PMID: 32453470 DOI: 10.1002/cnm.3372] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/08/2020] [Accepted: 05/13/2020] [Indexed: 06/11/2023]
Abstract
One of the challenging tasks in the medicinal field is genomic disorder investigation and its classification from the microarray dataset. The microarray dataset reorganization and its classification is more complex and expensive in the biomedical research area due to the larger number of features in the microarray dataset. In this paper, we construct a hybrid feature selection method such as t test, Fisher ration, and Bayesian logistic regression to select genes and that reduce the time cost. Based on the features, the top-ranked features are selected via the best hybrid rank method. Thereafter, the features are extracted using the modified firefly optimization-based discriminant independent component analysis (MF-DICA). Especially, the modified firefly optimization algorithm is capable of improving the search efficiency of DICA. From the high dimensional microarray dataset, MF-DICA is used to obtain the best features within the entire search space. The kernel extreme learning machine classifies the gene features depending upon the most relevant class. Experimentally, six datasets namely Leukemia dataset, Diffuse Larger B-cell Lymphomas, Lung cancer, Breast cancer, Prostate tumor, and Colon dataset are chosen to evaluate the performance of proposed approaches. Finally, the experimental data demonstrate that the proposed method is well suitable to classify the microarray data.
Collapse
Affiliation(s)
- Tram Thi Huyen Nguyen
- Department of Pharmacy, Ear - Nose - Throat Hospital in Ho Chi Minh city, Ho Chi Minh City, Vietnam
| | - Pol Van Nguyen
- Department of Economic and Administrative Pharmacy, Faculty of Pharmacy, Pham Ngoc Thach University of Medicine, Ho Chi Minh City, Vietnam
| | - Quang Vinh Tran
- Department of Economic and Administrative Pharmacy, Faculty of Pharmacy, Pham Ngoc Thach University of Medicine, Ho Chi Minh City, Vietnam
| | - Nam Xuan Vo
- Department of Economic and Administrative Pharmacy, Faculty of Pharmacy, Ton Duc Thang University, Ho Chi Minh City, Vietnam
| | - Trung Quang Vo
- Department of Economic and Administrative Pharmacy, Faculty of Pharmacy, Pham Ngoc Thach University of Medicine, Ho Chi Minh City, Vietnam
| |
Collapse
|
6
|
Kilicarslan S, Adem K, Celik M. Diagnosis and classification of cancer using hybrid model based on ReliefF and convolutional neural network. Med Hypotheses 2020; 137:109577. [DOI: 10.1016/j.mehy.2020.109577] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2019] [Revised: 01/04/2020] [Accepted: 01/16/2020] [Indexed: 10/25/2022]
|
7
|
A Wrapper Feature Subset Selection Method Based on Randomized Search and Multilayer Structure. BIOMED RESEARCH INTERNATIONAL 2019; 2019:9864213. [PMID: 31828154 PMCID: PMC6885241 DOI: 10.1155/2019/9864213] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/08/2019] [Revised: 08/10/2019] [Accepted: 08/27/2019] [Indexed: 12/11/2022]
Abstract
The identification of discriminative features from information-rich data with the goal of clinical diagnosis is crucial in the field of biomedical science. In this context, many machine-learning techniques have been widely applied and achieved remarkable results. However, disease, especially cancer, is often caused by a group of features with complex interactions. Unlike traditional feature selection methods, which only focused on finding single discriminative features, a multilayer feature subset selection method (MLFSSM), which employs randomized search and multilayer structure to select a discriminative subset, is proposed herein. In each level of this method, many feature subsets are generated to assure the diversity of the combinations, and the weights of features are evaluated on the performances of the subsets. The weight of a feature would increase if the feature is selected into more subsets with better performances compared with other features on the current layer. In this manner, the values of feature weights are revised layer-by-layer; the precision of feature weights is constantly improved; and better subsets are repeatedly constructed by the features with higher weights. Finally, the topmost feature subset of the last layer is returned. The experimental results based on five public gene datasets showed that the subsets selected by MLFSSM were more discriminative than the results by traditional feature methods including LVW (a feature subset method used the Las Vegas method for randomized search strategy), GAANN (a feature subset selection method based genetic algorithm (GA)), and support vector machine recursive feature elimination (SVM-RFE). Furthermore, MLFSSM showed higher classification performance than some state-of-the-art methods which selected feature pairs or groups, including top scoring pair (TSP), k-top scoring pairs (K-TSP), and relative simplicity-based direct classifier (RS-DC).
Collapse
|
8
|
Nagpal A, Singh V. Feature selection from high dimensional data based on iterative qualitative mutual information. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS 2019. [DOI: 10.3233/jifs-181665] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Affiliation(s)
- Arpita Nagpal
- Department of Computer Science and Engineering, The Nothcap University, Sector-23A, Gurugram, India
| | - Vijendra Singh
- Department of Computer Science and Engineering, The Nothcap University, Sector-23A, Gurugram, India
| |
Collapse
|
9
|
Dif N, Elberrichi Z. An Enhanced Recursive Firefly Algorithm for Informative Gene Selection. INTERNATIONAL JOURNAL OF SWARM INTELLIGENCE RESEARCH 2019. [DOI: 10.4018/ijsir.2019040102] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Feature selection is the process of identifying good performing combinations of significant features among many possibilities. This preprocess improves the classification accuracy and facilitates the learning task. For this optimization problem, the authors have used a metaheuristics approach. Their main objective is to propose an enhanced version of the firefly algorithm as a wrapper approach by adding a recursive behavior to improve the search of the optimal solution. They applied SVM classifier to investigate the proposed method. For the authors experimentations, they have used the benchmark microarray datasets. The results show that the new enhanced recursive FA (RFA) outperforms the standard version with a reduction of dimensionality for all the datasets. As an example, for the leukemia microarray dataset, they have a perfect performance score of 100% with only 18 informative selected genes among the 7,129 of the original dataset. The RFA was competitive compared to other state-of-art approaches and achieved the best results for CNS, Ovarian cancer, MLL, prostate, Leukemia_4c, and lymphoma datasets.
Collapse
Affiliation(s)
- Nassima Dif
- EEDIS Laboratory, Djillali Liabes University, Sidi Belabbes, Algeria
| | | |
Collapse
|
10
|
Yan Y, Dai T, Yang M, Du X, Zhang Y, Zhang Y. Classifying Incomplete Gene-Expression Data: Ensemble Learning with Non-Pre-Imputation Feature Filtering and Best-First Search Technique. Int J Mol Sci 2018; 19:ijms19113398. [PMID: 30380746 PMCID: PMC6274900 DOI: 10.3390/ijms19113398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2018] [Revised: 10/20/2018] [Accepted: 10/23/2018] [Indexed: 01/09/2023] Open
Abstract
(1) Background: Gene-expression data usually contain missing values (MVs). Numerous methods focused on how to estimate MVs have been proposed in the past few years. Recent studies show that those imputation algorithms made little difference in classification. Thus, some scholars believe that how to select the informative genes for downstream classification is more important than how to impute MVs. However, most feature-selection (FS) algorithms need beforehand imputation, and the impact of beforehand MV imputation on downstream FS performance is seldom considered. (2) Method: A modified chi-square test-based FS is introduced for gene-expression data. To deal with the challenge of a small sample size of gene-expression data, a heuristic method called recursive element aggregation is proposed in this study. Our approach can directly handle incomplete data without any imputation methods or missing-data assumptions. The most informative genes can be selected through a threshold. After that, the best-first search strategy is utilized to find optimal feature subsets for classification. (3) Results: We compare our method with several FS algorithms. Evaluation is performed on twelve original incomplete cancer gene-expression datasets. We demonstrate that MV imputation on an incomplete dataset impacts subsequent FS in terms of classification tasks. Through directly conducting FS on incomplete data, our method can avoid potential disturbances on subsequent FS procedures caused by MV imputation. An experiment on small, round blue cell tumor (SRBCT) dataset showed that our method found additional genes besides many common genes with the two compared existing methods.
Collapse
Affiliation(s)
- Yuanting Yan
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Tao Dai
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Meili Yang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
| | - Xiuquan Du
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yiwen Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| | - Yanping Zhang
- School of Computer Science and Technology, Anhui University, Hefei 230601, China.
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, China.
| |
Collapse
|
11
|
WiFi Indoor Localization with CSI Fingerprinting-Based Random Forest. SENSORS 2018; 18:s18092869. [PMID: 30200285 PMCID: PMC6164737 DOI: 10.3390/s18092869] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/03/2018] [Revised: 08/27/2018] [Accepted: 08/29/2018] [Indexed: 11/17/2022]
Abstract
WiFi fingerprinting indoor positioning systems have extensive applied prospects. However, a vast amount of data in a particular environment has to be gathered to establish a fingerprinting database. Deficiencies of these systems are the lack of universality of multipath effects and a burden of heavy workload on fingerprint storage. Thus, this paper presents a novel Random Forest fingerprinting localization (RFFP) method using channel state information (CSI), which utilizes the Random Forest model trained in the offline stage as fingerprints in order to economize memory space and possess a good anti-multipath characteristic. Furthermore, a series of specific experiments are conducted in a microwave anechoic chamber and an office to detail the localization performance of RFFP with different wireless channel circumstances, system parameters, algorithms, and input datasets. In addition, compared with other algorithms including K-Nearest-Neighbor (KNN), Weighted K-Nearest-Neighbor (WKNN), REPTree, CART, and J48, the RFFP method provides far greater classification accuracy as well as lower mean location error. The proposed method offers outstanding comprehensive performance including accuracy, robustness, low workload, and better anti-multipath-fading.
Collapse
|
12
|
Anand D, Pandey B, Pandey DK. Facioscapulohumeral Muscular Dystrophy Diagnosis Using Hierarchical Clustering Algorithm and K-Nearest Neighbor Based Methodology. INTERNATIONAL JOURNAL OF E-HEALTH AND MEDICAL COMMUNICATIONS 2017. [DOI: 10.4018/ijehmc.2017040103] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The genetic diagnosis of neuromuscular disorder is an active area of research. Microarrays are used to detect the changes in genes for the accurate diagnosis. Unfortunately, the number of genes in gene expression data is very large as compared to number of samples. The number of genes needs to be reduced for correct diagnosis. In the present paper, the authors have made an intelligent integrated model for clustering and diagnosis of neuromuscular diseases. Wilcoxon signed rank test is used to preselect the genes. K-means and hierarchical clustering algorithms with different distance metric are employed to cluster the genes. Three classifiers namely linear discriminant analysis, quadratic discriminant analysis and k-nearest neighbor are used. For the employment of integrated techniques, a balanced facioscapulohumeral muscular dystrophy dataset is taken. A comparative analysis of the above integrated algorithms is presented which demonstrate that the integration of cosine distance metric hierarchical clustering algorithm with k-nearest neighbor has given the best performance measures.
Collapse
Affiliation(s)
- Divya Anand
- Department of Computer Science and Engineering, Lovely Professional University, Phagwara, India
| | - Babita Pandey
- Department of Computer Applications, Lovely Professional University, Phagwara, India
| | | |
Collapse
|
13
|
A Novel Hybrid Feature Selection Model for Classification of Neuromuscular Dystrophies Using Bhattacharyya Coefficient, Genetic Algorithm and Radial Basis Function Based Support Vector Machine. Interdiscip Sci 2016; 10:244-250. [PMID: 27637476 DOI: 10.1007/s12539-016-0183-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2016] [Revised: 08/07/2016] [Accepted: 08/30/2016] [Indexed: 10/21/2022]
Abstract
An accurate classification of neuromuscular disorders is important in providing proper treatment facilities to the patients. Recently, the microarray technology is employed to monitor the level of activity or expression of large number of genes simultaneously. The gene expression data derived from the microarray experiment usually involve a large number of genes but a very few number of samples. There is a need to reduce the dimension of gene expression data which intends to find a small set of discriminative genes that accurately classifies the samples of various kinds of diseases. So, our goal is to find a small subset of genes which ensures the accurate classification of neuromuscular disorders. In the present paper, we propose a novel hybrid feature selection model for classification of neuromuscular disorders. The process of feature selection is done in two phases by integrating Bhattacharyya coefficient and genetic algorithm (GA). In the first phase, we find Bhattacharyya coefficient to choose a candidate gene subset by removing the most redundant genes. In the second phase, the target gene subset is created by selecting the most discriminative gene subset by applying GA wherein the fitness function is calculated using radial basis function support vector machine (RBF SVM). The proposed hybrid algorithm is applied on two publicly available microarray neuromuscular disorders datasets. The results are compared with two individual techniques of feature selection, namely Bhattacharyya coefficient and GA, and one integrated technique, i.e., Bhattacharyya-GA wherein the fitness function of GA is calculated using four other classifiers, which shows that the proposed integrated method is capable of giving the better classification accuracy.
Collapse
|
14
|
Banerjee S, Anura A, Chakrabarty J, Sengupta S, Chatterjee J. Identification and functional assessment of novel gene sets towards better understanding of dysplasia associated oral carcinogenesis. GENE REPORTS 2016. [DOI: 10.1016/j.genrep.2016.04.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
15
|
Abstract
BACKGROUND Development of biologically relevant models from gene expression data notably, microarray data has become a topic of great interest in the field of bioinformatics and clinical genetics and oncology. Only a small number of gene expression data compared to the total number of genes explored possess a significant correlation with a certain phenotype. Gene selection enables researchers to obtain substantial insight into the genetic nature of the disease and the mechanisms responsible for it. Besides improvement of the performance of cancer classification, it can also cut down the time and cost of medical diagnoses. METHODS This study presents a modified Artificial Bee Colony Algorithm (ABC) to select minimum number of genes that are deemed to be significant for cancer along with improvement of predictive accuracy. The search equation of ABC is believed to be good at exploration but poor at exploitation. To overcome this limitation we have modified the ABC algorithm by incorporating the concept of pheromones which is one of the major components of Ant Colony Optimization (ACO) algorithm and a new operation in which successive bees communicate to share their findings. RESULTS The proposed algorithm is evaluated using a suite of ten publicly available datasets after the parameters are tuned scientifically with one of the datasets. Obtained results are compared to other works that used the same datasets. The performance of the proposed method is proved to be superior. CONCLUSION The method presented in this paper can provide subset of genes leading to more accurate classification results while the number of selected genes is smaller. Additionally, the proposed modified Artificial Bee Colony Algorithm could conceivably be applied to problems in other areas as well.
Collapse
Affiliation(s)
| | - Rameen Shakur
- Wellcome Trust - Medical Research Council Cambridge Stem Cell Institute, University of Cambridge, Cambridge, UK
| | - Mohammad Kaykobad
- A ℓEDA Group, Department of CSE, BUET, Dhaka-1205, Dhaka, Bangladesh
| | | |
Collapse
|
16
|
Mundra PA, Rajapakse JC. Gene and sample selection using T-score with sample selection. J Biomed Inform 2016; 59:31-41. [DOI: 10.1016/j.jbi.2015.11.003] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2014] [Revised: 10/13/2015] [Accepted: 11/04/2015] [Indexed: 10/22/2022]
|
17
|
Johnson GR, Li J, Shariff A, Rohde GK, Murphy RF. Automated Learning of Subcellular Variation among Punctate Protein Patterns and a Generative Model of Their Relation to Microtubules. PLoS Comput Biol 2015; 11:e1004614. [PMID: 26624011 PMCID: PMC4704559 DOI: 10.1371/journal.pcbi.1004614] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2015] [Accepted: 10/19/2015] [Indexed: 12/23/2022] Open
Abstract
Characterizing the spatial distribution of proteins directly from microscopy images is a difficult problem with numerous applications in cell biology (e.g. identifying motor-related proteins) and clinical research (e.g. identification of cancer biomarkers). Here we describe the design of a system that provides automated analysis of punctate protein patterns in microscope images, including quantification of their relationships to microtubules. We constructed the system using confocal immunofluorescence microscopy images from the Human Protein Atlas project for 11 punctate proteins in three cultured cell lines. These proteins have previously been characterized as being primarily located in punctate structures, but their images had all been annotated by visual examination as being simply "vesicular". We were able to show that these patterns could be distinguished from each other with high accuracy, and we were able to assign to one of these subclasses hundreds of proteins whose subcellular localization had not previously been well defined. In addition to providing these novel annotations, we built a generative approach to modeling of punctate distributions that captures the essential characteristics of the distinct patterns. Such models are expected to be valuable for representing and summarizing each pattern and for constructing systems biology simulations of cell behaviors.
Collapse
Affiliation(s)
- Gregory R. Johnson
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Jieyue Li
- Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Aabid Shariff
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Gustavo K. Rohde
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
| | - Robert F. Murphy
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Center for Bioimage Informatics, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Department of Biomedical Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Departments of Biological Sciences and Machine Learning, Carnegie Mellon University, Pittsburgh, Pennsylvania, United States of America
- Faculty of Biology and Freiburg Institute for Advanced Studies, Albert Ludwig University of Freiburg, Freiburg, Germany
- * E-mail:
| |
Collapse
|
18
|
Hybrid Classification Techniques for Microarray Data. NATIONAL ACADEMY SCIENCE LETTERS-INDIA 2015. [DOI: 10.1007/s40009-015-0390-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
|
19
|
Sachnev V, Saraswathi S, Niaz R, Kloczkowski A, Suresh S. Multi-class BCGA-ELM based classifier that identifies biomarkers associated with hallmarks of cancer. BMC Bioinformatics 2015; 16:166. [PMID: 25986937 PMCID: PMC4448565 DOI: 10.1186/s12859-015-0565-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 03/31/2015] [Indexed: 12/05/2022] Open
Abstract
Background Traditional cancer treatments have centered on cytotoxic drugs and general purpose chemotherapy that may not be tailored to treat specific cancers. Identification of molecular markers that are related to different types of cancers might lead to discovery of drugs that are patient and disease specific. This study aims to use microarray gene expression cancer data to identify biomarkers that are indicative of different types of cancers. Our aim is to provide a multi-class cancer classifier that can simultaneously differentiate between cancers and identify type-specific biomarkers, through the application of the Binary Coded Genetic Algorithm (BCGA) and a neural network based Extreme Learning Machine (ELM) algorithm. Results BCGA and ELM are combined and used to select a subset of genes that are present in the Global Cancer Mapping (GCM) data set. This set of candidate genes contains over 52 biomarkers that are related to multiple cancers, according to the literature. They include APOA1, VEGFC, YWHAZ, B2M, EIF2S1, CCR9 and many other genes that have been associated with the hallmarks of cancer. BCGA-ELM is tested on several cancer data sets and the results are compared to other classification methods. BCGA-ELM compares or exceeds other algorithms in terms of accuracy. We were also able to show that over 50% of genes selected by BCGA-ELM on GCM data are cancer related biomarkers. Conclusions We were able to simultaneously differentiate between 14 different types of cancers, using only 92 genes, to achieve a multi-class classification accuracy of 95.4% which is between 21.6% and 38% higher than other results in the literature for multi-class cancer classification. Our findings suggest that computational algorithms such as BCGA-ELM can facilitate biomarker-driven integrated cancer research that can lead to a detailed understanding of the complexities of cancer. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0565-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vasily Sachnev
- Department of Information, Communication and Electronics Engineering, Catholic University of Korea, Bucheon, Republic of Korea.
| | - Saras Saraswathi
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; currently at Sidra, Medical and Research Center, Doha, Qatar.
| | - Rashid Niaz
- Department of Medical Informatics, Sidra Medical and Research Center, Doha, Qatar.
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, USA.
| | - Sundaram Suresh
- School of Computer Science, Nanyang Technological University, Nanyang, Singapore.
| |
Collapse
|
20
|
Dessì N, Pes B, Cannas LM. An Evolutionary Approach for Balancing Effectiveness and Representation Level in Gene Selection. JOURNAL OF INFORMATION TECHNOLOGY RESEARCH 2015. [DOI: 10.4018/jitr.2015040102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As data mining develops and expands to new application areas, feature selection also reveals various aspects to be considered. This paper underlines two aspects that seem to categorize the large body of available feature selection algorithms: the effectiveness and the representation level. The effectiveness deals with selecting the minimum set of variables that maximize the accuracy of a classifier and the representation level concerns discovering how relevant the variables are for the domain of interest. For balancing the above aspects, the paper proposes an evolutionary framework for feature selection that expresses a hybrid method, organized in layers, each of them exploits a specific model of search strategy. Extensive experiments on gene selection from DNA-microarray datasets are presented and discussed. Results indicate that the framework compares well with different hybrid methods proposed in literature as it has the capability of finding well suited subsets of informative features while improving classification accuracy.
Collapse
Affiliation(s)
- Nicoletta Dessì
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Barbara Pes
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Laura Maria Cannas
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
21
|
García V, Salvador Sánchez J. Mapping microarray gene expression data into dissimilarity spaces for tumor classification. Inf Sci (N Y) 2015. [DOI: 10.1016/j.ins.2014.09.064] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
|
22
|
Classification of Microarray Data Using Kernel Fuzzy Inference System. INTERNATIONAL SCHOLARLY RESEARCH NOTICES 2014; 2014:769159. [PMID: 27433543 PMCID: PMC4897118 DOI: 10.1155/2014/769159] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2014] [Revised: 05/28/2014] [Accepted: 06/12/2014] [Indexed: 12/02/2022]
Abstract
The DNA microarray classification technique has gained more popularity in both research and practice. In real data analysis, such as microarray data, the dataset contains a huge number of insignificant and irrelevant features that tend to lose useful information. Classes with high relevance and feature sets with high significance are generally referred for the selected features, which determine the samples classification into their respective classes. In this paper, kernel fuzzy inference system (K-FIS) algorithm is applied to classify the microarray data (leukemia) using t-test as a feature selection method. Kernel functions are used to map original data points into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function ϕ through a mathematical process called the kernel trick. This paper also presents a comparative study for classification using K-FIS along with support vector machine (SVM) for different set of features (genes). Performance parameters available in the literature such as precision, recall, specificity, F-measure, ROC curve, and accuracy are considered to analyze the efficiency of the classification model. From the proposed approach, it is apparent that K-FIS model obtains similar results when compared with SVM model. This is an indication that the proposed approach relies on kernel function.
Collapse
|
23
|
A discrete wavelet based feature extraction and hybrid classification technique for microarray data analysis. ScientificWorldJournal 2014; 2014:195470. [PMID: 25162043 PMCID: PMC4138760 DOI: 10.1155/2014/195470] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2014] [Revised: 06/20/2014] [Accepted: 07/02/2014] [Indexed: 11/18/2022] Open
Abstract
Cancer classification by doctors and radiologists was based on morphological and clinical features and had limited diagnostic ability in olden days. The recent arrival of DNA microarray technology has led to the concurrent monitoring of thousands of gene expressions in a single chip which stimulates the progress in cancer classification. In this paper, we have proposed a hybrid approach for microarray data classification based on nearest neighbor (KNN), naive Bayes, and support vector machine (SVM). Feature selection prior to classification plays a vital role and a feature selection technique which combines discrete wavelet transform (DWT) and moving window technique (MWT) is used. The performance of the proposed method is compared with the conventional classifiers like support vector machine, nearest neighbor, and naive Bayes. Experiments have been conducted on both real and benchmark datasets and the results indicate that the ensemble approach produces higher classification accuracy than conventional classifiers. This paper serves as an automated system for the classification of cancer and can be applied by doctors in real cases which serve as a boon to the medical community. This work further reduces the misclassification of cancers which is highly not allowed in cancer detection.
Collapse
|
24
|
Han F, Sun W, Ling QH. A novel strategy for gene selection of microarray data based on gene-to-class sensitivity information. PLoS One 2014; 9:e97530. [PMID: 24844313 PMCID: PMC4028211 DOI: 10.1371/journal.pone.0097530] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2013] [Accepted: 04/21/2014] [Indexed: 11/19/2022] Open
Abstract
To obtain predictive genes with lower redundancy and better interpretability, a hybrid gene selection method encoding prior information is proposed in this paper. To begin with, the prior information referred to as gene-to-class sensitivity (GCS) of all genes from microarray data is exploited by a single hidden layered feedforward neural network (SLFN). Then, to select more representative and lower redundant genes, all genes are grouped into some clusters by K-means method, and some low sensitive genes are filtered out according to their GCS values. Finally, a modified binary particle swarm optimization (BPSO) encoding the GCS information is proposed to perform further gene selection from the remainder genes. For considering the GCS information, the proposed method selects those genes highly correlated to sample classes. Thus, the low redundant gene subsets obtained by the proposed method also contribute to improve classification accuracy on microarray data. The experiments results on some open microarray data verify the effectiveness and efficiency of the proposed approach.
Collapse
Affiliation(s)
- Fei Han
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
| | - Wei Sun
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
| | - Qing-Hua Ling
- School of Computer Science and Communication Engineering, Jiangsu University, Zhenjiang, China
- School of Computer Science and Engineering, Jiangsu University of Science and Technology, Zhenjiang, China
| |
Collapse
|
25
|
Cai H, Ruan P, Ng M, Akutsu T. Feature weight estimation for gene selection: a local hyperlinear learning approach. BMC Bioinformatics 2014; 15:70. [PMID: 24625071 PMCID: PMC4007530 DOI: 10.1186/1471-2105-15-70] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Accepted: 03/06/2014] [Indexed: 11/10/2022] Open
Abstract
Background Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.
Collapse
Affiliation(s)
- Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangdong, China.
| | | | | | | |
Collapse
|
26
|
Wang H, Zhang H, Dai Z, Chen MS, Yuan Z. TSG: a new algorithm for binary and multi-class cancer classification and informative genes selection. BMC Med Genomics 2013; 6 Suppl 1:S3. [PMID: 23445528 PMCID: PMC3552704 DOI: 10.1186/1755-8794-6-s1-s3] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND One of the challenges in classification of cancer tissue samples based on gene expression data is to establish an effective method that can select a parsimonious set of informative genes. The Top Scoring Pair (TSP), k-Top Scoring Pairs (k-TSP), Support Vector Machines (SVM), and prediction analysis of microarrays (PAM) are four popular classifiers that have comparable performance on multiple cancer datasets. SVM and PAM tend to use a large number of genes and TSP, k-TSP always use even number of genes. In addition, the selection of distinct gene pairs in k-TSP simply combined the pairs of top ranking genes without considering the fact that the gene set with best discrimination power may not be the combined pairs. The k-TSP algorithm also needs the user to specify an upper bound for the number of gene pairs. Here we introduce a computational algorithm to address the problems. The algorithm is named Chisquare-statistic-based Top Scoring Genes (Chi-TSG) classifier simplified as TSG. RESULTS The TSG classifier starts with the top two genes and sequentially adds additional gene into the candidate gene set to perform informative gene selection. The algorithm automatically reports the total number of informative genes selected with cross validation. We provide the algorithm for both binary and multi-class cancer classification. The algorithm was applied to 9 binary and 10 multi-class gene expression datasets involving human cancers. The TSG classifier outperforms TSP family classifiers by a big margin in most of the 19 datasets. In addition to improved accuracy, our classifier shares all the advantages of the TSP family classifiers including easy interpretation, invariant to monotone transformation, often selects a small number of informative genes allowing follow-up studies, resistant to sampling variations due to within sample operations. CONCLUSIONS Redefining the scores for gene set and the classification rules in TSP family classifiers by incorporating the sample size information can lead to better selection of informative genes and classification accuracy. The resulting TSG classifier offers a useful tool for cancer classification based on numerical molecular data.
Collapse
Affiliation(s)
- Haiyan Wang
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
| | | | | | | | | |
Collapse
|
27
|
Zhang H, Wang H, Dai Z, Chen MS, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics 2012; 13:298. [PMID: 23148517 PMCID: PMC3562261 DOI: 10.1186/1471-2105-13-298] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2012] [Accepted: 09/24/2012] [Indexed: 12/21/2022] Open
Abstract
Background Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability. Results We applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature. Conclusions Evaluation of a gene’s contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.
Collapse
Affiliation(s)
- Hongyan Zhang
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
| | | | | | | | | |
Collapse
|
28
|
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C, de Schaetzen V, Duque R, Bersini H, Nowé A. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:1106-19. [PMID: 22350210 DOI: 10.1109/tcbb.2012.33] [Citation(s) in RCA: 219] [Impact Index Per Article: 16.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
A plenitude of feature selection (FS) methods is available in the literature, most of them rising as a need to analyze data of very high dimension, usually hundreds or thousands of variables. Such data sets are now available in various application areas like combinatorial chemistry, text mining, multivariate imaging, or bioinformatics. As a general accepted rule, these methods are grouped in filters, wrappers, and embedded methods. More recently, a new group of methods has been added in the general framework of FS: ensemble techniques. The focus in this survey is on filter feature selection methods for informative feature discovery in gene expression microarray (GEM) analysis, which is also known as differentially expressed genes (DEGs) discovery, gene prioritization, or biomarker discovery. We present them in a unified framework, using standardized notations in order to reveal their technical details and to highlight their common characteristics as well as their particularities.
Collapse
Affiliation(s)
- Cosmin Lazar
- Computational Modeling Group, Department of Computer Science, Vrije Universiteit Brussel, Pleinlaan 2, Brussels 1050, Belgium.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
29
|
Zhang JG, Li J, Tang W, Deng HW. Fusing Gene Interaction to Improve Disease Discrimination on Classification Analysis. ADVANCES IN GENETICS 2012; 1:1000102. [PMID: 23814698 DOI: 10.4172/age.1000102] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
It is usually observed that among genes there exist strong statistical interactions associated with diseases of public health importance. Gene interactions can potentially contribute to the improvement of disease classification accuracy. Especially when gene expression differs across different classes are not great enough, it is more important to take use of gene interactions for disease classification analyses. However, most gene selection algorithms in classification analyses merely focus on genes whose expression levels show differences across classes, and ignore the discriminatory information from gene interactions. In this study, we develop a two-stage algorithm that can take gene interaction into account during a gene selection procedure. Its biggest advantage is that it can take advantage of discriminatory information from gene interactions as well as gene expression differences, by using "Bayes error" as a gene selection criterion. Using simulated and real microarray data sets, we demonstrate the ability of gene interactions for classification accuracy improvement, and present that the proposed algorithm can yield small informative sets of genes while leading to highly accurate classification results. Thus our study may give a novel sight for future gene selection algorithms of human diseases discrimination.
Collapse
Affiliation(s)
- Ji-Gang Zhang
- Center for Bioinformatics and Genomics, Department of Biostatistics and Bioinformatics, School of Public Health and Tropical Medicine, Tulane University, USA
| | | | | | | |
Collapse
|
30
|
Cancer classification based on microarray gene expression data using a principal component accumulation method. Sci China Chem 2011. [DOI: 10.1007/s11426-011-4263-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
31
|
Dagliyan O, Uney-Yuksektepe F, Kavakli IH, Turkay M. Optimization based tumor classification from microarray gene expression data. PLoS One 2011; 6:e14579. [PMID: 21326602 PMCID: PMC3033885 DOI: 10.1371/journal.pone.0014579] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 12/23/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND An important use of data obtained from microarray measurements is the classification of tumor types with respect to genes that are either up or down regulated in specific cancer types. A number of algorithms have been proposed to obtain such classifications. These algorithms usually require parameter optimization to obtain accurate results depending on the type of data. Additionally, it is highly critical to find an optimal set of markers among those up or down regulated genes that can be clinically utilized to build assays for the diagnosis or to follow progression of specific cancer types. In this paper, we employ a mixed integer programming based classification algorithm named hyper-box enclosure method (HBE) for the classification of some cancer types with a minimal set of predictor genes. This optimization based method which is a user friendly and efficient classifier may allow the clinicians to diagnose and follow progression of certain cancer types. METHODOLOGY/PRINCIPAL FINDINGS We apply HBE algorithm to some well known data sets such as leukemia, prostate cancer, diffuse large B-cell lymphoma (DLBCL), small round blue cell tumors (SRBCT) to find some predictor genes that can be utilized for diagnosis and prognosis in a robust manner with a high accuracy. Our approach does not require any modification or parameter optimization for each data set. Additionally, information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection methods are employed for the gene selection. The results are compared with those from other studies and biological roles of selected genes in corresponding cancer type are described. CONCLUSIONS/SIGNIFICANCE The performance of our algorithm overall was better than the other algorithms reported in the literature and classifiers found in WEKA data-mining package. Since it does not require a parameter optimization and it performs consistently very high prediction rate on different type of data sets, HBE method is an effective and consistent tool for cancer type prediction with a small number of gene markers.
Collapse
MESH Headings
- Algorithms
- Calibration
- Electronic Data Processing/standards
- Gene Expression Profiling/methods
- Gene Expression Profiling/standards
- Gene Expression Regulation, Neoplastic
- Humans
- Leukemia/classification
- Leukemia/diagnosis
- Leukemia/genetics
- Lymphoma, Large B-Cell, Diffuse/classification
- Lymphoma, Large B-Cell, Diffuse/diagnosis
- Lymphoma, Large B-Cell, Diffuse/genetics
- Male
- Microarray Analysis/methods
- Microarray Analysis/standards
- Models, Theoretical
- Neoplasms/classification
- Neoplasms/diagnosis
- Neoplasms/genetics
- Pattern Recognition, Automated/methods
- Pattern Recognition, Automated/standards
- Prognosis
- Prostatic Neoplasms/classification
- Prostatic Neoplasms/diagnosis
- Prostatic Neoplasms/genetics
Collapse
Affiliation(s)
- Onur Dagliyan
- Department of Chemical and Biological Engineering, Koc University, Istanbul, Turkey
| | | | - I. Halil Kavakli
- Department of Chemical and Biological Engineering, Koc University, Istanbul, Turkey
| | - Metin Turkay
- Department of Industrial Engineering, Koc University, Istanbul, Turkey
| |
Collapse
|
32
|
|
33
|
|
34
|
Chuang LY, Ke CH, Chang HW, Yang CH. A Two-Stage Feature Selection Method for Gene Expression Data. OMICS-A JOURNAL OF INTEGRATIVE BIOLOGY 2009; 13:127-37. [DOI: 10.1089/omi.2008.0083] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Affiliation(s)
- Li-Yeh Chuang
- Institute of Biotechnology and Chemical Engineering, I-Shou University, Kaohsiung, Taiwan, Republic of China
| | - Chao-Hsuan Ke
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| | - Hsueh-Wei Chang
- Faculty of Biomedical Science and Environmental Biology, Kaohsiung Medical University, Taiwan, Republic of China
- Graduate Institute of Natural Products, College of Pharmacy, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
- Center of Excellence for Environmental Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan, Republic of China
| | - Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan, Republic of China
| |
Collapse
|
35
|
The Impact of Gene Selection on Imbalanced Microarray Expression Data. BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2009. [DOI: 10.1007/978-3-642-00727-9_25] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/12/2023]
|
36
|
Gadgil M. A Population Proportion approach for ranking differentially expressed genes. BMC Bioinformatics 2008; 9:380. [PMID: 18801167 PMCID: PMC2566584 DOI: 10.1186/1471-2105-9-380] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2008] [Accepted: 09/18/2008] [Indexed: 11/14/2022] Open
Abstract
Background DNA microarrays are used to investigate differences in gene expression between two or more classes of samples. Most currently used approaches compare mean expression levels between classes and are not geared to find genes whose expression is significantly different in only a subset of samples in a class. However, biological variability can lead to situations where key genes are differentially expressed in only a subset of samples. To facilitate the identification of such genes, a new method is reported. Methods The key difference between the Population Proportion Ranking Method (PPRM) presented here and almost all other methods currently used is in the quantification of variability. PPRM quantifies variability in terms of inter-sample ratios and can be used to calculate the relative merit of differentially expressed genes with a specified difference in expression level between at least some samples in the two classes, which at the same time have lower than a specified variability within each class. Results PPRM is tested on simulated data and on three publicly available cancer data sets. It is compared to the t test, PPST, COPA, OS, ORT and MOST using the simulated data. Under the conditions tested, it performs as well or better than the other methods tested under low intra-class variability and better than t test, PPST, COPA and OS when a gene is differentially expressed in only a subset of samples. It performs better than ORT and MOST in recognizing non differentially expressed genes with high variability in expression levels across all samples. For biological data, the success of predictor genes identified in appropriately classifying an independent sample is reported.
Collapse
Affiliation(s)
- Mugdha Gadgil
- Chemical Engineering and Process Development, National Chemical Laboratory, Pune, India .
| |
Collapse
|
37
|
Su Z, Hong H, Fang H, Shi L, Perkins R, Tong W. Very Important Pool (VIP) genes--an application for microarray-based molecular signatures. BMC Bioinformatics 2008; 9 Suppl 9:S9. [PMID: 18793473 PMCID: PMC2537560 DOI: 10.1186/1471-2105-9-s9-s9] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Advances in DNA microarray technology portend that molecular signatures from which microarray will eventually be used in clinical environments and personalized medicine. Derivation of biomarkers is a large step beyond hypothesis generation and imposes considerably more stringency for accuracy in identifying informative gene subsets to differentiate phenotypes. The inherent nature of microarray data, with fewer samples and replicates compared to the large number of genes, requires identifying informative genes prior to classifier construction. However, improving the ability to identify differentiating genes remains a challenge in bioinformatics. RESULTS A new hybrid gene selection approach was investigated and tested with nine publicly available microarray datasets. The new method identifies a Very Important Pool (VIP) of genes from the broad patterns of gene expression data. The method uses a bagging sampling principle, where the re-sampled arrays are used to identify the most informative genes. Frequency of selection is used in a repetitive process to identify the VIP genes. The putative informative genes are selected using two methods, t-statistic and discriminatory analysis. In the t-statistic, the informative genes are identified based on p-values. In the discriminatory analysis, disjoint Principal Component Analyses (PCAs) are conducted for each class of samples, and genes with high discrimination power (DP) are identified. The VIP gene selection approach was compared with the p-value ranking approach. The genes identified by the VIP method but not by the p-value ranking approach are also related to the disease investigated. More importantly, these genes are part of the pathways derived from the common genes shared by both the VIP and p-ranking methods. Moreover, the binary classifiers built from these genes are statistically equivalent to those built from the top 50 p-value ranked genes in distinguishing different types of samples. CONCLUSION The VIP gene selection approach could identify additional subsets of informative genes that would not always be selected by the p-value ranking method. These genes are likely to be additional true positives since they are a part of pathways identified by the p-value ranking method and expected to be related to the relevant biology. Therefore, these additional genes derived from the VIP method potentially provide valuable biological insights.
Collapse
Affiliation(s)
- Zhenqiang Su
- Center for Toxicoinformatics, National Center for Toxicological Research (NCTR), U,S, Food and Drug Administration (FDA), 3900 NCTR Road, Jefferson, AR 72079, USA.
| | | | | | | | | | | |
Collapse
|
38
|
Jiang W, Li X, Rao S, Wang L, Du L, Li C, Wu C, Wang H, Wang Y, Yang B. Constructing disease-specific gene networks using pair-wise relevance metric: application to colon cancer identifies interleukin 8, desmin and enolase 1 as the central elements. BMC SYSTEMS BIOLOGY 2008; 2:72. [PMID: 18691435 PMCID: PMC2535780 DOI: 10.1186/1752-0509-2-72] [Citation(s) in RCA: 49] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2007] [Accepted: 08/10/2008] [Indexed: 12/11/2022]
Abstract
Background With the advance of large-scale omics technologies, it is now feasible to reversely engineer the underlying genetic networks that describe the complex interplays of molecular elements that lead to complex diseases. Current networking approaches are mainly focusing on building genetic networks at large without probing the interaction mechanisms specific to a physiological or disease condition. The aim of this study was thus to develop such a novel networking approach based on the relevance concept, which is ideal to reveal integrative effects of multiple genes in the underlying genetic circuit for complex diseases. Results The approach started with identification of multiple disease pathways, called a gene forest, in which the genes extracted from the decision forest constructed by supervised learning of the genome-wide transcriptional profiles for patients and normal samples. Based on the newly identified disease mechanisms, a novel pair-wise relevance metric, adjusted frequency value, was used to define the degree of genetic relationship between two molecular determinants. We applied the proposed method to analyze a publicly available microarray dataset for colon cancer. The results demonstrated that the colon cancer-specific gene network captured the most important genetic interactions in several cellular processes, such as proliferation, apoptosis, differentiation, mitogenesis and immunity, which are known to be pivotal for tumourigenesis. Further analysis of the topological architecture of the network identified three known hub cancer genes [interleukin 8 (IL8) (p ≈ 0), desmin (DES) (p = 2.71 × 10-6) and enolase 1 (ENO1) (p = 4.19 × 10-5)], while two novel hub genes [RNA binding motif protein 9 (RBM9) (p = 1.50 × 10-4) and ribosomal protein L30 (RPL30) (p = 1.50 × 10-4)] may define new central elements in the gene network specific to colon cancer. Gene Ontology (GO) based analysis of the colon cancer-specific gene network and the sub-network that consisted of three-way gene interactions suggested that tumourigenesis in colon cancer resulted from dysfunction in protein biosynthesis and categories associated with ribonucleoprotein complex which are well supported by multiple lines of experimental evidence. Conclusion This study demonstrated that IL8, DES and ENO1 act as the central elements in colon cancer susceptibility, and protein biosynthesis and the ribosome-associated function categories largely account for the colon cancer tumuorigenesis. Thus, the newly developed relevancy-based networking approach offers a powerful means to reverse-engineer the disease-specific network, a promising tool for systematic dissection of complex diseases.
Collapse
Affiliation(s)
- Wei Jiang
- College of Bioinformatics Science and Technology and Bio-pharmaceutical Key Laboratory of Heilongjiang Province, Harbin Medical University, Harbin 150081, PR China.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
39
|
Abstract
The increasing use of gene expression microarrays, and depositing of the resulting data into public repositories, means that more investigators are interested in using the technology either directly or through meta analysis of the publicly available data. The tools available for data analysis have generally been developed for use by experts in the field, making them difficult to use by the general research community. For those interested in entering the field, especially those without a background in statistics, it is difficult to understand why experimental results can be so variable. The purpose of this review is to go through the workflow of a typical microarray experiment, to show that decisions made at each step, from choice of platform through statistical analysis methods to biological interpretation, are all sources of this variability.
Collapse
|
40
|
|