1
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
2
|
Alhenawi E, Al-Sayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: A systematic review. Comput Biol Med 2022; 140:105051. [PMID: 34839186 DOI: 10.1016/j.compbiomed.2021.105051] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Revised: 11/01/2021] [Accepted: 11/15/2021] [Indexed: 11/29/2022]
Abstract
This systematic review provides researchers interested in feature selection (FS) for processing microarray data with comprehensive information about the main research directions for gene expression classification conducted during the recent seven years. A set of 132 researches published by three different publishers is reviewed. The studied papers are categorized into nine directions based on their objectives. The FS directions that received various levels of attention were then summarized. The review revealed that 'propose hybrid FS methods' represented the most interesting research direction with a percentage of 34.9%, while the other directions have lower percentages that ranged from 13.6% down to 3%. This guides researchers to select the most competitive research direction. Papers in each category are thoroughly reviewed based on six perspectives, mainly: method(s), classifier(s), dataset(s), dataset dimension(s) range, performance metric(s), and result(s) achieved.
Collapse
Affiliation(s)
- Esra'a Alhenawi
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Rizik Al-Sayyed
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Amjad Hudaib
- King Abdullah II School for Information Technology, The University of Jordan, Amman, Jordan.
| | - Seyedali Mirjalili
- Center for Artificial Intelligence Research and Optimization, Torrens University Australia, Fortitude Valley, Brisbane, 4006, QLD, Australia; Yonsei Frontier Lab, Yonsei University, Seoul, South Korea.
| |
Collapse
|
3
|
Gul S, Rahim F, Isin S, Yilmaz F, Ozturk N, Turkay M, Kavakli IH. Structure-based design and classifications of small molecules regulating the circadian rhythm period. Sci Rep 2021; 11:18510. [PMID: 34531414 PMCID: PMC8445970 DOI: 10.1038/s41598-021-97962-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2020] [Accepted: 08/27/2021] [Indexed: 11/09/2022] Open
Abstract
Circadian rhythm is an important mechanism that controls behavior and biochemical events based on 24 h rhythmicity. Ample evidence indicates disturbance of this mechanism is associated with different diseases such as cancer, mood disorders, and familial delayed phase sleep disorder. Therefore, drug discovery studies have been initiated using high throughput screening. Recently the crystal structures of core clock proteins (CLOCK/BMAL1, Cryptochromes (CRY), Periods), responsible for generating circadian rhythm, have been solved. Availability of structures makes amenable core clock proteins to design molecules regulating their activity by using in silico approaches. In addition to that, the implementation of classification features of molecules based on their toxicity and activity will improve the accuracy of the drug discovery process. Here, we identified 171 molecules that target functional domains of a core clock protein, CRY1, using structure-based drug design methods. We experimentally determined that 115 molecules were nontoxic, and 21 molecules significantly lengthened the period of circadian rhythm in U2OS cells. We then performed a machine learning study to classify these molecules for identifying features that make them toxic and lengthen the circadian period. Decision tree classifiers (DTC) identified 13 molecular descriptors, which predict the toxicity of molecules with a mean accuracy of 79.53% using tenfold cross-validation. Gradient boosting classifiers (XGBC) identified 10 molecular descriptors that predict and increase in the circadian period length with a mean accuracy of 86.56% with tenfold cross-validation. Our results suggested that these features can be used in QSAR studies to design novel nontoxic molecules that exhibit period lengthening activity.
Collapse
Affiliation(s)
- Seref Gul
- Department of Chemical and Biological Engineering, Koc University, Rumelifeneri Yolu, Sariyer, Istabul, Turkey
| | - Fatih Rahim
- Department of Industrial Engineering, Koc University, Rumelifeneri Yolu, Sariyer, Istabul, Turkey
| | - Safak Isin
- Department of Molecular Biology and Genetics, Rumelifeneri Yolu, Sariyer, Istabul, Turkey
| | - Fatma Yilmaz
- Department of Molecular Biology and Genetics, Gebze Technical University, Gebze, 41400, Kocaeli, Turkey
| | - Nuri Ozturk
- Department of Molecular Biology and Genetics, Gebze Technical University, Gebze, 41400, Kocaeli, Turkey
| | - Metin Turkay
- Department of Industrial Engineering, Koc University, Rumelifeneri Yolu, Sariyer, Istabul, Turkey.
| | - Ibrahim Halil Kavakli
- Department of Chemical and Biological Engineering, Koc University, Rumelifeneri Yolu, Sariyer, Istabul, Turkey.
- Department of Molecular Biology and Genetics, Rumelifeneri Yolu, Sariyer, Istabul, Turkey.
| |
Collapse
|
4
|
Gumaei A, Sammouda R, Al-Rakhami M, AlSalman H, El-Zaart A. Feature selection with ensemble learning for prostate cancer diagnosis from microarray gene expression. Health Informatics J 2021; 27:1460458221989402. [PMID: 33570011 DOI: 10.1177/1460458221989402] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022]
Abstract
Cancer diagnosis using machine learning algorithms is one of the main topics of research in computer-based medical science. Prostate cancer is considered one of the reasons that are leading to deaths worldwide. Data analysis of gene expression from microarray using machine learning and soft computing algorithms is a useful tool for detecting prostate cancer in medical diagnosis. Even though traditional machine learning methods have been successfully applied for detecting prostate cancer, the large number of attributes with a small sample size of microarray data is still a challenge that limits their ability for effective medical diagnosis. Selecting a subset of relevant features from all features and choosing an appropriate machine learning method can exploit the information of microarray data to improve the accuracy rate of detection. In this paper, we propose to use a correlation feature selection (CFS) method with random committee (RC) ensemble learning to detect prostate cancer from microarray data of gene expression. A set of experiments are conducted on a public benchmark dataset using 10-fold cross-validation technique to evaluate the proposed approach. The experimental results revealed that the proposed approach attains 95.098% accuracy, which is higher than related work methods on the same dataset.
Collapse
Affiliation(s)
- Abdu Gumaei
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia.,Taiz University, Yemen
| | | | - Mabrook Al-Rakhami
- Research Chair of Pervasive and Mobile Computing, King Saud University, Saudi Arabia
| | | | | |
Collapse
|
5
|
Ghosh M, Sen S, Sarkar R, Maulik U. Quantum squirrel inspired algorithm for gene selection in methylation and expression data of prostate cancer. Appl Soft Comput 2021. [DOI: 10.1016/j.asoc.2021.107221] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
6
|
Haznedar B, Arslan MT, Kalinli A. Optimizing ANFIS using simulated annealing algorithm for classification of microarray gene expression cancer data. Med Biol Eng Comput 2021; 59:497-509. [PMID: 33543413 DOI: 10.1007/s11517-021-02331-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Accepted: 01/29/2021] [Indexed: 11/28/2022]
Abstract
In the medical field, successful classification of microarray gene expression data is of major importance for cancer diagnosis. However, due to the profusion of genes number, the performance of classifying DNA microarray gene expression data using statistical algorithms is often limited. Recently, there has been an important increase in the studies on the utilization of artificial intelligence methods, for the purpose of classifying large-scale data. In this context, a hybrid approach based on the adaptive neuro-fuzzy inference system (ANFIS), the fuzzy c-means clustering (FCM), and the simulated annealing (SA) algorithm is proposed in this study. The proposed method is applied to classify five different cancer datasets (i.e., lung cancer, central nervous system cancer, brain cancer, endometrial cancer, and prostate cancer). The backpropagation algorithm, hybrid algorithm, genetic algorithm, and the other statistical methods such as Bayesian network, support vector machine, and J48 decision tree are used to compare the proposed approach's performance to other algorithms. The results show that the performance of training FCM-based ANFIS using SA algorithm for classifying all the cancer datasets becomes more successful with the average accuracy rate of 96.28% and the results of the other methods are also satisfactory. The proposed method gives more effective results than the others for classifying DNA microarray cancer gene expression data. Basic structure of proposed method.
Collapse
Affiliation(s)
- Bulent Haznedar
- Department of Computer Engineering, Hasan Kalyoncu University, 27100, Gaziantep, Turkey.
| | - Mustafa Turan Arslan
- Department of Computer Technology, Mustafa Kemal University, 31440, Hatay, Turkey
| | - Adem Kalinli
- Department of Computer Engineering, Erciyes University, 38039, Kayseri, Turkey.,Presidency Office, Rectorate, Middle East Technical University, 06800, Ankara, Turkey
| |
Collapse
|
7
|
Cancer molecular subtype classification from hypervolume-based discrete evolutionary optimization. Neural Comput Appl 2020. [DOI: 10.1007/s00521-020-04846-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Liu XY, Wang S, Zhang H, Zhang H, Yang ZY, Liang Y. Novel Regularization Method for Biomarker Selection and Cancer Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1329-1340. [PMID: 30716046 DOI: 10.1109/tcbb.2019.2897301] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Variable selection has attracted more attention in big data and machine learning fields. In high dimensional data analysis, many relevant variables or variable groups are widely found. For example, people pay more interests to biological pathway or regulatory network in microarray gene expression data. In recent years, regularization methods are commonly used approaches for variable selection. Existing regularization methods generally use L2 penalty to evaluate the grouping effect and penalty with a fixed value of q to evaluate the variable sparsity, respectively. These methods typically produce a good performance with high efficiency, but they often require the data to satisfy a certain probability distribution. In this paper, we propose a novel complex harmonic regularization (CHR) penalty function, which can approximate the combination of [Formula: see text] and regularizations with adjustable p and q to select the groups of the relevant variables. The CHR penalty function can be effectively solved by a direct path seeking algorithm. We demonstrate that the proposed CHR penalty function performs better than the state-of-the-art regularization methods in selecting groups of relevant variables and classification.
Collapse
|
9
|
SGL-SVM: A novel method for tumor classification via support vector machine with sparse group Lasso. J Theor Biol 2020; 486:110098. [DOI: 10.1016/j.jtbi.2019.110098] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 11/27/2019] [Accepted: 11/28/2019] [Indexed: 02/07/2023]
|
10
|
Rodrigues V, Deusdado S. Deterministic Classifiers Accuracy Optimization for Cancer Microarray Data. PRACTICAL APPLICATIONS OF COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 13TH INTERNATIONAL CONFERENCE 2020. [DOI: 10.1007/978-3-030-23873-5_19] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
11
|
Sharma A, Rani R. C-HMOSHSSA: Gene selection for cancer classification using multi-objective meta-heuristic and machine learning methods. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2019; 178:219-235. [PMID: 31416551 DOI: 10.1016/j.cmpb.2019.06.029] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Revised: 06/24/2019] [Accepted: 06/27/2019] [Indexed: 05/21/2023]
Abstract
BACKGROUND AND OBJECTIVE Over the last two decades, DNA microarray technology has emerged as a powerful tool for early cancer detection and prevention. It helps to provide a detailed overview of disease complex microenvironment. Moreover, online availability of thousands of gene expression assays made microarray data classification an active research area. A common goal is to find a minimum subset of genes and maximizing the classification accuracy. METHODS In pursuit of a similar objective, we have proposed framework (C-HMOSHSSA) for gene selection using multi-objective spotted hyena optimizer (MOSHO) and salp swarm algorithm (SSA). The real-life optimization problems with more than one objective usually face the challenge to maintain convergence and diversity. Salp Swarm Algorithm (SSA) maintains diversity but, suffers from the overhead of maintaining the necessary information. On the other hand, the calculation of MOSHO requires low computational efforts hence is used for maintaining the necessary information. Therefore, the proposed algorithm is a hybrid algorithm that utilizes the features of both SSA and MOSHO to facilitate its exploration and exploitation capability. RESULTS Four different classifiers are trained on seven high-dimensional datasets using a subset of features (genes), which are obtained after applying the proposed hybrid gene selection algorithm. The results show that the proposed technique significantly outperforms existing state-of-the-art techniques. CONCLUSION It is also shown that the new sets of informative and biologically relevant genes are successfully identified by the proposed technique. The proposed approach can also be applied to other problem domains of interest which involve feature selection.
Collapse
Affiliation(s)
- Aman Sharma
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| | - Rinkle Rani
- Computer Science and Engineering Department, Thapar Institute of Engineering & Technology, Patiala, Punjab, India.
| |
Collapse
|
12
|
Kertmen A, Przysiecka Ł, Coy E, Popenda Ł, Andruszkiewicz R, Jurga S, Milewski S. Emerging Anticancer Activity of Candidal Glucoseamine-6-Phosphate Synthase Inhibitors upon Nanoparticle-Mediated Delivery. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2019; 35:5281-5293. [PMID: 30912436 DOI: 10.1021/acs.langmuir.8b04250] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Numerous glutamine analogues have been reported as irreversible inhibitors of the glucosamine-6-phosphate (GlcN-6-P) synthase in pathogenic Candida albicans in the last 3.5 decades. Among the reported inhibitors, the most effective N3-(4-methoxyfumaroyl)-l-2,3-diaminopropanoic acid (FMDP) has been extensively studied in order to develop its more active analogues. Several peptide-FMDP conjugates were tested to deliver FMDP to its subcellularly located GlcN-6-P synthase target. However, the rapid development of fungal resistance to FMDP-peptides required development of different therapeutic approaches to tackle antifungal resistance. In the current state of the global antifungal resistance, subcellular delivery of FMDP via free diffusion or endocytosis has become crucial. In this study, we report on in vitro nanomedical applications of FMDP and one of its ketoacid analogues, N3- trans-4-oxo-4-phenyl-2-butenoyl-l-2,3-diaminopropanoic acid (BADP). FMDP and BADP covalently attached to polyethylene glycol-coated iron oxide/silica core-shell nanoparticles are tested against intrinsically multidrug-resistant C. albicans. Three different human cancer cell lines potentially overexpressing the GlcN-6-P synthase enzyme are tested to demonstrate the immediate inhibitory effects of nanoparticle conjugates against mammalian cells. It is shown that nanoparticle-mediated delivery transforms FMDP and BADP into strong anticancer agents by inhibiting the growth of the tested cancer cells, whereas their anti-Candidal activity is decreased. This study discusses the emerging inhibitory effect of the FMDP/BADP-nanoparticle conjugates based on their cellular internalization efficiency and biocompatibility.
Collapse
Affiliation(s)
- Ahmet Kertmen
- Department of Pharmaceutical Technology and Biochemistry , Gdansk University of Technology , G. Narutowicza 11/12 , 80-233 Gdansk , Poland
| | | | | | | | - Ryszard Andruszkiewicz
- Department of Pharmaceutical Technology and Biochemistry , Gdansk University of Technology , G. Narutowicza 11/12 , 80-233 Gdansk , Poland
| | | | - Sławomir Milewski
- Department of Pharmaceutical Technology and Biochemistry , Gdansk University of Technology , G. Narutowicza 11/12 , 80-233 Gdansk , Poland
| |
Collapse
|
13
|
Kang C, Huo Y, Xin L, Tian B, Yu B. Feature selection and tumor classification for microarray data using relaxed Lasso and generalized multi-class support vector machine. J Theor Biol 2019; 463:77-91. [DOI: 10.1016/j.jtbi.2018.12.010] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2018] [Revised: 11/03/2018] [Accepted: 12/06/2018] [Indexed: 02/08/2023]
|
14
|
Dashtban M, Balafar M, Suravajhala P. Gene selection for tumor classification using a novel bio-inspired multi-objective approach. Genomics 2017; 110:10-17. [PMID: 28780377 DOI: 10.1016/j.ygeno.2017.07.010] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2017] [Revised: 07/12/2017] [Accepted: 07/30/2017] [Indexed: 12/21/2022]
Abstract
Identifying the informative genes has always been a major step in microarray data analysis. The complexity of various cancer datasets makes this issue still challenging. In this paper, a novel Bio-inspired Multi-objective algorithm is proposed for gene selection in microarray data classification specifically in the binary domain of feature selection. The presented method extends the traditional Bat Algorithm with refined formulations, effective multi-objective operators, and novel local search strategies employing social learning concepts in designing random walks. A hybrid model using the Fisher criterion is then applied to three widely-used microarray cancer datasets to explore significant biomarkers which reveal the effectiveness of the proposed method for genomic analysis. Experimental results unveil new combinations of informative biomarkers have association with other studies.
Collapse
Affiliation(s)
- M Dashtban
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran.
| | - Mohammadali Balafar
- Department of Computer Engineering, Faculty of Electrical & Computer Engineering, University of Tabriz, Iran
| | - Prashanth Suravajhala
- Birla Institute of Scientific Research, Statue Circle, Jaipur 302001, Rajasthan, India; Bioclues.org, Kukatpally, Hyderabad 500072, Telangana, India
| |
Collapse
|
15
|
Chen H, Zhang Y, Gutman I. A kernel-based clustering method for gene selection with gene expression data. J Biomed Inform 2016; 62:12-20. [DOI: 10.1016/j.jbi.2016.05.007] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2015] [Revised: 05/08/2016] [Accepted: 05/19/2016] [Indexed: 12/21/2022]
|
16
|
Huang HH, Liu XY, Liang Y. Feature Selection and Cancer Classification via Sparse Logistic Regression with the Hybrid L1/2 +2 Regularization. PLoS One 2016; 11:e0149675. [PMID: 27136190 PMCID: PMC4852916 DOI: 10.1371/journal.pone.0149675] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2015] [Accepted: 02/02/2016] [Indexed: 11/18/2022] Open
Abstract
Cancer classification and feature (gene) selection plays an important role in knowledge discovery in genomic data. Although logistic regression is one of the most popular classification methods, it does not induce feature selection. In this paper, we presented a new hybrid L1/2 +2 regularization (HLR) function, a linear combination of L1/2 and L2 penalties, to select the relevant gene in the logistic regression. The HLR approach inherits some fascinating characteristics from L1/2 (sparsity) and L2 (grouping effect where highly correlated variables are in or out a model together) penalties. We also proposed a novel univariate HLR thresholding approach to update the estimated coefficients and developed the coordinate descent algorithm for the HLR penalized logistic regression model. The empirical results and simulations indicate that the proposed method is highly competitive amongst several state-of-the-art methods.
Collapse
Affiliation(s)
- Hai-Hui Huang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
| | - Xiao-Ying Liu
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
| | - Yong Liang
- Faculty of Information Technology & State Key Laboratory of Quality Research in Chinese Medicines, Macau University of Science and Technology, Avenida Wai Long, Taipa, Macau, 999078, China
- * E-mail:
| |
Collapse
|
17
|
A Unified Framework for Reservoir Computing and Extreme Learning Machines based on a Single Time-delayed Neuron. Sci Rep 2015; 5:14945. [PMID: 26446303 PMCID: PMC4597340 DOI: 10.1038/srep14945] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 07/21/2015] [Indexed: 11/25/2022] Open
Abstract
In this paper we present a unified framework for extreme learning machines and reservoir computing (echo state networks), which can be physically implemented using a single nonlinear neuron subject to delayed feedback. The reservoir is built within the delay-line, employing a number of “virtual” neurons. These virtual neurons receive random projections from the input layer containing the information to be processed. One key advantage of this approach is that it can be implemented efficiently in hardware. We show that the reservoir computing implementation, in this case optoelectronic, is also capable to realize extreme learning machines, demonstrating the unified framework for both schemes in software as well as in hardware.
Collapse
|
18
|
Sachnev V, Saraswathi S, Niaz R, Kloczkowski A, Suresh S. Multi-class BCGA-ELM based classifier that identifies biomarkers associated with hallmarks of cancer. BMC Bioinformatics 2015; 16:166. [PMID: 25986937 PMCID: PMC4448565 DOI: 10.1186/s12859-015-0565-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2015] [Accepted: 03/31/2015] [Indexed: 12/05/2022] Open
Abstract
Background Traditional cancer treatments have centered on cytotoxic drugs and general purpose chemotherapy that may not be tailored to treat specific cancers. Identification of molecular markers that are related to different types of cancers might lead to discovery of drugs that are patient and disease specific. This study aims to use microarray gene expression cancer data to identify biomarkers that are indicative of different types of cancers. Our aim is to provide a multi-class cancer classifier that can simultaneously differentiate between cancers and identify type-specific biomarkers, through the application of the Binary Coded Genetic Algorithm (BCGA) and a neural network based Extreme Learning Machine (ELM) algorithm. Results BCGA and ELM are combined and used to select a subset of genes that are present in the Global Cancer Mapping (GCM) data set. This set of candidate genes contains over 52 biomarkers that are related to multiple cancers, according to the literature. They include APOA1, VEGFC, YWHAZ, B2M, EIF2S1, CCR9 and many other genes that have been associated with the hallmarks of cancer. BCGA-ELM is tested on several cancer data sets and the results are compared to other classification methods. BCGA-ELM compares or exceeds other algorithms in terms of accuracy. We were also able to show that over 50% of genes selected by BCGA-ELM on GCM data are cancer related biomarkers. Conclusions We were able to simultaneously differentiate between 14 different types of cancers, using only 92 genes, to achieve a multi-class classification accuracy of 95.4% which is between 21.6% and 38% higher than other results in the literature for multi-class cancer classification. Our findings suggest that computational algorithms such as BCGA-ELM can facilitate biomarker-driven integrated cancer research that can lead to a detailed understanding of the complexities of cancer. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0565-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Vasily Sachnev
- Department of Information, Communication and Electronics Engineering, Catholic University of Korea, Bucheon, Republic of Korea.
| | - Saras Saraswathi
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; currently at Sidra, Medical and Research Center, Doha, Qatar.
| | - Rashid Niaz
- Department of Medical Informatics, Sidra Medical and Research Center, Doha, Qatar.
| | - Andrzej Kloczkowski
- Battelle Center for Mathematical Medicine at The Research Institute at Nationwide Children's Hospital; Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, USA.
| | - Sundaram Suresh
- School of Computer Science, Nanyang Technological University, Nanyang, Singapore.
| |
Collapse
|
19
|
Dessì N, Pes B, Cannas LM. An Evolutionary Approach for Balancing Effectiveness and Representation Level in Gene Selection. JOURNAL OF INFORMATION TECHNOLOGY RESEARCH 2015. [DOI: 10.4018/jitr.2015040102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
As data mining develops and expands to new application areas, feature selection also reveals various aspects to be considered. This paper underlines two aspects that seem to categorize the large body of available feature selection algorithms: the effectiveness and the representation level. The effectiveness deals with selecting the minimum set of variables that maximize the accuracy of a classifier and the representation level concerns discovering how relevant the variables are for the domain of interest. For balancing the above aspects, the paper proposes an evolutionary framework for feature selection that expresses a hybrid method, organized in layers, each of them exploits a specific model of search strategy. Extensive experiments on gene selection from DNA-microarray datasets are presented and discussed. Results indicate that the framework compares well with different hybrid methods proposed in literature as it has the capability of finding well suited subsets of informative features while improving classification accuracy.
Collapse
Affiliation(s)
- Nicoletta Dessì
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Barbara Pes
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| | - Laura Maria Cannas
- Department of Mathematics and Computer Science, Università degli Studi di Cagliari, Cagliari, Italy
| |
Collapse
|
20
|
Hou D, Koyutürk M. Comprehensive evaluation of composite gene features in cancer outcome prediction. Cancer Inform 2015; 13:93-104. [PMID: 25780335 PMCID: PMC4345828 DOI: 10.4137/cin.s14028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Revised: 09/29/2014] [Accepted: 10/04/2014] [Indexed: 11/24/2022] Open
Abstract
Owing to the heterogeneous and continuously evolving nature of cancers, classifiers based on the expression of individual genes usually do not result in robust prediction of cancer outcome. As an alternative, composite gene features that combine functionally related genes have been proposed. It is expected that such features can be more robust and reproducible since they can capture the alterations in relevant biological processes as a whole and may be less sensitive to fluctuations in the expression of individual genes. Various algorithms have been developed for the identification of composite features and inference of composite gene feature activity, which all claim to improve the prediction accuracy. However, because of the limitations of test datasets incorporated by each individual study and inconsistent test procedures, the results of these studies are sometimes conflicting and unproducible. For this reason, it is difficult to have a comprehensive understanding of the prediction performance of composite gene features, particularly across different cancers, cancer subtypes, and cohorts. In this study, we implement various algorithms for the identification of composite gene features and their utilization in cancer outcome prediction, and perform extensive comparison and evaluation using seven microarray datasets covering two cancer types and three different phenotypes. Our results show that, while some algorithms outperform others for certain classification tasks, no single algorithm consistently outperforms other algorithms and individual gene features.
Collapse
Affiliation(s)
- Dezhi Hou
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA
| | - Mehmet Koyutürk
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH, USA. ; Center for Proteomics and Bioinformatics, Case Western Reserve University, Cleveland, OH, USA
| |
Collapse
|
21
|
Yang L, Ainali C, Kittas A, Nestle FO, Papageorgiou LG, Tsoka S. Pathway-level disease data mining through hyper-box principles. Math Biosci 2015; 260:25-34. [DOI: 10.1016/j.mbs.2014.09.005] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2014] [Revised: 09/11/2014] [Accepted: 09/13/2014] [Indexed: 01/16/2023]
|
22
|
Yang L, Ainali C, Tsoka S, Papageorgiou LG. Pathway activity inference for multiclass disease classification through a mathematical programming optimisation framework. BMC Bioinformatics 2014; 15:390. [PMID: 25475756 PMCID: PMC4269079 DOI: 10.1186/s12859-014-0390-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Accepted: 11/19/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Applying machine learning methods on microarray gene expression profiles for disease classification problems is a popular method to derive biomarkers, i.e. sets of genes that can predict disease state or outcome. Traditional approaches where expression of genes were treated independently suffer from low prediction accuracy and difficulty of biological interpretation. Current research efforts focus on integrating information on protein interactions through biochemical pathway datasets with expression profiles to propose pathway-based classifiers that can enhance disease diagnosis and prognosis. As most of the pathway activity inference methods in literature are either unsupervised or applied on two-class datasets, there is good scope to address such limitations by proposing novel methodologies. RESULTS A supervised multiclass pathway activity inference method using optimisation techniques is reported. For each pathway expression dataset, patterns of its constituent genes are summarised into one composite feature, termed pathway activity, and a novel mathematical programming model is proposed to infer this feature as a weighted linear summation of expression of its constituent genes. Gene weights are determined by the optimisation model, in a way that the resulting pathway activity has the optimal discriminative power with regards to disease phenotypes. Classification is then performed on the resulting low-dimensional pathway activity profile. CONCLUSIONS The model was evaluated through a variety of published gene expression profiles that cover different types of disease. We show that not only does it improve classification accuracy, but it can also perform well in multiclass disease datasets, a limitation of other approaches from the literature. Desirable features of the model include the ability to control the maximum number of genes that may participate in determining pathway activity, which may be pre-specified by the user. Overall, this work highlights the potential of building pathway-based multi-phenotype classifiers for accurate disease diagnosis and prognosis problems.
Collapse
Affiliation(s)
- Lingjian Yang
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| | - Chrysanthi Ainali
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Sophia Tsoka
- Department of Informatics, School of Natural and Mathematical Sciences, King's College London, London, WC2R 2LS, UK.
| | - Lazaros G Papageorgiou
- Centre for Process Systems Engineering, Department of Chemical Engineering, University College London, London, WC1E 7JE, UK.
| |
Collapse
|
23
|
Yang JF, Ding XF, Chen L, Mat WK, Xu MZ, Chen JF, Wang JM, Xu L, Poon WS, Kwong A, Leung GKK, Tan TC, Yu CH, Ke YB, Xu XY, Ke XY, Ma RC, Chan JC, Wan WQ, Zhang LW, Kumar Y, Tsang SY, Li S, Wang HY, Xue H. Copy number variation analysis based on AluScan sequences. J Clin Bioinforma 2014; 4:15. [PMID: 25558350 PMCID: PMC4273479 DOI: 10.1186/s13336-014-0015-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Accepted: 11/12/2014] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND AluScan combines inter-Alu PCR using multiple Alu-based primers with opposite orientations and next-generation sequencing to capture a huge number of Alu-proximal genomic sequences for investigation. Its requirement of only sub-microgram quantities of DNA facilitates the examination of large numbers of samples. However, the special features of AluScan data rendered difficult the calling of copy number variation (CNV) directly using the calling algorithms designed for whole genome sequencing (WGS) or exome sequencing. RESULTS In this study, an AluScanCNV package has been assembled for efficient CNV calling from AluScan sequencing data employing a Geary-Hinkley transformation (GHT) of read-depth ratios between either paired test-control samples, or between test samples and a reference template constructed from reference samples, to call the localized CNVs, followed by use of a GISTIC-like algorithm to identify recurrent CNVs and circular binary segmentation (CBS) to reveal large extended CNVs. To evaluate the utility of CNVs called from AluScan data, the AluScans from 23 non-cancer and 38 cancer genomes were analyzed in this study. The glioma samples analyzed yielded the familiar extended copy-number losses on chromosomes 1p and 9. Also, the recurrent somatic CNVs identified from liver cancer samples were similar to those reported for liver cancer WGS with respect to a striking enrichment of copy-number gains in chromosomes 1q and 8q. When localized or recurrent CNV-features capable of distinguishing between liver and non-liver cancer samples were selected by correlation-based machine learning, a highly accurate separation of the liver and non-liver cancer classes was attained. CONCLUSIONS The results obtained from non-cancer and cancerous tissues indicated that the AluScanCNV package can be employed to call localized, recurrent and extended CNVs from AluScan sequences. Moreover, both the localized and recurrent CNVs identified by this method could be subjected to machine-learning selection to yield distinguishing CNV-features that were capable of separating between liver cancers and other types of cancers. Since the method is applicable to any human DNA sample with or without the availability of a paired control, it can also be employed to analyze the constitutional CNVs of individuals.
Collapse
Affiliation(s)
- Jian-Feng Yang
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Xiao-Fan Ding
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Lei Chen
- National Center for Liver Cancer Research and Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China
| | - Wai-Kin Mat
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Michelle Zhi Xu
- Department of Oncology, Nanjing First Hospital, No. 68 Changle Road, Nanjing, 210006 China
| | - Jin-Fei Chen
- Department of Oncology, Nanjing First Hospital, No. 68 Changle Road, Nanjing, 210006 China
| | - Jian-Min Wang
- Department of Hematology, Changhai Hospital, Second Military Medical University, 174 Changhai Road, Shanghai, 200433 China
| | - Lin Xu
- Department of Thoracic Surgery, Jiangsu Key Laboratory of Molecular and Translational Cancer Research, Nanjing Medical University Affiliated Cancer Hospital, Cancer Institute of Jiangsu Province, Baiziting 42, Nanjing, 210009 China
| | - Wai-Sang Poon
- Division of Neurosurgery, Department of Surgery, Prince of Wales Hospital, Chinese University of Hong Kong, 30-32 Ngan Shing Street, Sha Tin, Hong Kong, China
| | - Ava Kwong
- Division of Neurosurgery, Department of Surgery, Li Ka Shing Faculty of Medicine, University of Hong Kong, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
| | - Gilberto Ka-Kit Leung
- Division of Neurosurgery, Department of Surgery, Li Ka Shing Faculty of Medicine, University of Hong Kong, Queen Mary Hospital, 102 Pokfulam Road, Hong Kong, China
| | - Tze-Ching Tan
- Department of Neurosurgery, Queen Elizabeth Hospital, 30 Gascoigne Road, Kowloon, Hong Kong, China
| | - Chi-Hung Yu
- Department of Neurosurgery, Queen Elizabeth Hospital, 30 Gascoigne Road, Kowloon, Hong Kong, China
| | - Yue-Bin Ke
- Shenzhen Center for Disease Control and Prevention, No 8 Longyuan Road, Nanshan district, Shenzhen City, 518055 China
| | - Xin-Yun Xu
- Shenzhen Center for Disease Control and Prevention, No 8 Longyuan Road, Nanshan district, Shenzhen City, 518055 China
| | - Xiao-Yan Ke
- Nanjing Brain Hospital and Nanjing Institute of Neuropsychiatry, Nanjing Medical University, Nanjing, 210029 China
| | - Ronald Cw Ma
- Department of Medicine and Therapeutics, 9th floor, Clinical Sciences Building, The Prince of Wales Hospital, Shatin, Hong Kong
| | - Juliana Cn Chan
- Department of Medicine and Therapeutics, 9th floor, Clinical Sciences Building, The Prince of Wales Hospital, Shatin, Hong Kong
| | - Wei-Qing Wan
- Department of Neurosurgery, Beijing Tiantan Hospital, 6 Tiantan Xili, Dongcheng District, Capital Medical University, Beijing, 100050 China
| | - Li-Wei Zhang
- Department of Neurosurgery, Beijing Tiantan Hospital, 6 Tiantan Xili, Dongcheng District, Capital Medical University, Beijing, 100050 China
| | - Yogesh Kumar
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Shui-Ying Tsang
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| | - Shao Li
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST, Department of Automation, Tsinghua University, Beijing, 100084 China
| | - Hong-Yang Wang
- National Center for Liver Cancer Research and Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China.,International Cooperation Laboratory on Signal Transduction, Eastern Hepatobiliary Surgery Hospital, 225 Changhai Road, Shanghai, 200438 China
| | - Hong Xue
- Division of Life Science and Applied Genomics Centre, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China
| |
Collapse
|
24
|
A comparative analysis of swarm intelligence techniques for feature selection in cancer classification. ScientificWorldJournal 2014; 2014:693831. [PMID: 25157377 PMCID: PMC4137534 DOI: 10.1155/2014/693831] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2014] [Accepted: 06/18/2014] [Indexed: 11/17/2022] Open
Abstract
Feature selection in cancer classification is a central area of research in the field of bioinformatics and used to select the informative genes from thousands of genes of the microarray. The genes are ranked based on T-statistics, signal-to-noise ratio (SNR), and F-test values. The swarm intelligence (SI) technique finds the informative genes from the top-m ranked genes. These selected genes are used for classification. In this paper the shuffled frog leaping with Lévy flight (SFLLF) is proposed for feature selection. In SFLLF, the Lévy flight is included to avoid premature convergence of shuffled frog leaping (SFL) algorithm. The SI techniques such as particle swarm optimization (PSO), cuckoo search (CS), SFL, and SFLLF are used for feature selection which identifies informative genes for classification. The k-nearest neighbour (k-NN) technique is used to classify the samples. The proposed work is applied on 10 different benchmark datasets and examined with SI techniques. The experimental results show that the results obtained from k-NN classifier through SFLLF feature selection method outperform PSO, CS, and SFL.
Collapse
|
25
|
Ding X, Tsang SY, Ng SK, Xue H. Application of Machine Learning to Development of Copy Number Variation-based Prediction of Cancer Risk. GENOMICS INSIGHTS 2014. [PMID: 26203258 PMCID: PMC4504076 DOI: 10.4137/gei.s15002] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
In the present study, recurrent copy number variations (CNVs) from non-tumor blood cell DNAs of Caucasian non-cancer subjects and glioma, myeloma, and colorectal cancer-patients, and Korean non-cancer subjects and hepatocellular carcinoma, gastric cancer, and colorectal cancer patients, were found to reveal for each of the two ethnic cohorts highly significant differences between cancer patients and controls with respect to the number of CN-losses and size-distribution of CN-gains, suggesting the existence of recurrent constitutional CNV-features useful for prediction of predisposition to cancer. Upon identification by machine learning, such CNV-features could extensively discriminate between cancer-patient and control DNAs. When the CNV-features selected from a learning-group of Caucasian or Korean mixed DNAs consisting of both cancer-patient and control DNAs were employed to make predictions on the cancer predisposition of an unseen test group of mixed DNAs, the average prediction accuracy was 93.6% for the Caucasian cohort and 86.5% for the Korean cohort.
Collapse
Affiliation(s)
- Xiaofan Ding
- Applied Genomics Center and Division of Life Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| | - Shui-Ying Tsang
- Applied Genomics Center and Division of Life Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| | - Siu-Kin Ng
- Applied Genomics Center and Division of Life Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| | - Hong Xue
- Applied Genomics Center and Division of Life Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong
| |
Collapse
|
26
|
Tian Y, Liu G, Wu C, Rong G, Sun A. Spring: A Method for Identifying Differentially Expressed Genes in Microarray Data. BIOTECHNOL BIOTEC EQ 2014. [DOI: 10.5504/bbeq.2013.0083] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
|
27
|
Cai H, Ruan P, Ng M, Akutsu T. Feature weight estimation for gene selection: a local hyperlinear learning approach. BMC Bioinformatics 2014; 15:70. [PMID: 24625071 PMCID: PMC4007530 DOI: 10.1186/1471-2105-15-70] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2013] [Accepted: 03/06/2014] [Indexed: 11/10/2022] Open
Abstract
Background Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.
Collapse
Affiliation(s)
- Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangdong, China.
| | | | | | | |
Collapse
|
28
|
Applications of Bayesian gene selection and classification with mixtures of generalized singular g-priors. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2014; 2013:420412. [PMID: 24382981 PMCID: PMC3870637 DOI: 10.1155/2013/420412] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/04/2013] [Revised: 11/10/2013] [Accepted: 11/10/2013] [Indexed: 11/17/2022]
Abstract
Recent advancement in microarray technologies has led to a collection of an enormous number of genetic markers in disease association studies, and yet scientists are interested in selecting a smaller set of genes to explore the relation between genes and disease. Current approaches either adopt a single marker test which ignores the possible interaction among genes or consider a multistage procedure that reduces the large size of genes before evaluation of the association. Among the latter, Bayesian analysis can further accommodate the correlation between genes through the specification of a multivariate prior distribution and estimate the probabilities of association through latent variables. The covariance matrix, however, depends on an unknown parameter. In this research, we suggested a reference hyperprior distribution for such uncertainty, outlined the implementation of its computation, and illustrated this fully Bayesian approach with a colon and leukemia cancer study. Comparison with other existing methods was also conducted. The classification accuracy of our proposed model is higher with a smaller set of selected genes. The results not only replicated findings in several earlier studies, but also provided the strength of association with posterior probabilities.
Collapse
|
29
|
Mao Z, Cai W, Shao X. Selecting significant genes by randomization test for cancer classification using gene expression data. J Biomed Inform 2013; 46:594-601. [DOI: 10.1016/j.jbi.2013.03.009] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2012] [Revised: 01/30/2013] [Accepted: 03/28/2013] [Indexed: 12/30/2022]
|
30
|
Paparountas T, Nikolaidou-Katsaridou MN, Rustici G, Aidinis V. Data Mining and Meta-Analysis on DNA Microarray Data. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Microarray technology enables high-throughput parallel gene expression analysis, and use has grown exponentially thanks to the development of a variety of applications for expression, genetics and epigenetic studies. A wealth of data is now available from public repositories, providing unprecedented opportunities for meta-analysis approaches, which could generate new biological information, unrelated to the original scope of individual studies. This study provides a guideline for identification of biological significance of the statistically-selected differentially-expressed genes derived from gene expression arrays as well as to suggest further analysis pathways. The authors review the prerequisites for data-mining and meta-analysis, summarize the conceptual methods to derive biological information from microarray data and suggest software for each category of data mining or meta-analysis.
Collapse
Affiliation(s)
| | | | - Gabriella Rustici
- European Molecular Biology Laboratory-European Bioinformatics Institute, UK
| | - Vasilis Aidinis
- Biomedical Sciences Research Center “Alexander Fleming”, Greece
| |
Collapse
|
31
|
Liu Z, Chen D, Sheng L, Liu AY. Class prediction and feature selection with linear optimization for metagenomic count data. PLoS One 2013; 8:e53253. [PMID: 23555553 PMCID: PMC3608598 DOI: 10.1371/journal.pone.0053253] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2012] [Accepted: 11/27/2012] [Indexed: 11/29/2022] Open
Abstract
The amount of metagenomic data is growing rapidly while the computational methods for metagenome analysis are still in their infancy. It is important to develop novel statistical learning tools for the prediction of associations between bacterial communities and disease phenotypes and for the detection of differentially abundant features. In this study, we presented a novel statistical learning method for simultaneous association prediction and feature selection with metagenomic samples from two or multiple treatment populations on the basis of count data. We developed a linear programming based support vector machine with L(1) and joint L(1,∞) penalties for binary and multiclass classifications with metagenomic count data (metalinprog). We evaluated the performance of our method on several real and simulation datasets. The proposed method can simultaneously identify features and predict classes with the metagenomic count data.
Collapse
Affiliation(s)
- Zhenqiu Liu
- University of Maryland Greenebaum Cancer Center, Baltimore, Maryland, USA.
| | | | | | | |
Collapse
|
32
|
Ramani RG, Jacob SG. Improved classification of lung cancer tumors based on structural and physicochemical properties of proteins using data mining models. PLoS One 2013; 8:e58772. [PMID: 23505559 PMCID: PMC3591381 DOI: 10.1371/journal.pone.0058772] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2012] [Accepted: 02/06/2013] [Indexed: 11/22/2022] Open
Abstract
Detecting divergence between oncogenic tumors plays a pivotal role in cancer diagnosis and therapy. This research work was focused on designing a computational strategy to predict the class of lung cancer tumors from the structural and physicochemical properties (1497 attributes) of protein sequences obtained from genes defined by microarray analysis. The proposed methodology involved the use of hybrid feature selection techniques (gain ratio and correlation based subset evaluators with Incremental Feature Selection) followed by Bayesian Network prediction to discriminate lung cancer tumors as Small Cell Lung Cancer (SCLC), Non-Small Cell Lung Cancer (NSCLC) and the COMMON classes. Moreover, this methodology eliminated the need for extensive data cleansing strategies on the protein properties and revealed the optimal and minimal set of features that contributed to lung cancer tumor classification with an improved accuracy compared to previous work. We also attempted to predict via supervised clustering the possible clusters in the lung tumor data. Our results revealed that supervised clustering algorithms exhibited poor performance in differentiating the lung tumor classes. Hybrid feature selection identified the distribution of solvent accessibility, polarizability and hydrophobicity as the highest ranked features with Incremental feature selection and Bayesian Network prediction generating the optimal Jack-knife cross validation accuracy of 87.6%. Precise categorization of oncogenic genes causing SCLC and NSCLC based on the structural and physicochemical properties of their protein sequences is expected to unravel the functionality of proteins that are essential in maintaining the genomic integrity of a cell and also act as an informative source for drug design, targeting essential protein properties and their composition that are found to exist in lung cancer tumors.
Collapse
Affiliation(s)
- R. Geetha Ramani
- Department of Information Science and Technology, College of Engineering, Guindy, Anna University, Chennai, Tamilnadu, India
| | - Shomona Gracia Jacob
- Faculty of Information and Communication Engineering, Anna University, Chennai, Tamilnadu, India
| |
Collapse
|
33
|
Zhang H, Wang H, Dai Z, Chen MS, Yuan Z. Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics 2012; 13:298. [PMID: 23148517 PMCID: PMC3562261 DOI: 10.1186/1471-2105-13-298] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2012] [Accepted: 09/24/2012] [Indexed: 12/21/2022] Open
Abstract
Background Even though the classification of cancer tissue samples based on gene expression data has advanced considerably in recent years, it faces great challenges to improve accuracy. One of the challenges is to establish an effective method that can select a parsimonious set of relevant genes. So far, most methods for gene selection in literature focus on screening individual or pairs of genes without considering the possible interactions among genes. Here we introduce a new computational method named the Binary Matrix Shuffling Filter (BMSF). It not only overcomes the difficulty associated with the search schemes of traditional wrapper methods and overfitting problem in large dimensional search space but also takes potential gene interactions into account during gene selection. This method, coupled with Support Vector Machine (SVM) for implementation, often selects very small number of genes for easy model interpretability. Results We applied our method to 9 two-class gene expression datasets involving human cancers. During the gene selection process, the set of genes to be kept in the model was recursively refined and repeatedly updated according to the effect of a given gene on the contributions of other genes in reference to their usefulness in cancer classification. The small number of informative genes selected from each dataset leads to significantly improved leave-one-out (LOOCV) classification accuracy across all 9 datasets for multiple classifiers. Our method also exhibits broad generalization in the genes selected since multiple commonly used classifiers achieved either equivalent or much higher LOOCV accuracy than those reported in literature. Conclusions Evaluation of a gene’s contribution to binary cancer classification is better to be considered after adjusting for the joint effect of a large number of other genes. A computationally efficient search scheme was provided to perform effective search in the extensive feature space that includes possible interactions of many genes. Performance of the algorithm applied to 9 datasets suggests that it is possible to improve the accuracy of cancer classification by a big margin when joint effects of many genes are considered.
Collapse
Affiliation(s)
- Hongyan Zhang
- Hunan Provincial Key Laboratory of Crop Germplasm Innovation and Utilization, Changsha 410128, China
| | | | | | | | | |
Collapse
|
34
|
Novoselova N, Tom I. Entropy-based cluster validation and estimation of the number of clusters in gene expression data. J Bioinform Comput Biol 2012; 10:1250011. [PMID: 22849366 DOI: 10.1142/s0219720012500114] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.
Collapse
Affiliation(s)
- Natalia Novoselova
- Department of Bioinformatics, United Institute of Informatics Problems, Surganova Street 6, Minsk 220012, Belarus.
| | | |
Collapse
|
35
|
Chen Z, Padmanabhan K, Rocha AM, Shpanskaya Y, Mihelcic JR, Scott K, Samatova NF. SPICE: discovery of phenotype-determining component interplays. BMC SYSTEMS BIOLOGY 2012; 6:40. [PMID: 22583800 PMCID: PMC3515406 DOI: 10.1186/1752-0509-6-40] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/06/2011] [Accepted: 04/17/2012] [Indexed: 01/17/2023]
Abstract
Background A latent behavior of a biological cell is complex. Deriving the underlying simplicity, or the fundamental rules governing this behavior has been the Holy Grail of systems biology. Data-driven prediction of the system components and their component interplays that are responsible for the target system’s phenotype is a key and challenging step in this endeavor. Results The proposed approach, which we call System Phenotype-related Interplaying Components Enumerator (Spice), iteratively enumerates statistically significant system components that are hypothesized (1) to play an important role in defining the specificity of the target system’s phenotype(s); (2) to exhibit a functionally coherent behavior, namely, act in a coordinated manner to perform the phenotype-specific function; and (3) to improve the predictive skill of the system’s phenotype(s) when used collectively in the ensemble of predictive models. Spice can be applied to both instance-based data and network-based data. When validated, Spice effectively identified system components related to three target phenotypes: biohydrogen production, motility, and cancer. Manual results curation agreed with the known phenotype-related system components reported in literature. Additionally, using the identified system components as discriminatory features improved the prediction accuracy by 10% on the phenotype-classification task when compared to a number of state-of-the-art methods applied to eight benchmark microarray data sets. Conclusion We formulate a problem—enumeration of phenotype-determining system component interplays—and propose an effective methodology (Spice) to address this problem. Spice improved identification of cancer-related groups of genes from various microarray data sets and detected groups of genes associated with microbial biohydrogen production and motility, many of which were reported in literature. Spice also improved the predictive skill of the system’s phenotype determination compared to individual classifiers and/or other ensemble methods, such as bagging, boosting, random forest, nearest shrunken centroid, and random forest variable selection method.
Collapse
Affiliation(s)
- Zhengzhang Chen
- Department of Computer Science, North Carolina State University, Raleigh, NC 27695, USA
| | | | | | | | | | | | | |
Collapse
|
36
|
Chen Y, Arthur PF, Barchia IM, Quinn K, Parnell PF, Herd RM. Using gene expression information obtained by quantitative real-time PCR to evaluate Angus bulls divergently selected for feed efficiency. ANIMAL PRODUCTION SCIENCE 2012. [DOI: 10.1071/an12098] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
Abstract
Residual feed intake (RFI) is a measure of feed efficiency in beef cattle. Young Angus bulls from lines of cattle divergently selected for RFI were used in a gene expression profiling study of the liver. Quantitative real-time PCR (qPCR) assay was used to quantify the differentially expressed genes and the information was used to examine the relationships between the genes and RFI and to classify the bulls into their respective RFI group. Gene expression of 21 genes in liver biopsies from 22 low RFI and 22 high RFI bulls were measured by qPCR. Gene expressions of 14 of the 21 genes were significantly correlated with RFI. The expression of the genes was used in a principal component analysis from which five components were extracted. The five principal components explained 70% of the variation in the dependency structure. The first component was highly correlated (correlation coefficient of 0.69) with RFI. The genes of the glutathione S-transferase Mu family (GSTM1, GSTM2, GSTM4), protocadherin 19 (PCDH19), ATP-binding cassette transporter C4 (ABCC4) and superoxide dismutase 3 (SOD3) are in the xenobiotic pathway and were the key factors in the first principal component. This highlights the important relationship between this pathway and variation in RFI. The second and third principal components were also correlated with RFI, with correlation coefficients of –0.28 and –0.20, respectively. Two of the four important genes of the second principal component work coordinately in the signalling pathways that inhibit the insulin-stimulated insulin receptor and regulate energy metabolism. This is consistent with the observation that a positive genetic correlation exists between RFI and fatness. The important genes in the third principal component are related to the extracellular matrix activity, with low RFI bulls showing high extracellular matrix activity.
Collapse
|
37
|
Yang H, Cheng C, Zhang W. Average rank-based score to measure deregulation of molecular pathway gene sets. PLoS One 2011; 6:e27579. [PMID: 22096597 PMCID: PMC3212578 DOI: 10.1371/journal.pone.0027579] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2011] [Accepted: 10/19/2011] [Indexed: 12/04/2022] Open
Abstract
Background Deregulation of biological pathways has been shown to be involved in the turmorigenesis of a variety of cancers. The co-regulation of pathways in tumor and normal tissues has not been studied in a systematic manner. Results In this study we propose a novel statistic named AR-score (average rank based score) to measure pathway activities based on microarray gene expression profiles. We calculate and compare the AR-scores of pathways in microarray datasets containing expression profiles for a wide range of cancer types as well as the corresponding normal tissues. We find that many pathways undergo significant activity changes in tumors with respect to normal tissues. AR-scores for a small subset of pathways are capable of distinguishing tumor from normal tissues or classifying tumor subtypes. In normal tissues many pathways are highly correlated in their activities, whereas their correlations reduce significantly in tumors and cancer cell lines. The co-expression of genes in the same pathways was also significantly perturbed in tumors. Conclusions The co-regulation of genes in the same pathways and co-regulation of different pathways are significantly perturbed in tumors versus normal tissues. Our method provides a useful tool for better understanding the mechanistic changes in tumors, which can also be used for exploring other biological problems.
Collapse
Affiliation(s)
- Huan Yang
- Department of Reproductive Endocrinology, Obstetrics and Gynecology Hospital, Fudan University, Shanghai, China
| | - Chao Cheng
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut, United States of America
- * E-mail: (WZ); (CC)
| | - Wei Zhang
- Department of Reproductive Endocrinology, Obstetrics and Gynecology Hospital, Fudan University, Shanghai, China
- * E-mail: (WZ); (CC)
| |
Collapse
|
38
|
Bassel GW, Glaab E, Marquez J, Holdsworth MJ, Bacardit J. Functional network construction in Arabidopsis using rule-based machine learning on large-scale data sets. THE PLANT CELL 2011; 23:3101-16. [PMID: 21896882 PMCID: PMC3203449 DOI: 10.1105/tpc.111.088153] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/14/2011] [Revised: 08/01/2011] [Accepted: 08/25/2011] [Indexed: 05/17/2023]
Abstract
The meta-analysis of large-scale postgenomics data sets within public databases promises to provide important novel biological knowledge. Statistical approaches including correlation analyses in coexpression studies of gene expression have emerged as tools to elucidate gene function using these data sets. Here, we present a powerful and novel alternative methodology to computationally identify functional relationships between genes from microarray data sets using rule-based machine learning. This approach, termed "coprediction," is based on the collective ability of groups of genes co-occurring within rules to accurately predict the developmental outcome of a biological system. We demonstrate the utility of coprediction as a powerful analytical tool using publicly available microarray data generated exclusively from Arabidopsis thaliana seeds to compute a functional gene interaction network, termed Seed Co-Prediction Network (SCoPNet). SCoPNet predicts functional associations between genes acting in the same developmental and signal transduction pathways irrespective of the similarity in their respective gene expression patterns. Using SCoPNet, we identified four novel regulators of seed germination (ALTERED SEED GERMINATION5, 6, 7, and 8), and predicted interactions at the level of transcript abundance between these novel and previously described factors influencing Arabidopsis seed germination. An online Web tool to query SCoPNet has been developed as a community resource to dissect seed biology and is available at http://www.vseed.nottingham.ac.uk/.
Collapse
Affiliation(s)
- George W Bassel
- Division of Plant and Crop Sciences, University of Nottingham, Loughborough, Leicestershire, UK.
| | | | | | | | | |
Collapse
|