551
|
Oh JH, Gao J. Fast kernel discriminant analysis for classification of liver cancer mass spectra. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1522-1534. [PMID: 20479503 DOI: 10.1109/tcbb.2010.42] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The classification of serum samples based on mass spectrometry (MS) has been increasingly used for monitoring disease progression and for diagnosing early disease. However, the classification task in mass spectrometry data is extremely challenging due to the very huge size of peaks (features) on mass spectra. Linear discriminant analysis (LDA) has been widely used for dimension reduction and feature extraction in many applications. However, the conversional LDA suffers from the singularity problem when dealing with high-dimensional features. Another critical limitation is its linearity property which results in failing in classification problems over nonlinearly clustered data sets. To overcome such problems, we develop a new fast kernel discriminant analysis (FKDA) that is pretty fast in the calculation of optimal discriminant vectors. FKDA is applied to the classification of liver cancer mass spectrometry data that consist of three categories: hepatocellular carcinoma, cirrhosis, and healthy that was originally analyzed by Ressom et al. We demonstrate the superiority and effectiveness of FKDA when compared to other classification techniques.
Collapse
Affiliation(s)
- Jung Hun Oh
- Division of Bioinformatics and Outcomes Research, Department of Radiation Oncology, Washington University School of Medicine, St. Louis, MO 63110, USA
| | | |
Collapse
|
552
|
Zheng S, Liu W. An experimental comparison of gene selection by Lasso and Dantzig selector for cancer classification. Comput Biol Med 2011; 41:1033-40. [DOI: 10.1016/j.compbiomed.2011.08.011] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2011] [Revised: 08/29/2011] [Accepted: 08/30/2011] [Indexed: 01/28/2023]
|
553
|
McLachlan GJ, Rathnayake SI. Testing for Group Structure in High-Dimensional Data. J Biopharm Stat 2011; 21:1113-25. [DOI: 10.1080/10543406.2011.608342] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
Affiliation(s)
- G. J. McLachlan
- a Department of Mathematics , University of Queensland , St. Lucia, Queensland, Australia
- b Institute of Molecular Bioscience , University of Queensland , St. Lucia, Queensland, Australia
| | - Suren I. Rathnayake
- a Department of Mathematics , University of Queensland , St. Lucia, Queensland, Australia
| |
Collapse
|
554
|
Vu TN, Valkenborg D, Smets K, Verwaest KA, Dommisse R, Lemière F, Verschoren A, Goethals B, Laukens K. An integrated workflow for robust alignment and simplified quantitative analysis of NMR spectrometry data. BMC Bioinformatics 2011; 12:405. [PMID: 22014236 PMCID: PMC3217056 DOI: 10.1186/1471-2105-12-405] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2011] [Accepted: 10/20/2011] [Indexed: 11/24/2022] Open
Abstract
Background Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. Results We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. Conclusions The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/.
Collapse
Affiliation(s)
- Trung N Vu
- Department of Mathematics and Computer Science, University of Antwerp, Belgium.
| | | | | | | | | | | | | | | | | |
Collapse
|
555
|
Kim SK, Yun SJ, Kim J, Lee OJ, Bae SC, Kim WJ. Identification of gene expression signature modulated by nicotinamide in a mouse bladder cancer model. PLoS One 2011; 6:e26131. [PMID: 22028816 PMCID: PMC3189956 DOI: 10.1371/journal.pone.0026131] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2011] [Accepted: 09/20/2011] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Urinary bladder cancer is often a result of exposure to chemical carcinogens such as cigarette smoking. Because of histological similarity, chemically-induced rodent cancer model was largely used for human bladder cancer studies. Previous investigations have suggested that nicotinamide, water-soluble vitamin B3, may play a key role in cancer prevention through its activities in cellular repair. However, to date, evidence towards identifying the genetic alterations of nicotinamide in cancer prevention has not been provided. Here, we search for the molecular signatures of cancer prevention by nicotinamide using a N-butyl-N-(4-hydroxybutyl)-nitrosamine (BBN)-induced urinary bladder cancer model in mice. METHODOLOGY/PRINCIPAL FINDINGS Via microarray gene expression profiling of 20 mice and 233 human bladder samples, we performed various statistical analyses and immunohistochemical staining for validation. The expression patterns of 893 genes associated with nicotinamide activity in cancer prevention were identified by microarray data analysis. Gene network analyses of these 893 genes revealed that the Myc and its associated genes may be the most important regulator of bladder cancer prevention, and the gene expression signature correlated well with protein expression data. Comparison of gene expression between human and mouse revealed that BBN-induced mouse bladder cancers exhibited gene expression profiles that were more similar to those of invasive human bladder cancers than to those of non-invasive human bladder cancers. CONCLUSIONS/SIGNIFICANCE This study demonstrates that nicotinamide plays an important role as a chemo-preventive and therapeutic agent in bladder cancer through the regulation of the Myc oncogenic signature. Nicotinamide may represent a promising therapeutic modality in patients with muscle-invasive bladder cancer.
Collapse
Affiliation(s)
- Seon-Kyu Kim
- Department of Urology, Chungbuk National University College of Medicine, Cheongju, Korea
- BK21 Chungbuk Biomedical Science Center, Chungbuk National University School of Medicine, Cheongju, Korea
| | - Seok-Joong Yun
- Department of Urology, Chungbuk National University College of Medicine, Cheongju, Korea
- BK21 Chungbuk Biomedical Science Center, Chungbuk National University School of Medicine, Cheongju, Korea
| | - Jiyeon Kim
- Department of Pharmacology and Cancer Biology, Duke University Medical School, Durham, North Carolina, United States of America
| | - Ok-Jun Lee
- Department of Pathology, Chungbuk National University College of Medicine, Cheongju, Korea
| | - Suk-Chul Bae
- Department of Biochemistry, Chungbuk National University College of Medicine, Cheongju, Korea
| | - Wun-Jae Kim
- Department of Urology, Chungbuk National University College of Medicine, Cheongju, Korea
- BK21 Chungbuk Biomedical Science Center, Chungbuk National University School of Medicine, Cheongju, Korea
- * E-mail:
| |
Collapse
|
556
|
Wang X, Simon R. Microarray-based cancer prediction using single genes. BMC Bioinformatics 2011; 12:391. [PMID: 21982331 PMCID: PMC3228540 DOI: 10.1186/1471-2105-12-391] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2011] [Accepted: 10/07/2011] [Indexed: 11/23/2022] Open
Abstract
Background Although numerous methods of using microarray data analysis for cancer classification have been proposed, most utilize many genes to achieve accurate classification. This can hamper interpretability of the models and ease of translation to other assay platforms. We explored the use of single genes to construct classification models. We first identified the genes with the most powerful univariate class discrimination ability and then constructed simple classification rules for class prediction using the single genes. Results We applied our model development algorithm to eleven cancer gene expression datasets and compared classification accuracy to that for standard methods including Diagonal Linear Discriminant Analysis, k-Nearest Neighbor, Support Vector Machine and Random Forest. The single gene classifiers provided classification accuracy comparable to or better than those obtained by existing methods in most cases. We analyzed the factors that determined when simple single gene classification is effective and when more complex modeling is warranted. Conclusions For most of the datasets examined, the single-gene classification methods appear to work as well as more standard methods, suggesting that simple models could perform well in microarray-based cancer prediction.
Collapse
Affiliation(s)
- Xiaosheng Wang
- Biometric Research Branch, National Cancer Institute, National Institutes of Health, Rockville, MD 20852, USA
| | | |
Collapse
|
557
|
Lai Y, Wu B, Zhao H. A permutation test approach to the choice of size kfor the nearest neighbors classifier. J Appl Stat 2011. [DOI: 10.1080/02664763.2010.547565] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
558
|
Huang GH, Wang SM, Hsu CC. Optimization-Based Model Fitting for Latent Class and Latent Profile Analyses. PSYCHOMETRIKA 2011; 76:584-611. [PMID: 27519682 DOI: 10.1007/s11336-011-9227-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Revised: 03/21/2011] [Indexed: 06/06/2023]
Abstract
Statisticians typically estimate the parameters of latent class and latent profile models using the Expectation-Maximization algorithm. This paper proposes an alternative two-stage approach to model fitting. The first stage uses the modified k-means and hierarchical clustering algorithms to identify the latent classes that best satisfy the conditional independence assumption underlying the latent variable model. The second stage then uses mixture modeling treating the class membership as known. The proposed approach is theoretically justifiable, directly checks the conditional independence assumption, and converges much faster than the full likelihood approach when analyzing high-dimensional data. This paper also develops a new classification rule based on latent variable models. The proposed classification procedure reduces the dimensionality of measured data and explicitly recognizes the heterogeneous nature of the complex disease, which makes it perfect for analyzing high-throughput genomic data. Simulation studies and real data analysis demonstrate the advantages of the proposed method.
Collapse
Affiliation(s)
- Guan-Hua Huang
- Institute of Statistics, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu, 30010, Taiwan.
| | - Su-Mei Wang
- Institute of Statistics, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu, 30010, Taiwan
| | - Chung-Chu Hsu
- Institute of Statistics, National Chiao Tung University, 1001 Ta Hsueh Road, Hsinchu, 30010, Taiwan
| |
Collapse
|
559
|
|
560
|
Tong DL, Schierz AC. Hybrid genetic algorithm-neural network: Feature extraction for unpreprocessed microarray data. Artif Intell Med 2011; 53:47-56. [DOI: 10.1016/j.artmed.2011.06.008] [Citation(s) in RCA: 48] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2010] [Revised: 05/11/2011] [Accepted: 06/26/2011] [Indexed: 12/22/2022]
|
561
|
Tsai YS, Aguan K, Pal NR, Chung IF. Identification of single- and multiple-class specific signature genes from gene expression profiles by group marker index. PLoS One 2011; 6:e24259. [PMID: 21909426 PMCID: PMC3164723 DOI: 10.1371/journal.pone.0024259] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 08/06/2011] [Indexed: 01/06/2023] Open
Abstract
Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases.
Collapse
Affiliation(s)
- Yu-Shuen Tsai
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
| | - Kripamoy Aguan
- Department of Biotechnology & Bioinformatics, North Eastern Hill University, Shillong, India
| | - Nikhil R. Pal
- Electronics & Communication Sciences Unit, Indian Statistical Institute, Calcutta, India
- * E-mail: (I-FC); (NRP)
| | - I-Fang Chung
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan
- Center for Systems and Synthetic Biology, National Yang-Ming University, Taipei, Taiwan
- * E-mail: (I-FC); (NRP)
| |
Collapse
|
562
|
Zheng CH, Zhang L, Ng TY, Shiu SCK, Huang DS. Metasample-based sparse representation for tumor classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1273-1282. [PMID: 21282864 DOI: 10.1109/tcbb.2011.20] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
A reliable and accurate identification of the type of tumors is crucial to the proper treatment of cancers. In recent years, it has been shown that sparse representation (SR) by l1-norm minimization is robust to noise, outliers and even incomplete measurements, and SR has been successfully used for classification. This paper presents a new SR-based method for tumor classification using gene expression data. A set of metasamples are extracted from the training samples, and then an input testing sample is represented as the linear combination of these metasamples by l1-regularized least square method. Classification is achieved by using a discriminating function defined on the representation coefficients. Since l1-norm minimization leads to a sparse solution, the proposed method is called metasample-based SR classification (MSRC). Extensive experiments on publicly available gene expression data sets show that MSRC is efficient for tumor classification, achieving higher accuracy than many existing representative schemes.
Collapse
Affiliation(s)
- Chun-Hou Zheng
- College of Information and Communication Technology, Qufu Normal University, Rizhao, Shandong 276826, China.
| | | | | | | | | |
Collapse
|
563
|
Unimodal transform of variables selected by interval segmentation purity for classification tree modeling of high-dimensional microarray data. Talanta 2011; 85:1689-94. [DOI: 10.1016/j.talanta.2011.06.076] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2011] [Revised: 06/28/2011] [Accepted: 06/30/2011] [Indexed: 11/20/2022]
|
564
|
Strategy to find molecular signatures in a small series of rare cancers: validation for radiation-induced breast and thyroid tumors. PLoS One 2011; 6:e23581. [PMID: 21853153 PMCID: PMC3154936 DOI: 10.1371/journal.pone.0023581] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2010] [Accepted: 07/21/2011] [Indexed: 11/28/2022] Open
Abstract
Methods of classification using transcriptome analysis for case-by-case tumor diagnosis could be limited by tumor heterogeneity and masked information in the gene expression profiles, especially as the number of tumors is small. We propose a new strategy, EMts_2PCA, based on: 1) The identification of a gene expression signature with a great potential for discriminating subgroups of tumors (EMts stage), which includes: a) a learning step, based on an expectation-maximization (EM) algorithm, to select sets of candidate genes whose expressions discriminate two subgroups, b) a training step to select from the sets of candidate genes those with the highest potential to classify training tumors, c) the compilation of genes selected during the training step, and standardization of their levels of expression to finalize the signature. 2) The predictive classification of independent prospective tumors, according to the two subgroups of interest, by the definition of a validation space based on a two-step principal component analysis (2PCA). The present method was evaluated by classifying three series of tumors and its robustness, in terms of tumor clustering and prediction, was further compared with that of three classification methods (Gene expression bar code, Top-scoring pair(s) and a PCA-based method). Results showed that EMts_2PCA was very efficient in tumor classification and prediction, with scores always better that those obtained by the most common methods of tumor clustering. Specifically, EMts_2PCA permitted identification of highly discriminating molecular signatures to differentiate post-Chernobyl thyroid or post-radiotherapy breast tumors from their sporadic counterparts that were previously unsuccessfully classified or classified with errors.
Collapse
|
565
|
Wang H, van der Laan MJ. Dimension reduction with gene expression data using targeted variable importance measurement. BMC Bioinformatics 2011; 12:312. [PMID: 21849016 PMCID: PMC3166941 DOI: 10.1186/1471-2105-12-312] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2011] [Accepted: 07/29/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND When a large number of candidate variables are present, a dimension reduction procedure is usually conducted to reduce the variable space before the subsequent analysis is carried out. The goal of dimension reduction is to find a list of candidate genes with a more operable length ideally including all the relevant genes. Leaving many uninformative genes in the analysis can lead to biased estimates and reduced power. Therefore, dimension reduction is often considered a necessary predecessor of the analysis because it can not only reduce the cost of handling numerous variables, but also has the potential to improve the performance of the downstream analysis algorithms. RESULTS We propose a TMLE-VIM dimension reduction procedure based on the variable importance measurement (VIM) in the frame work of targeted maximum likelihood estimation (TMLE). TMLE is an extension of maximum likelihood estimation targeting the parameter of interest. TMLE-VIM is a two-stage procedure. The first stage resorts to a machine learning algorithm, and the second step improves the first stage estimation with respect to the parameter of interest. CONCLUSIONS We demonstrate with simulations and data analyses that our approach not only enjoys the prediction power of machine learning algorithms, but also accounts for the correlation structures among variables and therefore produces better variable rankings. When utilized in dimension reduction, TMLE-VIM can help to obtain the shortest possible list with the most truly associated variables.
Collapse
Affiliation(s)
- Hui Wang
- Department of Pediatrics, Stanford University, MSOB X111, Stanford, CA 94305, USA.
| | | |
Collapse
|
566
|
Zheng CH, Ng TY, Zhang L, Shiu CK, Wang HQ. Tumor classification based on non-negative matrix factorization using gene expression data. IEEE Trans Nanobioscience 2011; 10:86-93. [PMID: 21742573 DOI: 10.1109/tnb.2011.2144998] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
This paper presents a new method for tumor classification using gene expression data. In the proposed method, we first select genes using nonnegative matrix factorization (NMF) or sparse NMF (SNMF), and then we extract features from the selected genes by virtue of NMF or SNMF. At last, we apply support vector machines (SVM) to classify the tumor samples using the extracted features. In order for a better classification, a modified SNMF algorithm is also proposed. The experimental results on benchmark three microarray data sets validate that the proposed method is efficient. Moreover, the biological meaning of the selected genes are also analyzed.
Collapse
Affiliation(s)
- Chun-Hou Zheng
- College of Electrical Engineering and Automation, Anhui University, Hefei, Anhui, China.
| | | | | | | | | |
Collapse
|
567
|
Peters T, Bulger DW, Loi TH, Yang JYH, Ma D. Two-step cross-entropy feature selection for microarrays—power through complementarity. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1148-1151. [PMID: 21321369 DOI: 10.1109/tcbb.2011.30] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Current feature selection methods for supervised classification of tissue samples from microarray data generally fail to exploit complementary discriminatory power that can be found in sets of features. Using a feature selection method with the computational architecture of the cross-entropy method, including an additional preliminary step ensuring a lower bound on the number of times any feature is considered, we show when testing on a human lymph node data set that there are a significant number of genes that perform well when their complementary power is assessed, but “pass under the radar” of popular feature selection methods that only assess genes individually on a given classification tool. We also show that this phenomenon becomes more apparent as diagnostic specificity of the tissue samples analysed increases.
Collapse
Affiliation(s)
- Tim Peters
- Department of Statistics, Macquarie University, Sydney, NSW 2109, Australia.
| | | | | | | | | |
Collapse
|
568
|
Zhao X, Cheung LWK. Multiclass kernel-imbedded Gaussian processes for microarray data analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1041-1053. [PMID: 20805625 DOI: 10.1109/tcbb.2010.85] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Identifying significant differentially expressed genes of a disease can help understand the disease at the genomic level. A hierarchical statistical model named multiclass kernel-imbedded Gaussian process (mKIGP) is developed under a Bayesian framework for a multiclass classification problem using microarray gene expression data. Specifically, based on a multinomial probit regression setting, an empirically adaptive algorithm with a cascading structure is designed to find appropriate featuring kernels, to discover potentially significant genes, and to make optimal tumor/cancer class predictions. A Gibbs sampler is adopted as the core of the algorithm to perform Bayesian inferences. A prescreening procedure is implemented to alleviate the computational complexity. The simulated examples show that mKIGP performed very close to the Bayesian bound and outperformed the referred state-of-the-art methods in a linear case, a nonlinear case, and a case with a mislabeled training sample. Its usability has great promises to problems that linear-model-based methods become unsatisfactory. The mKIGP was also applied to four published real microarray data sets and it was very effective for identifying significant differentially expressed genes and predicting classes in all of these data sets.
Collapse
Affiliation(s)
- Xin Zhao
- Sanjole Inc., 2800 Woodlawn Dr., Suite 271, Honolulu, HI 96822, USA.
| | | |
Collapse
|
569
|
Drake JI, Bogaard HJ, Mizuno S, Clifton B, Xie B, Gao Y, Dumur CI, Fawcett P, Voelkel NF, Natarajan R. Molecular signature of a right heart failure program in chronic severe pulmonary hypertension. Am J Respir Cell Mol Biol 2011; 45:1239-47. [PMID: 21719795 DOI: 10.1165/rcmb.2010-0412oc] [Citation(s) in RCA: 169] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022] Open
Abstract
Right heart failure is the cause of death of most patients with severe pulmonary arterial hypertensive (PAH) disorders, yet little is known about the cellular and molecular causes of right ventricular failure (RVF). We first showed a differential gene expression pattern between normal rat right and left ventricles, and postulated the existence of a molecular right heart failure program that distinguishes RVF from adaptive right ventricular hypertrophy (RVH), and that may differ in some respects from a left heart failure program. By means of microarrays and transcriptional sequencing strategies, we used two models of adaptive RVH to characterize a gene expression pattern reflective of growth and the maintenance of myocardial structure. Moreover, two models of RVF were associated with fibrosis, capillary rarefaction, the decreased expression of genes encoding the angiogenesis factors vascular endothelial growth factor, insulin-like growth factor 1, apelin, and angiopoeitin-1, and the increased expression of genes encoding a set of glycolytic enzymes. The treatment of established RVF with a β-adrenergic receptor blocker reversed RVF, and partly reversed the molecular RVF program. We conclude that normal right and left ventricles demonstrate clearly discernable differences in the expression of mRNA and microRNA, and that RVH and RVF are characterized by distinct patterns of gene expression that relate to cell growth, angiogenesis, and energy metabolism.
Collapse
Affiliation(s)
- Jennifer I Drake
- Department of Microbiology, Virginia Commonwealth University, Richmond, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
570
|
Novak D, Mihelj M, Ziherl J, Olenšek A, Munih M. Psychophysiological measurements in a biocooperative feedback loop for upper extremity rehabilitation. IEEE Trans Neural Syst Rehabil Eng 2011; 19:400-10. [PMID: 21708507 DOI: 10.1109/tnsre.2011.2160357] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
This paper examines the usefulness of psychophysiological measurements in a biocooperative feedback loop that adjusts the difficulty of an upper extremity rehabilitation task. Psychophysiological measurements (heart rate, skin conductance, respiration, and skin temperature) were used both by themselves and in combination with task performance and biomechanics. Data fusion was performed with discriminant analysis, and a special adaptive version was implemented that can gradually adapt to a subject. Both healthy subjects and hemiparetic patients participated in the study. The accuracy of the biocooperative controller was defined as the percentage of times it matched the subjects' preferences. The highest accuracy rate was obtained for task performance (approximately 82% for both healthy subjects and patients), with psychophysiological measurements yielding relatively low accuracy (approximately 60%). The adaptive approach increased accuracy of psychophysiological measurements to 76.4% for healthy subjects and 68.8% for patients. Combining psychophysiology with task performance yielded an accuracy rate of 84.7% for healthy subjects and 89.4% for patients. Results suggest that psychophysiological measurements are not reliable as a primary data source in motor rehabilitation, but can provide supplementary information. However, it is questionable whether the amount of additional information justifies the increased complexity of the system.
Collapse
Affiliation(s)
- Domen Novak
- Faculty of Electrical Engineering, University of Ljubljana, SI-1000 Ljubljana, Slovenia.
| | | | | | | | | |
Collapse
|
571
|
Lê Cao KA, Boitard S, Besse P. Sparse PLS discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinformatics 2011; 12:253. [PMID: 21693065 PMCID: PMC3133555 DOI: 10.1186/1471-2105-12-253] [Citation(s) in RCA: 601] [Impact Index Per Article: 42.9] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2010] [Accepted: 06/22/2011] [Indexed: 11/24/2022] Open
Abstract
Background Variable selection on high throughput biological data, such as gene expression or single nucleotide polymorphisms (SNPs), becomes inevitable to select relevant information and, therefore, to better characterize diseases or assess genetic structure. There are different ways to perform variable selection in large data sets. Statistical tests are commonly used to identify differentially expressed features for explanatory purposes, whereas Machine Learning wrapper approaches can be used for predictive purposes. In the case of multiple highly correlated variables, another option is to use multivariate exploratory approaches to give more insight into cell biology, biological pathways or complex traits. Results A simple extension of a sparse PLS exploratory approach is proposed to perform variable selection in a multiclass classification framework. Conclusions sPLS-DA has a classification performance similar to other wrapper or sparse discriminant analysis approaches on public microarray and SNP data sets. More importantly, sPLS-DA is clearly competitive in terms of computational efficiency and superior in terms of interpretability of the results via valuable graphical outputs. sPLS-DA is available in the R package mixOmics, which is dedicated to the analysis of large biological data sets.
Collapse
Affiliation(s)
- Kim-Anh Lê Cao
- Queensland Facility for Advanced Bioinformatics, University of Queensland, 4072 St Lucia, QLD, Australia.
| | | | | |
Collapse
|
572
|
Feiping Nie, Dong Xu, Xuelong Li, Shiming Xiang. Semisupervised Dimensionality Reduction and Classification Through Virtual Label Regression. ACTA ACUST UNITED AC 2011; 41:675-85. [DOI: 10.1109/tsmcb.2010.2085433] [Citation(s) in RCA: 83] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|
573
|
Park SY, Liu Y. Robust penalized logistic regression with truncated loss functions. CAN J STAT 2011; 39:300-323. [PMID: 22162902 DOI: 10.1002/cjs.10105] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The penalized logistic regression (PLR) is a powerful statistical tool for classification. It has been commonly used in many practical problems. Despite its success, since the loss function of the PLR is unbounded, resulting classifiers can be sensitive to outliers. To build more robust classifiers, we propose the robust PLR (RPLR) which uses truncated logistic loss functions, and suggest three schemes to estimate conditional class probabilities. Connections of the RPLR with some other existing work on robust logistic regression have been discussed. Our theoretical results indicate that the RPLR is Fisher consistent and more robust to outliers. Moreover, we develop estimated generalized approximate cross validation (EGACV) for the tuning parameter selection. Through numerical examples, we demonstrate that truncating the loss function indeed yields better performance in terms of classification accuracy and class probability estimation.
Collapse
|
574
|
Cheng Q, Zhou H, Cheng J. The Fisher-Markov selector: fast selecting maximally separable feature subset for multiclass classification with applications to high-dimensional data. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2011; 33:1217-1233. [PMID: 21493968 DOI: 10.1109/tpami.2010.195] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Selecting features for multiclass classification is a critically important task for pattern recognition and machine learning applications. Especially challenging is selecting an optimal subset of features from high-dimensional data, which typically have many more variables than observations and contain significant noise, missing components, or outliers. Existing methods either cannot handle high-dimensional data efficiently or scalably, or can only obtain local optimum instead of global optimum. Toward the selection of the globally optimal subset of features efficiently, we introduce a new selector--which we call the Fisher-Markov selector--to identify those features that are the most useful in describing essential differences among the possible groups. In particular, in this paper we present a way to represent essential discriminating characteristics together with the sparsity as an optimization objective. With properly identified measures for the sparseness and discriminativeness in possibly high-dimensional settings, we take a systematic approach for optimizing the measures to choose the best feature subset. We use Markov random field optimization techniques to solve the formulated objective functions for simultaneous feature selection. Our results are noncombinatorial, and they can achieve the exact global optimum of the objective function for some special kernels. The method is fast; in particular, it can be linear in the number of features and quadratic in the number of observations. We apply our procedure to a variety of real-world data, including mid--dimensional optical handwritten digit data set and high-dimensional microarray gene expression data sets. The effectiveness of our method is confirmed by experimental results. In pattern recognition and from a model selection viewpoint, our procedure says that it is possible to select the most discriminating subset of variables by solving a very simple unconstrained objective function which in fact can be obtained with an explicit expression.
Collapse
Affiliation(s)
- Qiang Cheng
- Department of Computer Science, Faner Hall, Mailcode 4511, Southern Illinois University Carbondale, 1000 Faner Drive, Carbondale, IL 62901, USA.
| | | | | |
Collapse
|
575
|
Saintigny P, Zhang L, Fan YH, El-Naggar AK, Papadimitrakopoulou VA, Feng L, Lee JJ, Kim ES, Ki Hong W, Mao L. Gene expression profiling predicts the development of oral cancer. Cancer Prev Res (Phila) 2011; 4:218-29. [PMID: 21292635 DOI: 10.1158/1940-6207.capr-10-0155] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Patients with oral premalignant lesion (OPL) have a high risk of developing oral cancer. Although certain risk factors, such as smoking status and histology, are known, our ability to predict oral cancer risk remains poor. The study objective was to determine the value of gene expression profiling in predicting oral cancer development. Gene expression profile was measured in 86 of 162 OPL patients who were enrolled in a clinical chemoprevention trial that used the incidence of oral cancer development as a prespecified endpoint. The median follow-up time was 6.08 years and 35 of the 86 patients developed oral cancer over the course. Gene expression profiles were associated with oral cancer-free survival and used to develop multivariate predictive models for oral cancer prediction. We developed a 29-transcript predictive model which showed marked improvement in terms of prediction accuracy (with 8% predicting error rate) over the models using previously known clinicopathologic risk factors. On the basis of the gene expression profile data, we also identified 2,182 transcripts significantly associated with oral cancer risk-associated genes (P value < 0.01; univariate Cox proportional hazards model). Functional pathway analysis revealed proteasome machinery, MYC, and ribosomal components as the top gene sets associated with oral cancer risk. In multiple independent data sets, the expression profiles of the genes can differentiate head and neck cancer from normal mucosa. Our results show that gene expression profiles may improve the prediction of oral cancer risk in OPL patients and the significant genes identified may serve as potential targets for oral cancer chemoprevention.
Collapse
Affiliation(s)
- Pierre Saintigny
- Department of Thoracic/Head and Neck Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, Texas, USA
| | | | | | | | | | | | | | | | | | | |
Collapse
|
576
|
Ashton-Chess J, Cervino AC. Development of commercial gene-expression-based signatures: review of the scientific strategies. Per Med 2011; 8:253-269. [PMID: 29783527 DOI: 10.2217/pme.10.84] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Many scientific articles have been published that use gene-expression-based technologies to discriminate a trait of interest, typically a disease subgroup, within a patient population. However, few gene-expression-based signatures have at present reached the market and become a financially and clinically successful product. The technological, scientific and medical challenges, the regulatory environment and the financial considerations are all essential parts of the development process. Here we discuss the scientific aspects of successfully developing a gene-expression-based signature and review the global strategy of six products that made it to the market. We also present a point-to-point guide that should help researchers to successfully develop genomic signatures, thus paving the way towards personalized medicine.
Collapse
Affiliation(s)
- Joanna Ashton-Chess
- TcLand Expression, Halle 13, Bio-Ouest Ile de Nantes, 21 Rue de la Noue Bras de Fer, 44200 Nantes, France
| | | |
Collapse
|
577
|
Fisher TJ, Sun X. Improved Stein-type shrinkage estimators for the high-dimensional multivariate normal covariance matrix. Comput Stat Data Anal 2011. [DOI: 10.1016/j.csda.2010.12.006] [Citation(s) in RCA: 61] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
578
|
Ghorai S, Mukherjee A, Sengupta S, Dutta PK. Cancer classification from gene expression data by NPPC ensemble. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:659-671. [PMID: 20479504 DOI: 10.1109/tcbb.2010.36] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The most important application of microarray in gene expression analysis is to classify the unknown tissue samples according to their gene expression levels with the help of known sample expression levels. In this paper, we present a nonparallel plane proximal classifier (NPPC) ensemble that ensures high classification accuracy of test samples in a computer-aided diagnosis (CAD) framework than that of a single NPPC model. For each data set only, a few genes are selected by using a mutual information criterion. Then a genetic algorithm-based simultaneous feature and model selection scheme is used to train a number of NPPC expert models in multiple subspaces by maximizing cross-validation accuracy. The members of the ensemble are selected by the performance of the trained models on a validation set. Besides the usual majority voting method, we have introduced minimum average proximity-based decision combiner for NPPC ensemble. The effectiveness of the NPPC ensemble and the proposed new approach of combining decisions for cancer diagnosis are studied and compared with support vector machine (SVM) classifier in a similar framework. Experimental results on cancer data sets show that the NPPC ensemble offers comparable testing accuracy to that of SVM ensemble with reduced training time on average.
Collapse
Affiliation(s)
- Santanu Ghorai
- Department of Electronics and Communication Engineering, MCKV Institute of Engineering, 243, G.T. Road (N), Liluah, Howrah.
| | | | | | | |
Collapse
|
579
|
Taylor SL, Kim K. A jackknife and voting classifier approach to feature selection and classification. Cancer Inform 2011; 10:133-47. [PMID: 21584263 PMCID: PMC3091410 DOI: 10.4137/cin.s7111] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
With technological advances now allowing measurement of thousands of genes, proteins and metabolites, researchers are using this information to develop diagnostic and prognostic tests and discern the biological pathways underlying diseases. Often, an investigator's objective is to develop a classification rule to predict group membership of unknown samples based on a small set of features and that could ultimately be used in a clinical setting. While common classification methods such as random forest and support vector machines are effective at separating groups, they do not directly translate into a clinically-applicable classification rule based on a small number of features.We present a simple feature selection and classification method for biomarker detection that is intuitively understandable and can be directly extended for application to a clinical setting. We first use a jackknife procedure to identify important features and then, for classification, we use voting classifiers which are simple and easy to implement. We compared our method to random forest and support vector machines using three benchmark cancer 'omics datasets with different characteristics. We found our jackknife procedure and voting classifier to perform comparably to these two methods in terms of accuracy. Further, the jackknife procedure yielded stable feature sets. Voting classifiers in combination with a robust feature selection method such as our jackknife procedure offer an effective, simple and intuitive approach to feature selection and classification with a clear extension to clinical applications.
Collapse
Affiliation(s)
- Sandra L Taylor
- Division of Biostatistics, Department of Public Health Sciences, University of California School of Medicine, Davis, CA, USA
| | | |
Collapse
|
580
|
Mueller RS, Dill BD, Pan C, Belnap CP, Thomas BC, VerBerkmoes NC, Hettich RL, Banfield JF. Proteome changes in the initial bacterial colonist during ecological succession in an acid mine drainage biofilm community. Environ Microbiol 2011; 13:2279-92. [PMID: 21518216 DOI: 10.1111/j.1462-2920.2011.02486.x] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
Proteomes of acid mine drainage biofilms at different stages of ecological succession were examined to understand microbial responses to changing community membership. We evaluated the degree of reproducibility of the community proteomes between samples of the same growth stage and found stable and predictable protein abundance patterns across time and sampling space, allowing for a set of 50 classifier proteins to be identified for use in predicting growth stages of undefined communities. Additionally, physiological changes in the dominant species, Leptospirillum Group II, were analysed as biofilms mature. During early growth stages, this population responds to abiotic stresses related to growth on the acid mine drainage solution. Enzymes involved in protein synthesis, cell division and utilization of 1- and 2-carbon compounds were more abundant in early growth stages, suggesting rapid growth and a reorganization of metabolism during biofilm initiation. As biofilms thicken and diversify, external stresses arise from competition for dwindling resources, which may inhibit cell division of Leptospirillum Group II through the SOS response. This population also represses translation and synthesizes more complex carbohydrates and amino acids in mature biofilms. These findings provide unprecedented insight into the physiological changes that may result from competitive interactions within communities in natural environments.
Collapse
Affiliation(s)
- Ryan S Mueller
- University of California, Berkeley, California 94720, USA.
| | | | | | | | | | | | | | | |
Collapse
|
581
|
Shen Y, Lin Z, Zhu J. Penalized Independence Rule for Testing High-Dimensional Hypotheses. COMMUN STAT-THEOR M 2011. [DOI: 10.1080/03610926.2010.484160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
582
|
Lee C, Nkounkou B, Huang CH. Comparison of LDA and SPRT on Clinical Dataset Classifications. BIOMEDICAL INFORMATICS INSIGHTS 2011; 4:1-7. [PMID: 21949476 PMCID: PMC3178328 DOI: 10.4137/bii.s6935] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
Abstract
In this work, we investigate the well-known classification algorithm LDA as well as its close relative SPRT. SPRT affords many theoretical advantages over LDA. It allows specification of desired classification error rates α and β and is expected to be faster in predicting the class label of a new instance. However, SPRT is not as widely used as LDA in the pattern recognition and machine learning community. For this reason, we investigate LDA, SPRT and a modified SPRT (MSPRT) empirically using clinical datasets from Parkinson’s disease, colon cancer, and breast cancer. We assume the same normality assumption as LDA and propose variants of the two SPRT algorithms based on the order in which the components of an instance are sampled. Leave-one-out cross-validation is used to assess and compare the performance of the methods. The results indicate that two variants, SPRT-ordered and MSPRT-ordered, are superior to LDA in terms of prediction accuracy. Moreover, on average SPRT-ordered and MSPRT-ordered examine less components than LDA before arriving at a decision. These advantages imply that SPRT-ordered and MSPRT-ordered are the preferred algorithms over LDA when the normality assumption can be justified for a dataset.
Collapse
Affiliation(s)
- Chih Lee
- Computer Science and Engineering Department, University of Connecticut, Storrs, CT 06269, USA
| | | | | |
Collapse
|
583
|
Huang S, Tong T, Zhao H. Bias-corrected diagonal discriminant rules for high-dimensional classification. Biometrics 2011; 66:1096-106. [PMID: 20222939 DOI: 10.1111/j.1541-0420.2010.01395.x] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this article, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies.
Collapse
Affiliation(s)
- Song Huang
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA.
| | | | | |
Collapse
|
584
|
Identification by random forest method of HLA class I amino acid substitutions associated with lower survival at day 100 in unrelated donor hematopoietic cell transplantation. Bone Marrow Transplant 2011; 47:217-26. [PMID: 21441965 PMCID: PMC3128239 DOI: 10.1038/bmt.2011.56] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
The identification of important amino acid substitutions associated with low survival in hematopoietic cell transplantation (HCT) is hampered by the large number of observed substitutions compared to the small number of patients available for analysis. Random forest analysis is designed to address these limitations. We studied 2,107 HCT recipients with good or intermediate risk hematologic malignancies to identify HLA class I amino acid substitutions associated with reduced survival at day 100 post-transplant. Random forest analysis and traditional univariate and multivariate analyses were used. Random forest analysis identified amino acid substitutions in 33 positions that were associated with reduced 100 day survival, including HLA-A 9, 43, 62, 63, 76, 77, 95, 97, 114, 116, 152, 156, 166, and 167; HLA-B 97, 109, 116, and 156; and HLA-C 6, 9, 11, 14, 21, 66, 77, 80, 95, 97, 99, 116, 156, 163, and 173. Thirteen had been previously reported by other investigators using classical biostatistical approaches. Using the same dataset, traditional multivariate logistic regression identified only 5 amino acid substitutions associated with lower day 100 survival. Random forest analysis is a novel statistical methodology for analysis of HLA-mismatching and outcome studies, capable of identifying important amino acid substitutions missed by other methods.
Collapse
|
585
|
Derivation of cancer diagnostic and prognostic signatures from gene expression data. Bioanalysis 2011; 2:855-62. [PMID: 21083217 DOI: 10.4155/bio.10.35] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023] Open
Abstract
The ability to compare genome-wide expression profiles in human tissue samples has the potential to add an invaluable molecular pathology aspect to the detection and evaluation of multiple diseases. Applications include initial diagnosis, evaluation of disease subtype, monitoring of response to therapy and the prediction of disease recurrence. The derivation of molecular signatures that can predict tumor recurrence in breast cancer has been a particularly intense area of investigation and a number of studies have shown that molecular signatures can outperform currently used clinicopathologic factors in predicting relapse in this disease. However, many of these predictive models have been derived using relatively simple computational algorithms and whether these models are at a stage of development worthy of large-cohort clinical trial validation is currently a subject of debate. In this review, we focus on the derivation of optimal molecular signatures from high-dimensional data and discuss some of the expected future developments in the field.
Collapse
|
586
|
NIKULIN VLADIMIR, HUANG TIANHSIANG, MCLACHLAN GEOFFREYJ. CLASSIFICATION OF HIGH-DIMENSIONAL MICROARRAY DATA WITH A TWO-STEP PROCEDURE VIA A WILCOXON CRITERION AND MULTILAYER PERCEPTRON. INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS 2011. [DOI: 10.1142/s1469026811002969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
The method presented in this paper is novel as a natural combination of two mutually dependent steps. Feature selection is a key element (first step) in our classification system, which was employed during the 2010 International RSCTC data mining (bioinformatics) Challenge. The second step may be implemented using any suitable classifier such as linear regression, support vector machine or neural networks. We conducted leave-one-out (LOO) experiments with several feature selection techniques and classifiers. Based on the LOO evaluations, we decided to use feature selection with the separation type Wilcoxon-based criterion for all final submissions. The method presented in this paper was tested successfully during the RSCTC data mining Challenge, where we achieved the top score in the Basic track.
Collapse
Affiliation(s)
| | - TIAN-HSIANG HUANG
- Institute of Information Management, National Cheng Kung University, Taiwan
| | - GEOFFREY J. MCLACHLAN
- Department of Mathematics and Institute for Molecular Bioscience, University of Queensland, Australia
| |
Collapse
|
587
|
Chakraborty S, Guo R. A Bayesian hybrid Huberized support vector machine and its applications in high-dimensional medical data. Comput Stat Data Anal 2011. [DOI: 10.1016/j.csda.2010.09.024] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
588
|
Tapia E, Ornella L, Bulacio P, Angelone L. Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011; 12:59. [PMID: 21342522 PMCID: PMC3056725 DOI: 10.1186/1471-2105-12-59] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 02/22/2011] [Indexed: 01/05/2023] Open
Abstract
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Collapse
Affiliation(s)
- Elizabeth Tapia
- CIFASIS-Conicet Institute, Bv, 27 de Febrero 210 Bis, Rosario, Argentina.
| | | | | | | |
Collapse
|
589
|
Yao C, Zhang M, Zou J, Li H, Wang D, Zhu J, Guo Z. Functional modules with disease discrimination abilities for various cancers. SCIENCE CHINA-LIFE SCIENCES 2011; 54:189-93. [PMID: 21318490 DOI: 10.1007/s11427-010-4129-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2009] [Accepted: 09/22/2009] [Indexed: 12/13/2022]
Abstract
Selecting differentially expressed genes (DEGs) is one of the most important tasks in microarray applications for studying multi-factor diseases including cancers. However, the small samples typically used in current microarray studies may only partially reflect the widely altered gene expressions in complex diseases, which would introduce low reproducibility of gene lists selected by statistical methods. Here, by analyzing seven cancer datasets, we showed that, in each cancer, a wide range of functional modules have altered gene expressions and thus have high disease classification abilities. The results also showed that seven modules are shared across diverse cancers, suggesting hints about the common mechanisms of cancers. Therefore, instead of relying on a few individual genes whose selection is hardly reproducible in current microarray experiments, we may use functional modules as functional signatures to study core mechanisms of cancers and build robust diagnostic classifiers.
Collapse
Affiliation(s)
- Chen Yao
- Bioinformatics Centre, School of Life Science, University of Electronic Science and Technology of China, Chengdu 610054, China
| | | | | | | | | | | | | |
Collapse
|
590
|
De Bin R, Risso D. A novel approach to the clustering of microarray data via nonparametric density estimation. BMC Bioinformatics 2011; 12:49. [PMID: 21303507 PMCID: PMC3042915 DOI: 10.1186/1471-2105-12-49] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2010] [Accepted: 02/08/2011] [Indexed: 11/21/2022] Open
Abstract
Background Cluster analysis is a crucial tool in several biological and medical studies dealing with microarray data. Such studies pose challenging statistical problems due to dimensionality issues, since the number of variables can be much higher than the number of observations. Results Here, we present a general framework to deal with the clustering of microarray data, based on a three-step procedure: (i) gene filtering; (ii) dimensionality reduction; (iii) clustering of observations in the reduced space. Via a nonparametric model-based clustering approach we obtain promising results both in simulated and real data. Conclusions The proposed algorithm is a simple and effective tool for the clustering of microarray data, in an unsupervised setting.
Collapse
Affiliation(s)
- Riccardo De Bin
- Department of Statistical Sciences, University of Padova, Padova, Italy
| | | |
Collapse
|
591
|
Dagliyan O, Uney-Yuksektepe F, Kavakli IH, Turkay M. Optimization based tumor classification from microarray gene expression data. PLoS One 2011; 6:e14579. [PMID: 21326602 PMCID: PMC3033885 DOI: 10.1371/journal.pone.0014579] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 12/23/2010] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND An important use of data obtained from microarray measurements is the classification of tumor types with respect to genes that are either up or down regulated in specific cancer types. A number of algorithms have been proposed to obtain such classifications. These algorithms usually require parameter optimization to obtain accurate results depending on the type of data. Additionally, it is highly critical to find an optimal set of markers among those up or down regulated genes that can be clinically utilized to build assays for the diagnosis or to follow progression of specific cancer types. In this paper, we employ a mixed integer programming based classification algorithm named hyper-box enclosure method (HBE) for the classification of some cancer types with a minimal set of predictor genes. This optimization based method which is a user friendly and efficient classifier may allow the clinicians to diagnose and follow progression of certain cancer types. METHODOLOGY/PRINCIPAL FINDINGS We apply HBE algorithm to some well known data sets such as leukemia, prostate cancer, diffuse large B-cell lymphoma (DLBCL), small round blue cell tumors (SRBCT) to find some predictor genes that can be utilized for diagnosis and prognosis in a robust manner with a high accuracy. Our approach does not require any modification or parameter optimization for each data set. Additionally, information gain attribute evaluator, relief attribute evaluator and correlation-based feature selection methods are employed for the gene selection. The results are compared with those from other studies and biological roles of selected genes in corresponding cancer type are described. CONCLUSIONS/SIGNIFICANCE The performance of our algorithm overall was better than the other algorithms reported in the literature and classifiers found in WEKA data-mining package. Since it does not require a parameter optimization and it performs consistently very high prediction rate on different type of data sets, HBE method is an effective and consistent tool for cancer type prediction with a small number of gene markers.
Collapse
MESH Headings
- Algorithms
- Calibration
- Electronic Data Processing/standards
- Gene Expression Profiling/methods
- Gene Expression Profiling/standards
- Gene Expression Regulation, Neoplastic
- Humans
- Leukemia/classification
- Leukemia/diagnosis
- Leukemia/genetics
- Lymphoma, Large B-Cell, Diffuse/classification
- Lymphoma, Large B-Cell, Diffuse/diagnosis
- Lymphoma, Large B-Cell, Diffuse/genetics
- Male
- Microarray Analysis/methods
- Microarray Analysis/standards
- Models, Theoretical
- Neoplasms/classification
- Neoplasms/diagnosis
- Neoplasms/genetics
- Pattern Recognition, Automated/methods
- Pattern Recognition, Automated/standards
- Prognosis
- Prostatic Neoplasms/classification
- Prostatic Neoplasms/diagnosis
- Prostatic Neoplasms/genetics
Collapse
Affiliation(s)
- Onur Dagliyan
- Department of Chemical and Biological Engineering, Koc University, Istanbul, Turkey
| | | | - I. Halil Kavakli
- Department of Chemical and Biological Engineering, Koc University, Istanbul, Turkey
| | - Metin Turkay
- Department of Industrial Engineering, Koc University, Istanbul, Turkey
| |
Collapse
|
592
|
Mining gene expression profiles: an integrated implementation of kernel principal component analysis and singular value decomposition. GENOMICS PROTEOMICS & BIOINFORMATICS 2011; 8:200-10. [PMID: 20970748 PMCID: PMC5054124 DOI: 10.1016/s1672-0229(10)60022-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
The detection of genes that show similar profiles under different experimental conditions is often an initial step in inferring the biological significance of such genes. Visualization tools are used to identify genes with similar profiles in microarray studies. Given the large number of genes recorded in microarray experiments, gene expression data are generally displayed on a low dimensional plot, based on linear methods. However, microarray data show nonlinearity, due to high-order terms of interaction between genes, so alternative approaches, such as kernel methods, may be more appropriate. We introduce a technique that combines kernel principal component analysis (KPCA) and Biplot to visualize gene expression profiles. Our approach relies on the singular value decomposition of the input matrix and incorporates an additional step that involves KPCA. The main properties of our method are the extraction of nonlinear features and the preservation of the input variables (genes) in the output display. We apply this algorithm to colon tumor, leukemia and lymphoma datasets. Our approach reveals the underlying structure of the gene expression profiles and provides a more intuitive understanding of the gene and sample association.
Collapse
|
593
|
Zheng CH, Chong YW, Wang HQ. Gene selection using independent variable group analysis for tumor classification. Neural Comput Appl 2011. [DOI: 10.1007/s00521-010-0513-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
594
|
Chuang LY, Yang CH, Li JC, Yang CH. A hybrid BPSO-CGA approach for gene selection and classification of microarray data. J Comput Biol 2011; 19:68-82. [PMID: 21210743 DOI: 10.1089/cmb.2010.0064] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Microarray analysis promises to detect variations in gene expressions, and changes in the transcription rates of an entire genome in vivo. Microarray gene expression profiles indicate the relative abundance of mRNA corresponding to the genes. The selection of relevant genes from microarray data poses a formidable challenge to researchers due to the high-dimensionality of features, multiclass categories being involved, and the usually small sample size. A classification process is often employed which decreases the dimensionality of the microarray data. In order to correctly analyze microarray data, the goal is to find an optimal subset of features (genes) which adequately represents the original set of features. A hybrid method of binary particle swarm optimization (BPSO) and a combat genetic algorithm (CGA) is to perform the microarray data selection. The K-nearest neighbor (K-NN) method with leave-one-out cross-validation (LOOCV) served as a classifier. The proposed BPSO-CGA approach is compared to ten microarray data sets from the literature. The experimental results indicate that the proposed method not only effectively reduce the number of genes expression level, but also achieves a low classification error rate.
Collapse
Affiliation(s)
- Li-Yeh Chuang
- Department of Chemical Engineering, and Institute of Biotechnology and Chemical Engineering, I-Shou University Kaohsiung, Taiwan
| | | | | | | |
Collapse
|
595
|
|
596
|
Mao KZ, Tang W. Recursive Mahalanobis separability measure for gene subset selection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:266-272. [PMID: 20479500 DOI: 10.1109/tcbb.2010.43] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Mahalanobis class separability measure provides an effective evaluation of the discriminative power of a feature subset, and is widely used in feature selection. However, this measure is computationally intensive or even prohibitive when it is applied to gene expression data. In this study, a recursive approach to Mahalanobis measure evaluation is proposed, with the goal of reducing computational overhead. Instead of evaluating Mahalanobis measure directly in high-dimensional space, the recursive approach evaluates the measure through successive evaluations in 2D space. Because of its recursive nature, this approach is extremely efficient when it is combined with a forward search procedure. In addition, it is noted that gene subsets selected by Mahalanobis measure tend to overfit training data and generalize unsatisfactorily on unseen test data, due to small sample size in gene expression problems. To alleviate the overfitting problem, a regularized recursive Mahalanobis measure is proposed in this study, and guidelines on determination of regularization parameters are provided. Experimental studies on five gene expression problems show that the regularized recursive Mahalanobis measure substantially outperforms the nonregularized Mahalanobis measures and the benchmark recursive feature elimination (RFE) algorithm in all five problems.
Collapse
Affiliation(s)
- K Z Mao
- School of Electrical and Electronic Engineering, Block S2.1, Nanyang Technological University, Singapore 639798.
| | | |
Collapse
|
597
|
|
598
|
Top Scoring Pair Decision Tree for Gene Expression Data Analysis. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2011; 696:27-35. [DOI: 10.1007/978-1-4419-7046-6_3] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
|
599
|
Wang Q, Li HD, Xu QS, Liang YZ. Noise incorporated subwindow permutation analysis for informative gene selection using support vector machines. Analyst 2011; 136:1456-63. [DOI: 10.1039/c0an00667j] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
600
|
Pinto da Costa JF, Alonso H, Roque L. A weighted principal component analysis and its application to gene expression data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:246-252. [PMID: 21071812 DOI: 10.1109/tcbb.2009.61] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm.
Collapse
Affiliation(s)
- Joaquim F Pinto da Costa
- Departamento de Matemática, Faculdade de Ciências, Universidade do Porto, Rua do Campo Alegre, 687, 4169-007 Porto, Portugal.
| | | | | |
Collapse
|