751
|
Cai R, Hao Z, Yang X, Wen W. An efficient gene selection algorithm based on mutual information. Neurocomputing 2009. [DOI: 10.1016/j.neucom.2008.04.005] [Citation(s) in RCA: 60] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022]
|
752
|
Lamnisos D, Griffin JE, Steel MFJ. Transdimensional Sampling Algorithms for Bayesian Variable Selection in Classification Problems With Many More Variables Than Observations. J Comput Graph Stat 2009. [DOI: 10.1198/jcgs.2009.08027] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
753
|
Kodell RL, Pearce BA, Baek S, Moon H, Ahn H, Young JF, Chen JJ. A model-free ensemble method for class prediction with application to biomedical decision making. Artif Intell Med 2008; 46:267-76. [PMID: 19081231 DOI: 10.1016/j.artmed.2008.11.001] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2008] [Revised: 10/30/2008] [Accepted: 11/03/2008] [Indexed: 11/25/2022]
Abstract
OBJECTIVE A classification algorithm that utilizes two-dimensional convex hulls of training-set samples is presented. METHODS AND MATERIAL For each pair of predictor variables, separate convex hulls of positive and negative samples in the training set are formed, and these convex hulls are used to classify test points according to a nearest-neighbor criterion. An ensemble of these two-dimensional convex-hull classifiers is formed by trimming the (m)C(2) possible classifiers derived from the m predictors to a set of classifiers comprised of only unique predictor variables. Because only two-dimensional spaces are required to be populated by training-set samples, the "curse of dimensionality" is not an issue. At the same time, the power of ensemble voting is exploited by combining the classifications of the unique two-dimensional classifiers to reach a final classification. RESULTS The algorithm is illustrated by application to three publicly available biomedical data sets with genomic predictors and is shown to have prediction accuracy that is competitive with a number of published classification procedures. CONCLUSION Because of its superior performance in terms of sensitivity and negative predictive value compared to its competitors, the convex-hull ensemble classifier demonstrates good potential for medical screening, where often the major emphasis is placed on having reliable negative predictions.
Collapse
Affiliation(s)
- Ralph L Kodell
- Department of Biostatistics, University of Arkansas for Medical Sciences, Little Rock, 72205, United States.
| | | | | | | | | | | | | |
Collapse
|
754
|
Meta-analysis of gene expression profiles related to relapse-free survival in 1,079 breast cancer patients. Breast Cancer Res Treat 2008; 118:433-41. [PMID: 19052860 DOI: 10.1007/s10549-008-0242-8] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2008] [Accepted: 10/28/2008] [Indexed: 10/21/2022]
Abstract
The transcriptome of breast cancers have been extensively screened with microarrays and large sets of genes associated with clinical features have been established. The aim of this study was to validate original gene sets on a large cohort of raw breast cancer microarray data with known clinical follow-up. We recovered 20 publications and matched them to Affymetrix HGU133A annotations. Raw Affymetrix HGU133A microarray data were extracted from GEO and MAS5 normalized. For classifying patients using the selected gene sets, we applied prediction analysis of microarrays and constructed Kaplan-Meier plots. A new classification including all patients was generated using supervised principal components analysis. Seven studies including 1,470 patients were downloaded from GEO. Notably, we uncovered 641 microarrays representing 251 individual tumor specimens among them, which were repeatedly described under independent GEO identifiers. We excluded all redundant data and used the remaining 1,079 samples. Eight of the 20 gene sets were able to predict response at a significance of P < 0.05. The discrimination of good and poor prognosis groups exclusively relying on gene expression data resulted in high significance (P = 1.8E-12). A model including genes fitted by both gene expression and clinical covariates (lymph node status and grade) contains 44 genes and can predict response at P = 9.5E-7. The outcome provides a ranking of the gene lists regarding applicability on an independent dataset. We established a consensus predictor combining the available clinical and gene expression data. The database comprising expression profiles of 1,079 breast cancers can be used to classify individual patients.
Collapse
|
755
|
|
756
|
Schwender H, Ickstadt K, Rahnenführer J. Classification with High-Dimensional Genetic Data: Assigning Patients and Genetic Features to Known Classes. Biom J 2008; 50:911-26. [DOI: 10.1002/bimj.200810475] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
|
757
|
Sparse optimal scoring for multiclass cancer diagnosis and biomarker detection using microarray data. Comput Biol Chem 2008; 32:417-25. [DOI: 10.1016/j.compbiolchem.2008.07.015] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2007] [Revised: 03/23/2008] [Accepted: 07/06/2008] [Indexed: 11/22/2022]
|
758
|
Dobbin KK. A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions. Biostatistics 2008; 10:282-96. [PMID: 19039030 DOI: 10.1093/biostatistics/kxn035] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Constructing a confidence interval for the actual, conditional error rate of a prediction rule from multivariate data is problematic because this error rate is not a population parameter in the traditional sense--it is a functional of the training set. When the training set changes, so does this "parameter." A valid method for constructing confidence intervals for the actual error rate had been previously developed by McLachlan. However, McLachlan's method cannot be applied in many cancer research settings because it requires the number of samples to be much larger than the number of dimensions (n >> p), and it assumes that no dimension-reducing feature selection step is performed. Here, an alternative to McLachlan's method is presented that can be applied when p >> n, with an additional adjustment in the presence of feature selection. Coverage probabilities of the new method are shown to be nominal or conservative over a wide range of scenarios. The new method is relatively simple to implement and not computationally burdensome.
Collapse
Affiliation(s)
- Kevin K Dobbin
- National Cancer Institute, 6130 Executive Boulevard, EPN Room 8124, Rockville, MD 20892, USA.
| |
Collapse
|
759
|
Abstract
Many cancer treatments benefit only a minority of patients who receive them. This results in an enormous burden on patients and on the health care system. The problem will become even greater with the increasing use of molecularly targeted agents whose benefits are likely to be more selective unless the drug development process is modified to include co-development of companion diagnostics. Whole genome biotechnology and decreasing costs of genome sequencing make it increasingly possible to achieve an era of predictive medicine in oncology therapeutics. The challenges are numerous and substantial but are not primarily technological. They involve organizing publicly funded diagnostics of deregulated pathways, adopting new paradigms for drug development, and developing incentives for industry to incur the complexity and expense of co-development of drugs and companion diagnostics. This article reviews some designs for phase III clinical trials that may facilitate movement to a more predictive oncology.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, National Cancer Institute, 9000 Rockville Pike, Bethesda, MD 20892, USA.
| |
Collapse
|
760
|
Lee YJ, Chang CC, Chao CH. Incremental forward feature selection with application to microarray gene expression data. J Biopharm Stat 2008; 18:827-40. [PMID: 18781519 DOI: 10.1080/10543400802277868] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
In this study, the authors propose a new feature selection scheme, the incremental forward feature selection, which is inspired by incremental reduced support vector machines. In their method, a new feature is added into the current selected feature subset if it will bring in the most extra information. This information is measured by using the distance between the new feature vector and the column space spanned by current feature subset. The incremental forward feature selection scheme can exclude highly linear correlated features that provide redundant information and might degrade the efficiency of learning algorithms. The method is compared with the weight score approach and the 1-norm support vector machine on two well-known microarray gene expression data sets, the acute leukemia and colon cancer data sets. These two data sets have a very few observations but huge number of genes. The linear smooth support vector machine was applied to the feature subsets selected by these three schemes respectively and obtained a slightly better classification results in the 1-norm support vector machine and incremental forward feature selection. Finally, the authors claim that the rest of genes still contain some useful information. The previous selected features are iteratively removed from the data sets and the feature selection and classification steps are repeated for four rounds. The results show that there are many distinct feature subsets that can provide enough information for classification tasks in these two microarray gene expression data sets.
Collapse
Affiliation(s)
- Yuh-Jye Lee
- Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan.
| | | | | |
Collapse
|
761
|
Baek S, Moon H, Ahn H, Kodell RL, Lin CJ, Chen JJ. Identifying high-dimensional biomarkers for personalized medicine via variable importance ranking. J Biopharm Stat 2008; 18:853-68. [PMID: 18781521 DOI: 10.1080/10543400802278023] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Abstract
We apply robust classification algorithms to high-dimensional genomic data to find biomarkers, by analyzing variable importance, that enable a better diagnosis of disease, an earlier intervention, or a more effective assignment of therapies. The goal is to use variable importance ranking to isolate a set of important genes that can be used to classify life-threatening diseases with respect to prognosis or type to maximize efficacy or minimize toxicity in personalized treatment of such diseases. A ranking method and present several other methods to select a set of important genes to use as genomic biomarkers is proposed, and the performance of the selection procedures in patient classification by cross-validation is evaluated. The various selection algorithms are applied to published high-dimensional genomic data sets using several well-known classification methods. For each data set, a set of genes selected on the basis of variable importance that performed the best in classification is reported. That classification algorithm with the proposed ranking method is shown to be competitive with other selection methods for discovering genomic biomarkers underlying both adverse and efficacious outcomes for improving individualized treatment of patients for life-threatening diseases.
Collapse
Affiliation(s)
- Songjoon Baek
- Division of Personalized Nutrition and Medicine-Biometry Branch, National Center for Toxicological Research, FDA, Jefferson, Arkansas, USA
| | | | | | | | | | | |
Collapse
|
762
|
Gormley M, Tozeren A. Expression profiles of switch-like genes accurately classify tissue and infectious disease phenotypes in model-based classification. BMC Bioinformatics 2008; 9:486. [PMID: 19014681 PMCID: PMC2620272 DOI: 10.1186/1471-2105-9-486] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2008] [Accepted: 11/17/2008] [Indexed: 12/16/2022] Open
Abstract
Background Large-scale compilation of gene expression microarray datasets across diverse biological phenotypes provided a means of gathering a priori knowledge in the form of identification and annotation of bimodal genes in the human and mouse genomes. These switch-like genes consist of 15% of known human genes, and are enriched with genes coding for extracellular and membrane proteins. It is of interest to determine the prediction potential of bimodal genes for class discovery in large-scale datasets. Results Use of a model-based clustering algorithm accurately classified more than 400 microarray samples into 19 different tissue types on the basis of bimodal gene expression. Bimodal expression patterns were also highly effective in differentiating between infectious diseases in model-based clustering of microarray data. Supervised classification with feature selection restricted to switch-like genes also recognized tissue specific and infectious disease specific signatures in independent test datasets reserved for validation. Determination of "on" and "off" states of switch-like genes in various tissues and diseases allowed for the identification of activated/deactivated pathways. Activated switch-like genes in neural, skeletal muscle and cardiac muscle tissue tend to have tissue-specific roles. A majority of activated genes in infectious disease are involved in processes related to the immune response. Conclusion Switch-like bimodal gene sets capture genome-wide signatures from microarray data in health and infectious disease. A subset of bimodal genes coding for extracellular and membrane proteins are associated with tissue specificity, indicating a potential role for them as biomarkers provided that expression is altered in the onset of disease. Furthermore, we provide evidence that bimodal genes are involved in temporally and spatially active mechanisms including tissue-specific functions and response of the immune system to invading pathogens.
Collapse
Affiliation(s)
- Michael Gormley
- School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA.
| | | |
Collapse
|
763
|
Yang H, Harrington CA, Vartanian K, Coldren CD, Hall R, Churchill GA. Randomization in laboratory procedure is key to obtaining reproducible microarray results. PLoS One 2008; 3:e3724. [PMID: 19009020 PMCID: PMC2579585 DOI: 10.1371/journal.pone.0003724] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2008] [Accepted: 10/26/2008] [Indexed: 12/04/2022] Open
Abstract
The quality of gene expression microarray data has improved dramatically since the first arrays were introduced in the late 1990s. However, the reproducibility of data generated at multiple laboratory sites remains a matter of concern, especially for scientists who are attempting to combine and analyze data from public repositories. We have carried out a study in which a common set of RNA samples was assayed five times in four different laboratories using Affymetrix GeneChip arrays. We observed dramatic differences in the results across laboratories and identified batch effects in array processing as one of the primary causes for these differences. When batch processing of samples is confounded with experimental factors of interest it is not possible to separate their effects, and lists of differentially expressed genes may include many artifacts. This study demonstrates the substantial impact of sample processing on microarray analysis results and underscores the need for randomization in the laboratory as a means to avoid confounding of biological factors with procedural effects.
Collapse
Affiliation(s)
- Hyuna Yang
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Christina A. Harrington
- Gene Microarray Shared Resource, OHSU Cancer Institute, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Kristina Vartanian
- Gene Microarray Shared Resource, OHSU Cancer Institute, Oregon Health and Science University, Portland, Oregon, United States of America
| | - Christopher D. Coldren
- Pulmonary Sciences and Critical Care Medicine University of Colorado Health Sciences Center, Denver, Colorado, United States of America
| | - Rob Hall
- Center for Array Technologies, Department of Microbiology, University of Washington, Seattle, Washington, United States of America
| | - Gary A. Churchill
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
- * E-mail:
| |
Collapse
|
764
|
Zhang HH. Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space. J R Stat Soc Series B Stat Methodol 2008; 70:903. [PMID: 19603084 DOI: 10.1111/j.1467-9868.2008.00674.x] [Citation(s) in RCA: 1166] [Impact Index Per Article: 68.6] [Reference Citation Analysis] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Affiliation(s)
- Hao Helen Zhang
- Campus Box 8203, North Carolina State University, Raleigh, NC 27695-8203, U.S.A
| |
Collapse
|
765
|
Rizzi F, Belloni L, Crafa P, Lazzaretti M, Remondini D, Ferretti S, Cortellini P, Corti A, Bettuzzi S. A novel gene signature for molecular diagnosis of human prostate cancer by RT-qPCR. PLoS One 2008; 3:e3617. [PMID: 18974881 PMCID: PMC2570792 DOI: 10.1371/journal.pone.0003617] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2008] [Accepted: 10/02/2008] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Prostate cancer (CaP) is one of the most relevant causes of cancer death in Western Countries. Although detection of CaP at early curable stage is highly desirable, actual screening methods present limitations and new molecular approaches are needed. Gene expression analysis increases our knowledge about the biology of CaP and may render novel molecular tools, but the identification of accurate biomarkers for reliable molecular diagnosis is a real challenge. We describe here the diagnostic power of a novel 8-genes signature: ornithine decarboxylase (ODC), ornithine decarboxylase antizyme (OAZ), adenosylmethionine decarboxylase (AdoMetDC), spermidine/spermine N(1)-acetyltransferase (SSAT), histone H3 (H3), growth arrest specific gene (GAS1), glyceraldehyde 3-phosphate dehydrogenase (GAPDH) and Clusterin (CLU) in tumour detection/classification of human CaP. METHODOLOGY/PRINCIPAL FINDINGS The 8-gene signature was detected by retrotranscription real-time quantitative PCR (RT-qPCR) in frozen prostate surgical specimens obtained from 41 patients diagnosed with CaP and recommended to undergo radical prostatectomy (RP). No therapy was given to patients at any time before RP. The bio-bank used for the study consisted of 66 specimens: 44 were benign-CaP paired from the same patient. Thirty-five were classified as benign and 31 as CaP after final pathological examination. Only molecular data were used for classification of specimens. The Nearest Neighbour (NN) classifier was used in order to discriminate CaP from benign tissue. Validation of final results was obtained with 10-fold cross-validation procedure. CaP versus benign specimens were discriminated with (80+/-5)% accuracy, (81+/-6)% sensitivity and (78+/-7)% specificity. The method also correctly classified 71% of patients with Gleason score<7 versus > or =7, an important predictor of final outcome. CONCLUSIONS/SIGNIFICANCE The method showed high sensitivity in a collection of specimens in which a significant portion of the total (13/31, equal to 42%) was considered CaP on the basis of having less than 15% of cancer cells. This result supports the notion of the "cancer field effect", in which transformed cells extend beyond morphologically evident tumour. The molecular diagnosis method here described is objective and less subjected to human error. Although further confirmations are needed, this method poses the potential to enhance conventional diagnosis.
Collapse
Affiliation(s)
- Federica Rizzi
- Department of Medicina Sperimentale, University of Parma, Parma, Italy
- Istituto Nazionale Biostrutture e Biosistemi (I.N.B.B.), Roma, Italy
| | - Lucia Belloni
- Department of Medicina Sperimentale, University of Parma, Parma, Italy
- Istituto Nazionale Biostrutture e Biosistemi (I.N.B.B.), Roma, Italy
| | - Pellegrino Crafa
- Department of Patologia e Medicina di laboratorio, University of Parma, Parma, Italy
| | - Mirca Lazzaretti
- Department of Patologia e Medicina di laboratorio, University of Parma, Parma, Italy
| | | | - Stefania Ferretti
- Urology Operative Unit, Azienda Ospedaliera-Universitaria of Parma, Parma, Italy
| | - Piero Cortellini
- Urology Operative Unit, Azienda Ospedaliera-Universitaria of Parma, Parma, Italy
| | - Arnaldo Corti
- Department of Scienze Biomediche,University of Modena, Modena, Italy
| | - Saverio Bettuzzi
- Department of Medicina Sperimentale, University of Parma, Parma, Italy
- Istituto Nazionale Biostrutture e Biosistemi (I.N.B.B.), Roma, Italy
| |
Collapse
|
766
|
Affiliation(s)
- Karla V Ballman
- Division of Biostatistics, Mayo Clinic, Rochester, MN 55905, USA.
| |
Collapse
|
767
|
Valentini G, Tagliaferri R, Masulli F. Computational intelligence and machine learning in bioinformatics. Artif Intell Med 2008; 45:91-6. [PMID: 18929473 DOI: 10.1016/j.artmed.2008.08.014] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
|
768
|
Tsai YS, Lin CT, Tseng GC, Chung IF, Pal NR. Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems. BMC Bioinformatics 2008; 9:425. [PMID: 18842155 PMCID: PMC2620271 DOI: 10.1186/1471-2105-9-425] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2008] [Accepted: 10/09/2008] [Indexed: 12/14/2022] Open
Abstract
Background The Signal-to-Noise-Ratio (SNR) is often used for identification of biomarkers for two-class problems and no formal and useful generalization of SNR is available for multiclass problems. We propose innovative generalizations of SNR for multiclass cancer discrimination through introduction of two indices, Gene Dominant Index and Gene Dormant Index (GDIs). These two indices lead to the concepts of dominant and dormant genes with biological significance. We use these indices to develop methodologies for discovery of dominant and dormant biomarkers with interesting biological significance. The dominancy and dormancy of the identified biomarkers and their excellent discriminating power are also demonstrated pictorially using the scatterplot of individual gene and 2-D Sammon's projection of the selected set of genes. Using information from the literature we have shown that the GDI based method can identify dominant and dormant genes that play significant roles in cancer biology. These biomarkers are also used to design diagnostic prediction systems. Results and discussion To evaluate the effectiveness of the GDIs, we have used four multiclass cancer data sets (Small Round Blue Cell Tumors, Leukemia, Central Nervous System Tumors, and Lung Cancer). For each data set we demonstrate that the new indices can find biologically meaningful genes that can act as biomarkers. We then use six machine learning tools, Nearest Neighbor Classifier (NNC), Nearest Mean Classifier (NMC), Support Vector Machine (SVM) classifier with linear kernel, and SVM classifier with Gaussian kernel, where both SVMs are used in conjunction with one-vs-all (OVA) and one-vs-one (OVO) strategies. We found GDIs to be very effective in identifying biomarkers with strong class specific signatures. With all six tools and for all data sets we could achieve better or comparable prediction accuracies usually with fewer marker genes than results reported in the literature using the same computational protocols. The dominant genes are usually easy to find while good dormant genes may not always be available as dormant genes require stronger constraints to be satisfied; but when they are available, they can be used for authentication of diagnosis. Conclusion Since GDI based schemes can find a small set of dominant/dormant biomarkers that is adequate to design diagnostic prediction systems, it opens up the possibility of using real-time qPCR assays or antibody based methods such as ELISA for an easy and low cost diagnosis of diseases. The dominant and dormant genes found by GDIs can be used in different ways to design more reliable diagnostic prediction systems.
Collapse
Affiliation(s)
- Yu-Shuen Tsai
- Institute of Biomedical Informatics, National Yang-Ming University, Taipei, Taiwan.
| | | | | | | | | |
Collapse
|
769
|
Kubokawa T, Srivastava MS. Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data. J MULTIVARIATE ANAL 2008. [DOI: 10.1016/j.jmva.2008.01.016] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
770
|
Gadgil M. A Population Proportion approach for ranking differentially expressed genes. BMC Bioinformatics 2008; 9:380. [PMID: 18801167 PMCID: PMC2566584 DOI: 10.1186/1471-2105-9-380] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2008] [Accepted: 09/18/2008] [Indexed: 11/14/2022] Open
Abstract
Background DNA microarrays are used to investigate differences in gene expression between two or more classes of samples. Most currently used approaches compare mean expression levels between classes and are not geared to find genes whose expression is significantly different in only a subset of samples in a class. However, biological variability can lead to situations where key genes are differentially expressed in only a subset of samples. To facilitate the identification of such genes, a new method is reported. Methods The key difference between the Population Proportion Ranking Method (PPRM) presented here and almost all other methods currently used is in the quantification of variability. PPRM quantifies variability in terms of inter-sample ratios and can be used to calculate the relative merit of differentially expressed genes with a specified difference in expression level between at least some samples in the two classes, which at the same time have lower than a specified variability within each class. Results PPRM is tested on simulated data and on three publicly available cancer data sets. It is compared to the t test, PPST, COPA, OS, ORT and MOST using the simulated data. Under the conditions tested, it performs as well or better than the other methods tested under low intra-class variability and better than t test, PPST, COPA and OS when a gene is differentially expressed in only a subset of samples. It performs better than ORT and MOST in recognizing non differentially expressed genes with high variability in expression levels across all samples. For biological data, the success of predictor genes identified in appropriately classifying an independent sample is reported.
Collapse
Affiliation(s)
- Mugdha Gadgil
- Chemical Engineering and Process Development, National Chemical Laboratory, Pune, India .
| |
Collapse
|
771
|
Li GZ, Bu HL, Yang MQ, Zeng XQ, Yang JY. Selecting subsets of newly extracted features from PCA and PLS in microarray data analysis. BMC Genomics 2008; 9 Suppl 2:S24. [PMID: 18831790 PMCID: PMC2559889 DOI: 10.1186/1471-2164-9-s2-s24] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background Dimension reduction is a critical issue in the analysis of microarray data, because the high dimensionality of gene expression microarray data set hurts generalization performance of classifiers. It consists of two types of methods, i.e. feature selection and feature extraction. Principle component analysis (PCA) and partial least squares (PLS) are two frequently used feature extraction methods, and in the previous works, the top several components of PCA or PLS are selected for modeling according to the descending order of eigenvalues. While in this paper, we prove that not all the top features are useful, but features should be selected from all the components by feature selection methods. Results We demonstrate a framework for selecting feature subsets from all the newly extracted components, leading to reduced classification error rates on the gene expression microarray data. Here we have considered both an unsupervised method PCA and a supervised method PLS for extracting new components, genetic algorithms for feature selection, and support vector machines and k nearest neighbor for classification. Experimental results illustrate that our proposed framework is effective to select feature subsets and to reduce classification error rates. Conclusion Not only the top features newly extracted by PCA or PLS are important, therefore, feature selection should be performed to select subsets from new features to improve generalization performance of classifiers.
Collapse
Affiliation(s)
- Guo-Zheng Li
- Department of Control Science & Engineering, Tongji University, Shanghai 201804, PR China.
| | | | | | | | | |
Collapse
|
772
|
Abstract
BACKGROUND Gene expression data usually contains a large number of genes, but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. In this paper, we present a two-stage selection algorithm by combining ReliefF and mRMR: In the first stage, ReliefF is applied to find a candidate gene set; In the second stage, mRMR method is applied to directly and explicitly reduce redundancy for selecting a compact yet effective gene subset from the candidate set. RESULTS We perform comprehensive experiments to compare the mRMR-ReliefF selection algorithm with ReliefF, mRMR and other feature selection methods using two classifiers as SVM and Naive Bayes, on seven different datasets. And we also provide all source codes and datasets for sharing with others. CONCLUSION The experimental results show that the mRMR-ReliefF gene selection algorithm is very effective.
Collapse
Affiliation(s)
- Yi Zhang
- School of Computer Science, Florida International University, 11200 SW 8th Street, Miami, FL, 33199, USA
| | - Chris Ding
- Department of Computer Science and Engineering, University of Texas at Arlington, 416 Yates Street, Arlington, TX, 76019, USA
| | - Tao Li
- School of Computer Science, Florida International University, 11200 SW 8th Street, Miami, FL, 33199, USA
| |
Collapse
|
773
|
Kianmehr K, Alhajj R. CARSVM: A class association rule-based classification framework and its application to gene expression data. Artif Intell Med 2008; 44:7-25. [DOI: 10.1016/j.artmed.2008.05.002] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2007] [Revised: 05/10/2008] [Accepted: 05/13/2008] [Indexed: 12/01/2022]
|
774
|
Prasad NB, Somervell H, Tufano RP, Dackiw APB, Marohn MR, Califano JA, Wang Y, Westra WH, Clark DP, Umbricht CB, Libutti SK, Zeiger MA. Identification of genes differentially expressed in benign versus malignant thyroid tumors. Clin Cancer Res 2008; 14:3327-37. [PMID: 18519760 DOI: 10.1158/1078-0432.ccr-07-4495] [Citation(s) in RCA: 68] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
PURPOSE Although fine-needle aspiration biopsy is the most useful diagnostic tool in evaluating a thyroid nodule, preoperative diagnosis of thyroid nodules is frequently imprecise, with up to 30% of fine-needle aspiration biopsy cytology samples reported as "suspicious" or "indeterminate." Therefore, other adjuncts, such as molecular-based diagnostic approaches are needed in the preoperative distinction of these lesions. EXPERIMENTAL DESIGN In an attempt to identify diagnostic markers for the preoperative distinction of these lesions, we chose to study by microarray analysis the eight different thyroid tumor subtypes that can present a diagnostic challenge to the clinician. RESULTS Our microarray-based analysis of 94 thyroid tumors identified 75 genes that are differentially expressed between benign and malignant tumor subtypes. Of these, 33 were overexpressed and 42 were underexpressed in malignant compared with benign thyroid tumors. Statistical analysis of these genes, using nearest-neighbor classification, showed a 73% sensitivity and 82% specificity in predicting malignancy. Real-time reverse transcription-PCR validation for 12 of these genes was confirmatory. Western blot and immunohistochemical analyses of one of the genes, high mobility group AT-hook 2, further validated the microarray and real-time reverse transcription-PCR data. CONCLUSIONS Our results suggest that these 12 genes could be useful in the development of a panel of markers to differentiate benign from malignant tumors and thus serve as an important first step in solving the clinical problem associated with suspicious thyroid lesions.
Collapse
Affiliation(s)
- Nijaguna B Prasad
- Department of Surgery, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
775
|
Wray NR, Goddard ME, Visscher PM. Prediction of individual genetic risk of complex disease. Curr Opin Genet Dev 2008; 18:257-63. [PMID: 18682292 DOI: 10.1016/j.gde.2008.07.006] [Citation(s) in RCA: 134] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2008] [Accepted: 07/08/2008] [Indexed: 01/21/2023]
Abstract
Most common diseases are caused by multiple genetic and environmental factors. In the last 2 years, genome-wide association studies (GWAS) have identified polymorphisms that are associated with risk to common disease, but the effect of any one risk allele is typically small. By combining information from many risk variants, will it be possible to predict accurately each individual person's genetic risk for a disease? In this review we consider the lessons from GWAS and the implications for genetic risk prediction to common disease. We conclude that with larger GWAS sample sizes or by combining studies, accurate prediction of genetic risk will be possible, even if the causal mutations or the mechanisms by which they affect susceptibility are unknown.
Collapse
Affiliation(s)
- Naomi R Wray
- Genetic Epidemiology and Queensland Statistical Genetics, Queensland Institute of Medical Research, Brisbane, Australia.
| | | | | |
Collapse
|
776
|
Abstract
In this paper, we propose and evaluate methodologies for the classification of images from thin-layer chromatography. Each individual sample is characterized by an intensity profile that is further represented into a feature space. The first steps of this process aim at obtaining a robust estimate of the intensity profile by filtering noise, reducing the influence of background changes, and by fitting a mixture of Gaussians. The resulting profiles are represented by a set of appropriate features trying to characterize the state of nature, here spread out over four classes, one for normal subjects and the other three corresponding to lysosomal diseases, which are disorders responsible for severe nerve degeneration. For classification purposes, a novel solution based on a hierarchical structure is proposed. The main conclusion of this paper is that an automatically generated decision tree presents better results than more conventional solutions, being able to deal with the natural imbalance of the data that, as consequence of the rarity of lysosomal disorders, has very few representative cases in the disease classes when compared with the normal population.
Collapse
|
777
|
Cardiovascular genetic medicine: genomic assessment of prognosis and diagnosis in patients with cardiomyopathy and heart failure. J Cardiovasc Transl Res 2008; 1:225-31. [PMID: 20559924 DOI: 10.1007/s12265-008-9044-y] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/15/2008] [Accepted: 06/12/2008] [Indexed: 12/22/2022]
Abstract
In the last half century, epidemiologic studies and basic science investigations revealed that hypertension (Kannel et al., Ann Intern Med 55:33-50, 1961), hyperlipidemia (Dawber et al., Am J Public Health Nations Health 49:1349-1356, 1959), diabetes (Kannel et al., Am J Cardiol 34(1):29-34, 1974), smoking (Dawber et al., Am J Public Health Nations Health 49:1349-1356, 1959), and inflammation (Rossmann et al., Exp Gerontol 43(3):229-237, 2008) posed increased risk for cardiovascular disease. These associations served both as risk factors and offered insight into disease pathophysiology. Currently, it is increasingly appreciated that polygenic factors may also play a role as etiologic or risk factors (Chakravarti and Little, Nature 421(6921):412-414, 2003; Dorn and Molkentin, Circulation 109(2):150-158, 2004). Recent technologic advances in genomic screening make the search for these factors possible, and robust technologies are now available for both entire genome screening for expression or single nucleotide polymorphisms. In this paper, we review the basic principles of gene expression and molecular signature analysis in the context of potential clinical applications of transcriptomics.
Collapse
|
778
|
Mise N, Fuchikami T, Sugimoto M, Kobayakawa S, Ike F, Ogawa T, Tada T, Kanaya S, Noce T, Abe K. Differences and similarities in the developmental status of embryo-derived stem cells and primordial germ cells revealed by global expression profiling. Genes Cells 2008; 13:863-77. [DOI: 10.1111/j.1365-2443.2008.01211.x] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
779
|
Andreeff M, Ruvolo V, Gadgil S, Zeng C, Coombes K, Chen W, Kornblau S, Barón AE, Drabkin HA. HOX expression patterns identify a common signature for favorable AML. Leukemia 2008; 22:2041-7. [PMID: 18668134 DOI: 10.1038/leu.2008.198] [Citation(s) in RCA: 108] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Deregulated HOX expression, by chromosomal translocations and myeloid-lymphoid leukemia (MLL) rearrangements, is causal in some types of leukemia. Using real-time reverse transcription-PCR, we examined the expression of 43 clustered HOX, polycomb, MLL and FLT3 genes in 119 newly diagnosed adult acute myeloid leukemias (AMLs) selected from all major cytogenetic groups. Downregulated HOX expression was a consistent feature of favorable AMLs and, among these cases, inv(16) cases had a distinct expression profile. Using a 17-gene predictor in 44 additional samples, we observed a 94.7% specificity for classifying favorable vs intermediate/unfavorable cytogenetic groups. Among other AMLs, HOX overexpression was associated with nucleophosmin (NPM) mutations and we also identified a phenotypically similar subset with wt-NPM. In many unfavorable and other intermediate cytogenetic AMLs, HOX levels resembled those in normal CD34+ cells, except that the homogeneity characteristic of normal samples was not present. We also observed that HOXA9 levels were significantly inversely correlated with survival and that BMI-1 was overexpressed in cases with 11q23 rearrangements, suggesting that p19(ARF) suppression may be involved in MLL-associated leukemia. These results underscore the close relationship between HOX expression patterns and certain forms of AML and emphasize the need to determine whether these differences play a role in the disease process.
Collapse
Affiliation(s)
- M Andreeff
- Department of Stem Cell Transplantation, Section of Molecular Hematology and Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
780
|
|
781
|
Abstract
Although the random forest classification procedure works well in datasets with many features, when the number of features is huge and the percentage of truly informative features is small, such as with DNA microarray data, its performance tends to decline significantly. In such instances, the procedure can be improved by reducing the contribution of trees whose nodes are populated by non-informative features. To some extent, this can be achieved by prefiltering, but we propose a novel, yet simple, adjustment that has demonstrably superior performance: choose the eligible subsets at each node by weighted random sampling instead of simple random sampling, with the weights tilted in favor of the informative features. This results in an 'enriched random forest'. We illustrate the superior performance of this procedure in several actual microarray datasets.
Collapse
Affiliation(s)
- Dhammika Amaratunga
- Department of Nonclinical Biostatistics, Johnson & Johnson PRD LLC, Raritan, NJ 08869, USA.
| | | | | |
Collapse
|
782
|
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics 2008; 9:319. [PMID: 18647401 PMCID: PMC2492881 DOI: 10.1186/1471-2105-9-319] [Citation(s) in RCA: 298] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2008] [Accepted: 07/22/2008] [Indexed: 12/17/2022] Open
Abstract
Background Cancer diagnosis and clinical outcome prediction are among the most important emerging applications of gene expression microarray technology with several molecular signatures on their way toward clinical deployment. Use of the most accurate classification algorithms available for microarray gene expression data is a critical ingredient in order to develop the best possible molecular signatures for patient care. As suggested by a large body of literature to date, support vector machines can be considered "best of class" algorithms for classification of such data. Recent work, however, suggests that random forest classifiers may outperform support vector machines in this domain. Results In the present paper we identify methodological biases of prior work comparing random forests and support vector machines and conduct a new rigorous evaluation of the two algorithms that corrects these limitations. Our experiments use 22 diagnostic and prognostic datasets and show that support vector machines outperform random forests, often by a large margin. Our data also underlines the importance of sound research design in benchmarking and comparison of bioinformatics algorithms. Conclusion We found that both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Collapse
Affiliation(s)
- Alexander Statnikov
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA.
| | | | | |
Collapse
|
783
|
Wong AKC, Au WH, Chan KCC. Discovering high-order patterns of gene expression levels. J Comput Biol 2008; 15:625-37. [PMID: 18631025 DOI: 10.1089/cmb.2007.0147] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
This paper reports the discovery of statistically significant association patterns of gene expression levels from microarray data. By association patterns, we mean certain gene expression intensity intervals having statistically significant associations among themselves and with the tissue classes, such as cancerous and normal tissues. We describe how the significance of the associations among gene expression levels can be evaluated using a statistical measure in an objective manner. If an association is found to be significant based on the measure, we say that it is statistically significant. Given a gene expression data set, we first cluster the entire gene pool comprising all the genes into groups by optimizing the correlation (or more precisely, interdependence) among the gene expression levels within gene groups. From each group, we select one or several genes that are most correlated with other genes within that group to form a smaller gene pool. This gene pool then constitutes the most representative genes from the original pool. Our pattern discovery algorithm is then used, for the first time, to discover the significant association patterns of gene expression levels among the genes from the small pool. With our method, it is more effective to discover and express the associations in terms of their intensity intervals. Hence, we discretize each gene expression levels into intervals maximizing the interdependence between the gene expression and the tissue classes. From this data set of gene expression intervals, we discover the association patterns representing statistically significant associations, some positively and some negatively, with different tissue classes. We apply our pattern discovery methodology to the colon-cancer microarray gene expression data set. It consists of 2000 genes and 62 samples taken from colon cancer or normal subjects. The statistically significant combinations of gene expression levels that repress or activate colon cancer are revealed in the colon-cancer data set. The discovered association patterns are ranked according to their statistical significance and displayed for interpretation and further analysis.
Collapse
Affiliation(s)
- Andrew K C Wong
- Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada
| | | | | |
Collapse
|
784
|
Haibe-Kains B, Desmedt C, Sotiriou C, Bontempi G. A comparative study of survival models for breast cancer prognostication based on microarray data: does a single gene beat them all? ACTA ACUST UNITED AC 2008; 24:2200-8. [PMID: 18635567 PMCID: PMC2553442 DOI: 10.1093/bioinformatics/btn374] [Citation(s) in RCA: 169] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Motivation: Survival prediction of breast cancer (BC) patients independently of treatment, also known as prognostication, is a complex task since clinically similar breast tumors, in addition to be molecularly heterogeneous, may exhibit different clinical outcomes. In recent years, the analysis of gene expression profiles by means of sophisticated data mining tools emerged as a promising technology to bring additional insights into BC biology and to improve the quality of prognostication. The aim of this work is to assess quantitatively the accuracy of prediction obtained with state-of-the-art data analysis techniques for BC microarray data through an independent and thorough framework. Results: Due to the large number of variables, the reduced amount of samples and the high degree of noise, complex prediction methods are highly exposed to performance degradation despite the use of cross-validation techniques. Our analysis shows that the most complex methods are not significantly better than the simplest one, a univariate model relying on a single proliferation gene. This result suggests that proliferation might be the most relevant biological process for BC prognostication and that the loss of interpretability deriving from the use of overcomplex methods may be not sufficiently counterbalanced by an improvement of the quality of prediction. Availability: The comparison study is implemented in an R package called survcomp and is available from http://www.ulb.ac.be/di/map/bhaibeka/software/survcomp/. Contact:bhaibeka@ulb.ac.be Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- B Haibe-Kains
- Machine Learning Group, Department of Computer Science, Institut Jules Bordet, Université Libre de Bruxelles, Brussels, Belgium.
| | | | | | | |
Collapse
|
785
|
Song JJ, Deng W, Lee HJ, Kwon D. Optimal classification for time-course gene expression data using functional data analysis. Comput Biol Chem 2008; 32:426-32. [PMID: 18755633 DOI: 10.1016/j.compbiolchem.2008.07.007] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2007] [Revised: 05/16/2008] [Accepted: 07/06/2008] [Indexed: 11/15/2022]
Abstract
Classification problems have received considerable attention in biological and medical applications. In particular, classification methods combining to microarray technology play an important role in diagnosing and predicting disease, such as cancer, in medical research. Primary objective in classification is to build an optimal classifier based on the training sample in order to predict unknown class in the test sample. In this paper, we propose a unified approach for optimal gene classification with conjunction with functional principal component analysis (FPCA) in functional data analysis (FNDA) framework to classify time-course gene expression profiles based on information from the patterns. To derive an optimal classifier in FNDA, we also propose to find optimal number of bases in the smoothing step and functional principal components in FPCA using a cross-validation technique, and compare the performance of some popular classification techniques in the proposed setting. We illustrate the propose method with a simulation study and a real world data analysis.
Collapse
Affiliation(s)
- Joon Jin Song
- Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA.
| | | | | | | |
Collapse
|
786
|
Abstract
MOTIVATION The classification methods typically used in bioinformatics classify all examples, even if the classification is ambiguous, for instance, when the example is close to the separating hyperplane in linear classification. For medical applications, it may be better to classify an example only when there is a sufficiently high degree of accuracy, rather than classify all examples with decent accuracy. Moreover, when all examples are classified, the classification rule has no control over the accuracy of the classifier; the algorithm just aims to produce a classifier with the smallest error rate possible. In our approach, we fix the accuracy of the classifier and thereby choose a desired risk of error. RESULTS Our method consists of defining a rejection region in the feature space. This region contains the examples for which classification is ambiguous. These are rejected by the classifier. The accuracy of the classifier becomes a user-defined parameter of the classification rule. The task of the classification rule is to minimize the rejection region with the constraint that the error rate of the classifier be bounded by the chosen target error. This approach is also used in the feature-selection step. The results computed on both synthetic and real data show that classifier accuracy is significantly improved. AVAILABILITY Companion Website. http://gsp.tamu.edu/Publications/rejectoption/
Collapse
Affiliation(s)
- Blaise Hanczar
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | | |
Collapse
|
787
|
Abstract
Using a question and answer format we describe important aspects of using genomic technologies in cancer research. The main challenges are not managing the mass of data, but rather the design, analysis, and accurate reporting of studies that result in increased biological knowledge and medical utility. Many analysis issues address the use of expression microarrays but are also applicable to other whole genome assays. Microarray-based clinical investigations have generated both unrealistic hype and excessive skepticism. Genomic technologies are tremendously powerful and will play instrumental roles in elucidating the mechanisms of oncogenesis and in bringing on an era of predictive medicine in which treatments are tailored to individual tumors. Achieving these goals involves challenges in rethinking many paradigms for the conduct of basic and clinical cancer research and for the organization of interdisciplinary collaboration.
Collapse
Affiliation(s)
- Richard Simon
- Biometric Research Branch, National Cancer Institute, Bethesda, MD 20892-7434, USA.
| |
Collapse
|
788
|
Heidecker B, Kasper EK, Wittstein IS, Champion HC, Breton E, Russell SD, Kittleson MM, Baughman KL, Hare JM. Transcriptomic biomarkers for individual risk assessment in new-onset heart failure. Circulation 2008; 118:238-46. [PMID: 18591436 DOI: 10.1161/circulationaha.107.756544] [Citation(s) in RCA: 65] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
BACKGROUND Prediction of prognosis remains a major unmet need in new-onset heart failure (HF). Although several clinical tests are in use, none accurately distinguish between patients with poor versus excellent survival. We hypothesized that a transcriptomic signature, generated from a single endomyocardial biopsy, could serve as a novel prognostic biomarker in HF. METHODS AND RESULTS Endomyocardial biopsy samples and clinical data were collected from all patients presenting with new-onset HF from 1997 to 2006. Among a total of 350 endomyocardial biopsy samples, 180 were identified as idiopathic dilated cardiomyopathy. Patients with phenotypic extremes in survival were selected: good prognosis (event-free survival for at least 5 years; n=25) and poor prognosis (events [death, requirement for left ventricular assist device, or cardiac transplant] within the first 2 years of presentation with HF symptoms; n=18). We used human U133 Plus 2.0 microarrays (Affymetrix) and analyzed the data with significance analysis of microarrays and prediction analysis of microarrays. We identified 46 overexpressed genes in patients with good versus poor prognosis, of which 45 genes were selected by prediction analysis of microarrays for prediction of prognosis in a train set (n=29) with subsequent validation in test sets (n=14 each). The biomarker performed with 74% sensitivity (95% CI 69% to 79%) and 90% specificity (95% CI 87% to 93%) after 50 random partitions. CONCLUSIONS These findings suggest the potential of transcriptomic biomarkers to predict prognosis in patients with new-onset HF from a single endomyocardial biopsy sample. In addition, our findings offer potential novel therapeutic targets for HF and cardiomyopathy.
Collapse
Affiliation(s)
- Bettina Heidecker
- Miller School of Medicine, University of Miami Division of Cardiology, CRB, 1120 NW 14th St, Suite 1124, Miami, FL 33136, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
789
|
Oh JH, Kim YB, Gurnani P, Rosenblatt KP, Gao JX. Biomarker selection and sample prediction for multi-category disease on MALDI-TOF data. Bioinformatics 2008; 24:1812-8. [DOI: 10.1093/bioinformatics/btn316] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
790
|
Ma S, Huang J. Penalized feature selection and classification in bioinformatics. Brief Bioinform 2008; 9:392-403. [PMID: 18562478 DOI: 10.1093/bib/bbn027] [Citation(s) in RCA: 130] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
In bioinformatics studies, supervised classification with high-dimensional input variables is frequently encountered. Examples routinely arise in genomic, epigenetic and proteomic studies. Feature selection can be employed along with classifier construction to avoid over-fitting, to generate more reliable classifier and to provide more insights into the underlying causal relationships. In this article, we provide a review of several recently developed penalized feature selection and classification techniques--which belong to the family of embedded feature selection methods--for bioinformatics studies with high-dimensional input. Classification objective functions, penalty functions and computational algorithms are discussed. Our goal is to make interested researchers aware of these feature selection and classification methods that are applicable to high-dimensional bioinformatics data.
Collapse
Affiliation(s)
- Shuangge Ma
- Department of Epidemiology and Public Health, Yale University, USA.
| | | |
Collapse
|
791
|
Zhu M, Martinez AM. Using the information embedded in the testing sample to break the limits caused by the small sample size in microarray-based classification. BMC Bioinformatics 2008; 9:280. [PMID: 18554411 PMCID: PMC2443146 DOI: 10.1186/1471-2105-9-280] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2007] [Accepted: 06/14/2008] [Indexed: 12/25/2022] Open
Abstract
Background Microarray-based tumor classification is characterized by a very large number of features (genes) and small number of samples. In such cases, statistical techniques cannot determine which genes are correlated to each tumor type. A popular solution is the use of a subset of pre-specified genes. However, molecular variations are generally correlated to a large number of genes. A gene that is not correlated to some disease may, by combination with other genes, express itself. Results In this paper, we propose a new classiification strategy that can reduce the effect of over-fitting without the need to pre-select a small subset of genes. Our solution works by taking advantage of the information embedded in the testing samples. We note that a well-defined classification algorithm works best when the data is properly labeled. Hence, our classification algorithm will discriminate all samples best when the testing sample is assumed to belong to the correct class. We compare our solution with several well-known alternatives for tumor classification on a variety of publicly available data-sets. Our approach consistently leads to better classification results. Conclusion Studies indicate that thousands of samples may be required to extract useful statistical information from microarray data. Herein, it is shown that this problem can be circumvented by using the information embedded in the testing samples.
Collapse
Affiliation(s)
- Manli Zhu
- Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH 43210, USA.
| | | |
Collapse
|
792
|
Lin G, Cai Z, Wu J, Wan XF, Xu L, Goebel R. Identifying a few foot-and-mouth disease virus signature nucleotide strings for computational genotyping. BMC Bioinformatics 2008; 9:279. [PMID: 18554404 PMCID: PMC2438327 DOI: 10.1186/1471-2105-9-279] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2008] [Accepted: 06/13/2008] [Indexed: 11/19/2022] Open
Abstract
Background Serotypes of the Foot-and-Mouth disease viruses (FMDVs) were generally determined by biological experiments. The computational genotyping is not well studied even with the availability of whole viral genomes, due to uneven evolution among genes as well as frequent genetic recombination. Naively using sequence comparison for genotyping is only able to achieve a limited extent of success. Results We used 129 FMDV strains with known serotype as training strains to select as many as 140 most serotype-specific nucleotide strings. We then constructed a linear-kernel Support Vector Machine classifier using these 140 strings. Under the leave-one-out cross validation scheme, this classifier was able to assign correct serotype to 127 of these 129 strains, achieving 98.45% accuracy. It also assigned serotype correctly to an independent test set of 83 other FMDV strains downloaded separately from NCBI GenBank. Conclusion Computational genotyping is much faster and much cheaper than the wet-lab based biological experiments, upon the availability of the detailed molecular sequences. The high accuracy of our proposed method suggests the potential of utilizing a few signature nucleotide strings instead of whole genomes to determine the serotypes of novel FMDV strains.
Collapse
Affiliation(s)
- Guohui Lin
- Department of Computing Science, University of Alberta, Edmonton, Alberta T6G 2E8, Canada.
| | | | | | | | | | | |
Collapse
|
793
|
Schachtner R, Lutter D, Knollmüller P, Tomé AM, Theis FJ, Schmitz G, Stetter M, Vilda PG, Lang EW. Knowledge-based gene expression classification via matrix factorization. ACTA ACUST UNITED AC 2008; 24:1688-97. [PMID: 18535085 PMCID: PMC2638868 DOI: 10.1093/bioinformatics/btn245] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022]
Abstract
Motivation: Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. Results: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients. Supplementary information:Supplementary data are available at Bioinformatics online. Contact:elmar.lang@biologie.uni-regensburg.de
Collapse
Affiliation(s)
- R Schachtner
- CIML/Biophysics, University of Regensburg, D-93040 Regensburg, Germany
| | | | | | | | | | | | | | | | | |
Collapse
|
794
|
Yang JY, Li GZ, Meng HH, Yang MQ, Deng Y. Improving prediction accuracy of tumor classification by reusing genes discarded during gene selection. BMC Genomics 2008; 9 Suppl 1:S3. [PMID: 18366616 PMCID: PMC2386068 DOI: 10.1186/1471-2164-9-s1-s3] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Background Since the high dimensionality of gene expression microarray data sets degrades the generalization performance of classifiers, feature selection, which selects relevant features and discards irrelevant and redundant features, has been widely used in the bioinformatics field. Multi-task learning is a novel technique to improve prediction accuracy of tumor classification by using information contained in such discarded redundant features, but which features should be discarded or used as input or output remains an open issue. Results We demonstrate a framework for automatically selecting features to be input, output, and discarded by using a genetic algorithm, and propose two algorithms: GA-MTL (Genetic algorithm based multi-task learning) and e-GA-MTL (an enhanced version of GA-MTL). Experimental results demonstrate that this framework is effective at selecting features for multi-task learning, and that GA-MTL and e-GA-MTL perform better than other heuristic methods. Conclusions Genetic algorithms are a powerful technique to select features for multi-task learning automatically; GA-MTL and e-GA-MTL are shown to to improve generalization performance of classifiers on microarray data sets.
Collapse
Affiliation(s)
- Jack Y Yang
- Harvard Medical School, Harvard University, Cambridge, Massachusetts 02140-0888 USA.
| | | | | | | | | |
Collapse
|
795
|
Halvorsen OJ. Molecular and prognostic markers in prostate cancer. APMIS 2008. [DOI: 10.1111/j.1600-0463.2008.0s123.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
796
|
Tárraga J, Medina I, Carbonell J, Huerta-Cepas J, Minguez P, Alloza E, Al-Shahrour F, Vegas-Azcárate S, Goetz S, Escobar P, Garcia-Garcia F, Conesa A, Montaner D, Dopazo J. GEPAS, a web-based tool for microarray data analysis and interpretation. Nucleic Acids Res 2008; 36:W308-14. [PMID: 18508806 PMCID: PMC2447723 DOI: 10.1093/nar/gkn303] [Citation(s) in RCA: 69] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Gene Expression Profile Analysis Suite (GEPAS) is one of the most complete and extensively used web-based packages for microarray data analysis. During its more than 5 years of activity it has continuously been updated to keep pace with the state-of-the-art in the changing microarray data analysis arena. GEPAS offers diverse analysis options that include well established as well as novel algorithms for normalization, gene selection, class prediction, clustering and functional profiling of the experiment. New options for time-course (or dose-response) experiments, microarray-based class prediction, new clustering methods and new tests for differential expression have been included. The new pipeliner module allows automating the execution of sequential analysis steps by means of a simple but powerful graphic interface. An extensive re-engineering of GEPAS has been carried out which includes the use of web services and Web 2.0 technology features, a new user interface with persistent sessions and a new extended database of gene identifiers. GEPAS is nowadays the most quoted web tool in its field and it is extensively used by researchers of many countries and its records indicate an average usage rate of 500 experiments per day. GEPAS, is available at http://www.gepas.org.
Collapse
Affiliation(s)
- Joaquín Tárraga
- Bioinformatics Department, Centro de Investigación Príncipe Felipe (CIPF), Autopista del Saler 16, E46013, Valencia, Spain
| | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
797
|
Identifying subset of genes that have influential impacts on cancer progression: a new approach to analyze cancer microarray data. Funct Integr Genomics 2008; 8:361-73. [DOI: 10.1007/s10142-008-0084-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2008] [Revised: 04/10/2008] [Accepted: 04/20/2008] [Indexed: 01/30/2023]
|
798
|
Prediction of the tissue-specificity of selective estrogen receptor modulators by using a single biochemical method. Proc Natl Acad Sci U S A 2008; 105:7171-6. [PMID: 18474858 DOI: 10.1073/pnas.0710802105] [Citation(s) in RCA: 73] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Here, we demonstrate that a single biochemical assay is able to predict the tissue-selective pharmacology of an array of selective estrogen receptor modulators (SERMs). We describe an approach to classify estrogen receptor (ER) modulators based on dynamics of the receptor-ligand complex as probed with hydrogen/deuterium exchange (HDX) mass spectrometry. Differential HDX mapping coupled with cluster and discriminate analysis effectively predicted tissue-selective function in most, but not all, cases tested. We demonstrate that analysis of dynamics of the receptor-ligand complex facilitates binning of ER modulators into distinct groups based on structural dynamics. Importantly, we were able to differentiate small structural changes within ER ligands of the same chemotype. In addition, HDX revealed differentially stabilized regions within the ligand-binding pocket that may contribute to the different pharmacology phenotypes of the compounds independent of helix 12 positioning. In summary, HDX provides a sensitive and rapid approach to classify modulators of the estrogen receptor that correlates with their pharmacological profile.
Collapse
|
799
|
Julka PK, Chacko RT, Nag S, Parshad R, Nair A, Oh DS, Hu Z, Koppiker CB, Nair S, Dawar R, Dhindsa N, Miller ID, Ma D, Lin B, Awasthy B, Perou CM. A phase II study of sequential neoadjuvant gemcitabine plus doxorubicin followed by gemcitabine plus cisplatin in patients with operable breast cancer: prediction of response using molecular profiling. Br J Cancer 2008; 98:1327-35. [PMID: 18382427 PMCID: PMC2361717 DOI: 10.1038/sj.bjc.6604322] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2007] [Revised: 02/25/2008] [Accepted: 02/26/2008] [Indexed: 02/07/2023] Open
Abstract
This study examined the pathological complete response (pCR) rate and safety of sequential gemcitabine-based combinations in breast cancer. We also examined gene expression profiles from tumour biopsies to identify biomarkers predictive of response. Indian women with large or locally advanced breast cancer received 4 cycles of gemcitabine 1200 mg m(-2) plus doxorubicin 60 mg m(-2) (Gem+Dox), then 4 cycles of gemcitabine 1000 mg m(-2) plus cisplatin 70 mg m(-2) (Gem+Cis), and surgery. Three alternate dosing sequences were used during cycle 1 to examine dynamic changes in molecular profiles. Of 65 women treated, 13 (24.5% of 53 patients with surgery) had a pCR and 22 (33.8%) had a complete clinical response. Patients administered Gem d1, 8 and Dox d2 in cycle 1 (20 of 65) reported more toxicities, with G3/4 neutropenic infection/febrile neutropenia (7 of 20) as the most common cycle-1 event. Four drug-related deaths occurred. In 46 of 65 patients, 10-fold cross validated supervised analyses identified gene expression patterns that predicted with >or=73% accuracy (1) clinical complete response after eight cycles, (2) overall clinical complete response, and (3) pCR. This regimen shows strong activity. Patients receiving Gem d1, 8 and Dox d2 experienced unacceptable toxicity, whereas patients on other sequences had manageable safety profiles. Gene expression patterns may predict benefit from gemcitabine-containing neoadjuvant therapy.
Collapse
Affiliation(s)
- P K Julka
- Department of Radiotherapy and Oncology, AIIMS, New Delhi 110029, India
| | - R T Chacko
- Department of Medical Oncology, Christian Medical College, Vellore, Tamil Nadu 632004, India
| | - S Nag
- Department of Medical Oncology, HCJMRI, Pune, Maharashtra 411001, India
| | - R Parshad
- Department of Radiotherapy and Oncology, AIIMS, New Delhi 110029, India
| | - A Nair
- Department of Medical Oncology, Christian Medical College, Vellore, Tamil Nadu 632004, India
| | - D S Oh
- Departments of Genetics and Pathology and Laboratory Medicine, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Z Hu
- Departments of Genetics and Pathology and Laboratory Medicine, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - C B Koppiker
- Department of Medical Oncology, HCJMRI, Pune, Maharashtra 411001, India
| | - S Nair
- Department of Medical Oncology, Christian Medical College, Vellore, Tamil Nadu 632004, India
| | - R Dawar
- Department of Radiotherapy and Oncology, AIIMS, New Delhi 110029, India
| | - N Dhindsa
- Eli Lilly and Company (India) Pvt. Ltd., Gurgaon, Haryana 122001, India
| | - I D Miller
- Department of Pathology, Aberdeen Royal Infirmary, Foresterhill, Aberdeen AB25 2ZD, UK
| | - D Ma
- Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - B Lin
- Eli Lilly and Company, Indianapolis, IN 46285, USA
| | - B Awasthy
- Health Care Global Enterprises, Curie Centre of Oncology, St John's Hospital Campus, Koramangala, Bangalore 560034, India
| | - C M Perou
- Departments of Genetics and Pathology and Laboratory Medicine, Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| |
Collapse
|
800
|
Abstract
Genomic classifiers using DNA microarrays are becoming powerful tools in the medical community with the potential to revolutionize the diagnosis and treatment of disease. However, despite the tremendous interest in using these classifiers in diagnosis and the management of disease, few genomic classifiers have made it into clinical practice. Some of the major challenges for the development and validation of genomic classifiers will be discussed in this article together with some of their difficulties.
Collapse
Affiliation(s)
- Samir Lababidi
- CDRH, U.S. Food and Drug Administration, Rockville, Maryland 20850, USA.
| |
Collapse
|