801
|
Statistical data processing in clinical proteomics. J Chromatogr B Analyt Technol Biomed Life Sci 2008; 866:77-88. [DOI: 10.1016/j.jchromb.2007.10.042] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2007] [Revised: 10/17/2007] [Accepted: 10/18/2007] [Indexed: 01/12/2023]
|
802
|
Lee S. Mistakes in validating the accuracy of a prediction classifier in high-dimensional but small-sample microarray data. Stat Methods Med Res 2008; 17:635-42. [PMID: 18375459 DOI: 10.1177/0962280207084839] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
A major interest in gene expression microarray studies is to develop an accurate classifier which can be adopted in clinical practice. The usage of large numbers of genes with small data samples may lead to overfitting in classification, and generate promising, but often nonreproducible results. Therefore, assessing the reproducibility of a classifier is necessary. Appropriate methods for validating a developed classifier and estimating its predicting accuracy are discussed. In addition, some mistakes that can arise in the cross validation process are reviewed using published articles in prominent medical journals, to prevent the indefinite results of a classifier development from leading to inappropriate treatment.
Collapse
Affiliation(s)
- Sunho Lee
- Department of Applied Mathematics, Sejong University, Seoul, South Korea.
| |
Collapse
|
803
|
Baker SG, Kramer BS. Using microarrays to study the microenvironment in tumor biology: the crucial role of statistics. Semin Cancer Biol 2008; 18:305-10. [PMID: 18455427 DOI: 10.1016/j.semcancer.2008.03.001] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2008] [Accepted: 03/18/2008] [Indexed: 11/30/2022]
Abstract
Microarrays represent a potentially powerful tool for better understanding the role of the microenvironment on tumor biology. To make the best use of microarray data and avoid incorrect or unsubstantiated conclusions, care must be taken in the statistical analysis. To illustrate the statistical issues involved we discuss three microarray studies related to the microenvironment and tumor biology involving: (i) prostatic stroma cells in cancer and non-cancer tissues; (ii) breast stroma and epithelial cells in breast cancer patients and non-cancer patients; and (iii) serum associated with wound response and stroma in cancer patients. Using these examples we critically discuss three types of analyses: differential gene expression, cluster analysis, and class prediction. We also discuss design issues.
Collapse
Affiliation(s)
- Stuart G Baker
- Division of Cancer Prevention, National Cancer Institute, Bethesda, MD 20892-7354, USA.
| | | |
Collapse
|
804
|
Kim C, Cheon M, Kang M, Chang I. A simple and exact Laplacian clustering of complex networking phenomena: application to gene expression profiles. Proc Natl Acad Sci U S A 2008; 105:4083-7. [PMID: 18337496 PMCID: PMC2393820 DOI: 10.1073/pnas.0708598105] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2007] [Indexed: 11/18/2022] Open
Abstract
Unraveling of the unified networking characteristics of complex networking phenomena is of great interest yet a formidable task. There is currently no simple strategy with a rigorous framework. Using an analogy to the exact algebraic property for a transition matrix of a master equation in statistical physics, we propose a method based on a Laplacian matrix for the discovery and prediction of new classes in the unsupervised complex networking phenomena where the class of each sample is completely unknown. Using this proposed Laplacian approach, we can simultaneously discover different classes and determine the identity of each class. Through an illustrative test of the Laplacian approach applied to real datasets of gene expression profiles, leukemia data [Golub TR, et al. (1999) Science 286:531-537], and lymphoma data [Alizadeh AA, et al. (2000) Nature 403:503-511], we demonstrate that this approach is accurate and robust with a mathematical and physical realization. It offers a general framework for characterizing any kind of complex networking phenomenon in broad areas irrespective of whether they are supervised or unsupervised.
Collapse
Affiliation(s)
| | - Mookyung Cheon
- National Research Laboratory for Computational Proteomics and Biophysics, Department of Physics, and
| | - Minho Kang
- Interdisciplinary Research Program of Bioinformatics, Pusan National University, Busan 609-735, Korea
| | - Iksoo Chang
- National Research Laboratory for Computational Proteomics and Biophysics, Department of Physics, and
| |
Collapse
|
805
|
Grant GR, Manduchi E, Stoeckert CJ. Analysis and management of microarray gene expression data. ACTA ACUST UNITED AC 2008; Chapter 19:Unit 19.6. [PMID: 18265395 DOI: 10.1002/0471142727.mb1906s77] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Microarray experiments require careful planning and choice of analysis tools in order to get the most out of the data generated, especially considering the associated significant cost and effort. Microarray experiments also require careful documentation, often residing in local databases and/or submitted to public repositories. An often bewildering assortment of choices is available for experimental design, data preprocessing, data analysis (e.g., differential gene expression, classification), and data management. This unit covers the basic steps and common applications for planning, data processing, and data management of microarray experiments, and provides guidance to making choices based on the goals and practical realities of the experiment, as well as the authors' experience in this area.
Collapse
Affiliation(s)
- Gregory R Grant
- University of Pennsylvania School of Medicine, Philadelphia, Pennsylvania, USA
| | | | | |
Collapse
|
806
|
Mostafavi S, Baranzini S, Oksernberg J, Mousavi P. Predictive modeling of therapy response in multiple sclerosis using gene expression data. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2008; 2006:5519-22. [PMID: 17946311 DOI: 10.1109/iembs.2006.259681] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Transcription profiling studies reveal important insights in regards to molecular events that manifest in phenotypic outcomes such as response to drug therapy. Construction of computational models that accurately predict therapy response is only possible when precise data measurements, robust feature/gene selection, and advanced computational modeling methods are combined with stringent statistical validation and large scale verification of results. Due to the large number of gene expression measurements in transcriptional profiling studies, feature selection represents a bottleneck when constructing computational models. The degree of compromise between selection of the optimal feature set and computational efficiency results in many choices for candidate gene sets which leads to a wide range of classification accuracies. Furthermore, constructing a classification model using a larger-than-necessary gene set along with small number of samples may cause over-fitting the data, resulting in highly optimistic classification accuracies. In this study we present OSeMA, a fast, robust and accurate gene selection-classification framework which results in construction of classification models that are highly predictive of the rIFNB therapy response in multiple sclerosis patients. We assess the performance of OSeMA on held out test data. Additionally, we extensively evaluate OSeMA by comparing it to an exhaustive combinatorial gene selection-classification approach.
Collapse
Affiliation(s)
- Sara Mostafavi
- School of Computing, Queen's University, Kingston, ON, Canada.
| | | | | | | |
Collapse
|
807
|
|
808
|
Boulesteix AL, Strobl C, Augustin T, Daumer M. Evaluating microarray-based classifiers: an overview. Cancer Inform 2008; 6:77-97. [PMID: 19259405 PMCID: PMC2623308 DOI: 10.4137/cin.s408] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classification accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifier evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifiers and validation strategy.
Collapse
Affiliation(s)
- A-L Boulesteix
- Sylvia Lawry Centre for MS Research (SLC), Hohenlindenerstr. 1, Munich, Germany
| | | | | | | |
Collapse
|
809
|
Jiang W, Simon R. A comparison of bootstrap methods and an adjusted bootstrap approach for estimating the prediction error in microarray classification. Stat Med 2008; 26:5320-34. [PMID: 17624926 DOI: 10.1002/sim.2968] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
This paper first provides a critical review on some existing methods for estimating the prediction error in classifying microarray data where the number of genes greatly exceeds the number of specimens. Special attention is given to the bootstrap-related methods. When the sample size n is small, we find that all the reviewed methods suffer from either substantial bias or variability. We introduce a repeated leave-one-out bootstrap (RLOOB) method that predicts for each specimen in the sample using bootstrap learning sets of size ln. We then propose an adjusted bootstrap (ABS) method that fits a learning curve to the RLOOB estimates calculated with different bootstrap learning set sizes. The ABS method is robust across the situations we investigate and provides a slightly conservative estimate for the prediction error. Even with small samples, it does not suffer from large upward bias as the leave-one-out bootstrap and the 0.632+ bootstrap, and it does not suffer from large variability as the leave-one-out cross-validation in microarray applications.
Collapse
Affiliation(s)
- Wenyu Jiang
- Biometric Research Branch, Division of Cancer Treatment and Diagnosis, National Cancer Institute, National Institutes of Health, 6130 Executive Boulevard, Rockville, MD 20852, USA.
| | | |
Collapse
|
810
|
Li J, Fine JP. ROC analysis with multiple classes and multiple tests: methodology and its application in microarray studies. Biostatistics 2008; 9:566-76. [DOI: 10.1093/biostatistics/kxm050] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
811
|
Sanden SV, Lin D, Burzykowski T. Performance of Gene Selection and Classification Methods in a Microarray Setting: A Simulation Study. COMMUN STAT-SIMUL C 2008. [DOI: 10.1080/03610910701792554] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
812
|
|
813
|
Wong HS, Wang HQ. Constructing the gene regulation-level representation of microarray data for cancer classification. J Biomed Inform 2008; 41:95-105. [PMID: 17499026 DOI: 10.1016/j.jbi.2007.04.002] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2006] [Revised: 02/06/2007] [Accepted: 04/03/2007] [Indexed: 11/21/2022]
Abstract
In this paper, we propose a regulation-level representation for microarray data and optimize it using genetic algorithms (GAs) for cancer classification. Compared with the traditional expression-level features, this representation can greatly reduce the dimensionality of microarray data and accommodate noise and variability such that many statistical machine-learning methods now become applicable and efficient for cancer classification. Experimental results on real-world microarray datasets show that the regulation-level representation can consistently converge at a solution with three regulation levels. This verifies the existence of the three regulation levels (up-regulation, down-regulation and non-significant regulation) associated with a particular biological phenotype. The ternary regulation-level representation not only improves the cancer classification capability but also facilitates the visualization of microarray data.
Collapse
Affiliation(s)
- Hau-San Wong
- Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
| | | |
Collapse
|
814
|
Simon R. Development and Validation of Biomarker Classifiers for Treatment Selection. J Stat Plan Inference 2008; 138:308-320. [PMID: 19190712 DOI: 10.1016/j.jspi.2007.06.010] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Many syndromes traditionally viewed as individual diseases are heterogeneous in molecular pathogenesis and treatment responsiveness. This often leads to the conduct of large clinical trials to identify small average treatment benefits for heterogeneous groups of patients. Drugs that demonstrate effectiveness in such trials may subsequently be used broadly, resulting in ineffective treatment of many patients. New genomic and proteomic technologies provide powerful tools for the selection of patients likely to benefit from a therapeutic without unacceptable adverse events. In spite of the large literature on developing predictive biomarkers, there is considerable confusion about the development and validation of biomarker based diagnostic classifiers for treatment selection. In this paper we attempt to clarify some of these issues and to provide guidance on the design of clinical trials for evaluating the clinical utility and robustness of pharmacogenomic classifiers.
Collapse
Affiliation(s)
- Richard Simon
- Richard Simon, D.Sc., Biometric Research Branch, National Cancer Institute, 9000 Rockville Pike, Bethesda MD 20892-7434, U.S.A. 301.496-0975 (tel), 301.402-0560 (fax),
| |
Collapse
|
815
|
Park M, Lee JW, Bok Lee J, Heun Song S. Several biplot methods applied to gene expression data. J Stat Plan Inference 2008. [DOI: 10.1016/j.jspi.2007.06.019] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
816
|
|
817
|
Jen CH, Yang TP, Tung CY, Su SH, Lin CH, Hsu MT, Wang HW. Signature Evaluation Tool (SET): a Java-based tool to evaluate and visualize the sample discrimination abilities of gene expression signatures. BMC Bioinformatics 2008; 9:58. [PMID: 18221568 PMCID: PMC2248562 DOI: 10.1186/1471-2105-9-58] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2007] [Accepted: 01/28/2008] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The identification of specific gene expression signature for distinguishing sample groups is a dominant field in cancer research. Although a number of tools have been developed to identify optimal gene expression signatures, the number of signature genes obtained is often overly large to be applied clinically. Furthermore, experimental verification is sometimes limited by the availability of wet-lab materials such as antibodies and reagents. A tool to evaluate the discrimination power of candidate genes is therefore in high demand by clinical researchers. RESULTS Signature Evaluation Tool (SET) is a Java-based tool adopting the Golub's weighted voting algorithm as well as incorporating the visual presentation of prediction strength for each array sample. SET provides a flexible and easy-to-follow platform to evaluate the discrimination power of a gene signature. Here, we demonstrated the application of SET for several purposes: (1) for signatures consisting of a large number of genes, SET offers the ability to rapidly narrow down the number of genes; (2) for a given signature (from third party analyses or user-defined), SET can re-evaluate and re-adjust its discrimination power by selecting/de-selecting genes repeatedly; (3) for multiple microarray datasets, SET can evaluate the classification capability of a signature among datasets; and (4) by providing a module to visualize the prediction strength for each sample, SET allows users to re-evaluate the discrimination power on mis-grouped or less-certain samples. Information obtained from the above applications could be useful in prognostic analyses or clinical management decisions. CONCLUSION Here we present SET to evaluate and visualize the sample-discrimination ability of a given gene expression signature. This tool provides a filtration function for signature identification and lies between clinical analyses and class prediction (or feature selection) tools. The simplicity, flexibility and brevity of SET could make it an invaluable tool for marker identification in clinical research.
Collapse
Affiliation(s)
- Chih-Hung Jen
- Microarray & Gene Expression Analysis Core Facility, VGH National Yang-Ming University Genome Research Center, Taipei, Taiwan.
| | | | | | | | | | | | | |
Collapse
|
818
|
Molloy TJ, Bosma AJ, van't Veer LJ. Towards an optimized platform for the detection, enrichment, and semi-quantitation circulating tumor cells. Breast Cancer Res Treat 2008; 112:297-307. [PMID: 18213476 DOI: 10.1007/s10549-007-9872-5] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2007] [Accepted: 12/14/2007] [Indexed: 11/25/2022]
Abstract
Metastasis describes the process of migration of a frequently clinically occult circulating tumor cell (CTC) from the primary lesion to a new location and the subsequent formation of an overt growth. We and others have shown that the detection and quantitation of these cells has significant prognostic value, however there still remains no consensus as to the optimal methods to achieve this. The work described herein therefore considered various techniques, from storage and sample processing to data acquisition and analysis, to find an optimal combination of methods for an effective and practical platform for the detection of CTCs in peripheral blood. A dual-antigen epithelial cell enrichment procedure followed by a multi-marker QPCR analysis demonstrated the highest sensitivity and specificity, with the ability to detect as few as 10 tumor cells from a background of 10(6) peripheral blood mononuclear cells. Using these techniques in conjunction with a quadratic linear discriminant analysis (QDA) resulted in a platform able to generate this data and then combine it a single score for each patient, in which positivity reflected tumor cell presence, and negativity represented tumor cell absence. This assay was able to correctly determine tumor cell presence or absence in 100% of healthy controls and 84% of metastatic patients in a validation cohort of 39 individuals. This platform represents a highly sensitive and specific assay which could augment current routine assays for CTCs in the clinic.
Collapse
Affiliation(s)
- T J Molloy
- Division of Experimental Therapy, The Netherlands Cancer Institute, Amsterdam, The Netherlands
| | | | | |
Collapse
|
819
|
Tchagang AB, Tewfik AH, DeRycke MS, Skubitz KM, Skubitz AP. Early detection of ovarian cancer using group biomarkers. Mol Cancer Ther 2008; 7:27-37. [DOI: 10.1158/1535-7163.mct-07-0565] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
820
|
Lottaz C, Kostka D, Markowetz F, Spang R. Computational diagnostics with gene expression profiles. Methods Mol Biol 2008; 453:281-296. [PMID: 18712310 DOI: 10.1007/978-1-60327-429-6_15] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/26/2023]
Abstract
Gene expression profiling using micro-arrays is a modern approach for molecular diagnostics. In clinical micro-array studies, researchers aim to predict disease type, survival, or treatment response using gene expression profiles. In this process, they encounter a series of obstacles and pitfalls. This chapter reviews fundamental issues from machine learning and recommends a procedure for the computational aspects of a clinical micro-array study.
Collapse
Affiliation(s)
- Claudio Lottaz
- Max Planck Institute for Molecular Genetics and Berlin Center for Genome-Based Bioinformatics, Berlin, Germany
| | | | | | | |
Collapse
|
821
|
Korshunova Y, Maloney RK, Lakey N, Citek RW, Bacher B, Budiman A, Ordway JM, McCombie WR, Leon J, Jeddeloh JA, McPherson JD. Massively parallel bisulphite pyrosequencing reveals the molecular complexity of breast cancer-associated cytosine-methylation patterns obtained from tissue and serum DNA. Genes Dev 2008; 18:19-29. [PMID: 18032725 PMCID: PMC2134785 DOI: 10.1101/gr.6883307] [Citation(s) in RCA: 96] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2007] [Accepted: 09/18/2007] [Indexed: 01/06/2023]
Abstract
Cytosine-methylation changes are stable and thought to be among the earliest events in tumorigenesis. Theoretically, DNA carrying tumor-specifying methylation patterns escape the tumors and may be found circulating in the sera from cancer patients, thus providing the basis for development of noninvasive clinical tests for early cancer detection. Indeed, using methylation-specific PCR-based techniques, several groups reported the detection of tumor-associated methylated DNA in the sera from cancer patients with varying clinical success. However, by design, such analytical approaches allow assessment of the presence of molecules with only one methylation pattern, leaving the bigger picture unexplored. The limited knowledge about circulating DNA methylation patterns hinders the efficient development of clinical methylation tests and testing platforms. Here, we report the results of a comprehensive methylation pattern analysis from breast cancer clinical tissues and sera obtained using massively parallel bisulphite pyrosequencing. The four loci studied were recently discovered by our group, and demonstrated to be powerful epigenetic biomarkers of breast cancer. The detailed analysis of more than 700,000 DNA fragments derived from more than 50 individuals (cancer and cancer-free) revealed an unappreciated complexity of genomic cytosine-methylation patterns in both tissue derived and circulating DNAs. Both tumor and cancer-free tissues (as well as sera) contained molecules with nearly every conceivable cytosine-methylation pattern at each locus. Tumor samples displayed more variation in methylation level than normal samples. Importantly, by establishing the methylation landscape within circulating DNA, this study has better defined the development challenges facing DNA methylation-based cancer-detection tests.
Collapse
Affiliation(s)
| | | | - Nathan Lakey
- Orion Genomics, LLC, St. Louis. Missouri 63108, USA
| | | | | | | | | | | | - Jorge Leon
- Orion Genomics, LLC, St. Louis. Missouri 63108, USA
| | - Jeffrey A. Jeddeloh
- Orion Genomics, LLC, St. Louis. Missouri 63108, USA
- Roche NimbleGen, Madison, Wisconsin 53719, USA
| | - John D. McPherson
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
| |
Collapse
|
822
|
|
823
|
Yoo C, Gernaey KV. Classification and Diagnostic Output Prediction of Cancer Using Gene Expression Profiling and Supervised Machine Learning Algorithms. JOURNAL OF CHEMICAL ENGINEERING OF JAPAN 2008. [DOI: 10.1252/jcej.08we042] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Changkyoo Yoo
- College of Environment and Applied Chemistry, Green Energy Center/Center for Environmental Studies, Kyung Hee University
| | - Krist V. Gernaey
- Department of Chemical Engineering, Technical University of Denmark
| |
Collapse
|
824
|
Clarke J, West M. Bayesian Weibull tree models for survival analysis of clinico-genomic data. STATISTICAL METHODOLOGY 2008; 5:238-262. [PMID: 18618012 PMCID: PMC2447923 DOI: 10.1016/j.stamet.2007.09.003] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
An important goal of research involving gene expression data for outcome prediction is to establish the ability of genomic data to define clinically relevant risk factors. Recent studies have demonstrated that microarray data can successfully cluster patients into low- and high-risk categories. However, the need exists for models which examine how genomic predictors interact with existing clinical factors and provide personalized outcome predictions. We have developed clinico-genomic tree models for survival outcomes which use recursive partitioning to subdivide the current data set into homogeneous subgroups of patients, each with a specific Weibull survival distribution. These trees can provide personalized predictive distributions of the probability of survival for individuals of interest. Our strategy is to fit multiple models; within each model we adopt a prior on the Weibull scale parameter and update this prior via Empirical Bayes whenever the sample is split at a given node. The decision to split is based on a Bayes factor criterion. The resulting trees are weighted according to their relative likelihood values and predictions are made by averaging over models. In a pilot study of survival in advanced stage ovarian cancer we demonstrate that clinical and genomic data are complementary sources of information relevant to survival, and we use the exploratory nature of the trees to identify potential genomic biomarkers worthy of further study.
Collapse
Affiliation(s)
- Jennifer Clarke
- Department of Epidemiology and Public Health, Leonard M. Miller School of Medicine, University of Miami, Miami, FL 33136, USA
| | - Mike West
- Department of Statistical Science, Duke University, Durham, NC 27705, USA
| |
Collapse
|
825
|
|
826
|
Istepanian RSH, Sungoor A, Nebel JC. Fractal dimension and wavelet decomposition for robust microarray data clustering. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2008; 2008:4106-4109. [PMID: 19163615 DOI: 10.1109/iembs.2008.4650112] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/27/2023]
Abstract
Microarrays are now established technologies which are considered as key to gene expression analysis. Their study is usually achieved by using clustering techniques. Genomic signal processing is a new area of research that combines genomics with digital signal processing methodologies. In this paper, we present a comparative analysis of two genomic signal processing methods for robust microarray data clustering. Techniques based on Fractal Dimension and Discrete Wavelet Decomposition with Vector Quantization are validated for standard data sets. Comparative analysis of the results indicates that these methods provide improved clustering accuracy compared to some conventional clustering techniques. Moreover, these classifiers don't require any prior training procedures.
Collapse
Affiliation(s)
- Robert S H Istepanian
- Mobile Information and Network Technologies Research Centre (MINT), Kingston University, London KT1 2EE, UK.
| | | | | |
Collapse
|
827
|
Clarke R, Ressom HW, Wang A, Xuan J, Liu MC, Gehan EA, Wang Y. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nat Rev Cancer 2008; 8:37-49. [PMID: 18097463 PMCID: PMC2238676 DOI: 10.1038/nrc2294] [Citation(s) in RCA: 322] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation.
Collapse
Affiliation(s)
- Robert Clarke
- Department of Oncology and Lombardi Comprehensive Cancer Center, Georgetown University School of Medicine, 3970 Reservoir Road NW, Washington, DC 20057, USA
| | | | | | | | | | | | | |
Collapse
|
828
|
Abstract
Classification using high-dimensional features arises frequently in many contemporary statistical studies such as tumor classification using microarray or other high-throughput data. The impact of dimensionality on classifications is largely poorly understood. In a seminal paper, Bickel and Levina (2004) show that the Fisher discriminant performs poorly due to diverging spectra and they propose to use the independence rule to overcome the problem. We first demonstrate that even for the independence classification rule, classification using all the features can be as bad as the random guessing due to noise accumulation in estimating population centroids in high-dimensional feature space. In fact, we demonstrate further that almost all linear discriminants can perform as bad as the random guessing. Thus, it is paramountly important to select a subset of important features for high-dimensional classification, resulting in Features Annealed Independence Rules (FAIR). The conditions under which all the important features can be selected by the two-sample t-statistic are established. The choice of the optimal number of features, or equivalently, the threshold value of the test statistics are proposed based on an upper bound of the classification error. Simulation studies and real data analysis support our theoretical results and demonstrate convincingly the advantage of our new classification procedure.
Collapse
|
829
|
|
830
|
Zhang HH, Liu Y, Wu Y, Zhu J. Variable selection for the multicategory SVM via adaptive sup-norm regularization. Electron J Stat 2008. [DOI: 10.1214/08-ejs122] [Citation(s) in RCA: 54] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
831
|
Xie B, Pan W, Shen X. Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables. Electron J Stat 2008; 2:168-212. [PMID: 19920875 DOI: 10.1214/08-ejs194] [Citation(s) in RCA: 58] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery. For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thresholding. Numerical examples, including an application to acute leukemia subtype discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method.
Collapse
Affiliation(s)
- Benhuai Xie
- Division of Biostatistics, School of Public Health, University of Minnesota,
| | | | | |
Collapse
|
832
|
Sarkar A, Chakraborty A, Chaudhuri A. A Method of Finding Predictor Genes for a Particular Disease Using a Clustering Algorithm. COMMUN STAT-SIMUL C 2007. [DOI: 10.1080/03610910701724037] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
833
|
Xie B, Pan W, Shen X. Variable Selection in Penalized Model‐Based Clustering Via Regularization on Grouped Parameters. Biometrics 2007; 64:921-930. [DOI: 10.1111/j.1541-0420.2007.00955.x] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Affiliation(s)
- Benhuai Xie
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455 U.S.A
| | - Wei Pan
- Division of Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota 55455 U.S.A
| | - Xiaotong Shen
- School of Statistics, University of Minnesota, Minneapolis, Minnesota 55455 U.S.A
| |
Collapse
|
834
|
Ye J, Liu H, Kirmiz C, Lebrilla CB, Rocke DM. On the analysis of glycomics mass spectrometry data via the regularized area under the ROC curve. BMC Bioinformatics 2007; 8:477. [PMID: 18076765 PMCID: PMC2211327 DOI: 10.1186/1471-2105-8-477] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2007] [Accepted: 12/12/2007] [Indexed: 11/10/2022] Open
Abstract
Background Novel molecular and statistical methods are in rising demand for disease diagnosis and prognosis with the help of recent advanced biotechnology. High-resolution mass spectrometry (MS) is one of those biotechnologies that are highly promising to improve health outcome. Previous literatures have identified some proteomics biomarkers that can distinguish healthy patients from cancer patients using MS data. In this paper, an MS study is demonstrated which uses glycomics to identify ovarian cancer. Glycomics is the study of glycans and glycoproteins. The glycans on the proteins may deviate between a cancer cell and a normal cell and may be visible in the blood. High-resolution MS has been applied to measure relative abundances of potential glycan biomarkers in human serum. Multiple potential glycan biomarkers are measured in MS spectra. With the objection of maximizing the empirical area under the ROC curve (AUC), an analysis method was considered which combines potential glycan biomarkers for the diagnosis of cancer. Results Maximizing the empirical AUC of glycomics MS data is a large-dimensional optimization problem. The technical difficulty is that the empirical AUC function is not continuous. Instead, it is in fact an empirical 0–1 loss function with a large number of linear predictors. An approach was investigated that regularizes the area under the ROC curve while replacing the 0–1 loss function with a smooth surrogate function. The constrained threshold gradient descent regularization algorithm was applied, where the regularization parameters were chosen by the cross-validation method, and the confidence intervals of the regression parameters were estimated by the bootstrap method. The method is called TGDR-AUC algorithm. The properties of the approach were studied through a numerical simulation study, which incorporates the positive values of mass spectrometry data with the correlations between measurements within person. The simulation proved asymptotic properties that estimated AUC approaches the true AUC. Finally, mass spectrometry data of serum glycan for ovarian cancer diagnosis was analyzed. The optimal combination based on TGDR-AUC algorithm yields plausible result and the detected biomarkers are confirmed based on biological evidence. Conclusion The TGDR-AUC algorithm relaxes the normality and independence assumptions from previous literatures. In addition to its flexibility and easy interpretability, the algorithm yields good performance in combining potential biomarkers and is computationally feasible. Thus, the approach of TGDR-AUC is a plausible algorithm to classify disease status on the basis of multiple biomarkers.
Collapse
Affiliation(s)
- Jingjing Ye
- Department of Statistics, University of California, Davis, Davis, CA, 95616, USA.
| | | | | | | | | |
Collapse
|
835
|
Zhao H, Liew AWC, Xie X, Yan H. A new geometric biclustering algorithm based on the Hough transform for analysis of large-scale microarray data. J Theor Biol 2007; 251:264-74. [PMID: 18199458 DOI: 10.1016/j.jtbi.2007.11.030] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2007] [Revised: 10/17/2007] [Accepted: 11/29/2007] [Indexed: 11/30/2022]
Abstract
Biclustering is an important tool in microarray analysis when only a subset of genes co-regulates in a subset of conditions. Different from standard clustering analyses, biclustering performs simultaneous classification in both gene and condition directions in a microarray data matrix. However, the biclustering problem is inherently intractable and computationally complex. In this paper, we present a new biclustering algorithm based on the geometrical viewpoint of coherent gene expression profiles. In this method, we perform pattern identification based on the Hough transform in a column-pair space. The algorithm is especially suitable for the biclustering analysis of large-scale microarray data. Our studies show that the approach can discover significant biclusters with respect to the increased noise level and regulatory complexity. Furthermore, we also test the ability of our method to locate biologically verifiable biclusters within an annotated set of genes.
Collapse
Affiliation(s)
- Hongya Zhao
- Department of Electronic Engineering, City University of Hong Kong, Kowloon, Hong Kong.
| | | | | | | |
Collapse
|
836
|
Gene Selection Based on Support Vector Machine using Bootstrap. KOREAN JOURNAL OF APPLIED STATISTICS 2007. [DOI: 10.5351/kjas.2007.20.3.531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
837
|
Microarray-based expression profiling and informatics. Curr Opin Biotechnol 2007; 19:26-9. [PMID: 18053704 DOI: 10.1016/j.copbio.2007.10.008] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 10/14/2007] [Indexed: 11/23/2022]
Abstract
Microarray-based expression profiling is a powerful technology for studying biological mechanisms and for developing clinically valuable predictive classifiers. The high-dimensional read-out for each sample assayed makes it possible to do new kinds of studies but also increases the risks of misleading conclusions. We review here the current state-of-the-art for design and analysis of microarray-based investigations.
Collapse
|
838
|
Pittelkow YE, Wilson SR. Visualisation of “High p, Small n” data. Comput Stat 2007. [DOI: 10.1007/s00180-007-0060-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
|
839
|
Gormley M, Dampier W, Ertel A, Karacali B, Tozeren A. Prediction potential of candidate biomarker sets identified and validated on gene expression data from multiple datasets. BMC Bioinformatics 2007; 8:415. [PMID: 17963508 PMCID: PMC2211325 DOI: 10.1186/1471-2105-8-415] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2007] [Accepted: 10/26/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Independently derived expression profiles of the same biological condition often have few genes in common. In this study, we created populations of expression profiles from publicly available microarray datasets of cancer (breast, lymphoma and renal) samples linked to clinical information with an iterative machine learning algorithm. ROC curves were used to assess the prediction error of each profile for classification. We compared the prediction error of profiles correlated with molecular phenotype against profiles correlated with relapse-free status. Prediction error of profiles identified with supervised univariate feature selection algorithms were compared to profiles selected randomly from a) all genes on the microarray platform and b) a list of known disease-related genes (a priori selection). We also determined the relevance of expression profiles on test arrays from independent datasets, measured on either the same or different microarray platforms. RESULTS Highly discriminative expression profiles were produced on both simulated gene expression data and expression data from breast cancer and lymphoma datasets on the basis of ER and BCL-6 expression, respectively. Use of relapse-free status to identify profiles for prognosis prediction resulted in poorly discriminative decision rules. Supervised feature selection resulted in more accurate classifications than random or a priori selection, however, the difference in prediction error decreased as the number of features increased. These results held when decision rules were applied across-datasets to samples profiled on the same microarray platform. CONCLUSION Our results show that many gene sets predict molecular phenotypes accurately. Given this, expression profiles identified using different training datasets should be expected to show little agreement. In addition, we demonstrate the difficulty in predicting relapse directly from microarray data using supervised machine learning approaches. These findings are relevant to the use of molecular profiling for the identification of candidate biomarker panels.
Collapse
Affiliation(s)
- Michael Gormley
- School of Biomedical Engineering, Drexel University, Philadelphia, PA, USA.
| | | | | | | | | |
Collapse
|
840
|
Isolation rearing impairs wound healing and is associated with increased locomotion and decreased immediate early gene expression in the medial prefrontal cortex of juvenile rats. Neuroscience 2007; 151:589-603. [PMID: 18063315 DOI: 10.1016/j.neuroscience.2007.10.014] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2007] [Revised: 09/28/2007] [Accepted: 10/18/2007] [Indexed: 11/20/2022]
Abstract
In addition to its maladaptive effects on psychiatric function, psychosocial deprivation impairs recovery from physical illness. Previously, we found that psychosocial deprivation, modeled by isolation rearing, depressed immediate early gene (IEG) expression in the medial prefrontal cortex (mPFC) and increased locomotion in the open field test [Levine JB, Youngs RM, et al. (2007) Isolation rearing and hyperlocomotion are associated with reduced immediate early gene expression levels in the medial prefrontal cortex. Neuroscience 145(1):42-55]. In the present study, we examined whether similar changes in behavior and gene expression are associated with the maladaptive effects of psychosocial deprivation on physical injury healing. After weaning, anesthetized rats were subjected to a 20% total body surface area third degree burn injury and were subsequently either group or isolation reared. After 4 weeks of either isolation or group rearing (a period that encompasses post-wearing and early adolescence), rats were killed, and their healing and gene expression in the mPFC were assessed. Locomotion in the open field test was examined at 3 weeks post-burn injury. We found that: 1) gross wound healing was significantly impaired in isolation-reared rats compared with group-reared rats, 2) locomotion was increased and IEG expression was suppressed for isolation-reared rats during burn injury healing, 3) the decreased activity in the open field and increased IEG expression was greater for burn injury healing group-reared rats than for uninjured group-reared rats, 4) the degree of hyperactivity and IEG suppression was relatively similar between isolation-reared rats during burn injury compared with uninjured isolation-reared rats. Thus, in the mPFC, behavioral hyperactivity to novelty (the open field test) along with IEG suppression may constitute a detectable biomarker of isolation rearing during traumatic physical injury. Implications of the findings for understanding, assessing, and treating the maladaptive effects of psychosocial deprivation on physical healing during childhood are discussed.
Collapse
|
841
|
Wang LY, Tu Z. Lung tumor diagnosis and subtype discovery by gene expression profiling. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2006:5868-71. [PMID: 17947173 DOI: 10.1109/iembs.2006.259539] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The optimal treatment of patients with complex diseases, such as cancers, depends on the accurate diagnosis by using a combination of clinical and histopathological data. In many scenarios, it becomes tremendously difficult because of the limitations in clinical presentation and histopathology. To accurate diagnose complex diseases, the molecular classification based on gene or protein expression profiles are indispensable for modern medicine. Moreover, many heterogeneous diseases consist of various potential subtypes in molecular basis and differ remarkably in their response to therapies. It is critical to accurate predict subgroup on disease gene expression profiles. More fundamental knowledge of the molecular basis and classification of disease could aid in the prediction of patient outcome, the informed selection of therapies, and identification of novel molecular targets for therapy. In this paper, we propose a new disease diagnostic method, probabilistic boosting tree (PB tree) method, on gene expression profiles of lung tumors. It enables accurate disease classification and subtype discovery in disease. It automatically constructs a tree in which each node combines a number of weak classifiers into a strong classifier. Also, subtype discovery is naturally embedded in the learning process. Our algorithm achieves excellent diagnostic performance, and meanwhile it is capable of detecting the disease subtype based on gene expression profile.
Collapse
Affiliation(s)
- Lu-yong Wang
- Integrated Data Syst. Dept., Siemens Corp. Res., Princeton, NJ 08540, USA.
| | | |
Collapse
|
842
|
Yang TY. The simple classification of multiple cancer types using a small number of significant genes. Mol Diagn Ther 2007; 11:265-75. [PMID: 17705581 DOI: 10.1007/bf03256248] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
BACKGROUND AND OBJECTIVE The problems involved in the classification of cancers have recently received a great deal of attention in the context of DNA microarrays. We propose a simple procedure for classifying or predicting the cancer types of test samples when multiple cancer types and many genes are present. METHOD The procedure sequentially combines a gene-sort algorithm and a predictive likelihood-based classifier. Genes that have homogeneous patterns of expression measurements across cancer types are of limited interest. Therefore, this algorithm orders genes on the basis of strong heterogeneous patterns. The proposed classifier then selects the first few genes, which are sufficient to classify most training samples correctly via cross validation. Test samples were classified using only the selected genes. RESULTS AND CONCLUSION This predictive likelihood-based classifier performs well and is simple to understand. Empirical examination revealed good classification accuracy using relatively few genes.
Collapse
Affiliation(s)
- Tae Young Yang
- Department of Mathematics, Myongji University, Yongin, Kyonggi, Republic of Korea.
| |
Collapse
|
843
|
Hanczar B, Zucker JD, Henegar C, Saitta L. Feature construction from synergic pairs to improve microarray-based classification. Bioinformatics 2007; 23:2866-72. [DOI: 10.1093/bioinformatics/btm429] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
844
|
Zhang JG, Deng HW. Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics 2007; 8:370. [PMID: 17915022 PMCID: PMC2089123 DOI: 10.1186/1471-2105-8-370] [Citation(s) in RCA: 63] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2007] [Accepted: 10/03/2007] [Indexed: 11/10/2022] Open
Abstract
Background With DNA microarray data, selecting a compact subset of discriminative genes from thousands of genes is a critical step for accurate classification of phenotypes for, e.g., disease diagnosis. Several widely used gene selection methods often select top-ranked genes according to their individual discriminative power in classifying samples into distinct categories, without considering correlations among genes. A limitation of these gene selection methods is that they may result in gene sets with some redundancy and yield an unnecessary large number of candidate genes for classification analyses. Some latest studies show that incorporating gene to gene correlations into gene selection can remove redundant genes and improve classification accuracy. Results In this study, we propose a new method, Based Bayes error Filter (BBF), to select relevant genes and remove redundant genes in classification analyses of microarray data. The effectiveness and accuracy of this method is demonstrated through analyses of five publicly available microarray datasets. The results show that our gene selection method is capable of achieving better accuracies than previous studies, while being able to effectively select relevant genes, remove redundant genes and obtain efficient and small gene sets for sample classification purposes. Conclusion The proposed method can effectively identify a compact set of genes with high classification accuracy. This study also indicates that application of the Bayes error is a feasible and effective wayfor removing redundant genes in gene selection.
Collapse
Affiliation(s)
- Ji-Gang Zhang
- Departments of Orthopedic Surgery and Basic Medical Science, School of Medicine, University of Missouri-Kansas City, 2411 Holmes Street, Kansas City, MO 64108, USA
| | - Hong-Wen Deng
- Laboratory of Molecular and Statistical Genetics, College of Life Sciences, Hunan Normal University, Changsha, Hunan 410081, P. R. China
- The Key Laboratory of Biomedical Information Engineering of Ministry of Education and Institute of Molecular Genetics, School of Life Science and Technology, Xi'an Jiaotong University, Xi'an 710049, P. R. China
- Departments of Orthopedic Surgery and Basic Medical Science, School of Medicine, University of Missouri-Kansas City, 2411 Holmes Street, Kansas City, MO 64108, USA
| |
Collapse
|
845
|
Dabney AR, Storey JD. Optimality driven nearest centroid classification from genomic data. PLoS One 2007; 2:e1002. [PMID: 17912341 PMCID: PMC1991588 DOI: 10.1371/journal.pone.0001002] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2007] [Accepted: 09/07/2007] [Indexed: 11/19/2022] Open
Abstract
Nearest-centroid classifiers have recently been successfully employed in high-dimensional applications, such as in genomics. A necessary step when building a classifier for high-dimensional data is feature selection. Feature selection is frequently carried out by computing univariate scores for each feature individually, without consideration for how a subset of features performs as a whole. We introduce a new feature selection approach for high-dimensional nearest centroid classifiers that instead is based on the theoretically optimal choice of a given number of features, which we determine directly here. This allows us to develop a new greedy algorithm to estimate this optimal nearest-centroid classifier with a given number of features. In addition, whereas the centroids are usually formed from maximum likelihood estimates, we investigate the applicability of high-dimensional shrinkage estimates of centroids. We apply the proposed method to clinical classification based on gene-expression microarrays, demonstrating that the proposed method can outperform existing nearest centroid classifiers.
Collapse
Affiliation(s)
- Alan R. Dabney
- Department of Statistics, Texas A&M University, College Station, Texas, United States of America
- * To whom correspondence should be addressed. E-mail: (AD); (JS)
| | - John D. Storey
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
- * To whom correspondence should be addressed. E-mail: (AD); (JS)
| |
Collapse
|
846
|
|
847
|
Escalera S, Pujol O, Radeva P. Boosted Landmarks of Contextual Descriptors and Forest-ECOC: A novel framework to detect and classify objects in cluttered scenes. Pattern Recognit Lett 2007. [DOI: 10.1016/j.patrec.2007.05.007] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
848
|
Xiong H, Zhang Y, Chen XW. Data-dependent kernel machines for microarray data classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2007; 4:583-595. [PMID: 17975270 DOI: 10.1109/tcbb.2007.1048] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/25/2023]
Abstract
One important application of gene expression analysis is to classify tissue samples according to their gene expression levels. Gene expression data are typically characterized by high dimensionality and small sample size, which makes the classification task quite challenging. In this paper, we present a data-dependent kernel for microarray data classification. This kernel function is engineered so that the class separability of the training data is maximized. A bootstrapping-based resampling scheme is introduced to reduce the possible training bias. The effectiveness of this adaptive kernel for microarray data classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Our experimental study shows that the data-dependent kernel leads to a significant improvement in the accuracy of KNN classifiers. Furthermore, this kernel-based KNN scheme has been demonstrated to be competitive to, if not better than, more sophisticated classifiers such as Support Vector Machines (SVMs) and the Uncorrelated Linear Discriminant Analysis (ULDA) for classifying gene expression data.
Collapse
|
849
|
Hendriks MMWB, Smit S, Akkermans WLMW, Reijmers TH, Eilers PHC, Hoefsloot HCJ, Rubingh CM, de Koster CG, Aerts JM, Smilde AK. How to distinguish healthy from diseased? Classification strategy for mass spectrometry-based clinical proteomics. Proteomics 2007; 7:3672-80. [PMID: 17880000 DOI: 10.1002/pmic.200700046] [Citation(s) in RCA: 48] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
SELDI-TOF-MS is rapidly gaining popularity as a screening tool for clinical applications of proteomics. Application of adequate statistical techniques in all the stages from measurement to information is obligatory. One of the statistical methods often used in proteomics is classification: the assignment of subjects to discrete categories, for example healthy or diseased. Lately, many new classification methods have been developed, often specifically for the analysis of X-omics data. For proteomics studies a good strategy for evaluating classification results is of prime importance, because usually the number of objects will be small and it would be wasteful to set aside part of these as a 'mere' test set. The present paper offers such a strategy in the form of a protocol which can be used for choosing among different statistical classification methods and obtaining figures of merit of their performance. This paper also illustrates the usefulness of proteomics in a clinical setting, serum samples from Gaucher disease patients, when used in combination with an appropriate classification method.
Collapse
|
850
|
|