1
|
Wong G, Leckie C, Kowalczyk A. FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number. ACTA ACUST UNITED AC 2011; 28:151-9. [PMID: 22110244 DOI: 10.1093/bioinformatics/btr644] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
MOTIVATION Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options. RESULTS We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer. AVAILABILITY FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.
Collapse
Affiliation(s)
- Gerard Wong
- National ICT Australia, Victoria Research Laboratory, Parkville, Australia.
| | | | | |
Collapse
|
2
|
Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A. Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections. PLoS One 2011; 6:e20662. [PMID: 21673802 PMCID: PMC3105991 DOI: 10.1371/journal.pone.0020662] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 05/06/2011] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them. METHODOLOGY AND PRINCIPAL FINDINGS Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures. CONCLUSIONS AND SIGNIFICANCE Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.
Collapse
Affiliation(s)
- Nikita I. Lytkin
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
| | - Lauren McVoy
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
| | - Jörn-Hendrik Weitkamp
- Division of Neonatology, Department of Pediatrics, Vanderbilt University
School of Medicine and Monroe Carell Jr. Children's Hospital at Vanderbilt,
Nashville, Tennessee, United States of America
| | - Constantin F. Aliferis
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee,
United States of America
| | - Alexander Statnikov
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Medicine, New York University School of Medicine, New York,
New York, United States of America
| |
Collapse
|
3
|
Gray-McGuire C, Guda K, Adrianto I, Lin CP, Natale L, Potter JD, Newcomb P, Poole EM, Ulrich CM, Lindor N, Goode EL, Fridley BL, Jenkins R, Marchand LL, Casey G, Haile R, Hopper J, Jenkins M, Young J, Buchanan D, Gallinger S, Adams M, Lewis S, Willis J, Elston R, Markowitz SD, Wiesner GL. Confirmation of linkage to and localization of familial colon cancer risk haplotype on chromosome 9q22. Cancer Res 2010; 70:5409-18. [PMID: 20551049 PMCID: PMC2896448 DOI: 10.1158/0008-5472.can-10-0188] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Genetic risk factors are important contributors to the development of colorectal cancer. Following the definition of a linkage signal at 9q22-31, we fine mapped this region in an independent collection of colon cancer families. We used a custom array of single-nucleotide polymorphisms (SNP) densely spaced across the candidate region, performing both single-SNP and moving-window association analyses to identify a colon neoplasia risk haplotype. Through this approach, we isolated the association effect to a five-SNP haplotype centered at 98.15 Mb on chromosome 9q. This haplotype is in strong linkage disequilibrium with the haplotype block containing HABP4 and may be a surrogate for the effect of this CD30 Ki-1 antigen. It is also in close proximity to GALNT12, also recently shown to be altered in colon tumors. We used a predictive modeling algorithm to show the contribution of this risk haplotype and surrounding candidate genes in distinguishing between colon cancer cases and healthy controls. The ability to replicate this finding, the strength of the haplotype association (odds ratio, 3.68), and the accuracy of our prediction model (approximately 60%) all strongly support the presence of a locus for familial colon cancer on chromosome 9q.
Collapse
Affiliation(s)
- Courtney Gray-McGuire
- Department of Arthritis and Immunology, Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma
| | - Kishore Guda
- Department of Medicine, Case Western Reserve University, Cleveland, Ohio
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
- Howard Hughes Medical Institute, Chevy Chase, Maryland
| | - Indra Adrianto
- Department of Arthritis and Immunology, Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma
| | - Chee Paul Lin
- Department of Arthritis and Immunology, Oklahoma Medical Research Foundation, Oklahoma City, Oklahoma
| | - Leanna Natale
- Department of Medicine, Case Western Reserve University, Cleveland, Ohio
| | - John D. Potter
- Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Polly Newcomb
- Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Elizabeth M. Poole
- Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Cornelia M. Ulrich
- Cancer Prevention Program, Fred Hutchinson Cancer Research Center, Seattle, Washington
| | - Noralane Lindor
- Department of Medical Genetics, Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Ellen L. Goode
- Department of Health Science Research, Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Brooke L. Fridley
- Department of Health Science Research, Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Robert Jenkins
- Department of Medical Genetics, Laboratory Medicine and Pathology, Mayo Clinic, Rochester, Minnesota
| | - Loic Le Marchand
- Epidemiology Program, Cancer Research Center of Hawaii, University of Hawaii, Honolulu, Hawaii
| | - Graham Casey
- Keck School of Medicine, USC/Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California
| | - Robert Haile
- Department of Preventive Medicine, USC/Norris Comprehensive Cancer Center, University of Southern California, Los Angeles, California
| | - John Hopper
- Melbourne School of Population Health, University of Melbourne, Victoria, Australia
| | - Mark Jenkins
- Melbourne School of Population Health, University of Melbourne, Victoria, Australia
| | - Joanne Young
- Division of Genetics and Population Health, Queensland Institute of Medical Research, Brisbane, Queensland, Australia
| | - Daniel Buchanan
- Division of Genetics and Population Health, Queensland Institute of Medical Research, Brisbane, Queensland, Australia
| | - Steve Gallinger
- Samuel Lunenfeld Research Institute, Toronto General Hospital, Toronto, Ontario
| | - Mark Adams
- Department of Genetics, Case Western Reserve University, Cleveland, Ohio
| | - Susan Lewis
- Department of Genetics, Case Western Reserve University, Cleveland, Ohio
| | - Joseph Willis
- Department of Pathology, Case Western Reserve University, Cleveland, Ohio
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
- Department of Pathology, and University Hospitals Case Medical Center, Cleveland, Ohio
| | - Robert Elston
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, Ohio
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
| | - Sanford D. Markowitz
- Department of Medicine, Case Western Reserve University, Cleveland, Ohio
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
- Howard Hughes Medical Institute, Chevy Chase, Maryland
- Department of Medicine, Hematology Oncology, and University Hospitals Case Medical Center, Cleveland, Ohio
| | - Georgia L. Wiesner
- Department of Medicine, Case Western Reserve University, Cleveland, Ohio
- Department of Genetics, Case Western Reserve University, Cleveland, Ohio
- Case Comprehensive Cancer Center, Case Western Reserve University, Cleveland, Ohio
- Center for Human Genetics, and University Hospitals Case Medical Center, Cleveland, Ohio
| |
Collapse
|
4
|
Lagani V, Montesanto A, Di Cianni F, Moreno V, Landi S, Conforti D, Rose G, Passarino G. A novel similarity-measure for the analysis of genetic data in complex phenotypes. BMC Bioinformatics 2009; 10 Suppl 6:S24. [PMID: 19534750 PMCID: PMC2697648 DOI: 10.1186/1471-2105-10-s6-s24] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Background Recent technological advances in DNA sequencing and genotyping have led to the accumulation of a remarkable quantity of data on genetic polymorphisms. However, the development of new statistical and computational tools for effective processing of these data has not been equally as fast. In particular, Machine Learning literature is limited to relatively few papers which are focused on the development and application of data mining methods for the analysis of genetic variability. On the other hand, these papers apply to genetic data procedures which had been developed for a different kind of analysis and do not take into account the peculiarities of population genetics. The aim of our study was to define a new similarity measure, specifically conceived for measuring the similarity between the genetic profiles of two groups of subjects (i.e., cases and controls) taking into account that genetic profiles are usually distributed in a population group according to the Hardy Weinberg equilibrium. Results We set up a new kernel function consisting of a similarity measure between groups of subjects genotyped for numerous genetic loci. This measure weighs different genetic profiles according to the estimates of gene frequencies at Hardy-Weinberg equilibrium in the population. We named this function the "Hardy-Weinberg kernel". The effectiveness of the Hardy-Weinberg kernel was compared to the performance of the well established linear kernel. We found that the Hardy-Weinberg kernel significantly outperformed the linear kernel in a number of experiments where we used either simulated data or real data. Conclusion The "Hardy-Weinberg kernel" reported here represents one of the first attempts at incorporating genetic knowledge into the definition of a kernel function designed for the analysis of genetic data. We show that the best performance of the "Hardy-Weinberg kernel" is observed when rare genotypes have different frequencies in cases and controls. The ability to capture the effect of rare genotypes on phenotypic traits might be a very important and useful feature, as most of the current statistical tools loose most of their statistical power when rare genotypes are involved in the susceptibility to the trait under study.
Collapse
Affiliation(s)
- Vincenzo Lagani
- Department of Electronis, Informatics and Systems, University of Calabria, Via Ponte Pietro Bucci 41C, 87036, Rende, Italy.
| | | | | | | | | | | | | | | |
Collapse
|