1
|
Whalen S, Schreiber J, Noble WS, Pollard KS. Navigating the pitfalls of applying machine learning in genomics. Nat Rev Genet 2022; 23:169-81. [PMID: 34837041 DOI: 10.1038/s41576-021-00434-9] [Citation(s) in RCA: 62] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/28/2021] [Indexed: 11/08/2022]
Abstract
The scale of genetic, epigenomic, transcriptomic, cheminformatic and proteomic data available today, coupled with easy-to-use machine learning (ML) toolkits, has propelled the application of supervised learning in genomics research. However, the assumptions behind the statistical models and performance evaluations in ML software frequently are not met in biological systems. In this Review, we illustrate the impact of several common pitfalls encountered when applying supervised ML in genomics. We explore how the structure of genomics data can bias performance evaluations and predictions. To address the challenges associated with applying cutting-edge ML methods to genomics, we describe solutions and appropriate use cases where ML modelling shows great potential.
Collapse
|
2
|
Abstract
Microarray is a high throughput discovery tool that has been broadly used for genomic research. Probe-target hybridization is the central concept of this technology to determine the relative abundance of nucleic acid sequences through fluorescence-based detection. In microarray experiments, variations of expression measurements can be attributed to many different sources that influence the stability and reproducibility of microarray platforms. Normalization is an essential step to reduce non-biological errors and to convert raw image data from multiple arrays (channels) to quality data for further analysis. In general, for the traditional microarray analysis, most established normalization methods are based on two assumptions: (1) the total number of target genes is large enough (>10,000); and (2) the expression level of the majority of genes is kept constant. However, microRNA (miRNA) arrays are usually spotted in low density, due to the fact that the total number of miRNAs is less than 2,000 and the majority of miRNAs are weakly or not expressed. As a result, normalization methods based on the above two assumptions are not applicable to miRNA profiling studies. In this review, we discuss a few representative microarray platforms on the market for miRNA profiling and compare the traditional methods with a few novel strategies specific for miRNA microarrays.
Collapse
Affiliation(s)
- Bin Wang
- Department of Mathematics and Statistics, University of South Alabama, 411 University BLVD N, Room 325, Mobile, AL 36688, USA; E-Mail:
| | - Yaguang Xi
- Mitchell Cancer Institute, University of South Alabama, 1660 Springhill Avenue, Mobile, AL 36604, USA
- Author to whom correspondence should be addressed; E-Mail: ; Tel.: 1-251-445-9857; Fax: 1-251-460-6994
| |
Collapse
|
3
|
Abstract
High-throughput methods based on mass spectrometry (proteomics, metabolomics, lipidomics, etc.) produce a wealth of data that cannot be analyzed without computational methods. The impact of the choice of method on the overall result of a biological study is often underappreciated, but different methods can result in very different biological findings. It is thus essential to evaluate and compare the correctness and relative performance of computational methods. The volume of the data as well as the complexity of the algorithms render unbiased comparisons challenging. This paper discusses some problems and challenges in testing and validation of computational methods. We discuss the different types of data (simulated and experimental validation data) as well as different metrics to compare methods. We also introduce a new public repository for mass spectrometric reference data sets ( http://compms.org/RefData ) that contains a collection of publicly available data sets for performance evaluation for a wide range of different methods.
Collapse
Affiliation(s)
- Laurent Gatto
- Computational Proteomics Unit and Cambridge Centre for Proteomics, University of Cambridge , Cambridge CB2 1QR, United Kingdom
| | - Kasper D Hansen
- Department of Biostatistics, Johns Hopkins University , Baltimore, Maryland 21205, United States.,Institute of Genetic Medicine, Johns Hopkins University , Baltimore, Maryland 21205, United States
| | - Michael R Hoopmann
- Institute for Systems Biology , Seattle, Washington 98109, United States
| | - Henning Hermjakob
- European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom.,National Center for Protein Sciences , Beijing, China
| | - Oliver Kohlbacher
- Quantitative Biology Center, Universität Tübingen , Auf der Morgenstelle 10, 72076 Tübingen, Germany.,Center for Bioinformatics, Universität Tübingen , Sand 14, 72076 Tübingen, Germany.,Dept. of Computer Science, Universität Tübingen , Sand 14, 72076 Tübingen, Germany.,Biomolecular Interactions, Max Planck Institute for Developmental Biology , Spemannstr. 35, 72076 Tübingen, Germany
| | - Andreas Beyer
- CECAD, University of Cologne , 50931 Cologne, Germany
| |
Collapse
|
4
|
Pirim H, Ekşioğlu B, Perkins AD. Clustering high throughput biological data with B-MST, a minimum spanning tree based heuristic. Comput Biol Med 2015; 62:94-102. [DOI: 10.1016/j.compbiomed.2015.03.031] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2014] [Revised: 03/04/2015] [Accepted: 03/31/2015] [Indexed: 10/23/2022]
|
5
|
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Informatics, Biometry and Epidemiology, Ludwig Maximilians University, Munich, Germany
- * E-mail:
| |
Collapse
|
6
|
Yang A, Li Y, Tang N, Lin J. Bayesian variable selection in multinomial probit model for classifying high-dimensional data. Comput Stat 2015; 30:399-418. [DOI: 10.1007/s00180-014-0540-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
7
|
Reboiro-Jato M, Arrais JP, Oliveira JL, Fdez-Riverola F. geneCommittee: a web-based tool for extensively testing the discriminatory power of biologically relevant gene sets in microarray data classification. BMC Bioinformatics 2014; 15:31. [PMID: 24475928 PMCID: PMC3909759 DOI: 10.1186/1471-2105-15-31] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2012] [Accepted: 01/27/2014] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The diagnosis and prognosis of several diseases can be shortened through the use of different large-scale genome experiments. In this context, microarrays can generate expression data for a huge set of genes. However, to obtain solid statistical evidence from the resulting data, it is necessary to train and to validate many classification techniques in order to find the best discriminative method. This is a time-consuming process that normally depends on intricate statistical tools. RESULTS geneCommittee is a web-based interactive tool for routinely evaluating the discriminative classification power of custom hypothesis in the form of biologically relevant gene sets. While the user can work with different gene set collections and several microarray data files to configure specific classification experiments, the tool is able to run several tests in parallel. Provided with a straightforward and intuitive interface, geneCommittee is able to render valuable information for diagnostic analyses and clinical management decisions based on systematically evaluating custom hypothesis over different data sets using complementary classifiers, a key aspect in clinical research. CONCLUSIONS geneCommittee allows the enrichment of microarrays raw data with gene functional annotations, producing integrated datasets that simplify the construction of better discriminative hypothesis, and allows the creation of a set of complementary classifiers. The trained committees can then be used for clinical research and diagnosis. Full documentation including common use cases and guided analysis workflows is freely available at http://sing.ei.uvigo.es/GC/.
Collapse
Affiliation(s)
| | | | | | - Florentino Fdez-Riverola
- Escuela Superior de Ingeniería Informática, Universidade de Vigo, Campus Universitario As Lagoas s/n, 32004 Ourense, Spain.
| |
Collapse
|
8
|
Tong M, Li X, Wegener Parfrey L, Roth B, Ippoliti A, Wei B, Borneman J, McGovern DPB, Frank DN, Li E, Horvath S, Knight R, Braun J. A modular organization of the human intestinal mucosal microbiota and its association with inflammatory bowel disease. PLoS One 2013; 8:e80702. [PMID: 24260458 PMCID: PMC3834335 DOI: 10.1371/journal.pone.0080702] [Citation(s) in RCA: 127] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2013] [Accepted: 10/07/2013] [Indexed: 02/08/2023] Open
Abstract
Abnormalities of the intestinal microbiota are implicated in the pathogenesis of Crohn's disease (CD) and ulcerative colitis (UC), two spectra of inflammatory bowel disease (IBD). However, the high complexity and low inter-individual overlap of intestinal microbial composition are formidable barriers to identifying microbial taxa representing this dysbiosis. These difficulties might be overcome by an ecologic analytic strategy to identify modules of interacting bacteria (rather than individual bacteria) as quantitative reproducible features of microbial composition in normal and IBD mucosa. We sequenced 16S ribosomal RNA genes from 179 endoscopic lavage samples from different intestinal regions in 64 subjects (32 controls, 16 CD and 16 UC patients in clinical remission). CD and UC patients showed a reduction in phylogenetic diversity and shifts in microbial composition, comparable to previous studies using conventional mucosal biopsies. Analysis of weighted co-occurrence network revealed 5 microbial modules. These modules were unprecedented, as they were detectable in all individuals, and their composition and abundance was recapitulated in an independent, biopsy-based mucosal dataset 2 modules were associated with healthy, CD, or UC disease states. Imputed metagenome analysis indicated that these modules displayed distinct metabolic functionality, specifically the enrichment of oxidative response and glycan metabolism pathways relevant to host-pathogen interaction in the disease-associated modules. The highly preserved microbial modules accurately classified IBD status of individual patients during disease quiescence, suggesting that microbial dysbiosis in IBD may be an underlying disorder independent of disease activity. Microbial modules thus provide an integrative view of microbial ecology relevant to IBD.
Collapse
Affiliation(s)
- Maomeng Tong
- Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Xiaoxiao Li
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Laura Wegener Parfrey
- Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America
| | - Bennett Roth
- Department of Medicine, Division of Digestive Disease, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Andrew Ippoliti
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Bo Wei
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - James Borneman
- Department of Plant Pathology and Microbiology, University of California Riverside, Riverside, California, United States of America
| | - Dermot P. B. McGovern
- Cedars-Sinai F. Widjaja Inflammatory Bowel and Immunobiology Research Institute, Los Angeles, California, United States of America
| | - Daniel N. Frank
- Division of Infectious Diseases, University of Colorado, School of Medicine, Aurora, Colorado, United States of America
- Union Council, Denver Microbiome Research Consortium (MiRC), University of Colorado, School of Medicine, Aurora, Colorado, United States of America
| | - Ellen Li
- Department of Medicine, Stony Brook University, Stony Brook, New York, United States of America
| | - Steve Horvath
- Department of Human Genetics and Biostatistics, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | - Rob Knight
- Department of Chemistry & Biochemistry, University of Colorado, Boulder, Colorado, United States of America
- Howard Hughes Medical Institute, University of Colorado, Boulder, Colorado, United States of America;
| | - Jonathan Braun
- Department of Molecular and Medical Pharmacology, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| |
Collapse
|
9
|
Khondoker M, Dobson R, Skirrow C, Simmons A, Stahl D. A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies. Stat Methods Med Res 2013; 25:1804-1823. [PMID: 24047600 PMCID: PMC5081132 DOI: 10.1177/0962280213502437] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Background Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms. Methods We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features. Results For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study.
Collapse
Affiliation(s)
- Mizanur Khondoker
- King's College London, Institute of Psychiatry, Department of Biostatistics, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Richard Dobson
- King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Unit for Dementia at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Caroline Skirrow
- King's College London, Institute of Psychiatry, MRC Social, Genetic and Developmental Psychiatry Centre, UK
| | - Andrew Simmons
- King's College London, Institute of Psychiatry, NIHR Biomedical Research Centre for Mental Health at the South London and Maudsley NHS Foundation Trust, London, UK King's College London, Institute of Psychiatry, NIHR Biomedical Research Unit for Dementia at the South London and Maudsley NHS Foundation Trust, London, UK
| | - Daniel Stahl
- King's College London, Institute of Psychiatry, Department of Biostatistics, London, UK
| |
Collapse
|
10
|
|
11
|
Boulesteix AL. On representative and illustrative comparisons with real data in bioinformatics: response to the letter to the editor by Smith et al. Bioinformatics 2013; 29:2664-6. [PMID: 23929033 DOI: 10.1093/bioinformatics/btt458] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Affiliation(s)
- Anne-Laure Boulesteix
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, 81377 Munich, Germany
| |
Collapse
|
12
|
McHardy IH, Goudarzi M, Tong M, Ruegger PM, Schwager E, Weger JR, Graeber TG, Sonnenburg JL, Horvath S, Huttenhower C, McGovern DPB, Fornace AJ, Borneman J, Braun J. Integrative analysis of the microbiome and metabolome of the human intestinal mucosal surface reveals exquisite inter-relationships. Microbiome 2013; 1:17. [PMID: 24450808 PMCID: PMC3971612 DOI: 10.1186/2049-2618-1-17] [Citation(s) in RCA: 196] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/31/2013] [Accepted: 05/12/2013] [Indexed: 05/10/2023]
Abstract
BACKGROUND Consistent compositional shifts in the gut microbiota are observed in IBD and other chronic intestinal disorders and may contribute to pathogenesis. The identities of microbial biomolecular mechanisms and metabolic products responsible for disease phenotypes remain to be determined, as do the means by which such microbial functions may be therapeutically modified. RESULTS The composition of the microbiota and metabolites in gut microbiome samples in 47 subjects were determined. Samples were obtained by endoscopic mucosal lavage from the cecum and sigmoid colon regions, and each sample was sequenced using the 16S rRNA gene V4 region (Illumina-HiSeq 2000 platform) and assessed by UPLC mass spectroscopy. Spearman correlations were used to identify widespread, statistically significant microbial-metabolite relationships. Metagenomes for identified microbial OTUs were imputed using PICRUSt, and KEGG metabolic pathway modules for imputed genes were assigned using HUMAnN. The resulting metabolic pathway abundances were mostly concordant with metabolite data. Analysis of the metabolome-driven distribution of OTU phylogeny and function revealed clusters of clades that were both metabolically and metagenomically similar. CONCLUSIONS The results suggest that microbes are syntropic with mucosal metabolome composition and therefore may be the source of and/or dependent upon gut epithelial metabolites. The consistent relationship between inferred metagenomic function and assayed metabolites suggests that metagenomic composition is predictive to a reasonable degree of microbial community metabolite pools. The finding that certain metabolites strongly correlate with microbial community structure raises the possibility of targeting metabolites for monitoring and/or therapeutically manipulating microbial community function in IBD and other chronic diseases.
Collapse
Affiliation(s)
- Ian H McHardy
- Pathology and Laboratory Medicine UCLA, Los Angeles, CA, USA
| | - Maryam Goudarzi
- Biochemistry and Molecular and Cellular Biology, Georgetown University, Washington, DC, USA
| | - Maomeng Tong
- Molecular and Medical Pharmacology, UCLA, Los Angeles, CA, USA
| | | | | | - John R Weger
- Plant Pathology, UC Riverside, Riverside, CA, USA
| | | | | | | | | | - Dermot PB McGovern
- The F. Widjaja Family Foundation Inflammatory Bowel and Immunobiology Research Institute, Cedar's Sinai Medical Center, Los Angeles, CA, USA
| | - Albert J Fornace
- Biochemistry and Molecular and Cellular Biology, Georgetown University, Washington, DC, USA
| | | | - Jonathan Braun
- Pathology and Laboratory Medicine UCLA, Los Angeles, CA, USA
| |
Collapse
|
13
|
Abstract
MOTIVATION Complex diseases induce perturbations to interaction and regulation networks in living systems, resulting in dynamic equilibrium states that differ for different diseases and also normal states. Thus identifying gene expression patterns corresponding to different equilibrium states is of great benefit to the diagnosis and treatment of complex diseases. However, it remains a major challenge to deal with the high dimensionality and small size of available complex disease gene expression datasets currently used for discovering gene expression patterns. RESULTS Here we present a phase-only correlation (POC) based classification method for recognizing the type of complex diseases. First, a virtual sample template is constructed for each subclass by averaging all samples of each subclass in a training dataset. Then the label of a test sample is determined by measuring the similarity between the test sample and each template. This novel method can detect the similarity of overall patterns emerged from the differentially expressed genes or proteins while ignoring small mismatches. CONCLUSIONS The experimental results obtained on seven publicly available complex disease datasets including microarray and protein array data demonstrate that the proposed POC-based disease classification method is effective and robust for diagnosing complex diseases with regard to the number of initially selected features, and its recognition accuracy is better than or comparable to other state-of-the-art machine learning methods. In addition, the proposed method does not require parameter tuning and data scaling, which can effectively reduce the occurrence of over-fitting and bias.
Collapse
Affiliation(s)
- Shu-Lin Wang
- Applied Bioinformatics Laboratory, the University of Kansas, 2034 Becker Drive, Lawrence, KS 66047, USA
| | - Yaping Fang
- Applied Bioinformatics Laboratory, the University of Kansas, 2034 Becker Drive, Lawrence, KS 66047, USA
| | - Jianwen Fang
- Applied Bioinformatics Laboratory, the University of Kansas, 2034 Becker Drive, Lawrence, KS 66047, USA
| |
Collapse
|
14
|
Wang D, Zhang Y, Huang Y, Li P, Wang M, Wu R, Cheng L, Zhang W, Zhang Y, Li B, Wang C, Guo Z. Comparison of different normalization assumptions for analyses of DNA methylation data from the cancer genome. Gene 2012; 506:36-42. [PMID: 22771920 DOI: 10.1016/j.gene.2012.06.075] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2011] [Revised: 06/21/2012] [Accepted: 06/22/2012] [Indexed: 01/02/2023]
Abstract
Nowadays, some researchers normalized DNA methylation arrays data in order to remove the technical artifacts introduced by experimental differences in sample preparation, array processing and other factors. However, other researchers analyzed DNA methylation arrays without performing data normalization considering that current normalizations for methylation data may distort real differences between normal and cancer samples because cancer genomes may be extensively subject to hypomethylation and the total amount of CpG methylation might differ substantially among samples. In this study, using eight datasets by Infinium HumanMethylation27 assay, we systemically analyzed the global distribution of DNA methylation changes in cancer compared to normal control and its effect on data normalization for selecting differentially methylated (DM) genes. We showed more differentially methylated (DM) genes could be found in the Quantile/Lowess-normalized data than in the non-normalized data. We found the DM genes additionally selected in the Quantile/Lowess-normalized data showed significantly consistent methylation states in another independent dataset for the same cancer, indicating these extra DM genes were effective biological signals related to the disease. These results suggested normalization can increase the power of detecting DM genes in the context of diagnostic markers which were usually characterized by relatively large effect sizes. Besides, we evaluated the reproducibility of DM discoveries for a particular cancer type, and we found most of the DM genes additionally detected in one dataset showed the same methylation directions in the other dataset for the same cancer type, indicating that these DM genes were effective biological signals in the other dataset. Furthermore, we showed that some DM genes detected from different studies for a particular cancer type were significantly reproducible at the functional level.
Collapse
Affiliation(s)
- Dong Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China.
| | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
Abstract
We introduce a graph-theoretic approach to extract clusters and hierarchies in complex data-sets in an unsupervised and deterministic manner, without the use of any prior information. This is achieved by building topologically embedded networks containing the subset of most significant links and analyzing the network structure. For a planar embedding, this method provides both the intra-cluster hierarchy, which describes the way clusters are composed, and the inter-cluster hierarchy which describes how clusters gather together. We discuss performance, robustness and reliability of this method by first investigating several artificial data-sets, finding that it can outperform significantly other established approaches. Then we show that our method can successfully differentiate meaningful clusters and hierarchies in a variety of real data-sets. In particular, we find that the application to gene expression patterns of lymphoma samples uncovers biologically significant groups of genes which play key-roles in diagnosis, prognosis and treatment of some of the most relevant human lymphoid malignancies.
Collapse
Affiliation(s)
- Won-Min Song
- Applied Mathematics, Research School of Physics and Engineering, The Australian National University, Canberra, Australia
| | - T. Di Matteo
- Applied Mathematics, Research School of Physics and Engineering, The Australian National University, Canberra, Australia
- Department of Mathematics, King's College London, London, United Kingdom
| | - Tomaso Aste
- Applied Mathematics, Research School of Physics and Engineering, The Australian National University, Canberra, Australia
- School of Physical Sciences, University of Kent, Kent, United Kingdom
| |
Collapse
|
16
|
|
17
|
Webb-Robertson BJM, Matzke MM, Jacobs JM, Pounds JG, Waters KM. A statistical selection strategy for normalization procedures in LC-MS proteomics experiments through dataset-dependent ranking of normalization scaling factors. Proteomics 2011; 11:4736-41. [PMID: 22038874 DOI: 10.1002/pmic.201100078] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2011] [Revised: 08/04/2011] [Accepted: 10/03/2011] [Indexed: 11/07/2022]
Abstract
Quantification of LC-MS peak intensities assigned during peptide identification in a typical comparative proteomics experiment will deviate from run-to-run of the instrument due to both technical and biological variation. Thus, normalization of peak intensities across an LC-MS proteomics dataset is a fundamental step in pre-processing. However, the downstream analysis of LC-MS proteomics data can be dramatically affected by the normalization method selected. Current normalization procedures for LC-MS proteomics data are presented in the context of normalization values derived from subsets of the full collection of identified peptides. The distribution of these normalization values is unknown a priori. If they are not independent from the biological factors associated with the experiment the normalization process can introduce bias into the data, possibly affecting downstream statistical biomarker discovery. We present a novel approach to evaluate normalization strategies, which includes the peptide selection component associated with the derivation of normalization values. Our approach evaluates the effect of normalization on the between-group variance structure in order to identify the most appropriate normalization methods that improve the structure of the data without introducing bias into the normalized peak intensities.
Collapse
|
18
|
Wang SL, Zhu YH, Jia W, Huang DS. Robust classification method of tumor subtype by using correlation filters. IEEE/ACM Trans Comput Biol Bioinform 2011; 9:580-591. [PMID: 22025761 DOI: 10.1109/tcbb.2011.135] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
Tumor classification based on gene expression profiles, which is of great benefit to the accurate diagnosis and personalized treatment for different types of tumor, has drawn a great attention in recent years. This paper proposes a novel tumor classification method based on correlation filters to identify the overall pattern of tumor subtype hidden in differentially expressed genes. Concretely, two correlation filters, i.e., Minimum Average Correlation Energy (MACE) and Optimal Tradeoff Synthetic Discriminant Function (OTSDF), are introduced to determine whether a test sample matches the templates synthesized for each subclass. The experiments on six publicly available datasets indicate that the proposed method is robust to noise, and can more effectively avoid the effects of dimensionality curse. Compared with many model-based methods, the correlation filter based method can achieve better performance when balanced training sets are exploited to synthesize the templates. Particularly, the proposed method can detect the similarity of overall pattern while ignoring small mismatches between test sample and the synthesized template. And it performs well even if only few training samples are available. More importantly, the experimental results can be visually represented, which is helpful for the further analysis of results.
Collapse
|
19
|
Zou J, Hong G, Guo X, Zhang L, Yao C, Wang J, Guo Z. Reproducible cancer biomarker discovery in SELDI-TOF MS using different pre-processing algorithms. PLoS One 2011; 6:e26294. [PMID: 22022591 DOI: 10.1371/journal.pone.0026294] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2011] [Accepted: 09/24/2011] [Indexed: 12/14/2022] Open
Abstract
Background There has been much interest in differentiating diseased and normal samples using biomarkers derived from mass spectrometry (MS) studies. However, biomarker identification for specific diseases has been hindered by irreproducibility. Specifically, a peak profile extracted from a dataset for biomarker identification depends on a data pre-processing algorithm. Until now, no widely accepted agreement has been reached. Results In this paper, we investigated the consistency of biomarker identification using differentially expressed (DE) peaks from peak profiles produced by three widely used average spectrum-dependent pre-processing algorithms based on SELDI-TOF MS data for prostate and breast cancers. Our results revealed two important factors that affect the consistency of DE peak identification using different algorithms. One factor is that some DE peaks selected from one peak profile were not detected as peaks in other profiles, and the second factor is that the statistical power of identifying DE peaks in large peak profiles with many peaks may be low due to the large scale of the tests and small number of samples. Furthermore, we demonstrated that the DE peak detection power in large profiles could be improved by the stratified false discovery rate (FDR) control approach and that the reproducibility of DE peak detection could thereby be increased. Conclusions Comparing and evaluating pre-processing algorithms in terms of reproducibility can elucidate the relationship among different algorithms and also help in selecting a pre-processing algorithm. The DE peaks selected from small peak profiles with few peaks for a dataset tend to be reproducibly detected in large peak profiles, which suggests that a suitable pre-processing algorithm should be able to produce peaks sufficient for identifying useful and reproducible biomarkers.
Collapse
|
20
|
Abstract
MOTIVATION In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces. RESULTS This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known. AVAILABILITY Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering. CONTACT edward@mail.ece.tamu.edu.
Collapse
Affiliation(s)
- Chao Sima
- Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, USA
| | | | | |
Collapse
|
21
|
Matzke MM, Waters KM, Metz TO, Jacobs JM, Sims AC, Baric RS, Pounds JG, Webb-Robertson BJM. Improved quality control processing of peptide-centric LC-MS proteomics data. ACTA ACUST UNITED AC 2011; 27:2866-72. [PMID: 21852304 PMCID: PMC3187650 DOI: 10.1093/bioinformatics/btr479] [Citation(s) in RCA: 74] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Motivation: In the analysis of differential peptide peak intensities (i.e. abundance measures), LC-MS analyses with poor quality peptide abundance data can bias downstream statistical analyses and hence the biological interpretation for an otherwise high-quality dataset. Although considerable effort has been placed on assuring the quality of the peptide identification with respect to spectral processing, to date quality assessment of the subsequent peptide abundance data matrix has been limited to a subjective visual inspection of run-by-run correlation or individual peptide components. Identifying statistical outliers is a critical step in the processing of proteomics data as many of the downstream statistical analyses [e.g. analysis of variance (ANOVA)] rely upon accurate estimates of sample variance, and their results are influenced by extreme values. Results: We describe a novel multivariate statistical strategy for the identification of LC-MS runs with extreme peptide abundance distributions. Comparison with current method (run-by-run correlation) demonstrates a significantly better rate of identification of outlier runs by the multivariate strategy. Simulation studies also suggest that this strategy significantly outperforms correlation alone in the identification of statistically extreme liquid chromatography-mass spectrometry (LC-MS) runs. Availability:https://www.biopilot.org/docs/Software/RMD.php Contact:bj@pnl.gov Supplementary information:Supplementary material is available at Bioinformatics online.
Collapse
|
22
|
Abstract
MOTIVATION There is growing discussion in the bioinformatics community concerning overoptimism of reported results. Two approaches contributing to overoptimism in classification are (i) the reporting of results on datasets for which a proposed classification rule performs well and (ii) the comparison of multiple classification rules on a single dataset that purports to show the advantage of a certain rule. RESULTS This article provides a careful probabilistic analysis of the second issue and the 'multiple-rule bias', resulting from choosing a classification rule having minimum estimated error on the dataset. It quantifies this bias corresponding to estimating the expected true error of the classification rule possessing minimum estimated error and it characterizes the bias from estimating the true comparative advantage of the chosen classification rule relative to the others by the estimated comparative advantage on the dataset. The analysis is applied to both synthetic and real data using a number of classification rules and error estimators. AVAILABILITY We have implemented in C code the synthetic data distribution model, classification rules, feature selection routines and error estimation methods. The code for multiple-rule analysis is implemented in MATLAB. The source code is available at http://gsp.tamu.edu/Publications/supplementary/yousefi11a/. Supplementary simulation results are also included.
Collapse
Affiliation(s)
- Mohammadmahdi R Yousefi
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, USA
| | | | | |
Collapse
|
23
|
Ma S, Kosorok MR, Huang J, Dai Y. Incorporating higher-order representative features improves prediction in network-based cancer prognosis analysis. BMC Med Genomics 2011; 4:5. [PMID: 21226928 PMCID: PMC3037289 DOI: 10.1186/1755-8794-4-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2010] [Accepted: 01/12/2011] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND In cancer prognosis studies with gene expression measurements, an important goal is to construct gene signatures with predictive power. In this study, we describe the coordination among genes using the weighted coexpression network, where nodes represent genes and nodes are connected if the corresponding genes have similar expression patterns across samples. There are subsets of nodes, called modules, that are tightly connected to each other. In several published studies, it has been suggested that the first principal components of individual modules, also referred to as "eigengenes", may sufficiently represent the corresponding modules. RESULTS In this article, we refer to principal components and their functions as representative features". We investigate higher-order representative features, which include the principal components other than the first ones and second order terms (quadratics and interactions). Two gradient thresholding methods are adopted for regularized estimation and feature selection. Analysis of six prognosis studies on lymphoma and breast cancer shows that incorporating higher-order representative features improves prediction performance over using eigengenes only. Simulation study further shows that prediction performance can be less satisfactory if the representative feature set is not properly chosen. CONCLUSIONS This study introduces multiple ways of defining the representative features and effective thresholding regularized estimation approaches. It provides convincing evidence that the higher-order representative features may have important implications for the prediction of cancer prognosis.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, New Haven, CT, USA
| | - Michael R Kosorok
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Jian Huang
- Departments of Statistics and Actuarial Science, and Biostatistics, University of Iowa, Iowa City, IA, USA
| | - Ying Dai
- School of Public Health, Yale University, New Haven, CT, USA
| |
Collapse
|
24
|
Li Y, Wang N, Perkins EJ, Zhang C, Gong P. Identification and optimization of classifier genes from multi-class earthworm microarray dataset. PLoS One 2010; 5:e13715. [PMID: 21060837 DOI: 10.1371/journal.pone.0013715] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2010] [Accepted: 10/06/2010] [Indexed: 11/19/2022] Open
Abstract
Monitoring, assessment and prediction of environmental risks that chemicals pose demand rapid and accurate diagnostic assays. A variety of toxicological effects have been associated with explosive compounds TNT and RDX. One important goal of microarray experiments is to discover novel biomarkers for toxicity evaluation. We have developed an earthworm microarray containing 15,208 unique oligo probes and have used it to profile gene expression in 248 earthworms exposed to TNT, RDX or neither. We assembled a new machine learning pipeline consisting of several well-established feature filtering/selection and classification techniques to analyze the 248-array dataset in order to construct classifier models that can separate earthworm samples into three groups: control, TNT-treated, and RDX-treated. First, a total of 869 genes differentially expressed in response to TNT or RDX exposure were identified using a univariate statistical algorithm of class comparison. Then, decision tree-based algorithms were applied to select a subset of 354 classifier genes, which were ranked by their overall weight of significance. A multiclass support vector machine (MC-SVM) method and an unsupervised K-mean clustering method were applied to independently refine the classifier, producing a smaller subset of 39 and 30 classifier genes, separately, with 11 common genes being potential biomarkers. The combined 58 genes were considered the refined subset and used to build MC-SVM and clustering models with classification accuracy of 83.5% and 56.9%, respectively. This study demonstrates that the machine learning approach can be used to identify and optimize a small subset of classifier/biomarker genes from high dimensional datasets and generate classification models of acceptable precision for multiple classes.
Collapse
|
25
|
Baker SG. Simple and flexible classification of gene expression microarrays via Swirls and Ripples. BMC Bioinformatics 2010; 11:452. [PMID: 20825641 PMCID: PMC2949887 DOI: 10.1186/1471-2105-11-452] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2010] [Accepted: 09/08/2010] [Indexed: 11/23/2022] Open
Abstract
Background A simple classification rule with few genes and parameters is desirable when applying a classification rule to new data. One popular simple classification rule, diagonal discriminant analysis, yields linear or curved classification boundaries, called Ripples, that are optimal when gene expression levels are normally distributed with the appropriate variance, but may yield poor classification in other situations. Results A simple modification of diagonal discriminant analysis yields smooth highly nonlinear classification boundaries, called Swirls, that sometimes outperforms Ripples. In particular, if the data are normally distributed with different variances in each class, Swirls substantially outperforms Ripples when using a pooled variance to reduce the number of parameters. The proposed classification rule for two classes selects either Swirls or Ripples after parsimoniously selecting the number of genes and distance measures. Applications to five cancer microarray data sets identified predictive genes related to the tissue organization theory of carcinogenesis. Conclusion The parsimonious selection of classifiers coupled with the selection of either Swirls or Ripples provides a good basis for formulating a simple, yet flexible, classification rule. Open source software is available for download.
Collapse
Affiliation(s)
- Stuart G Baker
- Biometry Research Group, Division of Cancer Prevention, National Cancer Institute, EPN 3131, 6130 Executive Blvd MSC 7354, Bethesda, MD 20892-7354, USA.
| |
Collapse
|
26
|
Abstract
Development of high-throughput technologies makes it possible to survey the whole genome. Genomic studies have been extensively conducted, searching for markers with predictive power for prognosis of complex diseases such as cancer, diabetes and obesity. Most existing statistical analyses are focused on developing marker selection techniques, while little attention is paid to the underlying prognosis models. In this article, we review three commonly used prognosis models, namely the Cox, additive risk and accelerated failure time models. We conduct simulation and show that gene identification can be unsatisfactory under model misspecification. We analyze three cancer prognosis studies under the three models, and show that the gene identification results, prediction performance of all identified genes combined, and reproducibility of each identified gene are model-dependent. We suggest that in practical data analysis, more attention should be paid to the model assumption, and multiple models may need to be considered.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, USA.
| | | | | | | | | |
Collapse
|
27
|
Abstract
MOTIVATION In statistical bioinformatics research, different optimization mechanisms potentially lead to 'over-optimism' in published papers. So far, however, a systematic critical study concerning the various sources underlying this over-optimism is lacking. RESULTS We present an empirical study on over-optimism using high-dimensional classification as example. Specifically, we consider a 'promising' new classification algorithm, namely linear discriminant analysis incorporating prior knowledge on gene functional groups through an appropriate shrinkage of the within-group covariance matrix. While this approach yields poor results in terms of error rate, we quantitatively demonstrate that it can artificially seem superior to existing approaches if we 'fish for significance'. The investigated sources of over-optimism include the optimization of datasets, of settings, of competing methods and, most importantly, of the method's characteristics. We conclude that, if the improvement of a quantitative criterion such as the error rate is the main contribution of a paper, the superiority of new algorithms should always be demonstrated on independent validation data. AVAILABILITY The R codes and relevant data can be downloaded from http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/overoptimism/, such that the study is completely reproducible.
Collapse
Affiliation(s)
- Monika Jelizarow
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Munich, Germany
| | | | | | | | | |
Collapse
|
28
|
Bandyopadhyay N, Kahveci T, Goodison S, Sun Y, Ranka S. Pathway-BasedFeature Selection Algorithm for Cancer Microarray Data. Adv Bioinformatics 2010; 2009:532989. [PMID: 20204186 PMCID: PMC2831238 DOI: 10.1155/2009/532989] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2009] [Accepted: 11/30/2009] [Indexed: 01/09/2023] Open
Abstract
Classification of cancers based on gene expressions produces better accuracy when compared to that of the clinical markers. Feature selection improves the accuracy of these classification algorithms by reducing the chance of overfitting that happens due to large number of features. We develop a new feature selection method called Biological Pathway-based Feature Selection (BPFS) for microarray data. Unlike most of the existing methods, our method integrates signaling and gene regulatory pathways with gene expression data to minimize the chance of overfitting of the method and to improve the test accuracy. Thus, BPFS selects a biologically meaningful feature set that is minimally redundant. Our experiments on published breast cancer datasets demonstrate that all of the top 20 genes found by our method are associated with cancer. Furthermore, the classification accuracy of our signature is up to 18% better than that of vant Veers 70 gene signature, and it is up to 8% better accuracy than the best published feature selection method, I-RELIEF.
Collapse
Affiliation(s)
- Nirmalya Bandyopadhyay
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Tamer Kahveci
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Steve Goodison
- Anderson Cancer Center Orlando, Cancer Research Institute Orlando, FL 32827, USA
| | - Y. Sun
- Interdisciplinary Center for Biotechnology Research, University of Florida, Gainesville, FL 32611, USA
| | - Sanjay Ranka
- Computer and Information Science and Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
29
|
Abstract
Background Prognosis is of critical interest in breast cancer research. Biomedical studies suggest that genomic measurements may have independent predictive power for prognosis. Gene profiling studies have been conducted to search for predictive genomic measurements. Genes have the inherent pathway structure, where pathways are composed of multiple genes with coordinated functions. The goal of this study is to identify gene pathways with predictive power for breast cancer prognosis. Since our goal is fundamentally different from that of existing studies, a new pathway analysis method is proposed. Results The new method advances beyond existing alternatives along the following aspects. First, it can assess the predictive power of gene pathways, whereas existing methods tend to focus on model fitting accuracy only. Second, it can account for the joint effects of multiple genes in a pathway, whereas existing methods tend to focus on the marginal effects of genes. Third, it can accommodate multiple heterogeneous datasets, whereas existing methods analyze a single dataset only. We analyze four breast cancer prognosis studies and identify 97 pathways with significant predictive power for prognosis. Important pathways missed by alternative methods are identified. Conclusions The proposed method provides a useful alternative to existing pathway analysis methods. Identified pathways can provide further insights into breast cancer prognosis.
Collapse
Affiliation(s)
- Shuangge Ma
- School of Public Health, Yale University, New Haven, CT 06520, USA.
| | | |
Collapse
|
30
|
|
31
|
|