1
|
Diaz-Garelli F, Johnson TR, Rahbar MH, Bernstam EV. Exploring the Hazards of Scaling Up Clinical Data Analyses: A Drug Side Effect Discovery Case Report. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:180-189. [PMID: 34457132 PMCID: PMC8378643] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We assessed the scalability of pharmacological signal detection use case from a single-site CDW to a large aggregated clinical data warehouse (single-site database with 754,214 distinct patient IDs vs. multisite database with 49.8M). We aimed to explore whether a larger clinical dataset would provide clearer signals for secondary analyses such as detecting the known relationship between prednisone and weight. We found significant weight gain rate using the single-site data but not from using aggregated data (0.0104 kg/day, p<0.0001 vs. -0.050 kg/day, p<.0001). This rate was also found more consistently across 30 age and gender subgroups using the single-site data than in the aggregated data (26 vs. 18 significant weight gain findings). Contrary to our expectations, analyses of much larger aggregated clinical datasets did not yield stronger signals. Researchers must check the underlying model assumptions and account for greater heterogeneity when analyzing aggregated multisite data to ensure reliable findings.
Collapse
Affiliation(s)
| | - Todd R Johnson
- The University of Texas Health Science Center at Houston, TX
| | | | | |
Collapse
|
2
|
Saxe GN, Ma S, Ren J, Aliferis C. Machine learning methods to predict child posttraumatic stress: a proof of concept study. BMC Psychiatry 2017; 17:223. [PMID: 28689495 PMCID: PMC5502325 DOI: 10.1186/s12888-017-1384-1] [Citation(s) in RCA: 43] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/16/2016] [Accepted: 06/09/2017] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND The care of traumatized children would benefit significantly from accurate predictive models for Posttraumatic Stress Disorder (PTSD), using information available around the time of trauma. Machine Learning (ML) computational methods have yielded strong results in recent applications across many diseases and data types, yet they have not been previously applied to childhood PTSD. Since these methods have not been applied to this complex and debilitating disorder, there is a great deal that remains to be learned about their application. The first step is to prove the concept: Can ML methods - as applied in other fields - produce predictive classification models for childhood PTSD? Additionally, we seek to determine if specific variables can be identified - from the aforementioned predictive classification models - with putative causal relations to PTSD. METHODS ML predictive classification methods - with causal discovery feature selection - were applied to a data set of 163 children hospitalized with an injury and PTSD was determined three months after hospital discharge. At the time of hospitalization, 105 risk factor variables were collected spanning a range of biopsychosocial domains. RESULTS Seven percent of subjects had a high level of PTSD symptoms. A predictive classification model was discovered with significant predictive accuracy. A predictive model constructed based on subsets of potentially causally relevant features achieves similar predictivity compared to the best predictive model constructed with all variables. Causal Discovery feature selection methods identified 58 variables of which 10 were identified as most stable. CONCLUSIONS In this first proof-of-concept application of ML methods to predict childhood Posttraumatic Stress we were able to determine both predictive classification models for childhood PTSD and identify several causal variables. This set of techniques has great potential for enhancing the methodological toolkit in the field and future studies should seek to replicate, refine, and extend the results produced in this study.
Collapse
Affiliation(s)
- Glenn N. Saxe
- 0000 0004 1936 8753grid.137628.9Department of Child and Adolescent Psychiatry, New York University School of Medicine, One Park Avenue, New York, NY 10016 USA
| | - Sisi Ma
- 0000000419368657grid.17635.36Institute for Health Informatics and Department of Medicine, University of Minnesota, 330 Diehl Hall, MMC912, 420 Delaware Street S.E, Minneapolis, Minnesota, Mpls, MN 55455 USA
| | - Jiwen Ren
- 0000 0004 1936 8753grid.137628.9Department of Child and Adolescent Psychiatry and Center for Health Informatics and Bioinformatics, New York University School of Medicine, One Park Avenue, New York, NY 10016 USA
| | - Constantin Aliferis
- 0000000419368657grid.17635.36Institute for Health Informatics, Department of Medicine, and Data Science Program, University of Minnesota, Minneapolis, MN USA ,0000 0001 2264 7217grid.152326.1Department of Biostatistics, Vanderbilt University, 330 Diehl Hall, MMC912, 420 Delaware Street S.E., Mpls, MN, Nashville, TN 55455 USA
| |
Collapse
|
3
|
Chang HW, Chiu YH, Kao HY, Yang CH, Ho WH. Comparison of classification algorithms with wrapper-based feature selection for predicting osteoporosis outcome based on genetic factors in a taiwanese women population. Int J Endocrinol 2013; 2013:850735. [PMID: 23401685 PMCID: PMC3557627 DOI: 10.1155/2013/850735] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/26/2012] [Revised: 12/21/2012] [Accepted: 12/27/2012] [Indexed: 11/18/2022] Open
Abstract
An essential task in a genomic analysis of a human disease is limiting the number of strongly associated genes when studying susceptibility to the disease. The goal of this study was to compare computational tools with and without feature selection for predicting osteoporosis outcome in Taiwanese women based on genetic factors such as single nucleotide polymorphisms (SNPs). To elucidate relationships between osteoporosis and SNPs in this population, three classification algorithms were applied: multilayer feedforward neural network (MFNN), naive Bayes, and logistic regression. A wrapper-based feature selection method was also used to identify a subset of major SNPs. Experimental results showed that the MFNN model with the wrapper-based approach was the best predictive model for inferring disease susceptibility based on the complex relationship between osteoporosis and SNPs in Taiwanese women. The findings suggest that patients and doctors can use the proposed tool to enhance decision making based on clinical factors such as SNP genotyping data.
Collapse
Affiliation(s)
- Hsueh-Wei Chang
- Department of Biomedical Science and Environmental Biology, Graduate Institute of Natural Products, College of Pharmacy, Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Yu-Hsien Chiu
- Department of Healthcare Administration and Medical Informatics, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Hao-Yun Kao
- Department of Healthcare Administration and Medical Informatics, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| | - Cheng-Hong Yang
- Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung 807, Taiwan
| | - Wen-Hsien Ho
- Department of Healthcare Administration and Medical Informatics, Kaohsiung Medical University, Kaohsiung 807, Taiwan
| |
Collapse
|
4
|
Statnikov A, Alekseyenko AV, Li Z, Henaff M, Perez-Perez GI, Blaser MJ, Aliferis CF. Microbiomic signatures of psoriasis: feasibility and methodology comparison. Sci Rep 2013; 3:2620. [PMID: 24018484 PMCID: PMC3965359 DOI: 10.1038/srep02620] [Citation(s) in RCA: 59] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2013] [Accepted: 08/22/2013] [Indexed: 01/21/2023] Open
Abstract
Psoriasis is a common chronic inflammatory disease of the skin. We sought to use bacterial community abundance data to assess the feasibility of developing multivariate molecular signatures for differentiation of cutaneous psoriatic lesions, clinically unaffected contralateral skin from psoriatic patients, and similar cutaneous loci in matched healthy control subjects. Using 16S rRNA high-throughput DNA sequencing, we assayed the cutaneous microbiome for 51 such matched specimen triplets including subjects of both genders, different age groups, ethnicities and multiple body sites. None of the subjects had recently received relevant treatments or antibiotics. We found that molecular signatures for the diagnosis of psoriasis result in significant accuracy ranging from 0.75 to 0.89 AUC, depending on the classification task. We also found a significant effect of DNA sequencing and downstream analysis protocols on the accuracy of molecular signatures. Our results demonstrate that it is feasible to develop accurate molecular signatures for the diagnosis of psoriasis from microbiomic data.
Collapse
Affiliation(s)
- Alexander Statnikov
- Center for Health Informatics and Bioinformatics (CHIBI), New York University Langone Medical Center, New York, New York
- Department of Medicine, New York University School of Medicine, New York, New York
| | - Alexander V. Alekseyenko
- Center for Health Informatics and Bioinformatics (CHIBI), New York University Langone Medical Center, New York, New York
- Department of Medicine, New York University School of Medicine, New York, New York
| | - Zhiguo Li
- Center for Health Informatics and Bioinformatics (CHIBI), New York University Langone Medical Center, New York, New York
| | - Mikael Henaff
- Center for Health Informatics and Bioinformatics (CHIBI), New York University Langone Medical Center, New York, New York
| | - Guillermo I. Perez-Perez
- Department of Medicine, New York University School of Medicine, New York, New York
- Department of Microbiology, New York University School of Medicine, New York, New York
| | - Martin J. Blaser
- Department of Medicine, New York University School of Medicine, New York, New York
- Department of Microbiology, New York University School of Medicine, New York, New York
- Medical Service, Department of Veterans Affairs New York Harbor Healthcare System, New York, New York
| | - Constantin F. Aliferis
- Center for Health Informatics and Bioinformatics (CHIBI), New York University Langone Medical Center, New York, New York
- Department of Pathology, New York University School of Medicine, New York, New York
| |
Collapse
|
5
|
Regression of atherosclerosis is characterized by broad changes in the plaque macrophage transcriptome. PLoS One 2012; 7:e39790. [PMID: 22761902 PMCID: PMC3384622 DOI: 10.1371/journal.pone.0039790] [Citation(s) in RCA: 89] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2012] [Accepted: 05/29/2012] [Indexed: 01/19/2023] Open
Abstract
We have developed a mouse model of atherosclerotic plaque regression in which an atherosclerotic aortic arch from a hyperlipidemic donor is transplanted into a normolipidemic recipient, resulting in rapid elimination of cholesterol and monocyte-derived macrophage cells (CD68+) from transplanted vessel walls. To gain a comprehensive view of the differences in gene expression patterns in macrophages associated with regressing compared with progressing atherosclerotic plaque, we compared mRNA expression patterns in CD68+ macrophages extracted from plaque in aortic aches transplanted into normolipidemic or into hyperlipidemic recipients. In CD68+ cells from regressing plaque we observed that genes associated with the contractile apparatus responsible for cellular movement (e.g. actin and myosin) were up-regulated whereas genes related to cell adhesion (e.g. cadherins, vinculin) were down-regulated. In addition, CD68+ cells from regressing plaque were characterized by enhanced expression of genes associated with an anti-inflammatory M2 macrophage phenotype, including arginase I, CD163 and the C-lectin receptor. Our analysis suggests that in regressing plaque CD68+ cells preferentially express genes that reduce cellular adhesion, enhance cellular motility, and overall act to suppress inflammation.
Collapse
|
6
|
Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF, Statnikov A. Expanding the understanding of biases in development of clinical-grade molecular signatures: a case study in acute respiratory viral infections. PLoS One 2011; 6:e20662. [PMID: 21673802 PMCID: PMC3105991 DOI: 10.1371/journal.pone.0020662] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2010] [Accepted: 05/06/2011] [Indexed: 01/21/2023] Open
Abstract
BACKGROUND The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them. METHODOLOGY AND PRINCIPAL FINDINGS Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures. CONCLUSIONS AND SIGNIFICANCE Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution.
Collapse
Affiliation(s)
- Nikita I. Lytkin
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
| | - Lauren McVoy
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
| | - Jörn-Hendrik Weitkamp
- Division of Neonatology, Department of Pediatrics, Vanderbilt University
School of Medicine and Monroe Carell Jr. Children's Hospital at Vanderbilt,
Nashville, Tennessee, United States of America
| | - Constantin F. Aliferis
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Pathology, New York University School of Medicine, New
York, New York, United States of America
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee,
United States of America
| | - Alexander Statnikov
- Center for Health Informatics and Bioinformatics, New York University
School of Medicine, New York, New York, United States of America
- Department of Medicine, New York University School of Medicine, New York,
New York, United States of America
| |
Collapse
|
7
|
Alekseyenko AV, Lytkin NI, Ai J, Ding B, Padyukov L, Aliferis CF, Statnikov A. Causal graph-based analysis of genome-wide association data in rheumatoid arthritis. Biol Direct 2011; 6:25. [PMID: 21592391 PMCID: PMC3118953 DOI: 10.1186/1745-6150-6-25] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2010] [Accepted: 05/18/2011] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND GWAS owe their popularity to the expectation that they will make a major impact on diagnosis, prognosis and management of disease by uncovering genetics underlying clinical phenotypes. The dominant paradigm in GWAS data analysis so far consists of extensive reliance on methods that emphasize contribution of individual SNPs to statistical association with phenotypes. Multivariate methods, however, can extract more information by considering associations of multiple SNPs simultaneously. Recent advances in other genomics domains pinpoint multivariate causal graph-based inference as a promising principled analysis framework for high-throughput data. Designed to discover biomarkers in the local causal pathway of the phenotype, these methods lead to accurate and highly parsimonious multivariate predictive models. In this paper, we investigate the applicability of causal graph-based method TIE* to analysis of GWAS data. To test the utility of TIE*, we focus on anti-CCP positive rheumatoid arthritis (RA) GWAS datasets, where there is a general consensus in the community about the major genetic determinants of the disease. RESULTS Application of TIE* to the North American Rheumatoid Arthritis Cohort (NARAC) GWAS data results in six SNPs, mostly from the MHC locus. Using these SNPs we develop two predictive models that can classify cases and disease-free controls with an accuracy of 0.81 area under the ROC curve, as verified in independent testing data from the same cohort. The predictive performance of these models generalizes reasonably well to Swedish subjects from the closely related but not identical Epidemiological Investigation of Rheumatoid Arthritis (EIRA) cohort with 0.71-0.78 area under the ROC curve. Moreover, the SNPs identified by the TIE* method render many other previously known SNP associations conditionally independent of the phenotype. CONCLUSIONS Our experiments demonstrate that application of TIE* captures maximum amount of genetic information about RA in the data and recapitulates the major consensus findings about the genetic factors of this disease. In addition, TIE* yields reproducible markers and signatures of RA. This suggests that principled multivariate causal and predictive framework for GWAS analysis empowers the community with a new tool for high-quality and more efficient discovery. REVIEWERS This article was reviewed by Prof. Anthony Almudevar, Dr. Eugene V. Koonin, and Prof. Marianthi Markatou.
Collapse
Affiliation(s)
- Alexander V Alekseyenko
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY 10016, USA.
| | | | | | | | | | | | | |
Collapse
|
8
|
Tapia E, Ornella L, Bulacio P, Angelone L. Multiclass classification of microarray data samples with a reduced number of genes. BMC Bioinformatics 2011; 12:59. [PMID: 21342522 PMCID: PMC3056725 DOI: 10.1186/1471-2105-12-59] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 02/22/2011] [Indexed: 01/05/2023] Open
Abstract
Background Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained. Results A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples. Conclusions A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.
Collapse
Affiliation(s)
- Elizabeth Tapia
- CIFASIS-Conicet Institute, Bv, 27 de Febrero 210 Bis, Rosario, Argentina.
| | | | | | | |
Collapse
|
9
|
Statnikov A, Lytkin NI, McVoy L, Weitkamp JH, Aliferis CF. Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections. BMC Res Notes 2010; 3:264. [PMID: 20961438 PMCID: PMC2975649 DOI: 10.1186/1756-0500-3-264] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2010] [Accepted: 10/20/2010] [Indexed: 12/01/2022] Open
Abstract
Background A recent study reported that gene expression profiles from peripheral blood samples of healthy subjects prior to viral inoculation were indistinguishable from profiles of subjects who received viral challenge but remained asymptomatic and uninfected. If true, this implies that the host immune response does not have a molecular signature. Given the high sensitivity of microarray technology, we were intrigued by this result and hypothesize that it was an artifact of data analysis. Findings Using acute respiratory viral challenge microarray data, we developed a molecular signature that for the first time allowed for an accurate differentiation between uninfected subjects prior to viral inoculation and subjects who remained asymptomatic after the viral challenge. Conclusions Our findings suggest that molecular signatures can be used to characterize immune responses to viruses and may improve our understanding of susceptibility to viral infection with possible implications for vaccine development.
Collapse
Affiliation(s)
- Alexander Statnikov
- Center for Health Informatics and Bioinformatics, New York University School of Medicine, New York, NY 10016, USA.
| | | | | | | | | |
Collapse
|
10
|
Meder B, Keller A, Vogel B, Haas J, Sedaghat-Hamedani F, Kayvanpour E, Just S, Borries A, Rudloff J, Leidinger P, Meese E, Katus HA, Rottbauer W. MicroRNA signatures in total peripheral blood as novel biomarkers for acute myocardial infarction. Basic Res Cardiol 2010; 106:13-23. [PMID: 20886220 DOI: 10.1007/s00395-010-0123-2] [Citation(s) in RCA: 198] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/28/2010] [Revised: 09/20/2010] [Accepted: 09/22/2010] [Indexed: 12/20/2022]
Abstract
MicroRNAs (miRNAs) are important regulators of adaptive and maladaptive responses in cardiovascular diseases and hence are considered to be potential therapeutical targets. However, their role as novel biomarkers for the diagnosis of cardiovascular diseases still needs to be systematically evaluated. We assessed here for the first time whole-genome miRNA expression in peripheral total blood samples of patients with acute myocardial infarction (AMI). We identified 121 miRNAs, which are significantly dysregulated in AMI patients in comparison to healthy controls. Among these, miR-1291 and miR-663b show the highest sensitivity and specificity for the discrimination of cases from controls. Using a novel self-learning pattern recognition algorithm, we identified a unique signature of 20 miRNAs that predicts AMI with even higher power (specificity 96%, sensitivity 90%, and accuracy 93%). In addition, we show that miR-30c and miR-145 levels correlate with infarct sizes estimated by Troponin T release. The here presented study shows that single miRNAs and especially miRNA signatures derived from peripheral blood, could be valuable novel biomarkers for cardiovascular diseases.
Collapse
Affiliation(s)
- Benjamin Meder
- Department of Internal Medicine III, University of Heidelberg, Heidelberg, Germany
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Guo Y, Graber A, McBurney RN, Balasubramanian R. Sample size and statistical power considerations in high-dimensionality data settings: a comparative study of classification algorithms. BMC Bioinformatics 2010; 11:447. [PMID: 20815881 PMCID: PMC2942858 DOI: 10.1186/1471-2105-11-447] [Citation(s) in RCA: 46] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2010] [Accepted: 09/03/2010] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND data generated using 'omics' technologies are characterized by high dimensionality, where the number of features measured per subject vastly exceeds the number of subjects in the study. In this paper, we consider issues relevant in the design of biomedical studies in which the goal is the discovery of a subset of features and an associated algorithm that can predict a binary outcome, such as disease status. We compare the performance of four commonly used classifiers (K-Nearest Neighbors, Prediction Analysis for Microarrays, Random Forests and Support Vector Machines) in high-dimensionality data settings. We evaluate the effects of varying levels of signal-to-noise ratio in the dataset, imbalance in class distribution and choice of metric for quantifying performance of the classifier. To guide study design, we present a summary of the key characteristics of 'omics' data profiled in several human or animal model experiments utilizing high-content mass spectrometry and multiplexed immunoassay based techniques. RESULTS the analysis of data from seven 'omics' studies revealed that the average magnitude of effect size observed in human studies was markedly lower when compared to that in animal studies. The data measured in human studies were characterized by higher biological variation and the presence of outliers. The results from simulation studies indicated that the classifier Prediction Analysis for Microarrays (PAM) had the highest power when the class conditional feature distributions were Gaussian and outcome distributions were balanced. Random Forests was optimal when feature distributions were skewed and when class distributions were unbalanced. We provide a free open-source R statistical software library (MVpower) that implements the simulation strategy proposed in this paper. CONCLUSION no single classifier had optimal performance under all settings. Simulation studies provide useful guidance for the design of biomedical studies involving high-dimensionality data.
Collapse
Affiliation(s)
- Yu Guo
- BG Medicine, Inc., 610 Lincoln St., Waltham, MA 02451, USA
| | - Armin Graber
- Institute for Bioinformatics and Translational Research, UMIT, Eduard Wallnoefer Zentrum 1, 6060 Hall in Tyrol, Austria
| | - Robert N McBurney
- Optimal Medicine Ltd., Warwick Enterprise Park, Wellesbourne, Warwick CV35 9EF, UK
| | - Raji Balasubramanian
- Division of Biostatistics and Epidemiology, University of Massachusetts - Amherst, 715 North Pleasant Street, Amherst, MA 01003, USA
| |
Collapse
|
12
|
Statnikov A, McVoy L, Lytkin N, Aliferis CF. Improving development of the molecular signature for diagnosis of acute respiratory viral infections. Cell Host Microbe 2010; 7:100-1; author reply 102. [PMID: 20159615 PMCID: PMC2824607 DOI: 10.1016/j.chom.2010.01.003] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2009] [Revised: 12/28/2009] [Accepted: 01/12/2010] [Indexed: 01/15/2023]
|
13
|
Huang LC, Hsu SY, Lin E. A comparison of classification methods for predicting Chronic Fatigue Syndrome based on genetic data. J Transl Med 2009; 7:81. [PMID: 19772600 PMCID: PMC2765429 DOI: 10.1186/1479-5876-7-81] [Citation(s) in RCA: 43] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2009] [Accepted: 09/22/2009] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND In the studies of genomics, it is essential to select a small number of genes that are more significant than the others for the association studies of disease susceptibility. In this work, our goal was to compare computational tools with and without feature selection for predicting chronic fatigue syndrome (CFS) using genetic factors such as single nucleotide polymorphisms (SNPs). METHODS We employed the dataset that was original to the previous study by the CDC Chronic Fatigue Syndrome Research Group. To uncover relationships between CFS and SNPs, we applied three classification algorithms including naive Bayes, the support vector machine algorithm, and the C4.5 decision tree algorithm. Furthermore, we utilized feature selection methods to identify a subset of influential SNPs. One was the hybrid feature selection approach combining the chi-squared and information-gain methods. The other was the wrapper-based feature selection method. RESULTS The naive Bayes model with the wrapper-based approach performed maximally among predictive models to infer the disease susceptibility dealing with the complex relationship between CFS and SNPs. CONCLUSION We demonstrated that our approach is a promising method to assess the associations between CFS and SNPs.
Collapse
Affiliation(s)
- Lung-Cheng Huang
- Department of Psychiatry, National Taiwan University Hospital Yun-Lin Branch, Taiwan
- Graduate Institute of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Sen-Yen Hsu
- Department of Psychiatry, Chi Mei Medical Center, Liouying, Tainan, Taiwan
| | - Eugene Lin
- Vita Genomics, Inc, 7 Fl, No 6, Sec 1, Jung-Shing Road, Wugu Shiang, Taipei, Taiwan
| |
Collapse
|