1
|
Chang D, Gupta VK, Hur B, Cobo-López S, Cunningham KY, Han NS, Lee I, Kronzer VL, Teigen LM, Karnatovskaia LV, Longbrake EE, Davis JM, Nelson H, Sung J. Gut Microbiome Wellness Index 2 enhances health status prediction from gut microbiome taxonomic profiles. Nat Commun 2024; 15:7447. [PMID: 39198444 PMCID: PMC11358288 DOI: 10.1038/s41467-024-51651-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 08/09/2024] [Indexed: 09/01/2024] Open
Abstract
Recent advancements in translational gut microbiome research have revealed its crucial role in shaping predictive healthcare applications. Herein, we introduce the Gut Microbiome Wellness Index 2 (GMWI2), an enhanced version of our original GMWI prototype, designed as a standardized disease-agnostic health status indicator based on gut microbiome taxonomic profiles. Our analysis involves pooling existing 8069 stool shotgun metagenomes from 54 published studies across a global demographic landscape (spanning 26 countries and six continents) to identify gut taxonomic signals linked to disease presence or absence. GMWI2 achieves a cross-validation balanced accuracy of 80% in distinguishing healthy (no disease) from non-healthy (diseased) individuals and surpasses 90% accuracy for samples with higher confidence (i.e., outside the "reject option"). This performance exceeds that of the original GMWI model and traditional species-level α-diversity indices, indicating a more robust gut microbiome signature for differentiating between healthy and non-healthy phenotypes across multiple diseases. When assessed through inter-study validation and external validation cohorts, GMWI2 maintains an average accuracy of nearly 75%. Furthermore, by reevaluating previously published datasets, GMWI2 offers new insights into the effects of diet, antibiotic exposure, and fecal microbiota transplantation on gut health. Available as an open-source command-line tool, GMWI2 represents a timely, pivotal resource for evaluating health using an individual's unique gut microbial composition.
Collapse
Affiliation(s)
- Daniel Chang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Vinod K Gupta
- Microbiomics Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Benjamin Hur
- Microbiomics Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA
| | - Sergio Cobo-López
- Viral Information Institute, San Diego State University, San Diego, CA, USA
| | - Kevin Y Cunningham
- Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN, USA
| | - Nam Soo Han
- Brain Korea 21 Center for Bio-Health Industry, Department of Food Science and Biotechnology, Chungbuk National University, Cheongju, South Korea
| | - Insuk Lee
- Department of Biotechnology, Yonsei University, Seoul, South Korea
| | - Vanessa L Kronzer
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Levi M Teigen
- Department of Food Science and Nutrition, University of Minnesota, St. Paul, MN, USA
| | | | | | - John M Davis
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, USA
| | - Heidi Nelson
- Emeritus, Department of Surgery, Mayo Clinic, Rochester, MN, USA
| | - Jaeyun Sung
- Microbiomics Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, USA.
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, USA.
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
2
|
Chanda D, De D. Meta-analysis reveals obesity associated gut microbial alteration patterns and reproducible contributors of functional shift. Gut Microbes 2024; 16:2304900. [PMID: 38265338 PMCID: PMC10810176 DOI: 10.1080/19490976.2024.2304900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 01/09/2024] [Indexed: 01/25/2024] Open
Abstract
The majority of cohort-specific studies associating gut microbiota with obesity are often contradictory; thus, the replicability of the signature remains questionable. Moreover, the species that drive obesity-associated functional shifts and their replicability remain unexplored. Thus, we aimed to address these questions by analyzing gut microbial metagenome sequencing data to develop an in-depth understanding of obese host-gut microbiota interactions using 3329 samples (Obese, n = 1494; Control, n = 1835) from 17 different countries, including both 16S rRNA gene and metagenomic sequence data. Fecal metagenomic data from diverse geographical locations were curated, profiled, and pooled using a machine learning-based approach to identify robust global signatures of obesity. Furthermore, gut microbial species and pathways were systematically integrated through the genomic content of the species to identify contributors to obesity-associated functional shifts. The community structure of the obese gut microbiome was evaluated, and a reproducible depletion of diversity was observed in the obese compared to the lean gut. From this, we infer that the loss of diversity in the obese gut is responsible for perturbations in the healthy microbial functional repertoire. We identified 25 highly predictive species and 37 pathway associations as signatures of obesity, which were validated with remarkably high accuracy (AUC, Species: 0.85, and pathway: 0.80) with an independent validation dataset. We observed a reduction in short-chain fatty acid (SCFA) producers (several Alistipes species, Odoribacter splanchnicus, etc.) and depletion of promoters of gut barrier integrity (Akkermansia muciniphila and Bifidobacterium longum) in obese guts. Our analysis underlines SCFAs and purine/pyrimidine biosynthesis, carbohydrate metabolism pathways in control individuals, and amino acid, enzyme cofactor, and peptidoglycan biosynthesis pathway enrichment in obese individuals. We also mapped the contributors to important obesity-associated functional shifts and observed that these are both dataset-specific and shared across the datasets. In summary, a comprehensive analysis of diverse datasets unveils species specifically contributing to functional shifts and consistent gut microbial patterns associated to obesity.
Collapse
Affiliation(s)
- Deep Chanda
- Laboratory of Cellular Differentiation & Metabolic Disorder, Department of Biotechnology, National Institute of Technology, Durgapur, India
| | - Debojyoti De
- Laboratory of Cellular Differentiation & Metabolic Disorder, Department of Biotechnology, National Institute of Technology, Durgapur, India
| |
Collapse
|
3
|
Chang D, Gupta VK, Hur B, Cobo-López S, Cunningham KY, Han NS, Lee I, Kronzer VL, Teigen LM, Karnatovskaia LV, Longbrake EE, Davis JM, Nelson H, Sung J. Gut Microbiome Wellness Index 2 for Enhanced Health Status Prediction from Gut Microbiome Taxonomic Profiles. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.30.560294. [PMID: 37873265 PMCID: PMC10592848 DOI: 10.1101/2023.09.30.560294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Recent advancements in human gut microbiome research have revealed its crucial role in shaping innovative predictive healthcare applications. We introduce Gut Microbiome Wellness Index 2 (GMWI2), an advanced iteration of our original GMWI prototype, designed as a robust, disease-agnostic health status indicator based on gut microbiome taxonomic profiles. Our analysis involved pooling existing 8069 stool shotgun metagenome data across a global demographic landscape to effectively capture biological signals linking gut taxonomies to health. GMWI2 achieves a cross-validation balanced accuracy of 80% in distinguishing healthy (no disease) from non-healthy (diseased) individuals and surpasses 90% accuracy for samples with higher confidence (i.e., outside the "reject option"). The enhanced classification accuracy of GMWI2 outperforms both the original GMWI model and traditional species-level α-diversity indices, suggesting a more reliable tool for differentiating between healthy and non-healthy phenotypes using gut microbiome data. Furthermore, by reevaluating and reinterpreting previously published data, GMWI2 provides fresh insights into the established understanding of how diet, antibiotic exposure, and fecal microbiota transplantation influence gut health. Looking ahead, GMWI2 represents a timely pivotal tool for evaluating health based on an individual's unique gut microbial composition, paving the way for the early screening of adverse gut health shifts. GMWI2 is offered as an open-source command-line tool, ensuring it is both accessible to and adaptable for researchers interested in the translational applications of human gut microbiome science.
Collapse
Affiliation(s)
- Daniel Chang
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN 55455, USA
| | - Vinod K Gupta
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, USA
- Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN 55905, USA
| | - Benjamin Hur
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, USA
- Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN 55905, USA
| | - Sergio Cobo-López
- Viral Information Institute, San Diego State University, San Diego, CA 92182, USA
| | - Kevin Y Cunningham
- Bioinformatics and Computational Biology Program, University of Minnesota, Minneapolis, MN 55455, USA
| | - Nam Soo Han
- Brain Korea 21 Center for Bio-Health Industry, Department of Food Science and Biotechnology, Chungbuk National University, Cheongju, South Korea
| | - Insuk Lee
- Department of Biotechnology, Yonsei University, Seoul 03722, South Korea
| | - Vanessa L Kronzer
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA
| | - Levi M Teigen
- Department of Food Science and Nutrition, University of Minnesota, St. Paul, MN 55108, USA
| | | | - Erin E Longbrake
- Department of Neurology, Yale University, New Haven, CT 06510, USA
| | - John M Davis
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA
| | - Heidi Nelson
- Emeritus, Department of Surgery, Mayo Clinic, Rochester, MN 55905, USA
| | - Jaeyun Sung
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN 55905, USA
- Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN 55905, USA
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN 55905, USA
| |
Collapse
|
4
|
Zhang Y, Patil P, Johnson WE, Parmigiani G. Robustifying genomic classifiers to batch effects via ensemble learning. Bioinformatics 2021; 37:1521-1527. [PMID: 33245114 DOI: 10.1093/bioinformatics/btaa986] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 10/20/2020] [Accepted: 11/13/2020] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Genomic data are often produced in batches due to practical restrictions, which may lead to unwanted variation in data caused by discrepancies across batches. Such 'batch effects' often have negative impact on downstream biological analysis and need careful consideration. In practice, batch effects are usually addressed by specifically designed software, which merge the data from different batches, then estimate batch effects and remove them from the data. Here, we focus on classification and prediction problems, and propose a different strategy based on ensemble learning. We first develop prediction models within each batch, then integrate them through ensemble weighting methods. RESULTS We provide a systematic comparison between these two strategies using studies targeting diverse populations infected with tuberculosis. In one study, we simulated increasing levels of heterogeneity across random subsets of the study, which we treat as simulated batches. We then use the two methods to develop a genomic classifier for the binary indicator of disease status. We evaluate the accuracy of prediction in another independent study targeting a different population cohort. We observed that in independent validation, while merging followed by batch adjustment provides better discrimination at low level of heterogeneity, our ensemble learning strategy achieves more robust performance, especially at high severity of batch effects. These observations provide practical guidelines for handling batch effects in the development and evaluation of genomic classifiers. AVAILABILITY AND IMPLEMENTATION The data underlying this article are available in the article and in its online supplementary material. Processed data is available in the Github repository with implementation code, at https://github.com/zhangyuqing/bea_ensemble. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuqing Zhang
- Clinical Bioinformatics, Gilead Sciences, Inc., Foster City, CA 94404, USA
| | - Prasad Patil
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA
| | - W Evan Johnson
- Department of Biostatistics, Boston University School of Public Health, Boston, MA 02118, USA.,Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA 02118, USA
| | - Giovanni Parmigiani
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| |
Collapse
|
5
|
Zhang Y, Bernau C, Parmigiani G, Waldron L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 2020; 21:253-268. [PMID: 30202918 DOI: 10.1093/biostatistics/kxy044] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2018] [Revised: 07/22/2018] [Accepted: 08/04/2018] [Indexed: 11/13/2022] Open
Abstract
Cross-study validation (CSV) of prediction models is an alternative to traditional cross-validation (CV) in domains where multiple comparable datasets are available. Although many studies have noted potential sources of heterogeneity in genomic studies, to our knowledge none have systematically investigated their intertwined impacts on prediction accuracy across studies. We employ a hybrid parametric/non-parametric bootstrap method to realistically simulate publicly available compendia of microarray, RNA-seq, and whole metagenome shotgun microbiome studies of health outcomes. Three types of heterogeneity between studies are manipulated and studied: (i) imbalances in the prevalence of clinical and pathological covariates, (ii) differences in gene covariance that could be caused by batch, platform, or tumor purity effects, and (iii) differences in the "true" model that associates gene expression and clinical factors to outcome. We assess model accuracy, while altering these factors. Lower accuracy is seen in CSV than in CV. Surprisingly, heterogeneity in known clinical covariates and differences in gene covariance structure have very limited contributions in the loss of accuracy when validating in new studies. However, forcing identical generative models greatly reduces the within/across study difference. These results, observed consistently for multiple disease outcomes and omics platforms, suggest that the most easily identifiable sources of study heterogeneity are not necessarily the primary ones that undermine the ability to accurately replicate the accuracy of omics prediction models in new studies. Unidentified heterogeneity, such as could arise from unmeasured confounding, may be more important.
Collapse
Affiliation(s)
- Yuqing Zhang
- Graduate Program in Bioinformatics, Boston University, 24 Cummington Mall, Boston, MA, USA
| | - Christoph Bernau
- Department of Medical Informatics, Biometry and Epidemiology, University of Munich, Marchioninistr. 15, Munich, Germany
| | - Giovanni Parmigiani
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, 3 Blackfan Cir, Boston, MA, USA.,Department of Biostatistics, Harvard TH Chan School of Public Health, 677 Huntington Ave, Boston, MA, USA
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, Institute for Implementation Science in Population Health, City University of New York, 55 W 125th St, New York, NY, USA
| |
Collapse
|
6
|
Gupta VK, Kim M, Bakshi U, Cunningham KY, Davis JM, Lazaridis KN, Nelson H, Chia N, Sung J. A predictive index for health status using species-level gut microbiome profiling. Nat Commun 2020; 11:4635. [PMID: 32934239 PMCID: PMC7492273 DOI: 10.1038/s41467-020-18476-8] [Citation(s) in RCA: 119] [Impact Index Per Article: 29.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2020] [Accepted: 08/19/2020] [Indexed: 12/26/2022] Open
Abstract
Providing insight into one’s health status from a gut microbiome sample is an important clinical goal in current human microbiome research. Herein, we introduce the Gut Microbiome Health Index (GMHI), a biologically-interpretable mathematical formula for predicting the likelihood of disease independent of the clinical diagnosis. GMHI is formulated upon 50 microbial species associated with healthy gut ecosystems. These species are identified through a multi-study, integrative analysis on 4347 human stool metagenomes from 34 published studies across healthy and 12 different nonhealthy conditions, i.e., disease or abnormal bodyweight. When demonstrated on our population-scale meta-dataset, GMHI is the most robust and consistent predictor of disease presence (or absence) compared to α-diversity indices. Validation on 679 samples from 9 additional studies results in a balanced accuracy of 73.7% in distinguishing healthy from non-healthy groups. Our findings suggest that gut taxonomic signatures can predict health status, and highlight how data sharing efforts can provide broadly applicable discoveries. A biologically-interpretable and robust metric that provides insight into one’s health status from a gut microbiome sample is an important clinical goal in current human microbiome research. Herein, the authors introduce a species-level index that predicts the likelihood of having a disease.
Collapse
Affiliation(s)
- Vinod K Gupta
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.,Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Minsuk Kim
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.,Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Utpal Bakshi
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.,Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Kevin Y Cunningham
- Graduate Research Education Program (GREP), Mayo Clinic, Rochester, MN, 55905, USA.,Department of Computer Science and Engineering, University of Minnesota Twin-Cities, Minneapolis, MN, 55455, USA
| | - John M Davis
- Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Konstantinos N Lazaridis
- Division of Gastroenterology and Hepatology, Mayo Clinic College of Medicine and Science, Rochester, MN, 55905, USA
| | - Heidi Nelson
- Emeritus Chair, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Nicholas Chia
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.,Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA
| | - Jaeyun Sung
- Microbiome Program, Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA. .,Division of Surgery Research, Department of Surgery, Mayo Clinic, Rochester, MN, 55905, USA. .,Division of Rheumatology, Department of Medicine, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
7
|
Affiliation(s)
- Lo‐Bin Chang
- Department of StatisticsThe Ohio State UniversityColumbus OH 43210‐1326 U.S.A
| |
Collapse
|
8
|
Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2019; 46:D8-D13. [PMID: 29140470 PMCID: PMC5753372 DOI: 10.1093/nar/gkx1095] [Citation(s) in RCA: 908] [Impact Index Per Article: 181.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2017] [Accepted: 11/09/2017] [Indexed: 12/26/2022] Open
Abstract
The National Center for Biotechnology Information (NCBI) provides a large suite of online resources for biological information and data, including the GenBank® nucleic acid sequence database and the PubMed database of citations and abstracts for published life science journals. The Entrez system provides search and retrieval operations for most of these data from 39 distinct databases. The E-utilities serve as the programming interface for the Entrez system. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. New resources released in the past year include PubMed Data Management, RefSeq Functional Elements, genome data download, variation services API, Magic-BLAST, QuickBLASTp, and Identical Protein Groups. Resources that were updated in the past year include the genome data viewer, a human genome resources page, Gene, virus variation, OSIRIS, and PubChem. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.
Collapse
|
9
|
Affiliation(s)
- Meng Pan
- Department of Optoelectronic Engineering, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| | - Jie Zhang
- Department of Physics, College of Science and Engineering, Jinan University, Guangzhou, Guangdong, PR China
| |
Collapse
|
10
|
Abstract
This article considers replicability of the performance of predictors across studies. We suggest a general approach to investigating this issue, based on ensembles of prediction models trained on different studies. We quantify how the common practice of training on a single study accounts in part for the observed challenges in replicability of prediction performance. We also investigate whether ensembles of predictors trained on multiple studies can be combined, using unique criteria, to design robust ensemble learners trained upfront to incorporate replicability into different contexts and populations.
Collapse
Affiliation(s)
- Prasad Patil
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115
| | - Giovanni Parmigiani
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02215;
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA 02115
| |
Collapse
|
11
|
Ghosh D, Funk CC, Caballero J, Shah N, Rouleau K, Earls JC, Soroceanu L, Foltz G, Cobbs CS, Price ND, Hood L. A Cell-Surface Membrane Protein Signature for Glioblastoma. Cell Syst 2017; 4:516-529.e7. [PMID: 28365151 DOI: 10.1016/j.cels.2017.03.004] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2015] [Revised: 09/08/2016] [Accepted: 03/03/2017] [Indexed: 02/08/2023]
Abstract
We present a systems strategy that facilitated the development of a molecular signature for glioblastoma (GBM), composed of 33 cell-surface transmembrane proteins. This molecular signature, GBMSig, was developed through the integration of cell-surface proteomics and transcriptomics from patient tumors in the REMBRANDT (n = 228) and TCGA datasets (n = 547) and can separate GBM patients from control individuals with a Matthew's correlation coefficient value of 0.87 in a lock-down test. Functionally, 17/33 GBMSig proteins are associated with transforming growth factor β signaling pathways, including CD47, SLC16A1, HMOX1, and MRC2. Knockdown of these genes impaired GBM invasion, reflecting their role in disease-perturbed changes in GBM. ELISA assays for a subset of GBMSig (CD44, VCAM1, HMOX1, and BIGH3) on 84 plasma specimens from multiple clinical sites revealed a high degree of separation of GBM patients from healthy control individuals (area under the curve is 0.98 in receiver operating characteristic). In addition, a classifier based on these four proteins differentiated the blood of pre- and post-tumor resections, demonstrating potential clinical value as biomarkers.
Collapse
Affiliation(s)
| | - Cory C Funk
- Institute for Systems Biology, Seattle, WA 98109, USA
| | | | - Nameeta Shah
- The Ben and Catherine Ivy Center for Advanced Brain Tumor Treatment, Swedish Neuroscience Institute, Seattle, WA 98122, USA
| | | | - John C Earls
- Institute for Systems Biology, Seattle, WA 98109, USA; Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Liliana Soroceanu
- California Pacific Medical Center Research Institute, San Francisco, CA 94107, USA
| | - Greg Foltz
- The Ben and Catherine Ivy Center for Advanced Brain Tumor Treatment, Swedish Neuroscience Institute, Seattle, WA 98122, USA
| | - Charles S Cobbs
- The Ben and Catherine Ivy Center for Advanced Brain Tumor Treatment, Swedish Neuroscience Institute, Seattle, WA 98122, USA
| | - Nathan D Price
- Institute for Systems Biology, Seattle, WA 98109, USA; Department of Computer Science and Engineering, University of Washington, Seattle, WA 98195, USA
| | - Leroy Hood
- Institute for Systems Biology, Seattle, WA 98109, USA.
| |
Collapse
|
12
|
Kim S, Jhong JH, Lee J, Koo JY. Meta-analytic support vector machine for integrating multiple omics data. BioData Min 2017; 10:2. [PMID: 28149325 PMCID: PMC5270233 DOI: 10.1186/s13040-017-0126-8] [Citation(s) in RCA: 82] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Accepted: 01/11/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Of late, high-throughput microarray and sequencing data have been extensively used to monitor biomarkers and biological processes related to many diseases. Under this circumstance, the support vector machine (SVM) has been popularly used and been successful for gene selection in many applications. Despite surpassing benefits of the SVMs, single data analysis using small- and mid-size of data inevitably runs into the problem of low reproducibility and statistical power. To address this problem, we propose a meta-analytic support vector machine (Meta-SVM) that can accommodate multiple omics data, making it possible to detect consensus genes associated with diseases across studies. RESULTS Experimental studies show that the Meta-SVM is superior to the existing meta-analysis method in detecting true signal genes. In real data applications, diverse omics data of breast cancer (TCGA) and mRNA expression data of lung disease (idiopathic pulmonary fibrosis; IPF) were applied. As a result, we identified gene sets consistently associated with the diseases across studies. In particular, the ascertained gene set of TCGA omics data was found to be significantly enriched in the ABC transporters pathways well known as critical for the breast cancer mechanism. CONCLUSION The Meta-SVM effectively achieves the purpose of meta-analysis as jointly leveraging multiple omics data, and facilitates identifying potential biomarkers and elucidating the disease process.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea.,Department of Statistics, Keimyung University, Dalseoku, Daegu, 42601 South Korea
| | - Jae-Hwan Jhong
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - JungJun Lee
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| | - Ja-Yong Koo
- Department of Statistics, Korea University, Anam-dong, Seoul, 136-701 South Korea
| |
Collapse
|
13
|
Biales AD, Kostich MS, Batt AL, See MJ, Flick RW, Gordon DA, Lazorchak JM, Bencic DC. Initial development of a multigene 'omics-based exposure biomarker for pyrethroid pesticides. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2016; 179:27-35. [PMID: 27564377 DOI: 10.1016/j.aquatox.2016.08.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/12/2016] [Revised: 08/02/2016] [Accepted: 08/05/2016] [Indexed: 06/06/2023]
Abstract
Omics technologies have long since promised to address a number of long standing issues related to environmental regulation. Despite considerable resource investment, there are few examples where these tools have been adopted by the regulatory community, which is in part due to a focus of most studies on discovery rather than assay development. The current work describes the initial development of an omics based assay using 48h Pimephales promelas (FHM) larvae for identifying aquatic exposures to pyrethroid pesticides. Larval FHM were exposed to seven concentrations of each of four pyrethroids (permethrin, cypermethrin, esfenvalerate and bifenthrin) in order to establish dose response curves. Then, in three separate identical experiments, FHM were exposed to a single equitoxic concentration of each pyrethroid, corresponding to 33% of the calculated LC50. All exposures were separated by weeks and all materials were either cleaned or replaced between runs in an attempt to maintain independence among exposure experiments. Gene expression classifiers were developed using the random forest algorithm for each exposure and evaluated first by cross-validation using hold out organisms from the same exposure experiment and then against test sets of each pyrethroid from separate exposure experiments. Bifenthrin exposed organisms generated the highest quality classifier, demonstrating an empirical Area Under the Curve (eAUC) of 0.97 when tested against bifenthrin exposed organisms from other exposure experiments and 0.91 against organisms exposed to any of the pyrethroids. An eAUC of 1.0 represents perfect classification with no false positives or negatives. Additionally, the bifenthrin classifier was able to successfully classify organisms from all other pyrethroid exposures at multiple concentrations, suggesting a potential utility for detecting cumulative exposures. Considerable run-to-run variability was observed both in exposure concentrations and molecular responses of exposed fish across exposure experiments. The application of a calibration step in analysis successfully corrected this, resulting in a significantly improved classifier. Classifier evaluation suggested the importance of considering a number of aspects of experimental design when developing an expression based tool for general use in ecological monitoring and risk assessment, such as the inclusion of multiple experimental runs and high replicate numbers.
Collapse
Affiliation(s)
- Adam D Biales
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States.
| | - Mitchell S Kostich
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - Angela L Batt
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - Mary J See
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - Robert W Flick
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - Denise A Gordon
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - Jim M Lazorchak
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| | - David C Bencic
- US Environmental Protection Agency, National Exposure Research Laboratory, Cincinnati, OH 45268, United States
| |
Collapse
|
14
|
Triple-layer dissection of the lung adenocarcinoma transcriptome: regulation at the gene, transcript, and exon levels. Oncotarget 2016; 6:28755-73. [PMID: 26356813 PMCID: PMC4745690 DOI: 10.18632/oncotarget.4810] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2015] [Accepted: 08/21/2015] [Indexed: 12/30/2022] Open
Abstract
Lung adenocarcinoma is one of the most deadly human diseases. However, the molecular mechanisms underlying this disease, particularly RNA splicing, have remained underexplored. Here, we report a triple-level (gene-, transcript-, and exon-level) analysis of lung adenocarcinoma transcriptomes from 77 paired tumor and normal tissues, as well as an analysis pipeline to overcome genetic variability for accurate differentiation between tumor and normal tissues. We report three major results. First, more than 5,000 differentially expressed transcripts/exonic regions occur repeatedly in lung adenocarcinoma patients. These transcripts/exonic regions are enriched in nicotine metabolism and ribosomal functions in addition to the pathways enriched for differentially expressed genes (cell cycle, extracellular matrix receptor interaction, and axon guidance). Second, classification models based on rationally selected transcripts or exonic regions can reach accuracies of 0.93 to 1.00 in differentiating tumor from normal tissues. Of the 28 selected exonic regions, 26 regions correspond to alternative exons located in such regulators as tumor suppressor (GDF10), signal receptor (LYVE1), vascular-specific regulator (RASIP1), ubiquitination mediator (RNF5), and transcriptional repressor (TRIM27). Third, classification systems based on 13 to 14 differentially expressed genes yield accuracies near 100%. Genes selected by both detection methods include C16orf59, DAP3, ETV4, GABARAPL1, PPAR, RADIL, RSPO1, SERTM1, SRPK1, ST6GALNAC6, and TNXB. Our findings imply a multilayered lung adenocarcinoma regulome in which transcript-/exon-level regulation may be dissociated from gene-level regulation. Our described method may be used to identify potentially important genes/transcripts/exonic regions for the tumorigenesis of lung adenocarcinoma and to construct accurate tumor vs. normal classification systems for this disease.
Collapse
|
15
|
Kim S, Lin CW, Tseng GC. MetaKTSP: a meta-analytic top scoring pair method for robust cross-study validation of omics prediction analysis. Bioinformatics 2016; 32:1966-73. [PMID: 27153719 DOI: 10.1093/bioinformatics/btw115] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2015] [Accepted: 02/19/2016] [Indexed: 01/08/2023] Open
Abstract
MOTIVATION Supervised machine learning is widely applied to transcriptomic data to predict disease diagnosis, prognosis or survival. Robust and interpretable classifiers with high accuracy are usually favored for their clinical and translational potential. The top scoring pair (TSP) algorithm is an example that applies a simple rank-based algorithm to identify rank-altered gene pairs for classifier construction. Although many classification methods perform well in cross-validation of single expression profile, the performance usually greatly reduces in cross-study validation (i.e. the prediction model is established in the training study and applied to an independent test study) for all machine learning methods, including TSP. The failure of cross-study validation has largely diminished the potential translational and clinical values of the models. The purpose of this article is to develop a meta-analytic top scoring pair (MetaKTSP) framework that combines multiple transcriptomic studies and generates a robust prediction model applicable to independent test studies. RESULTS We proposed two frameworks, by averaging TSP scores or by combining P-values from individual studies, to select the top gene pairs for model construction. We applied the proposed methods in simulated data sets and three large-scale real applications in breast cancer, idiopathic pulmonary fibrosis and pan-cancer methylation. The result showed superior performance of cross-study validation accuracy and biomarker selection for the new meta-analytic framework. In conclusion, combining multiple omics data sets in the public domain increases robustness and accuracy of the classification model that will ultimately improve disease understanding and clinical treatment decisions to benefit patients. AVAILABILITY AND IMPLEMENTATION An R package MetaKTSP is available online. (http://tsenglab.biostat.pitt.edu/software.htm). CONTACT ctseng@pitt.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- SungHwan Kim
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Statistics, Korea University, Seoul, South Korea
| | - Chien-Wei Lin
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA
| | - George C Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA, USA Department of Computational and Systems Biology Department of Human Genetics, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|