1
|
Kępińska AP, Johnson JS, Huckins LM. Open Science Practices in Psychiatric Genetics: A Primer. BIOLOGICAL PSYCHIATRY GLOBAL OPEN SCIENCE 2024; 4:110-119. [PMID: 38298792 PMCID: PMC10829621 DOI: 10.1016/j.bpsgos.2023.08.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2023] [Revised: 08/04/2023] [Accepted: 08/11/2023] [Indexed: 02/02/2024] Open
Abstract
Open science ensures that research is transparently reported and freely accessible for all to assess and collaboratively build on. Psychiatric genetics has led among the health sciences in implementing some open science practices in common study designs, such as replication as part of genome-wide association studies. However, thorough open science implementation guidelines are limited and largely not specific to data, privacy, and research conduct challenges in psychiatric genetics. Here, we present a primer of open science practices, including selection of a research topic with patients/nonacademic collaborators, equitable authorship and citation practices, design of replicable, reproducible studies, preregistrations, open data, and privacy issues. We provide tips for informative figures and inclusive, precise reporting. We discuss considerations in working with nonacademic collaborators and distributing research through preprints, blogs, social media, and accessible lecture materials. Finally, we provide extra resources to support every step of the research process.
Collapse
Affiliation(s)
- Adrianna P. Kępińska
- Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Psychosis Studies, Institute of Psychiatry, Psychology & Neuroscience, King’s College London, London, United Kingdom
| | - Jessica S. Johnson
- Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, New York
- Psychiatry Department, The University of North Carolina at Chapel Hill School of Medicine, Chapel Hill, North Carolina
| | - Laura M. Huckins
- Pamela Sklar Division of Psychiatric Genomics, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Psychiatry, Icahn School of Medicine at Mount Sinai, New York, New York
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, New York
- Department of Psychiatry, Yale University, New Haven, Connecticut
| |
Collapse
|
2
|
Cortés AJ, López-Hernández F, Blair MW. Genome-Environment Associations, an Innovative Tool for Studying Heritable Evolutionary Adaptation in Orphan Crops and Wild Relatives. Front Genet 2022; 13:910386. [PMID: 35991553 PMCID: PMC9389289 DOI: 10.3389/fgene.2022.910386] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/30/2022] [Indexed: 11/23/2022] Open
Abstract
Leveraging innovative tools to speed up prebreeding and discovery of genotypic sources of adaptation from landraces, crop wild relatives, and orphan crops is a key prerequisite to accelerate genetic gain of abiotic stress tolerance in annual crops such as legumes and cereals, many of which are still orphan species despite advances in major row crops. Here, we review a novel, interdisciplinary approach to combine ecological climate data with evolutionary genomics under the paradigm of a new field of study: genome-environment associations (GEAs). We first exemplify how GEA utilizes in situ georeferencing from genotypically characterized, gene bank accessions to pinpoint genomic signatures of natural selection. We later discuss the necessity to update the current GEA models to predict both regional- and local- or micro-habitat-based adaptation with mechanistic ecophysiological climate indices and cutting-edge GWAS-type genetic association models. Furthermore, to account for polygenic evolutionary adaptation, we encourage the community to start gathering genomic estimated adaptive values (GEAVs) for genomic prediction (GP) and multi-dimensional machine learning (ML) models. The latter two should ideally be weighted by de novo GWAS-based GEA estimates and optimized for a scalable marker subset. We end the review by envisioning avenues to make adaptation inferences more robust through the merging of high-resolution data sources, such as environmental remote sensing and summary statistics of the genomic site frequency spectrum, with the epigenetic molecular functionality responsible for plastic inheritance in the wild. Ultimately, we believe that coupling evolutionary adaptive predictions with innovations in ecological genomics such as GEA will help capture hidden genetic adaptations to abiotic stresses based on crop germplasm resources to assist responses to climate change. "I shall endeavor to find out how nature's forces act upon one another, and in what manner the geographic environment exerts its influence on animals and plants. In short, I must find out about the harmony in nature" Alexander von Humboldt-Letter to Karl Freiesleben, June 1799.
Collapse
Affiliation(s)
- Andrés J. Cortés
- Corporacion Colombiana de Investigacion Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia
| | - Felipe López-Hernández
- Corporacion Colombiana de Investigacion Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia
| | - Matthew W. Blair
- Department of Agricultural & Environmental Sciences, Tennessee State University, Nashville, TN, United States
| |
Collapse
|
3
|
Yang JJ, Grissa D, Lambert CG, Bologa CG, Mathias SL, Waller A, Wild DJ, Jensen LJ, Oprea TI. TIGA: target illumination GWAS analytics. Bioinformatics 2021; 37:3865-3873. [PMID: 34086846 PMCID: PMC11025677 DOI: 10.1093/bioinformatics/btab427] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 05/12/2021] [Accepted: 06/03/2021] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Genome-wide association studies can reveal important genotype-phenotype associations; however, data quality and interpretability issues must be addressed. For drug discovery scientists seeking to prioritize targets based on the available evidence, these issues go beyond the single study. RESULTS Here, we describe rational ranking, filtering and interpretation of inferred gene-trait associations and data aggregation across studies by leveraging existing curation and harmonization efforts. Each gene-trait association is evaluated for confidence, with scores derived solely from aggregated statistics, linking a protein-coding gene and phenotype. We propose a method for assessing confidence in gene-trait associations from evidence aggregated across studies, including a bibliometric assessment of scientific consensus based on the iCite relative citation ratio, and meanRank scores, to aggregate multivariate evidence.This method, intended for drug target hypothesis generation, scoring and ranking, has been implemented as an analytical pipeline, available as open source, with public datasets of results, and a web application designed for usability by drug discovery scientists. AVAILABILITY AND IMPLEMENTATION Web application, datasets and source code via https://unmtid-shinyapps.net/tiga/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jeremy J Yang
- Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
- Integrative Data Science Laboratory, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Dhouha Grissa
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| | - Christophe G Lambert
- Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
| | - Cristian G Bologa
- Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
| | - Stephen L Mathias
- Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
| | - Anna Waller
- Department of Pathology, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
| | - David J Wild
- Integrative Data Science Laboratory, School of Informatics, Computing and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| | - Tudor I Oprea
- Division of Translational Informatics, Department of Internal Medicine, University of New Mexico Health Sciences Center, Albuquerque, NM 87131, USA
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
4
|
Sinke L, Cats D, Heijmans BT. Omixer: multivariate and reproducible sample randomization to proactively counter batch effects in omics studies. Bioinformatics 2021; 37:3051-3052. [PMID: 33693546 DOI: 10.1093/bioinformatics/btab159] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Revised: 02/02/2021] [Accepted: 03/04/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Batch effects heavily impact results in omics studies, causing bias and false positive results, but software to control them preemptively is lacking. Sample randomization prior to measurement is vital for minimizing these effects, but current approaches are often ad hoc, poorly documented and ill-equipped to handle multiple batches and outcomes. RESULTS We developed Omixer-a Bioconductor package implementing multivariate and reproducible sample randomization for omics studies. It proactively counters correlations between technical factors and biological variables of interest by optimizing sample distribution across batches. AVAILABILITYAND IMPLEMENTATION Omixer is available from Bioconductor at http://bioconductor.org/packages/release/bioc/html/Omixer.html. Scripts and data used to generate figures available upon request. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lucy Sinke
- Molecular Epidemiology, Department of Biomedical Data Science, Leiden University Medical Centre, Leiden 2333 ZC, The Netherlands
| | - Davy Cats
- Molecular Epidemiology, Department of Biomedical Data Science, Leiden University Medical Centre, Leiden 2333 ZC, The Netherlands
| | - Bastiaan T Heijmans
- Molecular Epidemiology, Department of Biomedical Data Science, Leiden University Medical Centre, Leiden 2333 ZC, The Netherlands
| |
Collapse
|
5
|
Cortés AJ, López-Hernández F. Harnessing Crop Wild Diversity for Climate Change Adaptation. Genes (Basel) 2021; 12:783. [PMID: 34065368 PMCID: PMC8161384 DOI: 10.3390/genes12050783] [Citation(s) in RCA: 49] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 04/28/2021] [Accepted: 05/19/2021] [Indexed: 12/20/2022] Open
Abstract
Warming and drought are reducing global crop production with a potential to substantially worsen global malnutrition. As with the green revolution in the last century, plant genetics may offer concrete opportunities to increase yield and crop adaptability. However, the rate at which the threat is happening requires powering new strategies in order to meet the global food demand. In this review, we highlight major recent 'big data' developments from both empirical and theoretical genomics that may speed up the identification, conservation, and breeding of exotic and elite crop varieties with the potential to feed humans. We first emphasize the major bottlenecks to capture and utilize novel sources of variation in abiotic stress (i.e., heat and drought) tolerance. We argue that adaptation of crop wild relatives to dry environments could be informative on how plant phenotypes may react to a drier climate because natural selection has already tested more options than humans ever will. Because isolated pockets of cryptic diversity may still persist in remote semi-arid regions, we encourage new habitat-based population-guided collections for genebanks. We continue discussing how to systematically study abiotic stress tolerance in these crop collections of wild and landraces using geo-referencing and extensive environmental data. By uncovering the genes that underlie the tolerance adaptive trait, natural variation has the potential to be introgressed into elite cultivars. However, unlocking adaptive genetic variation hidden in related wild species and early landraces remains a major challenge for complex traits that, as abiotic stress tolerance, are polygenic (i.e., regulated by many low-effect genes). Therefore, we finish prospecting modern analytical approaches that will serve to overcome this issue. Concretely, genomic prediction, machine learning, and multi-trait gene editing, all offer innovative alternatives to speed up more accurate pre- and breeding efforts toward the increase in crop adaptability and yield, while matching future global food demands in the face of increased heat and drought. In order for these 'big data' approaches to succeed, we advocate for a trans-disciplinary approach with open-source data and long-term funding. The recent developments and perspectives discussed throughout this review ultimately aim to contribute to increased crop adaptability and yield in the face of heat waves and drought events.
Collapse
Affiliation(s)
- Andrés J. Cortés
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Km 7 Vía Rionegro, Las Palmas, Rionegro 054048, Colombia;
- Departamento de Ciencias Forestales, Facultad de Ciencias Agrarias, Universidad Nacional de Colombia, Sede Medellín, Medellín 050034, Colombia
| | - Felipe López-Hernández
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Km 7 Vía Rionegro, Las Palmas, Rionegro 054048, Colombia;
| |
Collapse
|
6
|
Lin X. Learning Lessons on Reproducibility and Replicability in Large Scale Genome-Wide Association Studies. HARVARD DATA SCIENCE REVIEW 2020; 2:10.1162/99608f92.33703976. [PMID: 38362534 PMCID: PMC10869125 DOI: 10.1162/99608f92.33703976] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2024] Open
Abstract
Reproducibility and replicability play a pivotal role in science. The article reflects on reproducibility and replicability as they figure in large scale genome-wide association studies. Overall, we emphasize the importance of enhancing data reproducibility, analysis reproducibility, and result replicability. We make recommendations pertaining to the development of study designs that address 1) batch effects and selection bias, 2) the incorporation of discrete discovery and replication phases, and 3) the procurement of a large sample size. We emphasize the importance of systematic and transparent data generation, processing, and quality control pipelines, as well as a rigorous field-specific standardized analysis protocol, We offer guidance with respect to collaborative frameworks, open access analysis tools, and software, and the use of supporting mandates, infrastructure, and repositories for data and resource sharing. Finally, we identify the role of incentives and culture in fueling the production of reproducible and replicable research through partnerships of researchers, funding agencies, and journals.
Collapse
Affiliation(s)
- Xihong Lin
- Department of Biostatistics and Department of Statistics, Harvard University
| |
Collapse
|
7
|
Cortés AJ, López-Hernández F, Osorio-Rodriguez D. Predicting Thermal Adaptation by Looking Into Populations' Genomic Past. Front Genet 2020; 11:564515. [PMID: 33101385 PMCID: PMC7545011 DOI: 10.3389/fgene.2020.564515] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2020] [Accepted: 08/24/2020] [Indexed: 12/18/2022] Open
Abstract
Molecular evolution offers an insightful theory to interpret the genomic consequences of thermal adaptation to previous events of climate change beyond range shifts. However, disentangling often mixed footprints of selective and demographic processes from those due to lineage sorting, recombination rate variation, and genomic constrains is not trivial. Therefore, here we condense current and historical population genomic tools to study thermal adaptation and outline key developments (genomic prediction, machine learning) that might assist their utilization for improving forecasts of populations' responses to thermal variation. We start by summarizing how recent thermal-driven selective and demographic responses can be inferred by coalescent methods and in turn how quantitative genetic theory offers suitable multi-trait predictions over a few generations via the breeder's equation. We later assume that enough generations have passed as to display genomic signatures of divergent selection to thermal variation and describe how these footprints can be reconstructed using genome-wide association and selection scans or, alternatively, may be used for forward prediction over multiple generations under an infinitesimal genomic prediction model. Finally, we move deeper in time to comprehend the genomic consequences of thermal shifts at an evolutionary time scale by relying on phylogeographic approaches that allow for reticulate evolution and ecological parapatric speciation, and end by envisioning the potential of modern machine learning techniques to better inform long-term predictions. We conclude that foreseeing future thermal adaptive responses requires bridging the multiple spatial scales of historical and predictive environmental change research under modern cohesive approaches such as genomic prediction and machine learning frameworks.
Collapse
Affiliation(s)
- Andrés J Cortés
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia.,Departamento de Ciencias Forestales, Facultad de Ciencias Agrarias, Universidad Nacional de Colombia - Sede Medellín, Medellín, Colombia
| | - Felipe López-Hernández
- Corporación Colombiana de Investigación Agropecuaria AGROSAVIA, C.I. La Selva, Rionegro, Colombia
| | - Daniela Osorio-Rodriguez
- Division of Geological and Planetary Sciences, California Institute of Technology (Caltech), Pasadena, CA, United States
| |
Collapse
|
8
|
Novel Bead-Based Epitope Assay is a sensitive and reliable tool for profiling epitope-specific antibody repertoire in food allergy. Sci Rep 2019; 9:18425. [PMID: 31804555 PMCID: PMC6895130 DOI: 10.1038/s41598-019-54868-7] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2019] [Accepted: 11/09/2019] [Indexed: 12/16/2022] Open
Abstract
Identification of allergenic IgE epitopes is instrumental for the development of novel diagnostic and prognostic methods in food allergy. In this work, we present the quantification and validation of a Bead-Based Epitope Assay (BBEA) that through multiplexing of epitopes and multiple sample processing enables completion of large experiments in a short period of time, using minimal quantities of patients’ blood. Peptides that are uniquely coupled to beads are incubated with serum or plasma samples, and after a secondary fluorophore-labeled antibody is added, the level of fluorescence is quantified with a Luminex reader. The signal is then normalized and converted to epitope-specific antibody binding values. We show that the effect of technical artifacts, i.e. well position or reading order, is minimal; and batch effects - different individual microplate runs - can be easily estimated and eliminated from the data. Epitope-specific antibody binding quantified with BBEA is highly reliable, reproducible and has greater sensitivity of epitope detection compared to peptide microarrays. IgE directed at allergenic epitopes is a sensitive biomarker of food allergy and can be used to predict allergy severity and phenotypes; and quantification of the relationship between epitope-specific IgE and IgG4 can further improve our understanding of the immune mechanisms behind allergic sensitization.
Collapse
|
9
|
Abstract
The scientific method has been guiding biological research for a long time. It not only prescribes the order and types of activities that give a scientific study validity and a stamp of approval but also has substantially shaped how we collectively think about the endeavor of investigating nature. The advent of high-throughput data generation, data mining, and advanced computational modeling has thrown the formerly undisputed, monolithic status of the scientific method into turmoil. On the one hand, the new approaches are clearly successful and expect the same acceptance as the traditional methods, but on the other hand, they replace much of the hypothesis-driven reasoning with inductive argumentation, which philosophers of science consider problematic. Intrigued by the enormous wealth of data and the power of machine learning, some scientists have even argued that significant correlations within datasets could make the entire quest for causation obsolete. Many of these issues have been passionately debated during the past two decades, often with scant agreement. It is proffered here that hypothesis-driven, data-mining-inspired, and "allochthonous" knowledge acquisition, based on mathematical and computational models, are vectors spanning a 3D space of an expanded scientific method. The combination of methods within this space will most certainly shape our thinking about nature, with implications for experimental design, peer review and funding, sharing of result, education, medical diagnostics, and even questions of litigation.
Collapse
Affiliation(s)
- Eberhard O. Voit
- Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, Georgia, United States of America
| |
Collapse
|
10
|
Aruoma OI, Hausman-Cohen S, Pizano J, Schmidt MA, Minich DM, Joffe Y, Brandhorst S, Evans SJ, Brady DM. Personalized Nutrition: Translating the Science of NutriGenomics Into Practice: Proceedings From the 2018 American College of Nutrition Meeting. J Am Coll Nutr 2019; 38:287-301. [DOI: 10.1080/07315724.2019.1582980] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Affiliation(s)
- Okezie I Aruoma
- California State University Los Angeles, Los Angeles, California, USA
- Southern California University of Health Sciences, Whittier, California, USA
| | | | - Jessica Pizano
- Nutritional Genomics Institute, SNPed, and OmicsDX, Chasterfield, Virginia, USA
| | - Michael A. Schmidt
- Advanced Pattern Analysis & Countermeasures Group, Boulder, Colorado, USA
- Sovaris Aerospace, Boulder, Colorado, USA
| | - Deanna M. Minich
- University of Western States, Portland, Oregon, USA
- Institute for Functional Medicine, Federal Way, Washington, USA
| | - Yael Joffe
- 3X4 Genetics and Manuka Science, Cape Town, South Africa
| | | | | | - David M. Brady
- University of Bridgeport, Bridgeport, Connecticut, USA
- Whole Body Medicine, Fairfield, Connecticut, USA
| |
Collapse
|
11
|
Suprun M, Suárez-Fariñas M. PlateDesigner: a web-based application for the design of microplate experiments. Bioinformatics 2019; 35:1605-1607. [PMID: 30304481 PMCID: PMC6821189 DOI: 10.1093/bioinformatics/bty853] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2018] [Revised: 09/09/2018] [Accepted: 10/08/2018] [Indexed: 11/12/2022] Open
Abstract
SUMMARY In biological assays, systematic variability, known as a batch effect, can often confound the effects of true biological conditions and has been well documented for a variety of high-throughput technologies. In microplate-based multiplex experiments, such as Luminex or OLINK assays, researchers need to consider both position and plate effects. Those effects can be easily accounted for if the experiments are properly designed, which includes randomization of the samples across multiple experimental runs. However, doing the ad hoc randomization becomes challenging when handling multiple samples. PlateDesigner is the first web-based application that provides randomization for microplate experiments, ensuring that the main principles of the experimental design, such as grouping samples from the same biological units and balancing the distribution of experimental conditions, are applied. Creating randomizations with PlateDesigner is simple and the results can be exported in a variety of formats, and easily integrated with microplate readers and statistical analysis software. AVAILABILITY AND IMPLEMENTATION PlateDesigner is written in R/Shiny and is hosted online by the Center of Biostatistics at the Icahn School of Medicine at Mount Sinai. This application is freely available at platedesigner.net.
Collapse
Affiliation(s)
- Maria Suprun
- Department of Pediatrics, Allergy and Immunology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Mayte Suárez-Fariñas
- Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
12
|
Gradin R, Lindstedt M, Johansson H. Batch adjustment by reference alignment (BARA): Improved prediction performance in biological test sets with batch effects. PLoS One 2019; 14:e0212669. [PMID: 30794641 PMCID: PMC6386283 DOI: 10.1371/journal.pone.0212669] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Accepted: 02/07/2019] [Indexed: 12/15/2022] Open
Abstract
Many biological data acquisition platforms suffer from inadvertent inclusion of biologically irrelevant variance in analyzed data, collectively termed batch effects. Batch effects can lead to difficulties in downstream analysis by lowering the power to detect biologically interesting differences and can in certain instances lead to false discoveries. They are especially troublesome in predictive modelling where samples in training sets and test sets are often completely correlated with batches. In this article, we present BARA, a normalization method for adjusting batch effects in predictive modelling. BARA utilizes a few reference samples to adjust for batch effects in a compressed data space spanned by the training set. We evaluate BARA using a collection of publicly available datasets and three different prediction models, and compare its performance to already existing methods developed for similar purposes. The results show that data normalized with BARA generates high and consistent prediction performances. Further, they suggest that BARA produces reliable performances independent of the examined classifiers. We therefore conclude that BARA has great potential to facilitate the development of predictive assays where test sets and training sets are correlated with batch.
Collapse
Affiliation(s)
| | - Malin Lindstedt
- Department of Immunotechnology, Lund University, Lund, Sweden
| | | |
Collapse
|
13
|
Taylor DL, Gough A, Schurdak ME, Vernetti L, Chennubhotla CS, Lefever D, Pei F, Faeder JR, Lezon TR, Stern AM, Bahar I. Harnessing Human Microphysiology Systems as Key Experimental Models for Quantitative Systems Pharmacology. Handb Exp Pharmacol 2019; 260:327-367. [PMID: 31201557 PMCID: PMC6911651 DOI: 10.1007/164_2019_239] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
Two technologies that have emerged in the last decade offer a new paradigm for modern pharmacology, as well as drug discovery and development. Quantitative systems pharmacology (QSP) is a complementary approach to traditional, target-centric pharmacology and drug discovery and is based on an iterative application of computational and systems biology methods with multiscale experimental methods, both of which include models of ADME-Tox and disease. QSP has emerged as a new approach due to the low efficiency of success in developing therapeutics based on the existing target-centric paradigm. Likewise, human microphysiology systems (MPS) are experimental models complementary to existing animal models and are based on the use of human primary cells, adult stem cells, and/or induced pluripotent stem cells (iPSCs) to mimic human tissues and organ functions/structures involved in disease and ADME-Tox. Human MPS experimental models have been developed to address the relatively low concordance of human disease and ADME-Tox with engineered, experimental animal models of disease. The integration of the QSP paradigm with the use of human MPS has the potential to enhance the process of drug discovery and development.
Collapse
Affiliation(s)
- D Lansing Taylor
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA.
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA.
| | - Albert Gough
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Mark E Schurdak
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Lawrence Vernetti
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Chakra S Chennubhotla
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Daniel Lefever
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
| | - Fen Pei
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - James R Faeder
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Timothy R Lezon
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Andrew M Stern
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| | - Ivet Bahar
- University of Pittsburgh Drug Discovery Institute, Pittsburgh, PA, USA
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
14
|
San-Jose LM, Roulin A. Genomics of coloration in natural animal populations. Philos Trans R Soc Lond B Biol Sci 2018; 372:rstb.2016.0337. [PMID: 28533454 DOI: 10.1098/rstb.2016.0337] [Citation(s) in RCA: 37] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/15/2017] [Indexed: 12/28/2022] Open
Abstract
Animal coloration has traditionally been the target of genetic and evolutionary studies. However, until very recently, the study of the genetic basis of animal coloration has been mainly restricted to model species, whereas research on non-model species has been either neglected or mainly based on candidate approaches, and thereby limited by the knowledge obtained in model species. Recent high-throughput sequencing technologies allow us to overcome previous limitations, and open new avenues to study the genetic basis of animal coloration in a broader number of species and colour traits, and to address the general relevance of different genetic structures and their implications for the evolution of colour. In this review, we highlight aspects where genome-wide studies could be of major utility to fill in the gaps in our understanding of the biology and evolution of animal coloration. The new genomic approaches have been promptly adopted to study animal coloration although substantial work is still needed to consider a larger range of species and colour traits, such as those exhibiting continuous variation or based on reflective structures. We argue that a robust advancement in the study of animal coloration will also require large efforts to validate the functional role of the genes and variants discovered using genome-wide tools.This article is part of the themed issue 'Animal coloration: production, perception, function and application'.
Collapse
Affiliation(s)
- Luis M San-Jose
- Department of Ecology and Evolution, University of Lausanne, Building Le Biophore, 1015 Lausanne, Switzerland
| | - Alexandre Roulin
- Department of Ecology and Evolution, University of Lausanne, Building Le Biophore, 1015 Lausanne, Switzerland
| |
Collapse
|
15
|
Bálint M, Márton O, Schatz M, Düring R, Grossart H. Proper experimental design requires randomization/balancing of molecular ecology experiments. Ecol Evol 2018; 8:1786-1793. [PMID: 29435253 PMCID: PMC5792580 DOI: 10.1002/ece3.3687] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2017] [Revised: 09/21/2017] [Accepted: 10/26/2017] [Indexed: 12/12/2022] Open
Abstract
Properly designed (randomized and/or balanced) experiments are standard in ecological research. Molecular methods are increasingly used in ecology, but studies generally do not report the detailed design of sample processing in the laboratory. This may strongly influence the interpretability of results if the laboratory procedures do not account for the confounding effects of unexpected laboratory events. We demonstrate this with a simple experiment where unexpected differences in laboratory processing of samples would have biased results if randomization in DNA extraction and PCR steps do not provide safeguards. We emphasize the need for proper experimental design and reporting of the laboratory phase of molecular ecology research to ensure the reliability and interpretability of results.
Collapse
Affiliation(s)
- Miklós Bálint
- Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
| | - Orsolya Márton
- Senckenberg Biodiversity and Climate Research CentreFrankfurt am MainGermany
- Institute for Soil Sciences and Agricultural ChemistryCentre for Agricultural ResearchHungarian Academy of SciencesBudapestHungary
| | | | | | - Hans‐Peter Grossart
- Leibniz Institute for Freshwater Ecology and Inland FisheriesStechlinGermany
- Institute of Biochemistry and BiologyPotsdam UniversityPotsdamGermany
| |
Collapse
|
16
|
Tom JA, Reeder J, Forrest WF, Graham RR, Hunkapiller J, Behrens TW, Bhangale TR. Identifying and mitigating batch effects in whole genome sequencing data. BMC Bioinformatics 2017; 18:351. [PMID: 28738841 PMCID: PMC5525370 DOI: 10.1186/s12859-017-1756-z] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2017] [Accepted: 07/12/2017] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Large sample sets of whole genome sequencing with deep coverage are being generated, however assembling datasets from different sources inevitably introduces batch effects. These batch effects are not well understood and can be due to changes in the sequencing protocol or bioinformatics tools used to process the data. No systematic algorithms or heuristics exist to detect and filter batch effects or remove associations impacted by batch effects in whole genome sequencing data. RESULTS We describe key quality metrics, provide a freely available software package to compute them, and demonstrate that identification of batch effects is aided by principal components analysis of these metrics. To mitigate batch effects, we developed new site-specific filters that identified and removed variants that falsely associated with the phenotype due to batch effect. These include filtering based on: a haplotype based genotype correction, a differential genotype quality test, and removing sites with missing genotype rate greater than 30% after setting genotypes with quality scores less than 20 to missing. This method removed 96.1% of unconfirmed genome-wide significant SNP associations and 97.6% of unconfirmed genome-wide significant indel associations. We performed analyses to demonstrate that: 1) These filters impacted variants known to be disease associated as 2 out of 16 confirmed associations in an AMD candidate SNP analysis were filtered, representing a reduction in power of 12.5%, 2) In the absence of batch effects, these filters removed only a small proportion of variants across the genome (type I error rate of 3%), and 3) in an independent dataset, the method removed 90.2% of unconfirmed genome-wide SNP associations and 89.8% of unconfirmed genome-wide indel associations. CONCLUSIONS Researchers currently do not have effective tools to identify and mitigate batch effects in whole genome sequencing data. We developed and validated methods and filters to address this deficiency.
Collapse
Affiliation(s)
- Jennifer A Tom
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA.
| | - Jens Reeder
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - William F Forrest
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Robert R Graham
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Julie Hunkapiller
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Timothy W Behrens
- Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| | - Tushar R Bhangale
- Bioinformatics and Computational Biology Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA.,Human Genetics Department, Genentech Inc, 1 DNA Way, South San Francisco, CA, 94080, USA
| |
Collapse
|
17
|
Cuccaro D, De Marco EV, Cittadella R, Cavallaro S. Copy Number Variants in Alzheimer's Disease. J Alzheimers Dis 2017; 55:37-52. [PMID: 27662298 PMCID: PMC5115612 DOI: 10.3233/jad-160469] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/14/2016] [Indexed: 12/18/2022]
Abstract
Alzheimer's disease (AD) is a devastating disease mainly afflicting elderly people, characterized by decreased cognition, loss of memory, and eventually death. Although risk and deterministic genes are known, major genetics research programs are underway to gain further insights into the inheritance of AD. In the last years, in particular, new developments in genome-wide scanning methodologies have enabled the association of a number of previously uncharacterized copy number variants (CNVs, gain or loss of DNA) in AD. Because of the exceedingly large number of studies performed, it has become difficult for geneticists as well as clinicians to systematically follow, evaluate, and interpret the growing number of (sometime conflicting) CNVs implicated in AD. In this review, after a brief introduction of this type of structural variation, and a description of available databases, computational analyses, and technologies involved, we provide a systematic review of all published data showing statistical and scientific significance of pathogenic CNVs and discuss the role they might play in AD.
Collapse
Affiliation(s)
- Denis Cuccaro
- Institute of Neurological Sciences, National Research Council, Section of Catania, Italy
| | | | - Rita Cittadella
- Institute of Neurological Sciences, National Research Council, Section of Mangone, Italy
| | - Sebastiano Cavallaro
- Institute of Neurological Sciences, National Research Council, Section of Catania, Italy
- Institute of Neurological Sciences, National Research Council, Section of Mangone, Italy
| |
Collapse
|
18
|
Pranavchand R, Reddy BM. Genomics era and complex disorders: Implications of GWAS with special reference to coronary artery disease, type 2 diabetes mellitus, and cancers. J Postgrad Med 2016; 62:188-98. [PMID: 27424552 PMCID: PMC4970347 DOI: 10.4103/0022-3859.186390] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
The Human Genome Project (HGP) has identified millions of single nucleotide polymorphisms (SNPs) and their association with several diseases, apart from successfully characterizing the Mendelian/monogenic diseases. However, the dissection of precise etiology of complex genetic disorders still poses a challenge for human geneticists. This review outlines the landmark results of genome-wide association studies (GWAS) with respect to major complex diseases - Coronary artery disease (CAD), type 2 diabetes mellitus (T2DM), and predominant cancers. A brief account on the current Indian scenario is also given. All the relevant publications till mid-2015 were accessed through web databases such as PubMed and Google. Several databases providing genetic information related to these diseases were tabulated and in particular, the list of the most significant SNPs identified through GWAS was made, which may be useful for designing studies in functional validation. Post-GWAS implications and emerging concepts such as epigenomics and pharmacogenomics were also discussed.
Collapse
Affiliation(s)
- R Pranavchand
- Molecular Anthropology Group, Biological Anthropology Unit, Indian Statistical Institute, Hyderabad, Andhra Pradesh, India
| | - B M Reddy
- Molecular Anthropology Group, Biological Anthropology Unit, Indian Statistical Institute, Hyderabad, Andhra Pradesh, India
| |
Collapse
|
19
|
Manimaran S, Selby HM, Okrah K, Ruberman C, Leek JT, Quackenbush J, Haibe-Kains B, Bravo HC, Johnson WE. BatchQC: interactive software for evaluating sample and batch effects in genomic data. Bioinformatics 2016; 32:3836-3838. [PMID: 27540268 PMCID: PMC5167063 DOI: 10.1093/bioinformatics/btw538] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2016] [Revised: 08/10/2016] [Accepted: 08/10/2016] [Indexed: 12/02/2022] Open
Abstract
Sequencing and microarray samples often are collected or processed in multiple batches or at different times. This often produces technical biases that can lead to incorrect results in the downstream analysis. There are several existing batch adjustment tools for ‘-omics’ data, but they do not indicate a priori whether adjustment needs to be conducted or how correction should be applied. We present a software pipeline, BatchQC, which addresses these issues using interactive visualizations and statistics that evaluate the impact of batch effects in a genomic dataset. BatchQC can also apply existing adjustment tools and allow users to evaluate their benefits interactively. We used the BatchQC pipeline on both simulated and real data to demonstrate the effectiveness of this software toolkit. Availability and Implementation: BatchQC is available through Bioconductor: http://bioconductor.org/packages/BatchQC and GitHub: https://github.com/mani2012/BatchQC. Contact:wej@bu.edu Supplementary information:Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Solaiappan Manimaran
- Department of Biostatistics, Boston University, Boston, MA.,Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA
| | | | - Kwame Okrah
- gRED Oncology Biostatistics, Genentech, South San Francisco, CA
| | - Claire Ruberman
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD
| | - John Quackenbush
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA.,Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Benjamin Haibe-Kains
- Departments of Medical Biophysics and Computer Science, University of Toronto, Toronto, Ontario, Canada.,Princess Margaret Cancer Centre, University Health NetworkToronto, Ontario, Canada.,Ontario Institute of Cancer Research, Toronto, Ontario, Canada
| | - Hector Corrada Bravo
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD
| | - W Evan Johnson
- Department of Biostatistics, Boston University, Boston, MA.,Division of Computational Biomedicine, Boston University School of Medicine, Boston, MA.,Bioinformatics Program, Boston University, Boston, MA
| |
Collapse
|
20
|
Abstract
Systems medicine promotes a range of approaches and strategies to study human health and disease at a systems level with the aim of improving the overall well-being of (healthy) individuals, and preventing, diagnosing, or curing disease. In this chapter we discuss how bioinformatics critically contributes to systems medicine. First, we explain the role of bioinformatics in the management and analysis of data. In particular we show the importance of publicly available biological and clinical repositories to support systems medicine studies. Second, we discuss how the integration and analysis of multiple types of omics data through integrative bioinformatics may facilitate the determination of more predictive and robust disease signatures, lead to a better understanding of (patho)physiological molecular mechanisms, and facilitate personalized medicine. Third, we focus on network analysis and discuss how gene networks can be constructed from omics data and how these networks can be decomposed into smaller modules. We discuss how the resulting modules can be used to generate experimentally testable hypotheses, provide insight into disease mechanisms, and lead to predictive models. Throughout, we provide several examples demonstrating how bioinformatics contributes to systems medicine and discuss future challenges in bioinformatics that need to be addressed to enable the advancement of systems medicine.
Collapse
Affiliation(s)
- Ulf Schmitz
- Dept of Systems Biology & Bioinformatics, University of Rostock, Rostock, Germany
| | - Olaf Wolkenhauer
- Dept of Systems Biology & Bioinformatics, University of Rostock, Rostock, Germany
| |
Collapse
|
21
|
Jaffe AE, Hyde T, Kleinman J, Weinbergern DR, Chenoweth JG, McKay RD, Leek JT, Colantuoni C. Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis. BMC Bioinformatics 2015; 16:372. [PMID: 26545828 PMCID: PMC4636836 DOI: 10.1186/s12859-015-0808-5] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2015] [Accepted: 10/30/2015] [Indexed: 12/26/2022] Open
Abstract
Background Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of “batch” correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature. Methods We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272). Results Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the “cleaned” data, including sex, common copy number effects and sample or cell line-specific molecular behavior. Conclusions Our analyses indicate that data “cleaning” can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised “cleaning”, because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding “cleaning” process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0808-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Andrew E Jaffe
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA. .,Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD, 21205, USA.
| | - Thomas Hyde
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA. .,Department of Psychiatry, Johns Hopkins School of Medicine, Baltimor, MD, 21205, USA.
| | - Joel Kleinman
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA.
| | - Daniel R Weinbergern
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA. .,Department of Psychiatry, Johns Hopkins School of Medicine, Baltimor, MD, 21205, USA. .,Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, Maryland, 21205, USA. .,McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland, 21205, USA.
| | - Joshua G Chenoweth
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA.
| | - Ronald D McKay
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA.
| | - Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N Wolfe St, Baltimore, MD, 21205, USA.
| | - Carlo Colantuoni
- Lieber Institute for Brain Development, 855 N Wolfe St, Ste 300, Baltimore, MD, 21205, USA. .,Department of Neurology, Johns Hopkins School of Medicine, Baltimore, MD, 21205, USA. .,Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, Maryland, 21205, USA.
| |
Collapse
|
22
|
Evans DS, Cailotto F, Parimi N, Valdes AM, Castaño-Betancourt MC, Liu Y, Kaplan RC, Bidlingmaier M, Vasan RS, Teumer A, Tranah GJ, Nevitt MC, Cummings SR, Orwoll ES, Barrett-Connor E, Renner JB, Jordan JM, Doherty M, Doherty SA, Uitterlinden AG, van Meurs JB, Spector TD, Lories RJ, Lane NE. Genome-wide association and functional studies identify a role for IGFBP3 in hip osteoarthritis. Ann Rheum Dis 2015; 74:1861-7. [PMID: 24928840 PMCID: PMC4449305 DOI: 10.1136/annrheumdis-2013-205020] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2013] [Accepted: 05/22/2014] [Indexed: 01/10/2023]
Abstract
OBJECTIVES To identify genetic associations with hip osteoarthritis (HOA), we performed a meta-analysis of genome-wide association studies (GWAS) of HOA. METHODS The GWAS meta-analysis included approximately 2.5 million imputed HapMap single nucleotide polymorphisms (SNPs). HOA cases and controls defined radiographically and by total hip replacement were selected from the Osteoporotic Fractures in Men (MrOS) Study and the Study of Osteoporotic Fractures (SOF) (654 cases and 4697 controls, combined). Replication of genome-wide significant SNP associations (p ≤5×10(-8)) was examined in five studies (3243 cases and 6891 controls, combined). Functional studies were performed using in vitro models of chondrogenesis and osteogenesis. RESULTS The A allele of rs788748, located 65 kb upstream of the IGFBP3 gene, was associated with lower HOA odds at the genome-wide significance level in the discovery stage (OR 0.71, p=2×10(-8)). The association replicated in five studies (OR 0.92, p=0.020), but the joint analysis of discovery and replication results was not genome-wide significant (p=1×10(-6)). In separate study populations, the rs788748 A allele was also associated with lower circulating IGFBP3 protein levels (p=4×10(-13)), suggesting that this SNP or a variant in linkage disequilibrium could be an IGFBP3 regulatory variant. Results from functional studies were consistent with association results. Chondrocyte hypertrophy, a deleterious event in OA pathogenesis, was largely prevented upon IGFBP3 knockdown in chondrocytes. Furthermore, IGFBP3 overexpression induced cartilage catabolism and osteogenic differentiation. CONCLUSIONS Results from GWAS and functional studies provided suggestive links between IGFBP3 and HOA.
Collapse
Affiliation(s)
- Daniel S. Evans
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Frederic Cailotto
- Laboratory of Tissue Homeostasis and Disease, Skeletal Biology and Engineering Research Center, Department of Development and Regeneration, KU Leuven, Belgium
| | - Neeta Parimi
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Ana M. Valdes
- Academic Rheumatology, University of Nottingham, Nottingham City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK
| | - Martha C. Castaño-Betancourt
- Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands
- The Netherlands Genomics Initiative-sponsored Netherlands Consortium for Healthy Aging (NGI-NCHA), Rotterdam/Leiden, The Netherlands
| | - Youfang Liu
- Thurston Arthritis Research Center, Department of Medicine, and Department of Orthopedics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | - Martin Bidlingmaier
- Medizinische Klinik und Poliklinik IV, Ludwig-Maximilians Universität München, Munich, Germany
| | - Ramachandran S. Vasan
- Boston University School of Medicine, Section of Preventive Medicine & Epidemiology, Boston, MA, USA
| | - Alexander Teumer
- Institute of Functional Genomics, Ernst Moritz Arndt University, University of Greifswald, Greifswald, Germany
| | - Gregory J. Tranah
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
| | - Michael C. Nevitt
- Department of Epidemiology and Biostatistics, University of California, San Francisco, CA, USA
| | - Steven R. Cummings
- California Pacific Medical Center Research Institute, San Francisco, CA, USA
| | - Eric S. Orwoll
- School of Medicine, Oregon Health & Science University, Portland, OR, USA
| | - Elizabeth Barrett-Connor
- Division of Epidemiology, Departments of Family and Preventive Medicine and Medicine, University of California San Diego, La Jolla, CA, USA
| | - Jordan B. Renner
- Thurston Arthritis Research Center, Department of Medicine, and Department of Radiology, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Joanne M. Jordan
- Thurston Arthritis Research Center, Department of Medicine, and Department of Orthopedics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Michael Doherty
- Academic Rheumatology, University of Nottingham, Nottingham City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK
| | - Sally A. Doherty
- Academic Rheumatology, University of Nottingham, Nottingham City Hospital, Hucknall Road, Nottingham, NG5 1PB, UK
| | - Andre G. Uitterlinden
- Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands
- The Netherlands Genomics Initiative-sponsored Netherlands Consortium for Healthy Aging (NGI-NCHA), Rotterdam/Leiden, The Netherlands
- Department of Epidemiology, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Joyce B.J. van Meurs
- Department of Internal Medicine, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Tim D. Spector
- Department of Twin Research and Genetic Epidemiology Unit, King’s College London, London, UK
| | - Rik J. Lories
- Laboratory of Tissue Homeostasis and Disease, Skeletal Biology and Engineering Research Center, Department of Development and Regeneration, KU Leuven, Belgium
- Division of Rheumatology, University Hospitals Leuven, Leuven, Belgium
| | - Nancy E. Lane
- University of California at Davis, Sacramento, CA, USA
| |
Collapse
|
23
|
Masca NGD, Hensor EMA, Cornelius VR, Buffa FM, Marriott HM, Eales JM, Messenger MP, Anderson AE, Boot C, Bunce C, Goldin RD, Harris J, Hinchliffe RF, Junaid H, Kingston S, Martin-Ruiz C, Nelson CP, Peacock J, Seed PT, Shinkins B, Staples KJ, Toombs J, Wright AKA, Teare MD. RIPOSTE: a framework for improving the design and analysis of laboratory-based research. eLife 2015; 4:e05519. [PMID: 25951517 PMCID: PMC4461852 DOI: 10.7554/elife.05519] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 05/01/2015] [Indexed: 12/17/2022] Open
Abstract
Lack of reproducibility is an ongoing problem in some areas of the biomedical sciences. Poor experimental design and a failure to engage with experienced statisticians at key stages in the design and analysis of experiments are two factors that contribute to this problem. The RIPOSTE (Reducing IrreProducibility in labOratory STudiEs) framework has been developed to support early and regular discussions between scientists and statisticians in order to improve the design, conduct and analysis of laboratory studies and, therefore, to reduce irreproducibility. This framework is intended for use during the early stages of a research project, when specific questions or hypotheses are proposed. The essential points within the framework are explained and illustrated using three examples (a medical equipment test, a macrophage study and a gene expression study). Sound study design minimises the possibility of bias being introduced into experiments and leads to higher quality research with more reproducible results.
Collapse
Affiliation(s)
- Nicholas GD Masca
- Cardiovascular Biomedical Research Unit, University of Leicester, Leicester, United Kingdom
| | - Elizabeth MA Hensor
- Leeds Institute of Rheumatic and Musculoskeletal Medicine, University of Leeds, Leeds, United Kingdom; Leeds Institute of Rheumatic and Musculoskeletal Medicine, NIHR Leeds Musculoskeletal Biomedical Research Unit, Leeds, United Kingdom
| | - Victoria R Cornelius
- Department of Primary Care and Public Health Sciences, King's College London, London, United Kingdom
| | - Francesca M Buffa
- Applied Computational Genomics, University of Oxford, Oxford, United Kingdom
| | - Helen M Marriott
- Department of Infection and Immunity, University of Sheffield, Sheffield, United Kingdom; The Florey Institute, University of Sheffield, Sheffield, United Kingdom
| | - James M Eales
- Department of Cardiovascular Sciences, University of Leicester, Leicester, United Kingdom
| | - Michael P Messenger
- NIHR Diagnostic Evidence Co-Operative Leeds, Leeds Teaching Hospitals NHS Trust, Leeds, United Kingdom
| | - Amy E Anderson
- Musculoskeletal Research Group, Institute of Cellular Medicine, University of Newcastle, Newcastle, United Kingdom
| | - Chris Boot
- Newcastle Hospitals NHS Trust, Newcastle, United Kingdom
| | - Catey Bunce
- NIHR Biomedical Research Centre at Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology, London, United Kingdom; London School of Hygiene and Tropical Medicine, London, United Kingdom
| | - Robert D Goldin
- Centre for Pathology, Imperial College, London, United Kingdom
| | - Jessica Harris
- Clinical Trials and Evaluation Unit, School of Clinical Sciences, University of Bristol, Bristol, United Kingdom
| | - Rod F Hinchliffe
- Department of Paediatric Haematology, Sheffield Children's NHS Foundation Trust, Sheffield, United Kingdom
| | - Hiba Junaid
- Royal London Hospital, London, United Kingdom
| | - Shaun Kingston
- Respiratory Biomedical Research Unit, Royal Brompton and Harefield NHS Trust, London, United Kingdom
| | - Carmen Martin-Ruiz
- Institute for Ageing and Health, Newcastle University, Newcastle, United Kingdom
| | - Christopher P Nelson
- Department of Cardiovascular Sciences, NIHR Leicester Cardiovascular Biomedical Research Unit, University of Leicester, Leicester, United Kingdom
| | - Janet Peacock
- Division of Health and Social Care Research, Kings College London, London, United Kingdom; NIHR Biomedical Research Centre at Guy's and St Thomas' NHS Foundation, London, United Kingdom
| | - Paul T Seed
- Division of Women's Health, King's College London, London, United Kingdom
| | - Bethany Shinkins
- Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, United Kingdom
| | - Karl J Staples
- Clinical and Experimental Sciences, University of Southampton and NIHR Southampton Respiratory Biomedical Research Unit, Southampton General Hospital, Southampton, United Kingdom
| | - Jamie Toombs
- Department of Molecular Neuroscience, Institute of Neurology, University College London, London, United Kingdom
| | - Adam KA Wright
- Institute of Lung Health, Respiratory Biomedical Unit, University Hospitals of Leicester NHS Trust, Leicester, United Kingdom
| | - M Dawn Teare
- Sheffield School of Health and Related Research, University of Sheffield, Sheffield, United Kingdom
| |
Collapse
|
24
|
Ma L, Keinan A, Clark AG. Biological knowledge-driven analysis of epistasis in human GWAS with application to lipid traits. Methods Mol Biol 2015; 1253:35-45. [PMID: 25403526 DOI: 10.1007/978-1-4939-2155-3_3] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023]
Abstract
While the importance of epistasis is well established, specific gene-gene interactions have rarely been identified in human genome-wide association studies (GWAS), mainly due to low power associated with such interaction tests. In this chapter, we integrate biological knowledge and human GWAS data to reveal epistatic interactions underlying quantitative lipid traits, which are major risk factors for coronary artery disease. To increase power to detect interactions, we only tested pairs of SNPs filtered by prior biological knowledge, including GWAS results, protein-protein interactions (PPIs), and pathway information. Using published GWAS and 9,713 European Americans (EA) from the Atherosclerosis Risk in Communities (ARIC) study, we identified an interaction between HMGCR and LIPC affecting high-density lipoprotein cholesterol (HDL-C) levels. We then validated this interaction in additional multiethnic cohorts from ARIC, the Framingham Heart Study, and the Multi-Ethnic Study of Atherosclerosis. Both HMGCR and LIPC are involved in the metabolism of lipids and lipoproteins, and LIPC itself has been marginally associated with HDL-C. Furthermore, no significant interaction was detected using PPI and pathway information, mainly due to the stringent significance level required after correcting for the large number of tests conducted. These results suggest the potential of biological knowledge-driven approaches to detect epistatic interactions in human GWAS, which may hold the key to exploring the role gene-gene interactions play in connecting genotypes and complex phenotypes in future GWAS.
Collapse
Affiliation(s)
- Li Ma
- Department of Animal and Avian Sciences, University of Maryland, Bldg 142, College Park, MD, 20742, USA,
| | | | | |
Collapse
|
25
|
Leek JT. svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res 2014; 42:gku864. [PMID: 25294822 PMCID: PMC4245966 DOI: 10.1093/nar/gku864] [Citation(s) in RCA: 383] [Impact Index Per Article: 34.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Revised: 08/20/2014] [Accepted: 09/08/2014] [Indexed: 11/17/2022] Open
Abstract
It is now known that unwanted noise and unmodeled artifacts such as batch effects can dramatically reduce the accuracy of statistical inference in genomic experiments. These sources of noise must be modeled and removed to accurately measure biological variability and to obtain correct statistical inference when performing high-throughput genomic analysis. We introduced surrogate variable analysis (sva) for estimating these artifacts by (i) identifying the part of the genomic data only affected by artifacts and (ii) estimating the artifacts with principal components or singular vectors of the subset of the data matrix. The resulting estimates of artifacts can be used in subsequent analyses as adjustment factors to correct analyses. Here I describe a version of the sva approach specifically created for count data or FPKMs from sequencing experiments based on appropriate data transformation. I also describe the addition of supervised sva (ssva) for using control probes to identify the part of the genomic data only affected by artifacts. I present a comparison between these versions of sva and other methods for batch effect estimation on simulated data, real count-based data and FPKM-based data. These updates are available through the sva Bioconductor package and I have made fully reproducible analysis using these methods available from: https://github.com/jtleek/svaseq.
Collapse
Affiliation(s)
- Jeffrey T Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health Baltimore, MD 21212, US
| |
Collapse
|
26
|
Sokolowski M, Wasserman J, Wasserman D. Genome-wide association studies of suicidal behaviors: a review. Eur Neuropsychopharmacol 2014; 24:1567-77. [PMID: 25219938 DOI: 10.1016/j.euroneuro.2014.08.006] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/21/2014] [Revised: 07/24/2014] [Accepted: 08/10/2014] [Indexed: 11/17/2022]
Abstract
Suicidal behaviors represent a fatal dimension of mental ill-health, involving both environmental and heritable (genetic) influences. The putative genetic components of suicidal behaviors have until recent years been mainly investigated by hypothesis-driven research (of "candidate genes"). But technological progress in genotyping has opened the possibilities towards (hypothesis-generating) genomic screens and novel opportunities to explore polygenetic perspectives, now spanning a wide array of possible analyses falling under the term Genome-Wide Association Study (GWAS). Here we introduce and discuss broadly some apparent limitations but also certain developing opportunities of GWAS. We summarize the results from all the eight GWAS conducted up to date focused on suicidality outcomes; treatment emergent suicidal ideation (3 studies), suicide attempts (4 studies) and completed suicides (1 study). Clearly, there are few (if any) genome-wide significant and reproducible findings yet to be demonstrated. We then discuss and pinpoint certain future considerations in relation to sample sizes, the units of genetic associations used, study designs and outcome definitions, psychiatric diagnoses or biological measures, as well as the use of genomic sequencing. We conclude that GWAS should have a lot more potential to show in the case of suicidal outcomes, than what has yet been realized.
Collapse
Affiliation(s)
- Marcus Sokolowski
- National Centre for Suicide Research and Prevention of Mental Ill-Health (NASP), Karolinska Institute (KI), S-171 77 Stockholm, Sweden.
| | - Jerzy Wasserman
- National Centre for Suicide Research and Prevention of Mental Ill-Health (NASP), Karolinska Institute (KI), S-171 77 Stockholm, Sweden
| | - Danuta Wasserman
- National Centre for Suicide Research and Prevention of Mental Ill-Health (NASP), Karolinska Institute (KI), S-171 77 Stockholm, Sweden
| |
Collapse
|
27
|
Parker HS, Corrada Bravo H, Leek JT. Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2014; 2:e561. [PMID: 25332844 PMCID: PMC4179553 DOI: 10.7717/peerj.561] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2014] [Accepted: 08/15/2014] [Indexed: 01/06/2023] Open
Abstract
Batch effects are responsible for the failure of promising genomic prognostic signatures, major ambiguities in published genomic results, and retractions of widely-publicized findings. Batch effect corrections have been developed to remove these artifacts, but they are designed to be used in population studies. But genomic technologies are beginning to be used in clinical applications where samples are analyzed one at a time for diagnostic, prognostic, and predictive applications. There are currently no batch correction methods that have been developed specifically for prediction. In this paper, we propose an new method called frozen surrogate variable analysis (fSVA) that borrows strength from a training set for individual sample batch correction. We show that fSVA improves prediction accuracy in simulations and in public genomic studies. fSVA is available as part of the sva Bioconductor package.
Collapse
Affiliation(s)
- Hilary S. Parker
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Héctor Corrada Bravo
- Center for Bioinformatics and Computational Biology, Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Jeffrey T. Leek
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| |
Collapse
|
28
|
|
29
|
Lee S, Abecasis G, Boehnke M, Lin X. Rare-variant association analysis: study designs and statistical tests. Am J Hum Genet 2014; 95:5-23. [PMID: 24995866 DOI: 10.1016/j.ajhg.2014.06.009] [Citation(s) in RCA: 721] [Impact Index Per Article: 65.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2014] [Indexed: 12/30/2022] Open
Abstract
Despite the extensive discovery of trait- and disease-associated common variants, much of the genetic contribution to complex traits remains unexplained. Rare variants can explain additional disease risk or trait variability. An increasing number of studies are underway to identify trait- and disease-associated rare variants. In this review, we provide an overview of statistical issues in rare-variant association studies with a focus on study designs and statistical tests. We present the design and analysis pipeline of rare-variant studies and review cost-effective sequencing designs and genotyping platforms. We compare various gene- or region-based association tests, including burden tests, variance-component tests, and combined omnibus tests, in terms of their assumptions and performance. Also discussed are the related topics of meta-analysis, population-stratification adjustment, genotype imputation, follow-up studies, and heritability due to rare variants. We provide guidelines for analysis and discuss some of the challenges inherent in these studies and future research directions.
Collapse
|
30
|
Deane T, Nomme K, Jeffery E, Pollock C, Birol G. Development of the Biological Experimental Design Concept Inventory (BEDCI). CBE LIFE SCIENCES EDUCATION 2014; 13:540-51. [PMID: 25185236 PMCID: PMC4152214 DOI: 10.1187/cbe.13-11-0218] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2023]
Abstract
Interest in student conception of experimentation inspired the development of a fully validated 14-question inventory on experimental design in biology (BEDCI) by following established best practices in concept inventory (CI) design. This CI can be used to diagnose specific examples of non-expert-like thinking in students and to evaluate the success of teaching strategies that target conceptual changes. We used BEDCI to diagnose non-expert-like student thinking in experimental design at the pre- and posttest stage in five courses (total n = 580 students) at a large research university in western Canada. Calculated difficulty and discrimination metrics indicated that BEDCI questions are able to effectively capture learning changes at the undergraduate level. A high correlation (r = 0.84) between responses by students in similar courses and at the same stage of their academic career, also suggests that the test is reliable. Students showed significant positive learning changes by the posttest stage, but some non-expert-like responses were widespread and persistent. BEDCI is a reliable and valid diagnostic tool that can be used in a variety of life sciences disciplines.
Collapse
Affiliation(s)
- Thomas Deane
- *Departments of Botany and Zoology, Biology Program, Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Kathy Nomme
- *Departments of Botany and Zoology, Biology Program, Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Erica Jeffery
- *Departments of Botany and Zoology, Biology Program, Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Carol Pollock
- Department of Zoology, Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| | - Gülnur Birol
- Science Centre for Learning and Teaching, Faculty of Science, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
| |
Collapse
|
31
|
Spratt H, Ju H, Brasier AR. A structured approach to predictive modeling of a two-class problem using multidimensional data sets. Methods 2013; 61:73-85. [PMID: 23321025 PMCID: PMC3661737 DOI: 10.1016/j.ymeth.2013.01.002] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2012] [Revised: 01/03/2013] [Accepted: 01/07/2013] [Indexed: 02/09/2023] Open
Abstract
Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.
Collapse
Affiliation(s)
- Heidi Spratt
- Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA
- Sealy Center for Molecular Medicine, UTMB, Galveston, TX, USA
- Institute for Translational Sciences, UTMB, Galveston, TX, USA
| | - Hyunsu Ju
- Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA
- Institute for Translational Sciences, UTMB, Galveston, TX, USA
| | - Allan R. Brasier
- Sealy Center for Molecular Medicine, UTMB, Galveston, TX, USA
- Institute for Translational Sciences, UTMB, Galveston, TX, USA
| |
Collapse
|
32
|
Okser S, Pahikkala T, Aittokallio T. Genetic variants and their interactions in disease risk prediction - machine learning and network perspectives. BioData Min 2013; 6:5. [PMID: 23448398 PMCID: PMC3606427 DOI: 10.1186/1756-0381-6-5] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2012] [Accepted: 02/11/2013] [Indexed: 12/31/2022] Open
Abstract
A central challenge in systems biology and medical genetics is to understand how interactions among genetic loci contribute to complex phenotypic traits and human diseases. While most studies have so far relied on statistical modeling and association testing procedures, machine learning and predictive modeling approaches are increasingly being applied to mining genotype-phenotype relationships, also among those associations that do not necessarily meet statistical significance at the level of individual variants, yet still contributing to the combined predictive power at the level of variant panels. Network-based analysis of genetic variants and their interaction partners is another emerging trend by which to explore how sub-network level features contribute to complex disease processes and related phenotypes. In this review, we describe the basic concepts and algorithms behind machine learning-based genetic feature selection approaches, their potential benefits and limitations in genome-wide setting, and how physical or genetic interaction networks could be used as a priori information for providing improved predictive power and mechanistic insights into the disease networks. These developments are geared toward explaining a part of the missing heritability, and when combined with individual genomic profiling, such systems medicine approaches may also provide a principled means for tailoring personalized treatment strategies in the future.
Collapse
|
33
|
Hoffman S, Podgurski A. The use and misuse of biomedical data: is bigger really better? AMERICAN JOURNAL OF LAW & MEDICINE 2013; 39:497-538. [PMID: 24494442 DOI: 10.1177/009885881303900401] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
Very large biomedical research databases, containing electronic health records (EHR) and genomic data from millions of patients, have been heralded recently for their potential to accelerate scientific discovery and produce dramatic improvements in medical treatments. Research enabled by these databases may also lead to profound changes in law, regulation, social policy, and even litigation strategies. Yet, is "big data" necessarily better data? This paper makes an original contribution to the legal literature by focusing on what can go wrong in the process of biomedical database research and what precautions are necessary to avoid critical mistakes. We address three main reasons for approaching such research with care and being cautious in relying on its outcomes for purposes of public policy or litigation. First, the data contained in biomedical databases is surprisingly likely to be incorrect or incomplete. Second, systematic biases, arising from both the nature of the data and the preconceptions of investigators, are serious threats to the validity of research results, especially in answering causal questions. Third, data mining of biomedical databases makes it easier for individuals with political, social, or economic agendas to generate ostensibly scientific but misleading research findings for the purpose of manipulating public opinion and swaying policymakers. In short, this paper sheds much-needed light on the problems of credulous and uninformed acceptance of research results derived from biomedical databases. An understanding of the pitfalls of big data analysis is of critical importance to anyone who will rely on or dispute its outcomes, including lawyers, policymakers, and the public at large. The Article also recommends technical, methodological, and educational interventions to combat the dangers of database errors and abuses.
Collapse
Affiliation(s)
- Sharona Hoffman
- Law-Medicine Center, Case Western Reserve University School of Law, USA
| | | |
Collapse
|
34
|
Yan L, Ma C, Wang D, Hu Q, Qin M, Conroy JM, Sucheston LE, Ambrosone CB, Johnson CS, Wang J, Liu S. OSAT: a tool for sample-to-batch allocations in genomics experiments. BMC Genomics 2012; 13:689. [PMID: 23228338 PMCID: PMC3548766 DOI: 10.1186/1471-2164-13-689] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2012] [Accepted: 12/04/2012] [Indexed: 12/31/2022] Open
Abstract
Background Batch effect is one type of variability that is not of primary interest but ubiquitous in sizable genomic experiments. To minimize the impact of batch effects, an ideal experiment design should ensure the even distribution of biological groups and confounding factors across batches. However, due to the practical complications, the availability of the final collection of samples in genomics study might be unbalanced and incomplete, which, without appropriate attention in sample-to-batch allocation, could lead to drastic batch effects. Therefore, it is necessary to develop effective and handy tool to assign collected samples across batches in an appropriate way in order to minimize the impact of batch effects. Results We describe OSAT (Optimal Sample Assignment Tool), a bioconductor package designed for automated sample-to-batch allocations in genomics experiments. Conclusions OSAT is developed to facilitate the allocation of collected samples to different batches in genomics study. Through optimizing the even distribution of samples in groups of biological interest into different batches, it can reduce the confounding or correlation between batches and the biological variables of interest. It can also optimize the homogeneous distribution of confounding factors across batches. It can handle challenging instances where incomplete and unbalanced sample collections are involved as well as ideally balanced designs.
Collapse
Affiliation(s)
- Li Yan
- Department of Biostatistics and Bioinformatics, Roswell Park Cancer Institute, Buffalo, NY 14263, USA.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
35
|
Hsu YH, Kiel DP. Clinical review: Genome-wide association studies of skeletal phenotypes: what we have learned and where we are headed. J Clin Endocrinol Metab 2012; 97:E1958-77. [PMID: 22965941 PMCID: PMC3674343 DOI: 10.1210/jc.2012-1890] [Citation(s) in RCA: 81] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/05/2012] [Accepted: 07/09/2012] [Indexed: 02/07/2023]
Abstract
CONTEXT The primary goals of genome-wide association studies (GWAS) are to discover new molecular and biological pathways involved in the regulation of bone metabolism that can be leveraged for drug development. In addition, the identified genetic determinants may be used to enhance current risk factor profiles. EVIDENCE ACQUISITION There have been more than 40 published GWAS on skeletal phenotypes, predominantly focused on dual-energy x-ray absorptiometry-derived bone mineral density (BMD) of the hip and spine. EVIDENCE SYNTHESIS Sixty-six BMD loci have been replicated across all the published GWAS, confirming the highly polygenic nature of BMD variation. Only seven of the 66 previously reported genes (LRP5, SOST, ESR1, TNFRSF11B, TNFRSF11A, TNFSF11, PTH) from candidate gene association studies have been confirmed by GWAS. Among 59 novel BMD GWAS loci that have not been reported by previous candidate gene association studies, some have been shown to be involved in key biological pathways involving the skeleton, particularly Wnt signaling (AXIN1, LRP5, CTNNB1, DKK1, FOXC2, HOXC6, LRP4, MEF2C, PTHLH, RSPO3, SFRP4, TGFBR3, WLS, WNT3, WNT4, WNT5B, WNT16), bone development: ossification (CLCN7, CSF1, MEF2C, MEPE, PKDCC, PTHLH, RUNX2, SOX6, SOX9, SPP1, SP7), mesenchymal-stem-cell differentiation (FAM3C, MEF2C, RUNX2, SOX4, SOX9, SP7), osteoclast differentiation (JAG1, RUNX2), and TGF-signaling (FOXL1, SPTBN1, TGFBR3). There are still 30 BMD GWAS loci without prior molecular or biological evidence of their involvement in skeletal phenotypes. Other skeletal phenotypes that either have been or are being studied include hip geometry, bone ultrasound, quantitative computed tomography, high-resolution peripheral quantitative computed tomography, biochemical markers, and fractures such as vertebral, nonvertebral, hip, and forearm. CONCLUSIONS Although several challenges lie ahead as GWAS moves into the next generation, there are prospects of new discoveries in skeletal biology. This review integrates findings from previous GWAS and provides a roadmap for future directions building on current GWAS successes.
Collapse
Affiliation(s)
- Yi-Hsiang Hsu
- Hebrew SeniorLife Institute for Aging Research, 1200 Centre Street, Boston, Massachusetts 02131, USA
| | | |
Collapse
|
36
|
Baker SG. Paradoxes in Carcinogenesis Should Spur New Avenues of Research: An Historical Perspective. ACTA ACUST UNITED AC 2012. [DOI: 10.1089/dst.2012.0011] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
|