1
|
Abstract
Spatially resolved genomic technologies have allowed us to study the physical organization of cells and tissues, and promise an understanding of local interactions between cells. However, it remains difficult to precisely align spatial observations across slices, samples, scales, individuals and technologies. Here, we propose a probabilistic model that aligns spatially-resolved samples onto a known or unknown common coordinate system (CCS) with respect to phenotypic readouts (for example, gene expression). Our method, Gaussian Process Spatial Alignment (GPSA), consists of a two-layer Gaussian process: the first layer maps observed samples' spatial locations onto a CCS, and the second layer maps from the CCS to the observed readouts. Our approach enables complex downstream spatially aware analyses that are impossible or inaccurate with unaligned data, including an analysis of variance, creation of a dense three-dimensional (3D) atlas from sparse two-dimensional (2D) slices or association tests across data modalities.
Collapse
Affiliation(s)
- Andrew Jones
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - F William Townes
- Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Didong Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC, USA
| | - Barbara E Engelhardt
- Gladstone Institutes, San Francisco, CA, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
2
|
Jones A, Cai D, Li D, Engelhardt BE. Optimizing the design of spatial genomic studies. bioRxiv 2023:2023.01.29.526115. [PMID: 36778332 PMCID: PMC9915499 DOI: 10.1101/2023.01.29.526115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
Abstract
Spatially-resolved genomic technologies have shown promise for studying the relationship between the structural arrangement of cells and their functional behavior. While numerous sequencing and imaging platforms exist for performing spatial transcriptomics and spatial proteomics profiling, these experiments remain expensive and labor-intensive. Thus, when performing spatial genomics experiments using multiple tissue slices, there is a need to select the tissue cross sections that will be maximally informative for the purposes of the experiment. In this work, we formalize the problem of experimental design for spatial genomics experiments, which we generalize into a problem class that we call structured batch experimental design. We propose approaches for optimizing these designs in two types of spatial genomics studies: one in which the goal is to construct a spatially-resolved genomic atlas of a tissue and another in which the goal is to localize a region of interest in a tissue, such as a tumor. We demonstrate the utility of these optimal designs, where each slice is a two-dimensional plane, on several spatial genomics datasets.
Collapse
Affiliation(s)
- Andrew Jones
- Department of Computer Science, Princeton University
| | - Diana Cai
- Department of Computer Science, Princeton University
| | - Didong Li
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | | |
Collapse
|
3
|
Abstract
Nonnegative matrix factorization (NMF) is widely used to analyze high-dimensional count data because, in contrast to real-valued alternatives such as factor analysis, it produces an interpretable parts-based representation. However, in applications such as spatial transcriptomics, NMF fails to incorporate known structure between observations. Here, we present nonnegative spatial factorization (NSF), a spatially-aware probabilistic dimension reduction model based on transformed Gaussian processes that naturally encourages sparsity and scales to tens of thousands of observations. NSF recovers ground truth factors more accurately than real-valued alternatives such as MEFISTO in simulations, and has lower out-of-sample prediction error than probabilistic NMF on three spatial transcriptomics datasets from mouse brain and liver. Since not all patterns of gene expression have spatial correlations, we also propose a hybrid extension of NSF that combines spatial and nonspatial components, enabling quantification of spatial importance for both observations and features. A TensorFlow implementation of NSF is available from https://github.com/willtownes/nsf-paper .
Collapse
Affiliation(s)
- F. William Townes
- grid.147455.60000 0001 2097 0344Present Address: Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA USA
| | - Barbara E. Engelhardt
- grid.249878.80000 0004 0572 7110Present Address: Data Science and Biotechnology Institute, Gladstone Institutes, San Francisco, CA USA ,grid.168010.e0000000419368956Present Address: Department of Biomedical Data Science, Stanford University, Stanford, CA USA
| |
Collapse
|
4
|
Fitzgerald T, Jones A, Engelhardt BE. A Poisson reduced-rank regression model for association mapping in sequencing data. BMC Bioinformatics 2022; 23:529. [PMID: 36482321 PMCID: PMC9733401 DOI: 10.1186/s12859-022-05054-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 11/14/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cell-specific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, count-based scRNA-seq data pose challenges for traditional models based on Gaussian assumptions. RESULTS We aim to resolve these issues by developing a reduced-rank regression model that identifies low-dimensional linear associations between a large number of cell-specific covariates and high-dimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNA-seq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNA-seq dataset, a spatial gene expression dataset, and a bulk RNA-seq dataset to show its behavior in three distinct analyses. CONCLUSION We show that our statistical modeling approach, which is based on reduced-rank regression, captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcriptional states.
Collapse
Affiliation(s)
- Tiana Fitzgerald
- grid.16750.350000 0001 2097 5006Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Andrew Jones
- grid.16750.350000 0001 2097 5006Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Barbara E. Engelhardt
- grid.16750.350000 0001 2097 5006Department of Computer Science, Princeton University, Princeton, NJ USA ,grid.249878.80000 0004 0572 7110Data Science and Biotechnology Institute, Gladstone Institutes, San Francisco, CA USA ,grid.168010.e0000000419368956Department of Biomedical Data Science, Stanford University, Stanford, CA USA
| |
Collapse
|
5
|
Jones A, Townes FW, Li D, Engelhardt BE. Contrastive latent variable modeling with application to case-control sequencing experiments. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1534] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Andrew Jones
- Department of Computer Science, Princeton University
| | | | - Didong Li
- Department of Computer Science, Princeton University
| | | |
Collapse
|
6
|
Gewirtz AD, Townes FW, Engelhardt BE. Telescoping bimodal latent Dirichlet allocation to identify expression QTLs across tissues. Life Sci Alliance 2022; 5:e202101297. [PMID: 35977827 PMCID: PMC9387650 DOI: 10.26508/lsa.202101297] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2021] [Revised: 07/15/2022] [Accepted: 07/18/2022] [Indexed: 11/24/2022] Open
Abstract
Expression quantitative trait loci (eQTLs), or single-nucleotide polymorphisms that affect average gene expression levels, provide important insights into context-specific gene regulation. Classic eQTL analyses use one-to-one association tests, which test gene-variant pairs individually and ignore correlations induced by gene regulatory networks and linkage disequilibrium. Probabilistic topic models, such as latent Dirichlet allocation, estimate latent topics for a collection of count observations. Prior multimodal frameworks that bridge genotype and expression data assume matched sample numbers between modalities. However, many data sets have a nested structure where one individual has several associated gene expression samples and a single germline genotype vector. Here, we build a telescoping bimodal latent Dirichlet allocation (TBLDA) framework to learn shared topics across gene expression and genotype data that allows multiple RNA sequencing samples to correspond to a single individual's genotype. By using raw count data, our model avoids possible adulteration via normalization procedures. Ancestral structure is captured in a genotype-specific latent space, effectively removing it from shared components. Using GTEx v8 expression data across 10 tissues and genotype data, we show that the estimated topics capture meaningful and robust biological signal in both modalities and identify associations within and across tissue types. We identify 4,645 cis-eQTLs and 995 trans-eQTLs by conducting eQTL mapping between the most informative features in each topic. Our TBLDA model is able to identify associations using raw sequencing count data when the samples in two separate data modalities are matched one-to-many, as is often the case in biological data. Our code is freely available at https://github.com/gewirtz/TBLDA.
Collapse
Affiliation(s)
- Ariel Dh Gewirtz
- Lewis-Sigler Institute of Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - F William Townes
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, USA
- Gladstone Institutes, San Francisco, CA, USA
| |
Collapse
|
7
|
Cui S, Yoo EC, Li D, Laudanski K, Engelhardt BE. Hierarchical Gaussian Processes and Mixtures of Experts to Model COVID-19 Patient Trajectories. Pac Symp Biocomput 2022; 27:266-277. [PMID: 34890155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Gaussian processes (GPs) are a versatile nonparametric model for nonlinear regression and have been widely used to study spatiotemporal phenomena. However, standard GPs offer limited interpretability and generalizability for datasets with naturally occurring hierarchies. With large-scale, rapidly-updating electronic health record (EHR) data, we want to study patient trajectories across diverse patient cohorts while preserving patient subgroup structure. In this work, we partition our cohort of over 2000 COVID-19 patients by sex and ethnicity. We develop and apply a hierarchical Gaussian process and a mixture of experts (MOE) hierarchical GP model to fit patient trajectories on clinical markers of disease progression. A case study for albumin, an effective predictor of COVID-19 patient outcomes, highlights the predictive performance of these models. These hierarchical spatiotemporal models of EHR data bring us a step closer toward our goal of building flexible approaches to capture patient data that can be used in real-time systems*.
Collapse
Affiliation(s)
- Sunny Cui
- Department of Computer Science, Princeton University, Princeton, NJ, USA,
| | | | | | | | | |
Collapse
|
8
|
Wu A, Nastase SA, Baldassano CA, Turk-Browne NB, Norman KA, Engelhardt BE, Pillow JW. Brain kernel: A new spatial covariance function for fMRI data. Neuroimage 2021; 245:118580. [PMID: 34740792 DOI: 10.1016/j.neuroimage.2021.118580] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 07/30/2021] [Accepted: 09/14/2021] [Indexed: 11/20/2022] Open
Abstract
A key problem in functional magnetic resonance imaging (fMRI) is to estimate spatial activity patterns from noisy high-dimensional signals. Spatial smoothing provides one approach to regularizing such estimates. However, standard smoothing methods ignore the fact that correlations in neural activity may fall off at different rates in different brain areas, or exhibit discontinuities across anatomical or functional boundaries. Moreover, such methods do not exploit the fact that widely separated brain regions may exhibit strong correlations due to bilateral symmetry or the network organization of brain regions. To capture this non-stationary spatial correlation structure, we introduce the brain kernel, a continuous covariance function for whole-brain activity patterns. We define the brain kernel in terms of a continuous nonlinear mapping from 3D brain coordinates to a latent embedding space, parametrized with a Gaussian process (GP). The brain kernel specifies the prior covariance between voxels as a function of the distance between their locations in embedding space. The GP mapping warps the brain nonlinearly so that highly correlated voxels are close together in latent space, and uncorrelated voxels are far apart. We estimate the brain kernel using resting-state fMRI data, and we develop an exact, scalable inference method based on block coordinate descent to overcome the challenges of high dimensionality (10-100K voxels). Finally, we illustrate the brain kernel's usefulness with applications to brain decoding and factor analysis with multiple task-based fMRI datasets.
Collapse
Affiliation(s)
- Anqi Wu
- Center for Theoretical Neuroscience, Columbia University, New York City, NY, USA.
| | - Samuel A Nastase
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA; Department of Psychology, Princeton University, Princeton, NJ, USA
| | | | | | - Kenneth A Norman
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA; Department of Psychology, Princeton University, Princeton, NJ, USA
| | | | - Jonathan W Pillow
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA; Department of Psychology, Princeton University, Princeton, NJ, USA
| |
Collapse
|
9
|
Abstract
Multicellular organisms rely on spatial signaling among cells to drive their organization, development, and response to stimuli. Several models have been proposed to capture the behavior of spatial signaling in multicellular systems, but existing approaches fail to capture both the autonomous behavior of single cells and the interactions of a cell with its neighbors simultaneously. We propose a spatiotemporal model of dynamic cell signaling based on Hawkes processes-self-exciting point processes-that model the signaling processes within a cell and spatial couplings between cells. With this cellular point process (CPP), we capture both the single-cell pathway activation rate and the magnitude and duration of signaling between cells relative to their spatial location. Furthermore, our model captures tissues composed of heterogeneous cell types with different bursting rates and signaling behaviors across multiple signaling proteins. We apply our model to epithelial cell systems that exhibit a range of autonomous and spatial signaling behaviors basally and under pharmacological exposure. Our model identifies known drug-induced signaling deficits, characterizes signaling changes across a wound front, and generalizes to multichannel observations.
Collapse
Affiliation(s)
- Archit Verma
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ 08544
| | - Siddhartha G Jena
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544
| | - Danielle R Isakov
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544
| | - Kazuhiro Aoki
- National Institute of Basic Biology, National Institutes of Natural Sciences, Okazaki 444-8585, Japan
- Exploratory Research Center on Life and Living Systems, National Institutes of Natural Sciences, Okazaki 444-8787, Japan
- International Research Collaboration Center, National Institutes of Natural Sciences, Tokyo 105-0001, Japan
| | - Jared E Toettcher
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ 08540;
- Center for Statistics and Machine Learning, Princeton University, Princeton, NJ 08540
| |
Collapse
|
10
|
Dumitrascu B, Villar S, Mixon DG, Engelhardt BE. Optimal marker gene selection for cell type discrimination in single cell analyses. Nat Commun 2021; 12:1186. [PMID: 33608535 PMCID: PMC7895823 DOI: 10.1038/s41467-021-21453-4] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 01/27/2021] [Indexed: 11/17/2022] Open
Abstract
Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization. The selection of a small set of cellular labels to distinguish a subpopulation of cells from a complex mixture is an important task in cell biology. Here the authors propose a method for supervised genetic marker selection using linear programming and provides a Python package scGeneFit that implements this approach.
Collapse
Affiliation(s)
- Bianca Dumitrascu
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Soledad Villar
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA.,Mathematical Institute for Data Science, Johns Hopkins University, Baltimore, MD, USA
| | - Dustin G Mixon
- Department of Mathematics, The Ohio State University, Columbus, OH, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, USA. .,Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA.
| |
Collapse
|
11
|
Lu J, Dumitrascu B, McDowell IC, Jo B, Barrera A, Hong LK, Leichter SM, Reddy TE, Engelhardt BE. Causal network inference from gene transcriptional time-series response to glucocorticoids. PLoS Comput Biol 2021; 17:e1008223. [PMID: 33513136 PMCID: PMC7875426 DOI: 10.1371/journal.pcbi.1008223] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2019] [Revised: 02/10/2021] [Accepted: 08/07/2020] [Indexed: 11/19/2022] Open
Abstract
Gene regulatory network inference is essential to uncover complex relationships among gene pathways and inform downstream experiments, ultimately enabling regulatory network re-engineering. Network inference from transcriptional time-series data requires accurate, interpretable, and efficient determination of causal relationships among thousands of genes. Here, we develop Bootstrap Elastic net regression from Time Series (BETS), a statistical framework based on Granger causality for the recovery of a directed gene network from transcriptional time-series data. BETS uses elastic net regression and stability selection from bootstrapped samples to infer causal relationships among genes. BETS is highly parallelized, enabling efficient analysis of large transcriptional data sets. We show competitive accuracy on a community benchmark, the DREAM4 100-gene network inference challenge, where BETS is one of the fastest among methods of similar performance and additionally infers whether causal effects are activating or inhibitory. We apply BETS to transcriptional time-series data of differentially-expressed genes from A549 cells exposed to glucocorticoids over a period of 12 hours. We identify a network of 2768 genes and 31,945 directed edges (FDR ≤ 0.2). We validate inferred causal network edges using two external data sources: Overexpression experiments on the same glucocorticoid system, and genetic variants associated with inferred edges in primary lung tissue in the Genotype-Tissue Expression (GTEx) v6 project. BETS is available as an open source software package at https://github.com/lujonathanh/BETS.
Collapse
Affiliation(s)
- Jonathan Lu
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Bianca Dumitrascu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Ian C. McDowell
- Element Genomics, A UCB Company, Durham, North Carolina, United States of America
| | - Brian Jo
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - Alejandro Barrera
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina, United States of America
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Linda K. Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Sarah M. Leichter
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Timothy E. Reddy
- Department of Genome Sciences, Duke University, Durham, North Carolina, United States of America
| | - Barbara E. Engelhardt
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| |
Collapse
|
12
|
Camerlenghi F, Dumitrascu B, Ferrari F, Engelhardt BE, Favaro S. Nonparametric Bayesian multiarmed bandits for single-cell experiment design. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1370] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
13
|
Gill D, Arvanitis M, Carter P, Hernández Cordero AI, Jo B, Karhunen V, Larsson SC, Li X, Lockhart SM, Mason A, Pashos E, Saha A, Tan VY, Zuber V, Bossé Y, Fahle S, Hao K, Jiang T, Joubert P, Lunt AC, Ouwehand WH, Roberts DJ, Timens W, van den Berge M, Watkins NA, Battle A, Butterworth AS, Danesh J, Di Angelantonio E, Engelhardt BE, Peters JE, Sin DD, Burgess S. ACE inhibition and cardiometabolic risk factors, lung ACE2 and TMPRSS2 gene expression, and plasma ACE2 levels: a Mendelian randomization study. R Soc Open Sci 2020; 7:200958. [PMID: 33391794 PMCID: PMC7735342 DOI: 10.1098/rsos.200958] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Accepted: 11/03/2020] [Indexed: 05/14/2023]
Abstract
Angiotensin-converting enzyme 2 (ACE2) and serine protease TMPRSS2 have been implicated in cell entry for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus responsible for coronavirus disease 2019 (COVID-19). The expression of ACE2 and TMPRSS2 in the lung epithelium might have implications for the risk of SARS-CoV-2 infection and severity of COVID-19. We use human genetic variants that proxy angiotensin-converting enzyme (ACE) inhibitor drug effects and cardiovascular risk factors to investigate whether these exposures affect lung ACE2 and TMPRSS2 gene expression and circulating ACE2 levels. We observed no consistent evidence of an association of genetically predicted serum ACE levels with any of our outcomes. There was weak evidence for an association of genetically predicted serum ACE levels with ACE2 gene expression in the Lung eQTL Consortium (p = 0.014), but this finding did not replicate. There was evidence of a positive association of genetic liability to type 2 diabetes mellitus with lung ACE2 gene expression in the Gene-Tissue Expression (GTEx) study (p = 4 × 10-4) and with circulating plasma ACE2 levels in the INTERVAL study (p = 0.03), but not with lung ACE2 expression in the Lung eQTL Consortium study (p = 0.68). There were no associations of genetically proxied liability to the other cardiometabolic traits with any outcome. This study does not provide consistent evidence to support an effect of serum ACE levels (as a proxy for ACE inhibitors) or cardiometabolic risk factors on lung ACE2 and TMPRSS2 expression or plasma ACE2 levels.
Collapse
Affiliation(s)
- Dipender Gill
- Department of Epidemiology and Biostatistics, St Mary's Hospital, Imperial College London, Medical School Building, London, UK
| | - Marios Arvanitis
- Department of Medicine, Division of Cardiology, Johns Hopkins University, Baltimore, MD, USA
| | - Paul Carter
- Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Ana I. Hernández Cordero
- The University of British Columbia Centre for Heart Lung Innovation, St Paul's Hospital, Vancouver, BC, Canada
| | - Brian Jo
- Program in Quantitative and Computational Biology, Lewis Sigler Institute for Integrative Biology, Princeton, NJ, USA
| | - Ville Karhunen
- Department of Epidemiology and Biostatistics, St Mary's Hospital, Imperial College London, Medical School Building, London, UK
| | - Susanna C. Larsson
- Unit of Cardiovascular and Nutritional Epidemiology, Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
- Department of Surgical Sciences, Uppsala University, Uppsala, Sweden
| | - Xuan Li
- The University of British Columbia Centre for Heart Lung Innovation, St Paul's Hospital, Vancouver, BC, Canada
| | - Sam M. Lockhart
- Medical Research Council Metabolic Diseases Unit, Wellcome Trust-Medical Research Council Institute of Metabolic Science, University of Cambridge, Cambridge, UK
| | - Amy Mason
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
| | - Evanthia Pashos
- Internal Medicine Research Unit, Pfizer Worldwide Research, Development & Medical, Cambridge, MA, USA
| | - Ashis Saha
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Vanessa Y. Tan
- Medical Research Council Integrative Epidemiology Unit, University of Bristol, Bristol, UK
- Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Verena Zuber
- Department of Epidemiology and Biostatistics, St Mary's Hospital, Imperial College London, Medical School Building, London, UK
- Medical Research Council Biostatistics Unit, Cambridge Institute of Public Health, University of Cambridge, Cambridge, UK
| | - Yohan Bossé
- Institut universitaire de cardiologie et de pneumologie de Québec – Université Laval, Quebec, Canada
| | - Sarah Fahle
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
| | - Ke Hao
- Department of Genetics and Genomic Sciences, Icahn Institute for Data Science and Genomic Technology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Tao Jiang
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Philippe Joubert
- Institut universitaire de cardiologie et de pneumologie de Québec – Université Laval, Quebec, Canada
| | - Alan C. Lunt
- Department of Epidemiology and Biostatistics, St Mary's Hospital, Imperial College London, Medical School Building, London, UK
| | - Willem Hendrik Ouwehand
- Department of Haematology, University of Cambridge, Cambridge Biomedical Campus, Cambridge, UK
- NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
| | - David J. Roberts
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- NHS Blood and Transplant-Oxford Centre, Level 2, John Radcliffe Hospital, Oxford, UK
- Radcliffe Department of Medicine, University of Oxford, John Radcliffe Hospital, Oxford, UK
| | - Wim Timens
- Department of Pathology and Medical Biology and Groningen Research Institute for Asthma and COPD, University of Groningen, Groningen, The Netherlands
| | - Maarten van den Berge
- Department of Pulmonology and Groningen Research Institute for Asthma and COPD, University of Groningen, Groningen, The Netherlands
| | - Nicholas A. Watkins
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK
| | - Alexis Battle
- Department of Biomedical Engineering and Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Adam S. Butterworth
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - John Danesh
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
- Wellcome Sanger Institute, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Emanuele Di Angelantonio
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
- NHS Blood and Transplant, Cambridge Biomedical Campus, Cambridge, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
| | - Barbara E. Engelhardt
- Computer Science Department and Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA
| | - James E. Peters
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK
- Department of Immunology and Inflammation, Faculty of Medicine, Imperial College London, London, UK
| | - Don D. Sin
- The University of British Columbia Centre for Heart Lung Innovation, St Paul's Hospital, Vancouver, BC, Canada
| | - Stephen Burgess
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
- Medical Research Council Biostatistics Unit, Cambridge Institute of Public Health, University of Cambridge, Cambridge, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge, UK
- Homerton College, University of Cambridge, Cambridge, UK
- National Institute for Health Research Cambridge Biomedical Research Centre, University of Cambridge and Cambridge University Hospitals, Cambridge, UK
| |
Collapse
|
14
|
Oliva M, Muñoz-Aguirre M, Kim-Hellmuth S, Wucher V, Gewirtz ADH, Cotter DJ, Parsana P, Kasela S, Balliu B, Viñuela A, Castel SE, Mohammadi P, Aguet F, Zou Y, Khramtsova EA, Skol AD, Garrido-Martín D, Reverter F, Brown A, Evans P, Gamazon ER, Payne A, Bonazzola R, Barbeira AN, Hamel AR, Martinez-Perez A, Soria JM, Pierce BL, Stephens M, Eskin E, Dermitzakis ET, Segrè AV, Im HK, Engelhardt BE, Ardlie KG, Montgomery SB, Battle AJ, Lappalainen T, Guigó R, Stranger BE. The impact of sex on gene expression across human tissues. Science 2020; 369:369/6509/eaba3066. [PMID: 32913072 DOI: 10.1126/science.aba3066] [Citation(s) in RCA: 276] [Impact Index Per Article: 69.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 08/03/2020] [Indexed: 12/12/2022]
Abstract
Many complex human phenotypes exhibit sex-differentiated characteristics. However, the molecular mechanisms underlying these differences remain largely unknown. We generated a catalog of sex differences in gene expression and in the genetic regulation of gene expression across 44 human tissue sources surveyed by the Genotype-Tissue Expression project (GTEx, v8 release). We demonstrate that sex influences gene expression levels and cellular composition of tissue samples across the human body. A total of 37% of all genes exhibit sex-biased expression in at least one tissue. We identify cis expression quantitative trait loci (eQTLs) with sex-differentiated effects and characterize their cellular origin. By integrating sex-biased eQTLs with genome-wide association study data, we identify 58 gene-trait associations that are driven by genetic regulation of gene expression in a single sex. These findings provide an extensive characterization of sex differences in the human transcriptome and its genetic regulation.
Collapse
Affiliation(s)
- Meritxell Oliva
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA. .,Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA.,Department of Public Health Sciences, University of Chicago, Chicago, IL, USA
| | - Manuel Muñoz-Aguirre
- Centre for Genomic Regulation, Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain.,Department of Statistics and Operations Research, Universitat Politècnica de Catalunya, Barcelona, Catalonia, Spain
| | - Sarah Kim-Hellmuth
- Statistical Genetics, Max Planck Institute of Psychiatry, Munich, Germany.,New York Genome Center, New York, NY, USA.,Department of Systems Biology, Columbia University, New York, NY, USA
| | - Valentin Wucher
- Centre for Genomic Regulation, Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain
| | - Ariel D H Gewirtz
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Daniel J Cotter
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Princy Parsana
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | - Silva Kasela
- New York Genome Center, New York, NY, USA.,Department of Systems Biology, Columbia University, New York, NY, USA
| | - Brunilda Balliu
- Department of Computational Medicine, University of California, Los Angeles, CA, USA
| | - Ana Viñuela
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
| | - Stephane E Castel
- New York Genome Center, New York, NY, USA.,Department of Systems Biology, Columbia University, New York, NY, USA
| | - Pejman Mohammadi
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, Scripps Research Translational Institute, La Jolla, CA, USA
| | | | - Yuxin Zou
- Department of Statistics, University of Chicago, Chicago, IL, USA
| | - Ekaterina A Khramtsova
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA.,Computational Sciences, Janssen Pharmaceuticals, Spring House, PA, USA
| | - Andrew D Skol
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA.,Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA.,Center for Translational Data Science, University of Chicago, Chicago, IL, USA.,Department of Pathology and Laboratory Medicine, Ann and Robert H. Lurie Children's Hospital of Chicago, Chicago, IL, USA
| | - Diego Garrido-Martín
- Centre for Genomic Regulation, Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain
| | - Ferran Reverter
- Department of Genetics, Microbiology and Statistics, Faculty of Biology, University of Barcelona, Barcelona, Spain
| | | | - Patrick Evans
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Eric R Gamazon
- Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA.,Clare Hall, University of Cambridge, Cambridge, UK
| | - Anthony Payne
- Wellcome Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, UK
| | - Rodrigo Bonazzola
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Alvaro N Barbeira
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Andrew R Hamel
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Angel Martinez-Perez
- Genomics of Complex Diseases Group, Research Institute Hospital de la Sant Creu i Sant Pau, IIB Sant Pau, Barcelona, Spain
| | - José Manuel Soria
- Genomics of Complex Diseases Group, Research Institute Hospital de la Sant Creu i Sant Pau, IIB Sant Pau, Barcelona, Spain
| | | | - Brandon L Pierce
- Department of Public Health Sciences, University of Chicago, Chicago, IL, USA
| | - Matthew Stephens
- Department of Statistics, University of Chicago, Chicago, IL, USA.,Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Eleazar Eskin
- Departments of Computational Medicine, Computer Science, and Human Genetics, University of California, Los Angeles, CA, USA
| | - Emmanouil T Dermitzakis
- Department of Genetic Medicine and Development, University of Geneva Medical School, Geneva, Switzerland
| | - Ayellet V Segrè
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.,Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
| | - Hae Kyung Im
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA.,Genomics plc, Oxford, UK
| | | | - Stephen B Montgomery
- Department of Genetics, Stanford University, Stanford, CA, USA.,Department of Pathology, Stanford University, Stanford, CA, USA
| | - Alexis J Battle
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Tuuli Lappalainen
- New York Genome Center, New York, NY, USA.,Department of Systems Biology, Columbia University, New York, NY, USA
| | - Roderic Guigó
- Centre for Genomic Regulation, Barcelona Institute for Science and Technology, Barcelona, Catalonia, Spain.,Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
| | - Barbara E Stranger
- Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, USA. .,Institute for Genomics and Systems Biology, University of Chicago, Chicago, IL, USA.,Center for Translational Data Science, University of Chicago, Chicago, IL, USA.,Center for Genetic Medicine, Department of Pharmacology, Northwestern University, Chicago, IL, USA
| |
Collapse
|
15
|
Abstract
BACKGROUND Modern developments in single-cell sequencing technologies enable broad insights into cellular state. Single-cell RNA sequencing (scRNA-seq) can be used to explore cell types, states, and developmental trajectories to broaden our understanding of cellular heterogeneity in tissues and organs. Analysis of these sparse, high-dimensional experimental results requires dimension reduction. Several methods have been developed to estimate low-dimensional embeddings for filtered and normalized single-cell data. However, methods have yet to be developed for unfiltered and unnormalized count data that estimate uncertainty in the low-dimensional space. We present a nonlinear latent variable model with robust, heavy-tailed error and adaptive kernel learning to estimate low-dimensional nonlinear structure in scRNA-seq data. RESULTS Gene expression in a single cell is modeled as a noisy draw from a Gaussian process in high dimensions from low-dimensional latent positions. This model is called the Gaussian process latent variable model (GPLVM). We model residual errors with a heavy-tailed Student's t-distribution to estimate a manifold that is robust to technical and biological noise found in normalized scRNA-seq data. We compare our approach to common dimension reduction tools across a diverse set of scRNA-seq data sets to highlight our model's ability to enable important downstream tasks such as clustering, inferring cell developmental trajectories, and visualizing high throughput experiments on available experimental data. CONCLUSION We show that our adaptive robust statistical approach to estimate a nonlinear manifold is well suited for raw, unfiltered gene counts from high-throughput sequencing technologies for visualization, exploration, and uncertainty estimation of cell states.
Collapse
Affiliation(s)
- Archit Verma
- Chemical and Biological Engineering, Princeton University, 50-70 Olden Street, Princeton, 08540 NJ USA
| | - Barbara E. Engelhardt
- Computer Science, Center for Statistics and Machine Learning, 35 Olden Street, Princeton, 08540 NJ USA
| |
Collapse
|
16
|
Cheng LF, Dumitrascu B, Darnell G, Chivers C, Draugelis M, Li K, Engelhardt BE. Sparse multi-output Gaussian processes for online medical time series prediction. BMC Med Inform Decis Mak 2020; 20:152. [PMID: 32641134 PMCID: PMC7341595 DOI: 10.1186/s12911-020-1069-4] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 03/05/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND For real-time monitoring of hospital patients, high-quality inference of patients' health status using all information available from clinical covariates and lab test results is essential to enable successful medical interventions and improve patient outcomes. Developing a computational framework that can learn from observational large-scale electronic health records (EHRs) and make accurate real-time predictions is a critical step. In this work, we develop and explore a Bayesian nonparametric model based on multi-output Gaussian process (GP) regression for hospital patient monitoring. METHODS We propose MedGP, a statistical framework that incorporates 24 clinical covariates and supports a rich reference data set from which relationships between observed covariates may be inferred and exploited for high-quality inference of patient state over time. To do this, we develop a highly structured sparse GP kernel to enable tractable computation over tens of thousands of time points while estimating correlations among clinical covariates, patients, and periodicity in patient observations. MedGP has a number of benefits over current methods, including (i) not requiring an alignment of the time series data, (ii) quantifying confidence regions in the predictions, (iii) exploiting a vast and rich database of patients, and (iv) inferring interpretable relationships among clinical covariates. RESULTS We evaluate and compare results from MedGP on the task of online prediction for three patient subgroups from two medical data sets across 8,043 patients. We find MedGP improves online prediction over baseline and state-of-the-art methods for nearly all covariates across different disease subgroups and hospitals. CONCLUSIONS The MedGP framework is robust and efficient in estimating the temporal dependencies from sparse and irregularly sampled medical time series data for online prediction. The publicly available code is at https://github.com/bee-hive/MedGP .
Collapse
Affiliation(s)
- Li-Fang Cheng
- Department of Electrical Engineering, Princeton University, Princeton, USA
| | | | - Gregory Darnell
- Lewis-Sigler Institute, Princeton University, Princeton, NJ USA
| | - Corey Chivers
- University of Pennsylvania Health System, Philadelphia, PA USA
| | | | - Kai Li
- Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ USA
- Center for Statistics and Machine Learning, Princeton University, Princeton, NJ USA
| |
Collapse
|
17
|
Salganik MJ, Lundberg I, Kindel AT, Ahearn CE, Al-Ghoneim K, Almaatouq A, Altschul DM, Brand JE, Carnegie NB, Compton RJ, Datta D, Davidson T, Filippova A, Gilroy C, Goode BJ, Jahani E, Kashyap R, Kirchner A, McKay S, Morgan AC, Pentland A, Polimis K, Raes L, Rigobon DE, Roberts CV, Stanescu DM, Suhara Y, Usmani A, Wang EH, Adem M, Alhajri A, AlShebli B, Amin R, Amos RB, Argyle LP, Baer-Bositis L, Büchi M, Chung BR, Eggert W, Faletto G, Fan Z, Freese J, Gadgil T, Gagné J, Gao Y, Halpern-Manners A, Hashim SP, Hausen S, He G, Higuera K, Hogan B, Horwitz IM, Hummel LM, Jain N, Jin K, Jurgens D, Kaminski P, Karapetyan A, Kim EH, Leizman B, Liu N, Möser M, Mack AE, Mahajan M, Mandell N, Marahrens H, Mercado-Garcia D, Mocz V, Mueller-Gastell K, Musse A, Niu Q, Nowak W, Omidvar H, Or A, Ouyang K, Pinto KM, Porter E, Porter KE, Qian C, Rauf T, Sargsyan A, Schaffner T, Schnabel L, Schonfeld B, Sender B, Tang JD, Tsurkov E, van Loon A, Varol O, Wang X, Wang Z, Wang J, Wang F, Weissman S, Whitaker K, Wolters MK, Woon WL, Wu J, Wu C, Yang K, Yin J, Zhao B, Zhu C, Brooks-Gunn J, Engelhardt BE, Hardt M, Knox D, Levy K, Narayanan A, Stewart BM, Watts DJ, McLanahan S. Measuring the predictability of life outcomes with a scientific mass collaboration. Proc Natl Acad Sci U S A 2020; 117:8398-8403. [PMID: 32229555 PMCID: PMC7165437 DOI: 10.1073/pnas.1915006117] [Citation(s) in RCA: 58] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
How predictable are life trajectories? We investigated this question with a scientific mass collaboration using the common task method; 160 teams built predictive models for six life outcomes using data from the Fragile Families and Child Wellbeing Study, a high-quality birth cohort study. Despite using a rich dataset and applying machine-learning methods optimized for prediction, the best predictions were not very accurate and were only slightly better than those from a simple benchmark model. Within each outcome, prediction error was strongly associated with the family being predicted and weakly associated with the technique used to generate the prediction. Overall, these results suggest practical limits to the predictability of life outcomes in some settings and illustrate the value of mass collaborations in the social sciences.
Collapse
Affiliation(s)
| | - Ian Lundberg
- Department of Sociology, Princeton University, Princeton, NJ 08544
| | | | - Caitlin E Ahearn
- Department of Sociology, University of California, Los Angeles, CA 90095
| | | | - Abdullah Almaatouq
- Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA 02142
- Media Lab, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Drew M Altschul
- Mental Health Data Science Scotland, Department of Psychology, The University of Edinburgh, Edinburgh EH8 9JZ, United Kingdom
| | - Jennie E Brand
- Department of Sociology, University of California, Los Angeles, CA 90095
- Department of Statistics, University of California, Los Angeles, CA 90095
| | | | - Ryan James Compton
- Human Computer Interaction Lab, University of California, Santa Cruz, CA 95064
| | - Debanjan Datta
- Discovery Analytics Center, Virginia Polytechnic Institute and State University, Arlington, VA 22203
| | - Thomas Davidson
- Department of Sociology, Cornell University, Ithaca, NY 14853
| | | | - Connor Gilroy
- Department of Sociology, University of Washington, Seattle, WA 98105
| | - Brian J Goode
- Social and Decision Analytics Laboratory, Fralin Life Sciences Institute, Virginia Polytechnic Institute and State University, Arlington, VA 22203
| | - Eaman Jahani
- Institute for Data, Systems and Society, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Ridhi Kashyap
- Department of Sociology, University of Oxford, Oxford OX1 1JD, United Kingdom
- Nuffield College, University of Oxford, Oxford OX1 1NF, United Kingdom
- School of Anthropology and Museum Ethnography, University of Oxford, Oxford OX2 6PE, United Kingdom
| | - Antje Kirchner
- Program for Research in Survey Methodology, Survey Research Division, RTI International, Research Triangle Park, NC 27709
| | - Stephen McKay
- School of Social and Political Sciences, University of Lincoln, Brayford Pool, Lincoln LN6 7TS, United Kingdom
| | - Allison C Morgan
- Department of Computer Science, University of Colorado, Boulder, CO 80309
| | - Alex Pentland
- Media Lab, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Kivan Polimis
- Center for the Study of Demography and Ecology, University of Washington, Seattle, WA 98105
| | - Louis Raes
- Department of Economics, Tilburg School of Economics and Management, Tilburg University, 5037 AB Tilburg, The Netherlands
| | - Daniel E Rigobon
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Claudia V Roberts
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Diana M Stanescu
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Yoshihiko Suhara
- Media Lab, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Adaner Usmani
- Department of Sociology, Harvard University, Cambridge, MA 02138
| | - Erik H Wang
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Muna Adem
- Department of Sociology, Indiana University, Bloomington, IN 47405
| | - Abdulla Alhajri
- Department of Nuclear Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Bedoor AlShebli
- Computational Social Science Lab, Social Science Division, New York University Abu Dhabi, 129188 Abu Dhabi, United Arab Emirates
| | - Redwane Amin
- Bendheim Center for Finance, Princeton University, Princeton, NJ 08544
| | - Ryan B Amos
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Lisa P Argyle
- Department of Political Science, Brigham Young University, Provo, UT 84602
| | | | - Moritz Büchi
- Department of Communication and Media Research, University of Zurich, Zurich, Switzerland, ZH-8050
| | - Bo-Ryehn Chung
- Center for Statistics & Machine Learning, Princeton University, Princeton, NJ 08544
| | - William Eggert
- Department of Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ 08544
| | - Gregory Faletto
- Statistics Group, Department of Data Sciences and Operations, Marshall School of Business, University of Southern California, Los Angeles, CA 90089
| | - Zhilin Fan
- Department of Statistics, Columbia University, New York, NY 10027
| | - Jeremy Freese
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Tejomay Gadgil
- Center for Data Science, New York University, New York, NY 10011
| | - Josh Gagné
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Yue Gao
- Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027
| | | | - Sonia P Hashim
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Sonia Hausen
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Guanhua He
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544
| | - Kimberly Higuera
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Bernie Hogan
- Oxford Internet Institute, University of Oxford, Oxford OX1 3JS, United Kingdom
| | - Ilana M Horwitz
- Graduate School of Education, Stanford University, Stanford, CA, 94305
| | - Lisa M Hummel
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Naman Jain
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
| | - Kun Jin
- Department of Computer Science, Ohio State University, Columbus, OH 43210
| | - David Jurgens
- School of Information, University of Michigan, Ann Arbor, MI 48104
| | - Patrick Kaminski
- Department of Sociology, Indiana University, Bloomington, IN 47405
- Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN 47405
| | - Areg Karapetyan
- Department of Computer Science, Masdar Institute, Khalifa University, 127788 Abu Dhabi, United Arab Emirates
- Research Institute for Mathematical Sciences, Kyoto University, Kyoto 606-8502, Japan
| | - E H Kim
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Ben Leizman
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Naijia Liu
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Malte Möser
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Andrew E Mack
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Mayank Mahajan
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Noah Mandell
- Department of Astrophysical Sciences, Princeton University, Princeton, NJ 08544
| | - Helge Marahrens
- Department of Sociology, Indiana University, Bloomington, IN 47405
| | | | - Viola Mocz
- Department of Neuroscience, Princeton University, Princeton, NJ 08544
| | | | - Ahmed Musse
- Department of Electrical Engineering, Princeton University, Princeton, NJ, 08544
| | - Qiankun Niu
- Bendheim Center for Finance, Princeton University, Princeton, NJ 08544
| | | | - Hamidreza Omidvar
- Department of Civil and Environmental Engineering, Princeton University, Princeton, NJ 08544
| | - Andrew Or
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Karen Ouyang
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Katy M Pinto
- Department of Sociology, California State University, Dominguez Hills, Carson, CA 90747
| | - Ethan Porter
- School of Media and Public Affairs, George Washington University, Washington, DC 20052
| | | | - Crystal Qian
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Tamkinat Rauf
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Anahit Sargsyan
- Social Science Division, New York University Abu Dhabi, 129188 Abu Dhabi, United Arab Emirates
| | - Thomas Schaffner
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Landon Schnabel
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Bryan Schonfeld
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Ben Sender
- Department of Economics, Princeton University, Princeton, NJ 08544
| | - Jonathan D Tang
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Emma Tsurkov
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Austin van Loon
- Department of Sociology, Stanford University, Stanford, CA 94305
| | - Onur Varol
- Center for Complex Network Research, Northeastern University Networks Science Institute, Boston, MA 02115
- Luddy School of Informatics, Computing, & Engineering, Indiana University, Bloomington, IN 47408
| | - Xiafei Wang
- School of Social Work, David B. Falk College of Sport and Human Dynamics, Syracuse University, NY 13244
| | - Zhi Wang
- Luddy School of Informatics, Computing, & Engineering, Indiana University, Bloomington, IN 47408
- School of Public Health, Indiana University, Bloomington, IN 47408
| | - Julia Wang
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Flora Wang
- Department of Economics, Princeton University, Princeton, NJ 08544
| | - Samantha Weissman
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Kirstie Whitaker
- The Alan Turing Institute, London NW1 2DB, United Kingdom
- Department of Psychiatry, University of Cambridge, Cambridge CB2 0SZ, United Kingdom
| | - Maria K Wolters
- School of Informatics, University of Edinburgh, Edinburgh EH8 9AB, United Kingdom
| | - Wei Lee Woon
- Department of Marketplaces & Yield Data Science, Expedia Group, Seattle, WA 98119
| | - James Wu
- Department of the Applied Statistics, Social Science, and Humanities, New York University, New York, NY 10003
| | - Catherine Wu
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | - Kengran Yang
- Department of Civil and Environmental Engineering, Princeton University, Princeton, NJ 08544
| | - Jingwen Yin
- Department of Statistics, Columbia University, New York, NY 10027
| | - Bingyu Zhao
- Department of Engineering, University of Cambridge, Cambridge CB2 1PZ, United Kingdom
| | - Chenyun Zhu
- Department of Statistics, Columbia University, New York, NY 10027
| | - Jeanne Brooks-Gunn
- Department of Human Development, Teachers College, Columbia University, New York, NY 10027
- Department of Pediatrics, Vagelos College of Physicians and Surgeons, Columbia University, New York, NY 10032
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ 08544
- Center for Statistics & Machine Learning, Princeton University, Princeton, NJ 08544
| | - Moritz Hardt
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, CA 94720
| | - Dean Knox
- Department of Politics, Princeton University,Princeton, NJ, 08544
| | - Karen Levy
- Department of Information Science, Cornell University, Ithaca, NY 14853
| | - Arvind Narayanan
- Department of Computer Science, Princeton University, Princeton, NJ 08544
| | | | - Duncan J Watts
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104
- Annenberg School of Communication, University of Pennsylvania, Philadelphia, PA 19104
- Operations, Information and Decisions Department, University of Pennsylvania, Philadelphia, PA 19104
| | - Sara McLanahan
- Department of Sociology, Princeton University, Princeton, NJ 08544;
| |
Collapse
|
18
|
Elyanow R, Dumitrascu B, Engelhardt BE, Raphael BJ. netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis. Genome Res 2020; 30:195-204. [PMID: 31992614 PMCID: PMC7050525 DOI: 10.1101/gr.251603.119] [Citation(s) in RCA: 36] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 11/19/2019] [Indexed: 02/06/2023]
Abstract
Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene–gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene–gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.
Collapse
Affiliation(s)
- Rebecca Elyanow
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island 02912, USA.,Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| | - Bianca Dumitrascu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey 08540, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA.,Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
| | - Benjamin J Raphael
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA
| |
Collapse
|
19
|
Dumitrascu B, Darnell G, Ayroles J, Engelhardt BE. Statistical tests for detecting variance effects in quantitative trait studies. Bioinformatics 2019; 35:200-210. [PMID: 29982387 PMCID: PMC6330007 DOI: 10.1093/bioinformatics/bty565] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2017] [Accepted: 07/04/2018] [Indexed: 11/17/2022] Open
Abstract
Motivation Identifying variants, both discrete and continuous, that are associated with quantitative traits, or QTs, is the primary focus of quantitative genetics. Most current methods are limited to identifying mean effects, or associations between genotype or covariates and the mean value of a quantitative trait. It is possible, however, that a variant may affect the variance of the quantitative trait in lieu of, or in addition to, affecting the trait mean. Here, we develop a general methodology to identify covariates with variance effects on a quantitative trait using a Bayesian heteroskedastic linear regression model (BTH). We compare BTH with existing methods to detect variance effects across a large range of simulations drawn from scenarios common to the analysis of quantitative traits. Results We find that BTH and a double generalized linear model (dglm) outperform classical tests used for detecting variance effects in recent genomic studies. We show BTH and dglm are less likely to generate spurious discoveries through simulations and application to identifying methylation variance QTs and expression variance QTs. We identify four variance effects of sex in the Cardiovascular and Pharmacogenetics study. Our work is the first to offer a comprehensive view of variance identifying methodology. We identify shortcomings in previously used methodology and provide a more conservative and robust alternative. We extend variance effect analysis to a wide array of covariates that enables a new statistical dimension in the study of sex and age specific quantitative trait effects. Availability and implementation https://github.com/b2du/bth. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bianca Dumitrascu
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Gregory Darnell
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Julien Ayroles
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, USA.,Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, USA
| |
Collapse
|
20
|
Cheng LF, Prasad N, Engelhardt BE. An Optimal Policy for Patient Laboratory Tests in Intensive Care Units. Pac Symp Biocomput 2019; 24:320-331. [PMID: 30864333 PMCID: PMC6417830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Laboratory testing is an integral tool in the management of patient care in hospitals, particularly in intensive care units (ICUs). There exists an inherent trade-off in the selection and timing of lab tests between considerations of the expected utility in clinical decision-making of a given test at a specific time, and the associated cost or risk it poses to the patient. In this work, we introduce a framework that learns policies for ordering lab tests which optimizes for this trade-off. Our approach uses batch off-policy reinforcement learning with a composite reward function based on clinical imperatives, applied to data that include examples of clinicians ordering labs for patients. To this end, we develop and extend principles of Pareto optimality to improve the selection of actions based on multiple reward function components while respecting typical procedural considerations and prioritization of clinical goals in the ICU. Our experiments show that we can estimate a policy that reduces the frequency of lab tests and optimizes timing to minimize information redundancy. We also find that the estimated policies typically suggest ordering lab tests well ahead of critical onsets-such as mechanical ventilation or dialysis-that depend on the lab results. We evaluate our approach by quantifying how these policies may initiate earlier onset of treatment.
Collapse
Affiliation(s)
- Li-Fang Cheng
- Department of Electrical Engineering, Princeton University, USA*These authors contributed equally to this work
| | | | | |
Collapse
|
21
|
Zhao S, Engelhardt BE, Mukherjee S, Dunson DB. Fast Moment Estimation for Generalized Latent Dirichlet Models. J Am Stat Assoc 2018; 113:1528-1540. [DOI: 10.1080/01621459.2017.1341839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Shiwen Zhao
- Department of Statistical Science, Duke University, Durham, NC
| | - Barbara E. Engelhardt
- Department of Computer Science and Center for Statistics and Machine Learning, Princeton University, Princeton, NJ
| | - Sayan Mukherjee
- Department of Statistical Science, Duke University, Durham, NC
| | - David B. Dunson
- Department of Statistical Science, Duke University, Durham, NC
| |
Collapse
|
22
|
McDowell IC, Barrera A, D'Ippolito AM, Vockley CM, Hong LK, Leichter SM, Bartelt LC, Majoros WH, Song L, Safi A, Koçak DD, Gersbach CA, Hartemink AJ, Crawford GE, Engelhardt BE, Reddy TE. Glucocorticoid receptor recruits to enhancers and drives activation by motif-directed binding. Genome Res 2018; 28:1272-1284. [PMID: 30097539 PMCID: PMC6120625 DOI: 10.1101/gr.233346.117] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2017] [Accepted: 07/05/2018] [Indexed: 12/22/2022]
Abstract
Glucocorticoids are potent steroid hormones that regulate immunity and metabolism by activating the transcription factor (TF) activity of glucocorticoid receptor (GR). Previous models have proposed that DNA binding motifs and sites of chromatin accessibility predetermine GR binding and activity. However, there are vast excesses of both features relative to the number of GR binding sites. Thus, these features alone are unlikely to account for the specificity of GR binding and activity. To identify genomic and epigenetic contributions to GR binding specificity and the downstream changes resultant from GR binding, we performed hundreds of genome-wide measurements of TF binding, epigenetic state, and gene expression across a 12-h time course of glucocorticoid exposure. We found that glucocorticoid treatment induces GR to bind to nearly all pre-established enhancers within minutes. However, GR binds to only a small fraction of the set of accessible sites that lack enhancer marks. Once GR is bound to enhancers, a combination of enhancer motif composition and interactions between enhancers then determines the strength and persistence of GR binding, which consequently correlates with dramatic shifts in enhancer activation. Over the course of several hours, highly coordinated changes in TF binding and histone modification occupancy occur specifically within enhancers, and these changes correlate with changes in the expression of nearby genes. Following GR binding, changes in the binding of other TFs precede changes in chromatin accessibility, suggesting that other TFs are also sensitive to genomic features beyond that of accessibility.
Collapse
Affiliation(s)
- Ian C McDowell
- Graduate Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
| | - Alejandro Barrera
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Anthony M D'Ippolito
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,University Program in Genetics and Genomics, Duke University, Durham, North Carolina 27708, USA
| | - Christopher M Vockley
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Sarah M Leichter
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Luke C Bartelt
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - William H Majoros
- Graduate Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - D Dewran Koçak
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Charles A Gersbach
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,University Program in Genetics and Genomics, Duke University, Durham, North Carolina 27708, USA.,Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA.,Department of Orthopaedic Surgery, Duke University Medical Center, Durham, North Carolina 27708, USA
| | - Alexander J Hartemink
- Graduate Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Computer Science, Duke University, Durham, North Carolina 27708, USA.,Department of Biology, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Graduate Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27708, USA.,Department of Molecular Genetics and Microbiology, Duke University, Durham, North Carolina 27708, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, New Jersey 08540, USA.,Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
| | - Timothy E Reddy
- Graduate Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA.,Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA.,Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, North Carolina 27708, USA.,University Program in Genetics and Genomics, Duke University, Durham, North Carolina 27708, USA.,Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA.,Department of Molecular Genetics and Microbiology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
23
|
Aguiar D, Cheng LF, Dumitrascu B, Mordelet F, Pai AA, Engelhardt BE. Bayesian nonparametric discovery of isoforms and individual specific quantification. Nat Commun 2018; 9:1681. [PMID: 29703885 PMCID: PMC5923247 DOI: 10.1038/s41467-018-03402-w] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2017] [Accepted: 02/11/2018] [Indexed: 12/18/2022] Open
Abstract
Most human protein-coding genes can be transcribed into multiple distinct mRNA isoforms. These alternative splicing patterns encourage molecular diversity, and dysregulation of isoform expression plays an important role in disease etiology. However, isoforms are difficult to characterize from short-read RNA-seq data because they share identical subsequences and occur in different frequencies across tissues and samples. Here, we develop biisq, a Bayesian nonparametric model for isoform discovery and individual specific quantification from short-read RNA-seq data. biisq does not require isoform reference sequences but instead estimates an isoform catalog shared across samples. We use stochastic variational inference for efficient posterior estimates and demonstrate superior precision and recall for simulations compared to state-of-the-art isoform reconstruction methods. biisq shows the most gains for low abundance isoforms, with 36% more isoforms correctly inferred at low coverage versus a multi-sample method and 170% more versus single-sample methods. We estimate isoforms in the GEUVADIS RNA-seq data and validate inferred isoforms by associating genetic variants with isoform ratios. Alternative splicing leads to transcript isoform diversity. Here, Aguiar et al. develop biisq, a Bayesian nonparametric approach to discover and quantify isoforms from RNA-seq data.
Collapse
Affiliation(s)
- Derek Aguiar
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA.
| | - Li-Fang Cheng
- Department of Electrical Engineering, Princeton University, Princeton, NJ, 08540, USA
| | - Bianca Dumitrascu
- Lewis-Sigler Institute, Princeton University, Princeton, NJ, 08544, USA
| | - Fantine Mordelet
- Institute for Genome Sciences and Policy, Duke University, Durham, NC, 27708, USA
| | - Athma A Pai
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.,RNA Therapeutics Institute, University of Massachusetts Medical School, Worcester, MA, 01605, USA
| | - Barbara E Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ, 08540, USA. .,Center for Statistics and Machine Learning, Princeton University, Princeton, NJ, 08540, USA.
| |
Collapse
|
24
|
Abstract
Bayesian sparse factor models have proven useful for characterizing dependence in multivariate data, but scaling computation to large numbers of samples and dimensions is problematic. We propose expandable factor analysis for scalable inference in factor models when the number of factors is unknown. The method relies on a continuous shrinkage prior for efficient maximum a posteriori estimation of a low-rank and sparse loadings matrix. The structure of the prior leads to an estimation algorithm that accommodates uncertainty in the number of factors. We propose an information criterion to select the hyperparameters of the prior. Expandable factor analysis has better false discovery rates and true positive rates than its competitors across diverse simulation settings. We apply the proposed approach to a gene expression study of ageing in mice, demonstrating superior results relative to four competing methods.
Collapse
Affiliation(s)
- Sanvesh Srivastava
- Department of Statistics and Actuarial Science, University of Iowa, 241 Schaeffer Hall, 20 East Washington Street, Iowa City, Iowa 52242,
| | - Barbara E Engelhardt
- Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, 35 Olden Street, Princeton, New Jersey 08540,
| | - David B Dunson
- Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708,
| |
Collapse
|
25
|
McDowell IC, Manandhar D, Vockley CM, Schmid AK, Reddy TE, Engelhardt BE. Clustering gene expression time series data using an infinite Gaussian process mixture model. PLoS Comput Biol 2018; 14:e1005896. [PMID: 29337990 PMCID: PMC5786324 DOI: 10.1371/journal.pcbi.1005896] [Citation(s) in RCA: 83] [Impact Index Per Article: 13.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2017] [Revised: 01/26/2018] [Accepted: 11/25/2017] [Indexed: 12/24/2022] Open
Abstract
Transcriptome-wide time series expression profiling is used to characterize the cellular response to environmental perturbations. The first step to analyzing transcriptional response data is often to cluster genes with similar responses. Here, we present a nonparametric model-based method, Dirichlet process Gaussian process mixture model (DPGP), which jointly models data clusters with a Dirichlet process and temporal dependencies with Gaussian processes. We demonstrate the accuracy of DPGP in comparison to state-of-the-art approaches using hundreds of simulated data sets. To further test our method, we apply DPGP to published microarray data from a microbial model organism exposed to stress and to novel RNA-seq data from a human cell line exposed to the glucocorticoid dexamethasone. We validate our clusters by examining local transcription factor binding and histone modifications. Our results demonstrate that jointly modeling cluster number and temporal dependencies can reveal shared regulatory mechanisms. DPGP software is freely available online at https://github.com/PrincetonUniversity/DP_GP_cluster.
Collapse
Affiliation(s)
- Ian C. McDowell
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina, United States of America
- Center for Genomic & Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Dinesh Manandhar
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina, United States of America
- Center for Genomic & Computational Biology, Duke University, Durham, North Carolina, United States of America
| | - Christopher M. Vockley
- Center for Genomic & Computational Biology, Duke University, Durham, North Carolina, United States of America
- Department of Biostatistics & Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Amy K. Schmid
- Center for Genomic & Computational Biology, Duke University, Durham, North Carolina, United States of America
- Biology Department, Duke University, Durham, North Carolina, United States of America
| | - Timothy E. Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina, United States of America
- Center for Genomic & Computational Biology, Duke University, Durham, North Carolina, United States of America
- Department of Biostatistics & Bioinformatics, Duke University Medical Center, Durham, North Carolina, United States of America
| | - Barbara E. Engelhardt
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| |
Collapse
|
26
|
Saha A, Kim Y, Gewirtz ADH, Jo B, Gao C, McDowell IC, Engelhardt BE, Battle A. Co-expression networks reveal the tissue-specific regulation of transcription and splicing. Genome Res 2017; 27:1843-1858. [PMID: 29021288 PMCID: PMC5668942 DOI: 10.1101/gr.216721.116] [Citation(s) in RCA: 119] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2016] [Accepted: 08/22/2017] [Indexed: 11/24/2022]
Abstract
Gene co-expression networks capture biologically important patterns in gene expression data, enabling functional analyses of genes, discovery of biomarkers, and interpretation of genetic variants. Most network analyses to date have been limited to assessing correlation between total gene expression levels in a single tissue or small sets of tissues. Here, we built networks that additionally capture the regulation of relative isoform abundance and splicing, along with tissue-specific connections unique to each of a diverse set of tissues. We used the Genotype-Tissue Expression (GTEx) project v6 RNA sequencing data across 50 tissues and 449 individuals. First, we developed a framework called Transcriptome-Wide Networks (TWNs) for combining total expression and relative isoform levels into a single sparse network, capturing the interplay between the regulation of splicing and transcription. We built TWNs for 16 tissues and found that hubs in these networks were strongly enriched for splicing and RNA binding genes, demonstrating their utility in unraveling regulation of splicing in the human transcriptome. Next, we used a Bayesian biclustering model that identifies network edges unique to a single tissue to reconstruct Tissue-Specific Networks (TSNs) for 26 distinct tissues and 10 groups of related tissues. Finally, we found genetic variants associated with pairs of adjacent nodes in our networks, supporting the estimated network structures and identifying 20 genetic variants with distant regulatory impact on transcription and splicing. Our networks provide an improved understanding of the complex relationships of the human transcriptome across tissues.
Collapse
|
27
|
Abstract
Characterization of the molecular function of the human genome and its variation across individuals is essential for identifying the cellular mechanisms that underlie human genetic traits and diseases. The Genotype-Tissue Expression (GTEx) project aims to characterize variation in gene expression levels across individuals and diverse tissues of the human body, many of which are not easily accessible. Here we describe genetic effects on gene expression levels across 44 human tissues. We find that local genetic variation affects gene expression levels for the majority of genes, and we further identify inter-chromosomal genetic effects for 93 genes and 112 loci. On the basis of the identified genetic effects, we characterize patterns of tissue specificity, compare local and distal effects, and evaluate the functional properties of the genetic effects. We also demonstrate that multi-tissue, multi-individual data can be used to identify genes and pathways affected by human disease-associated variation, enabling a mechanistic interpretation of gene regulation and the genetic basis of disease.
Collapse
Affiliation(s)
- Alexis Battle
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Christopher D Brown
- Department of Genetics and Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA
| | - Barbara E Engelhardt
- Department of Computer Science and Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
| | - Stephen B Montgomery
- Department of Genetics, Stanford University, Stanford, California 94305, USA
- Department of Pathology, Stanford University, Stanford, California 94305, USA
| |
Collapse
|
28
|
Tonner PD, Darnell CL, Engelhardt BE, Schmid AK. Detecting differential growth of microbial populations with Gaussian process regression. Genome Res 2017; 27:320-333. [PMID: 27864351 PMCID: PMC5287237 DOI: 10.1101/gr.210286.116] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2016] [Accepted: 11/15/2016] [Indexed: 02/06/2023]
Abstract
Microbial growth curves are used to study differential effects of media, genetics, and stress on microbial population growth. Consequently, many modeling frameworks exist to capture microbial population growth measurements. However, current models are designed to quantify growth under conditions for which growth has a specific functional form. Extensions to these models are required to quantify the effects of perturbations, which often exhibit nonstandard growth curves. Rather than assume specific functional forms for experimental perturbations, we developed a general and robust model of microbial population growth curves using Gaussian process (GP) regression. GP regression modeling of high-resolution time-series growth data enables accurate quantification of population growth and allows explicit control of effects from other covariates such as genetic background. This framework substantially outperforms commonly used microbial population growth models, particularly when modeling growth data from environmentally stressed populations. We apply the GP growth model and develop statistical tests to quantify the differential effects of environmental perturbations on microbial growth across a large compendium of genotypes in archaea and yeast. This method accurately identifies known transcriptional regulators and implicates novel regulators of growth under standard and stress conditions in the model archaeal organism Halobacterium salinarum For yeast, our method correctly identifies known phenotypes for a diversity of genetic backgrounds under cyclohexamide stress and also detects previously unidentified oxidative stress sensitivity across a subset of strains. Together, these results demonstrate that the GP models are interpretable, recapitulating biological knowledge of growth response while providing new insights into the relevant parameters affecting microbial population growth.
Collapse
Affiliation(s)
- Peter D Tonner
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA
- Biology Department, Duke University, Durham, North Carolina 27708, USA
| | - Cynthia L Darnell
- Biology Department, Duke University, Durham, North Carolina 27708, USA
| | - Barbara E Engelhardt
- Computer Science Department, Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey 08540, USA
| | - Amy K Schmid
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27708, USA
- Biology Department, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
29
|
Gao C, McDowell IC, Zhao S, Brown CD, Engelhardt BE. Context Specific and Differential Gene Co-expression Networks via Bayesian Biclustering. PLoS Comput Biol 2016; 12:e1004791. [PMID: 27467526 PMCID: PMC4965098 DOI: 10.1371/journal.pcbi.1004791] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2015] [Accepted: 02/03/2016] [Indexed: 01/15/2023] Open
Abstract
Identifying latent structure in high-dimensional genomic data is essential for exploring biological processes. Here, we consider recovering gene co-expression networks from gene expression data, where each network encodes relationships between genes that are co-regulated by shared biological mechanisms. To do this, we develop a Bayesian statistical model for biclustering to infer subsets of co-regulated genes that covary in all of the samples or in only a subset of the samples. Our biclustering method, BicMix, allows overcomplete representations of the data, computational tractability, and joint modeling of unknown confounders and biological signals. Compared with related biclustering methods, BicMix recovers latent structure with higher precision across diverse simulation scenarios as compared to state-of-the-art biclustering methods. Further, we develop a principled method to recover context specific gene co-expression networks from the estimated sparse biclustering matrices. We apply BicMix to breast cancer gene expression data and to gene expression data from a cardiovascular study cohort, and we recover gene co-expression networks that are differential across ER+ and ER- samples and across male and female samples. We apply BicMix to the Genotype-Tissue Expression (GTEx) pilot data, and we find tissue specific gene networks. We validate these findings by using our tissue specific networks to identify trans-eQTLs specific to one of four primary tissues.
Collapse
Affiliation(s)
- Chuan Gao
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
| | - Ian C. McDowell
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina, United States of America
| | - Shiwen Zhao
- Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina, United States of America
| | - Christopher D. Brown
- Department of Genetics, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
| | - Barbara E. Engelhardt
- Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, United States of America
| |
Collapse
|
30
|
Affiliation(s)
- Barbara E Engelhardt
- Department of Computer Science, Center for Statistics and Machine Learning, Princeton University, Princeton, New Jersey, USA
| | - Christopher D Brown
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
31
|
van den Berg SM, de Moor MHM, Verweij KJH, Krueger RF, Luciano M, Arias Vasquez A, Matteson LK, Derringer J, Esko T, Amin N, Gordon SD, Hansell NK, Hart AB, Seppälä I, Huffman JE, Konte B, Lahti J, Lee M, Miller M, Nutile T, Tanaka T, Teumer A, Viktorin A, Wedenoja J, Abdellaoui A, Abecasis GR, Adkins DE, Agrawal A, Allik J, Appel K, Bigdeli TB, Busonero F, Campbell H, Costa PT, Smith GD, Davies G, de Wit H, Ding J, Engelhardt BE, Eriksson JG, Fedko IO, Ferrucci L, Franke B, Giegling I, Grucza R, Hartmann AM, Heath AC, Heinonen K, Henders AK, Homuth G, Hottenga JJ, Iacono WG, Janzing J, Jokela M, Karlsson R, Kemp JP, Kirkpatrick MG, Latvala A, Lehtimäki T, Liewald DC, Madden PAF, Magri C, Magnusson PKE, Marten J, Maschio A, Mbarek H, Medland SE, Mihailov E, Milaneschi Y, Montgomery GW, Nauck M, Nivard MG, Ouwens KG, Palotie A, Pettersson E, Polasek O, Qian Y, Pulkki-Råback L, Raitakari OT, Realo A, Rose RJ, Ruggiero D, Schmidt CO, Slutske WS, Sorice R, Starr JM, St Pourcain B, Sutin AR, Timpson NJ, Trochet H, Vermeulen S, Vuoksimaa E, Widen E, Wouda J, Wright MJ, Zgaga L, Porteous D, Minelli A, Palmer AA, Rujescu D, Ciullo M, Hayward C, Rudan I, Metspalu A, Kaprio J, Deary IJ, Räikkönen K, Wilson JF, Keltikangas-Järvinen L, Bierut LJ, Hettema JM, Grabe HJ, Penninx BWJH, van Duijn CM, Evans DM, Schlessinger D, Pedersen NL, Terracciano A, McGue M, Martin NG, Boomsma DI. Meta-analysis of Genome-Wide Association Studies for Extraversion: Findings from the Genetics of Personality Consortium. Behav Genet 2016; 46:170-82. [PMID: 26362575 PMCID: PMC4751159 DOI: 10.1007/s10519-015-9735-5] [Citation(s) in RCA: 149] [Impact Index Per Article: 18.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2014] [Accepted: 08/10/2015] [Indexed: 11/26/2022]
Abstract
Extraversion is a relatively stable and heritable personality trait associated with numerous psychosocial, lifestyle and health outcomes. Despite its substantial heritability, no genetic variants have been detected in previous genome-wide association (GWA) studies, which may be due to relatively small sample sizes of those studies. Here, we report on a large meta-analysis of GWA studies for extraversion in 63,030 subjects in 29 cohorts. Extraversion item data from multiple personality inventories were harmonized across inventories and cohorts. No genome-wide significant associations were found at the single nucleotide polymorphism (SNP) level but there was one significant hit at the gene level for a long non-coding RNA site (LOC101928162). Genome-wide complex trait analysis in two large cohorts showed that the additive variance explained by common SNPs was not significantly different from zero, but polygenic risk scores, weighted using linkage information, significantly predicted extraversion scores in an independent cohort. These results show that extraversion is a highly polygenic personality trait, with an architecture possibly different from other complex human traits, including other personality traits. Future studies are required to further determine which genetic variants, by what modes of gene action, constitute the heritable nature of extraversion.
Collapse
Affiliation(s)
- Stéphanie M van den Berg
- Department of Research Methodology, Measurement and Data-Analysis (OMD), Faculty of Behavioural, Management, and Social Sciences, University of Twente, PO Box 217, 7500 AE, Enschede, The Netherlands.
| | - Marleen H M de Moor
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
- Department of Clinical Child and Family Studies, VU University Amsterdam, Amsterdam, The Netherlands
- Department of Methods, VU University Amsterdam, Amsterdam, The Netherlands
| | - Karin J H Verweij
- QIMR Berghofer Medical Research Institute, Brisbane, Australia
- Department of Developmental Psychology and EMGO Institute for Health and Care Research, VU University Amsterdam, Amsterdam, The Netherlands
| | - Robert F Krueger
- Department of Psychology, University of Minnesota, Minneapolis, USA
| | - Michelle Luciano
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Alejandro Arias Vasquez
- Donders Institute for Cognitive Neuroscience, Radboud University Nijmegen, Nijmegen, The Netherlands
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Cognitive Neuroscience, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | | | - Jaime Derringer
- Department of Psychology, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Tõnu Esko
- Estonian Genome Center, University of Tartu, Tartu, Estonia
| | - Najaf Amin
- Department of Epidemiology, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Scott D Gordon
- QIMR Berghofer Medical Research Institute, Brisbane, Australia
| | | | - Amy B Hart
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Ilkka Seppälä
- Department of Clinical Chemistry, Fimlab Laboratories and School of Medicine, University of Tampere, Tampere, Finland
| | - Jennifer E Huffman
- MRC Human Genetics Unit, MRC IGMM, Western General Hospital, University of Edinburgh, Edinburgh, UK
| | - Bettina Konte
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Jari Lahti
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
- Folkhälsan Research Center, Helsinki, Finland
| | - Minyoung Lee
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA
| | - Mike Miller
- Department of Psychology, University of Minnesota, Minneapolis, USA
| | - Teresa Nutile
- Institute of Genetics and Biophysics "A. Buzzati-Traverso" - CNR, Naples, Italy
| | | | - Alexander Teumer
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Alexander Viktorin
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Juho Wedenoja
- Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Abdel Abdellaoui
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Goncalo R Abecasis
- Department of Biostatistics, Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI, USA
| | - Daniel E Adkins
- Pharmacotherapy & Outcomes Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Arpana Agrawal
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Jüri Allik
- Department of Psychology, University of Tartu, Tartu, Estonia
- Estonian Academy of Sciences, Tallinn, Estonia
| | - Katja Appel
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, Greifswald, Germany
| | - Timothy B Bigdeli
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA
| | - Fabio Busonero
- Istituto di Ricerca Genetica e Biomedica (IRGB), CNR, Monserrato, Italy
| | - Harry Campbell
- Usher Institute for Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK
| | - Paul T Costa
- Behavioral Medicine Research Center, Duke University School of Medicine, Durham, NC, USA
| | - George Davey Smith
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
| | - Gail Davies
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Harriet de Wit
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, USA
| | - Jun Ding
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Barbara E Engelhardt
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Johan G Eriksson
- Folkhälsan Research Center, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
- Department of General Practice and Primary Health Care, University of Helsinki, Helsinki, Finland
- Unit of General Practice and Primary Health Care, University of Helsinki, Helsinki, Finland
- Vasa Central Hospital, Vaasa, Finland
| | - Iryna O Fedko
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | | | - Barbara Franke
- Donders Institute for Cognitive Neuroscience, Radboud University Nijmegen, Nijmegen, The Netherlands
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | - Ina Giegling
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Richard Grucza
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | | | - Andrew C Heath
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Kati Heinonen
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | | | - Georg Homuth
- Interfaculty Institute for Genetics and Functional Genomics, University of Greifswald, Greifswald, Germany
| | - Jouke-Jan Hottenga
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - William G Iacono
- Department of Psychology, University of Minnesota, Minneapolis, USA
| | - Joost Janzing
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | - Markus Jokela
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - Robert Karlsson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - John P Kemp
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
- Translational Research Institute, University of Queensland Diamantina Institute, Brisbane, Australia
| | - Matthew G Kirkpatrick
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, USA
| | - Antti Latvala
- Department of Public Health, University of Helsinki, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
| | - Terho Lehtimäki
- Department of Clinical Chemistry, Fimlab Laboratories and School of Medicine, University of Tampere, Tampere, Finland
| | - David C Liewald
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Pamela A F Madden
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - Chiara Magri
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy
| | - Patrik K E Magnusson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Jonathan Marten
- MRC Human Genetics Unit, MRC IGMM, Western General Hospital, University of Edinburgh, Edinburgh, UK
| | - Andrea Maschio
- Istituto di Ricerca Genetica e Biomedica (IRGB), CNR, Monserrato, Italy
| | - Hamdi Mbarek
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Sarah E Medland
- QIMR Berghofer Medical Research Institute, Brisbane, Australia
| | - Evelin Mihailov
- Estonian Genome Center, University of Tartu, Tartu, Estonia
- Department of Biotechnology, University of Tartu, Tartu, Estonia
| | - Yuri Milaneschi
- Department of Psychiatry, EMGO+ Institute, Neuroscience Campus Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
| | | | - Matthias Nauck
- Institute of Clinical Chemistry and Laboratory Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Michel G Nivard
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Klaasjan G Ouwens
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Aarno Palotie
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Erik Pettersson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Ozren Polasek
- Department of Public Health, Faculty of Medicine, University of Split, Split, Croatia
| | - Yong Qian
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Laura Pulkki-Råback
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - Olli T Raitakari
- Department of Clinical Physiology and Nuclear Medicine, Turku University Hospital, Turku, Finland
- Research Centre of Applied and Preventive Cardiovascular Medicine, University of Turku, Turku, Finland
| | - Anu Realo
- Department of Psychology, University of Tartu, Tartu, Estonia
| | - Richard J Rose
- Department of Psychological & Brain Sciences, Indiana University, Bloomington, IN, USA
| | - Daniela Ruggiero
- Institute of Genetics and Biophysics "A. Buzzati-Traverso" - CNR, Naples, Italy
| | - Carsten O Schmidt
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Wendy S Slutske
- Department of Psychological Sciences and Missouri Alcoholism Research Center, University of Missouri, Columbia, MO, USA
| | - Rossella Sorice
- Institute of Genetics and Biophysics "A. Buzzati-Traverso" - CNR, Naples, Italy
| | - John M Starr
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
- Alzheimer Scotland Dementia Research Centre, University of Edinburgh, Edinburgh, UK
| | - Beate St Pourcain
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
- School of Oral and Dental Sciences, University of Bristol, Bristol, UK
- School of Experimental Psychology, University of Bristol, Bristol, UK
| | - Angelina R Sutin
- National Institute on Aging, NIH, Baltimore, MD, USA
- College of Medicine, Florida State University, Tallahassee, FL, USA
| | - Nicholas J Timpson
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
| | - Holly Trochet
- MRC Human Genetics Unit, MRC IGMM, Western General Hospital, University of Edinburgh, Edinburgh, UK
| | - Sita Vermeulen
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department for Health Evidence, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Eero Vuoksimaa
- Department of Public Health, University of Helsinki, Helsinki, Finland
| | - Elisabeth Widen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Jasper Wouda
- Department of Research Methodology, Measurement and Data-Analysis (OMD), Faculty of Behavioural, Management, and Social Sciences, University of Twente, PO Box 217, 7500 AE, Enschede, The Netherlands
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | | | - Lina Zgaga
- Usher Institute for Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK
- Department of Public Health and Primary Care, Trinity College Dublin, Dublin, Ireland
| | - David Porteous
- Medical Genetics Section, Centre for Genomics and Experimental Medicine, Institute of Genetics and Molecular Medicine, Western General Hospital, The University of Edinburgh, Edinburgh, UK
| | - Alessandra Minelli
- Department of Molecular and Translational Medicine, University of Brescia, Brescia, Italy
| | - Abraham A Palmer
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, USA
| | - Dan Rujescu
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Marina Ciullo
- Institute of Genetics and Biophysics "A. Buzzati-Traverso" - CNR, Naples, Italy
| | - Caroline Hayward
- MRC Human Genetics Unit, MRC IGMM, Western General Hospital, University of Edinburgh, Edinburgh, UK
| | - Igor Rudan
- Usher Institute for Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK
| | - Andres Metspalu
- Estonian Genome Center, University of Tartu, Tartu, Estonia
- Estonian Academy of Sciences, Tallinn, Estonia
| | - Jaakko Kaprio
- Department of Public Health, University of Helsinki, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, Helsinki, Finland
| | - Ian J Deary
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Katri Räikkönen
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - James F Wilson
- MRC Human Genetics Unit, MRC IGMM, Western General Hospital, University of Edinburgh, Edinburgh, UK
- Usher Institute for Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, UK
| | | | - Laura J Bierut
- Department of Psychiatry, Washington University School of Medicine, St. Louis, MO, USA
| | - John M Hettema
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, VA, USA
| | - Hans J Grabe
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, Greifswald, Germany
- Department of Psychiatry and Psychotherapy, HELIOS Hospital Stralsund, Stralsund, Germany
| | - Brenda W J H Penninx
- Department of Psychiatry, EMGO+ Institute, Neuroscience Campus Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
| | - Cornelia M van Duijn
- Department of Epidemiology, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - David M Evans
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
| | - David Schlessinger
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore, MD, USA
| | - Nancy L Pedersen
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Antonio Terracciano
- Folkhälsan Research Center, Helsinki, Finland
- National Institute on Aging, NIH, Baltimore, MD, USA
| | - Matt McGue
- Department of Psychology, University of Minnesota, Minneapolis, USA
- Institute of Public Health, University of Southern Denmark, Odense, Denmark
| | | | - Dorret I Boomsma
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
32
|
de Moor MH, van den Berg SM, Verweij KJ, Krueger RF, Luciano M, Vasquez AA, Matteson LK, Derringer J, Esko T, Amin N, Gordon SD, Hansell NK, Hart AB, Seppälä I, Huffman JE, Konte B, Lahti J, Lee M, Miller M, Nutile T, Tanaka T, Teumer A, Viktorin A, Wedenoja J, Abecasis GR, Adkins DE, Agrawal A, Allik J, Appel K, Bigdeli TB, Busonero F, Campbell H, Costa PT, Smith GD, Davies G, de Wit H, Ding J, Engelhardt BE, Eriksson JG, Fedko IO, Ferrucci L, Franke B, Giegling I, Grucza R, Hartmann AM, Heath AC, Heinonen K, Henders AK, Homuth G, Hottenga JJ, Janzing J, Jokela M, Karlsson R, Kemp JP, Kirkpatrick MG, Latvala A, Lehtimäki T, Liewald DC, Madden PA, Magri C, Magnusson PK, Marten J, Maschio A, Medland SE, Mihailov E, Milaneschi Y, Montgomery GW, Nauck M, Ouwens KG, Palotie A, Pettersson E, Polasek O, Qian Y, Pulkki-Råback L, Raitakari OT, Realo A, Rose RJ, Ruggiero D, Schmidt CO, Slutske WS, Sorice R, Starr JM, Pourcain BS, Sutin AR, Timpson NJ, Trochet H, Vermeulen S, Vuoksimaa E, Widen E, Wouda J, Wright MJ, Zgaga L, Scotland G, Porteous D, Minelli A, Palmer AA, Rujescu D, Ciullo M, Hayward C, Rudan I, Metspalu A, Kaprio J, Deary IJ, Räikkönen K, Wilson JF, Keltikangas-Järvinen L, Bierut LJ, Hettema JM, Grabe HJ, van Duijn CM, Evans DM, Schlessinger D, Pedersen NL, Terracciano A, McGue M, Penninx BW, Martin NG, Boomsma DI. Meta-analysis of Genome-wide Association Studies for Neuroticism, and the Polygenic Association With Major Depressive Disorder. JAMA Psychiatry 2015; 72:642-50. [PMID: 25993607 PMCID: PMC4667957 DOI: 10.1001/jamapsychiatry.2015.0554] [Citation(s) in RCA: 175] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
IMPORTANCE Neuroticism is a pervasive risk factor for psychiatric conditions. It genetically overlaps with major depressive disorder (MDD) and is therefore an important phenotype for psychiatric genetics. The Genetics of Personality Consortium has created a resource for genome-wide association analyses of personality traits in more than 63,000 participants (including MDD cases). OBJECTIVES To identify genetic variants associated with neuroticism by performing a meta-analysis of genome-wide association results based on 1000 Genomes imputation; to evaluate whether common genetic variants as assessed by single-nucleotide polymorphisms (SNPs) explain variation in neuroticism by estimating SNP-based heritability; and to examine whether SNPs that predict neuroticism also predict MDD. DESIGN, SETTING, AND PARTICIPANTS Genome-wide association meta-analysis of 30 cohorts with genome-wide genotype, personality, and MDD data from the Genetics of Personality Consortium. The study included 63,661 participants from 29 discovery cohorts and 9786 participants from a replication cohort. Participants came from Europe, the United States, or Australia. Analyses were conducted between 2012 and 2014. MAIN OUTCOMES AND MEASURES Neuroticism scores harmonized across all 29 discovery cohorts by item response theory analysis, and clinical MDD case-control status in 2 of the cohorts. RESULTS A genome-wide significant SNP was found on 3p14 in MAGI1 (rs35855737; P = 9.26 × 10-9 in the discovery meta-analysis). This association was not replicated (P = .32), but the SNP was still genome-wide significant in the meta-analysis of all 30 cohorts (P = 2.38 × 10-8). Common genetic variants explain 15% of the variance in neuroticism. Polygenic scores based on the meta-analysis of neuroticism in 27 cohorts significantly predicted neuroticism (1.09 × 10-12 < P < .05) and MDD (4.02 × 10-9 < P < .05) in the 2 other cohorts. CONCLUSIONS AND RELEVANCE This study identifies a novel locus for neuroticism. The variant is located in a known gene that has been associated with bipolar disorder and schizophrenia in previous studies. In addition, the study shows that neuroticism is influenced by many genetic variants of small effect that are either common or tagged by common variants. These genetic variants also influence MDD. Future studies should confirm the role of the MAGI1 locus for neuroticism and further investigate the association of MAGI1 and the polygenic association to a range of other psychiatric disorders that are phenotypically correlated with neuroticism.
Collapse
Affiliation(s)
- Marleen H.M. de Moor
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
- Department of Clinical Child and Family Studies, VU University Amsterdam, Amsterdam, The Netherlands
- Department of Methods, VU University Amsterdam, Amsterdam, The Netherlands
| | - Stéphanie M. van den Berg
- Department of Research Methodology, Measurement and Data-Analysis, University of Twente, Enschede, The Netherlands
| | - Karin J.H. Verweij
- QIMR Berghofer Medical Research Institute, Herston, Brisbane, Australia
- Department of Developmental Psychology and EMGO Institute for Health and Care Research, VU University Amsterdam, Amsterdam, The Netherlands
| | | | - Michelle Luciano
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Alejandro Arias Vasquez
- Donders Institute for Cognitive Neuroscience, Radboud University Nijmegen, Nijmegen, The Netherlands
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Cognitive Neuroscience, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | | | - Jaime Derringer
- Department of Psychology, University of Illinois at Urbana-Champaign, Champaign IL, USA
| | - Tõnu Esko
- Estonian Genome Center, University of Tartu, Tartu, Estonia
| | - Najaf Amin
- Department of Epidemiology, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Scott D. Gordon
- QIMR Berghofer Medical Research Institute, Herston, Brisbane, Australia
| | | | - Amy B. Hart
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
| | - Ilkka Seppälä
- Department of Clinical Chemistry, Fimlab Laboratories and School of Medicine, University of Tampere, Finland
| | - Jennifer E. Huffman
- MRC Human Genetics, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, Scotland, UK
| | - Bettina Konte
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Jari Lahti
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
- Folkhälsan Research Center, Helsinki, Finland
| | - Minyoung Lee
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Mike Miller
- Department of Psychology, University of Minnesota, Minneapolis, USA
| | - Teresa Nutile
- Institute of Genetics and Biophysics “A. Buzzati-Traverso” – CNR, Naples, Italy
| | | | - Alexander Teumer
- Institute for Community Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Alexander Viktorin
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Juho Wedenoja
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | - Goncalo R. Abecasis
- Center for Statistical Genetics, Department of Biostatistics, University of Michigan School of Public Health, Ann Arbor, Michigan, USA
| | - Daniel E. Adkins
- Pharmacotherapy & Outcomes Science, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Arpana Agrawal
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA
| | - Jüri Allik
- Department of Psychology, University of Tartu, Tartu, Estonia
- Estonian Academy of Sciences, Tallinn, Estonia
| | - Katja Appel
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, Greifswald, Germany
| | - Timothy B. Bigdeli
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Fabio Busonero
- Istituto di Ricerca Genetica e Biomedica (IRGB), CNR, Monserrato, Italy
| | - Harry Campbell
- Centre for Population Health Sciences, Medical School, University of Edinburgh, Edinburgh, UK
| | - Paul T. Costa
- Behavioral Medicine Research Center, Duke University School of Medicine, Durham NC, USA
| | - George Davey Smith
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
| | - Gail Davies
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Harriet de Wit
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, USA
| | - Jun Ding
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore MD USA
| | | | - Johan G. Eriksson
- Folkhälsan Research Center, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
- Department of General Practice and Primary Health Care, University of Helsinki, Helsinki, Finland
- Unit of General Practice and Primary Health Care, University of Helsinki, Helsinki, Finland
- Vasa Central Hospital, Vasa, Finland
| | - Iryna O. Fedko
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | | | - Barbara Franke
- Donders Institute for Cognitive Neuroscience, Radboud University Nijmegen, Nijmegen, The Netherlands
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | - Ina Giegling
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Richard Grucza
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA
| | | | - Andrew C. Heath
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA
| | - Kati Heinonen
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - Anjali K. Henders
- QIMR Berghofer Medical Research Institute, Herston, Brisbane, Australia
| | - Georg Homuth
- Interfaculty Institute for Genetics and Functional Genomics, University of Greifswald, Germany
| | - Jouke-Jan Hottenga
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Joost Janzing
- Department of Psychiatry, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
| | - Markus Jokela
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - Robert Karlsson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - John P. Kemp
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
- University of Queensland Diamantina Institute, Translational Research Institute, Brisbane, Australia
| | | | - Antti Latvala
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
| | - Terho Lehtimäki
- Department of Clinical Chemistry, Fimlab Laboratories and School of Medicine, University of Tampere, Finland
| | - David C. Liewald
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Pamela A.F. Madden
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA
| | - Chiara Magri
- Department of Molecular and Translational Medicine, University of Brescia, Italy
| | - Patrik K.E. Magnusson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Jonathan Marten
- MRC Human Genetics, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, Scotland, UK
| | - Andrea Maschio
- Istituto di Ricerca Genetica e Biomedica (IRGB), CNR, Monserrato, Italy
| | - Sarah E. Medland
- QIMR Berghofer Medical Research Institute, Herston, Brisbane, Australia
| | - Evelin Mihailov
- Estonian Genome Center, University of Tartu, Tartu, Estonia
- Department of Biotechnology, University of Tartu, Tartu, Estonia
| | - Yuri Milaneschi
- Department of Psychiatry, EMGO+ Institute, Neuroscience Campus Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
| | | | - Matthias Nauck
- Institute of Clinical Chemistry and Laboratory Medicine, University Medicine Greifswald, Greifswald, Germany
| | - Klaasjan G. Ouwens
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| | - Aarno Palotie
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, UK
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, University of Helsinki, Finland
| | - Erik Pettersson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Ozren Polasek
- Department of Public Health, Faculty of Medicine, University of Split, Faculty of Medicine, University of Split, Split, Croatia
| | - Yong Qian
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore MD USA
| | - Laura Pulkki-Råback
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - Olli T. Raitakari
- Department of Clinical Physiology and Nuclear Medicine, Turku University Hospital, Turku, Finland
- Research Centre of Applied and Preventive Cardiovascular Medicine, University of Turku, Turku, Finland
| | - Anu Realo
- Department of Psychology, University of Tartu, Tartu, Estonia
| | - Richard J. Rose
- Department of Psychological & Brain Sciences, Indiana University, Bloomington, IN, USA
| | - Daniela Ruggiero
- Institute of Genetics and Biophysics “A. Buzzati-Traverso” – CNR, Naples, Italy
| | - Carsten O. Schmidt
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | - Wendy S. Slutske
- Department of Psychological Sciences and Missouri Alcoholism Research Center, University of Missouri, Columbia, Missouri, USA
| | - Rossella Sorice
- Institute of Genetics and Biophysics “A. Buzzati-Traverso” – CNR, Naples, Italy
| | - John M. Starr
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
- Alzheimer Scotland Dementia Research Centre, University of Edinburgh
- Geriatric Medicine Royal Victoria Hospital, Edinburgh, UK
| | - Beate St Pourcain
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
- School of Oral and Dental Sciences, University of Bristol, Bristol, UK
- School of Experimental Psychology, University of Bristol, Bristol, UK
| | - Angelina R. Sutin
- National Institute on Aging, NIH, Baltimore, MD, USA
- College of Medicine, Florida State University, Tallahassee, FL, USA
| | - Nicholas J. Timpson
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
| | - Holly Trochet
- MRC Human Genetics, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, Scotland, UK
| | - Sita Vermeulen
- Department of Human Genetics, Radboud University Nijmegen Medical Center, Nijmegen, The Netherlands
- Department for Health Evidence, Radboud University Medical Center, Nijmegen, The Netherlands
| | - Eero Vuoksimaa
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | - Elisabeth Widen
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, University of Helsinki, Finland
| | - Jasper Wouda
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
- Department of Research Methodology, Measurement and Data-Analysis, University of Twente, Enschede, The Netherlands
| | | | - Lina Zgaga
- Centre for Population Health Sciences, Medical School, University of Edinburgh, Edinburgh, UK
- Department of Public Health and Primary Care, Trinity College Dublin, Dublin, Ireland
| | - Generation Scotland
- Generation Scotland, A Collaboration between the University Medical Schools and NHS, Aberdeen, Dundee, Edinburgh and Glasgow, UK
| | - David Porteous
- Medical Genetics Section, The University of Edinburgh, Centre for Genomics and Experimental Medicine, Institute of Genetics and Molecular Medicine, Western General Hospital, Edinburgh, UK
| | - Alessandra Minelli
- Department of Molecular and Translational Medicine, University of Brescia, Italy
| | - Abraham A. Palmer
- Department of Human Genetics, University of Chicago, Chicago, IL, USA
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, USA
| | - Dan Rujescu
- Department of Psychiatry, University of Halle, Halle, Germany
| | - Marina Ciullo
- Institute of Genetics and Biophysics “A. Buzzati-Traverso” – CNR, Naples, Italy
| | - Caroline Hayward
- MRC Human Genetics, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Western General Hospital, Edinburgh, Scotland, UK
| | - Igor Rudan
- Centre for Population Health Sciences, Medical School, University of Edinburgh, Edinburgh, UK
| | - Andres Metspalu
- Estonian Genome Center, University of Tartu, Tartu, Estonia
- Estonian Academy of Sciences, Tallinn, Estonia
| | - Jaakko Kaprio
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
- National Institute for Health and Welfare (THL), Helsinki, Finland
- Institute for Molecular Medicine Finland (FIMM), University of Helsinki, University of Helsinki, Finland
| | - Ian J. Deary
- Department of Psychology, University of Edinburgh, Edinburgh, UK
- Centre for Cognitive Ageing and Cognitive Epidemiology, University of Edinburgh, Edinburgh, UK
| | - Katri Räikkönen
- Institute of Behavioural Sciences, University of Helsinki, Helsinki, Finland
| | - James F. Wilson
- Centre for Population Health Sciences, Medical School, University of Edinburgh, Edinburgh, UK
| | | | - Laura J. Bierut
- Department of Psychiatry, Washington University School of Medicine, St. Louis, Missouri, USA
| | - John M. Hettema
- Department of Psychiatry, Virginia Institute for Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Hans J. Grabe
- Department of Psychiatry and Psychotherapy, University Medicine Greifswald, Greifswald, Germany
- Department of Psychiatry and Psychotherapy, HELIOS Hospital Stralsund, Stralsund, Germany
| | - Cornelia M. van Duijn
- Department of Epidemiology, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - David M. Evans
- Medical Research Council Integrative Epidemiology Unit, School of Social and Community Medicine, University of Bristol, Bristol, UK
- University of Queensland Diamantina Institute, Translational Research Institute, Brisbane, Australia
| | - David Schlessinger
- Laboratory of Genetics, National Institute on Aging, National Institutes of Health, Baltimore MD USA
| | - Nancy L. Pedersen
- Institute of Genetics and Biophysics “A. Buzzati-Traverso” – CNR, Naples, Italy
| | - Antonio Terracciano
- Folkhälsan Research Center, Helsinki, Finland
- College of Medicine, Florida State University, Tallahassee, FL, USA
| | - Matt McGue
- Department of Psychology, University of Minnesota, Minneapolis, USA
- Institute of Public Health, University of Southern Denmark, Odense, Denmark
| | - Brenda W.J.H. Penninx
- Department of Psychiatry, EMGO+ Institute, Neuroscience Campus Amsterdam, VU University Medical Center, Amsterdam, The Netherlands
| | | | - Dorret I. Boomsma
- Department of Biological Psychology, VU University Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
33
|
Zhang W, Spector TD, Deloukas P, Bell JT, Engelhardt BE. Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements. Genome Biol 2015; 16:14. [PMID: 25616342 PMCID: PMC4389802 DOI: 10.1186/s13059-015-0581-9] [Citation(s) in RCA: 125] [Impact Index Per Article: 13.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2013] [Accepted: 01/02/2015] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Recent assays for individual-specific genome-wide DNA methylation profiles have enabled epigenome-wide association studies to identify specific CpG sites associated with a phenotype. Computational prediction of CpG site-specific methylation levels is critical to enable genome-wide analyses, but current approaches tackle average methylation within a locus and are often limited to specific genomic regions. RESULTS We characterize genome-wide DNA methylation patterns, and show that correlation among CpG sites decays rapidly, making predictions solely based on neighboring sites challenging. We built a random forest classifier to predict methylation levels at CpG site resolution using features including neighboring CpG site methylation levels and genomic distance, co-localization with coding regions, CpG islands (CGIs), and regulatory elements from the ENCODE project. Our approach achieves 92% prediction accuracy of genome-wide methylation levels at single-CpG-site precision. The accuracy increases to 98% when restricted to CpG sites within CGIs and is robust across platform and cell-type heterogeneity. Our classifier outperforms other types of classifiers and identifies features that contribute to prediction accuracy: neighboring CpG site methylation, CGIs, co-localized DNase I hypersensitive sites, transcription factor binding sites, and histone modifications were found to be most predictive of methylation levels. CONCLUSIONS Our observations of DNA methylation patterns led us to develop a classifier to predict DNA methylation levels at CpG site resolution with high accuracy. Furthermore, our method identified genomic features that interact with DNA methylation, suggesting mechanisms involved in DNA methylation modification and regulation, and linking diverse epigenetic processes.
Collapse
Affiliation(s)
- Weiwei Zhang
- Department of Molecular Genetics and Microbiology, Duke University, Durham, NC, USA.
| | - Tim D Spector
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK.
| | - Panos Deloukas
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
- Princess Al-Jawhara Al-Brahim Centre of Excellence in Research of Hereditary Disorders (PACER-HD), King Abdulaziz University, Jeddah, 21589, Saudi Arabia.
| | - Jordana T Bell
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK.
| | | |
Collapse
|
34
|
Affiliation(s)
| | - Barbara E Engelhardt
- 1] Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA [2] Present address: Biostatistics and Bioinformatics Department and Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Matthew Stephens
- 1] Department of Human Genetics, University of Chicago, Chicago, Illinois 60637, USA [2] Department of Statistics, University of Chicago, Chicago, Illinois 60637, USA
| | - Ronald M Krauss
- Children's Hospital Research Institute, Oakland, California 94609, USA
| |
Collapse
|
35
|
Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordân R. Stability selection for regression-based models of transcription factor-DNA binding specificity. Bioinformatics 2013; 29:i117-25. [PMID: 23812975 PMCID: PMC3694650 DOI: 10.1093/bioinformatics/btt221] [Citation(s) in RCA: 38] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Motivation: The DNA binding specificity of a transcription factor (TF) is typically represented using a position weight matrix model, which implicitly assumes that individual bases in a TF binding site contribute independently to the binding affinity, an assumption that does not always hold. For this reason, more complex models of binding specificity have been developed. However, these models have their own caveats: they typically have a large number of parameters, which makes them hard to learn and interpret. Results: We propose novel regression-based models of TF–DNA binding specificity, trained using high resolution in vitro data from custom protein-binding microarray (PBM) experiments. Our PBMs are specifically designed to cover a large number of putative DNA binding sites for the TFs of interest (yeast TFs Cbf1 and Tye7, and human TFs c-Myc, Max and Mad2) in their native genomic context. These high-throughput quantitative data are well suited for training complex models that take into account not only independent contributions from individual bases, but also contributions from di- and trinucleotides at various positions within or near the binding sites. To ensure that our models remain interpretable, we use feature selection to identify a small number of sequence features that accurately predict TF–DNA binding specificity. To further illustrate the accuracy of our regression models, we show that even in the case of paralogous TF with highly similar position weight matrices, our new models can distinguish the specificities of individual factors. Thus, our work represents an important step toward better sequence-based models of individual TF–DNA binding specificity. Availability: Our code is available at http://genome.duke.edu/labs/gordan/ISMB2013. The PBM data used in this article are available in the Gene Expression Omnibus under accession number GSE47026. Contact:raluca.gordan@duke.edu
Collapse
Affiliation(s)
- Fantine Mordelet
- Institute for Genome Sciences and Policy, Duke University, Durham, NC 27708, USA
| | | | | | | | | |
Collapse
|
36
|
Brown CD, Mangravite LM, Engelhardt BE. Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs. PLoS Genet 2013; 9:e1003649. [PMID: 23935528 PMCID: PMC3731231 DOI: 10.1371/journal.pgen.1003649] [Citation(s) in RCA: 106] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2012] [Accepted: 06/04/2013] [Indexed: 12/11/2022] Open
Abstract
Genetic variants in cis-regulatory elements or trans-acting regulators frequently influence the quantity and spatiotemporal distribution of gene transcription. Recent interest in expression quantitative trait locus (eQTL) mapping has paralleled the adoption of genome-wide association studies (GWAS) for the analysis of complex traits and disease in humans. Under the hypothesis that many GWAS associations tag non-coding SNPs with small effects, and that these SNPs exert phenotypic control by modifying gene expression, it has become common to interpret GWAS associations using eQTL data. To fully exploit the mechanistic interpretability of eQTL-GWAS comparisons, an improved understanding of the genetic architecture and causal mechanisms of cell type specificity of eQTLs is required. We address this need by performing an eQTL analysis in three parts: first we identified eQTLs from eleven studies on seven cell types; then we integrated eQTL data with cis-regulatory element (CRE) data from the ENCODE project; finally we built a set of classifiers to predict the cell type specificity of eQTLs. The cell type specificity of eQTLs is associated with eQTL SNP overlap with hundreds of cell type specific CRE classes, including enhancer, promoter, and repressive chromatin marks, regions of open chromatin, and many classes of DNA binding proteins. These associations provide insight into the molecular mechanisms generating the cell type specificity of eQTLs and the mode of regulation of corresponding eQTLs. Using a random forest classifier with cell specific CRE-SNP overlap as features, we demonstrate the feasibility of predicting the cell type specificity of eQTLs. We then demonstrate that CREs from a trait-associated cell type can be used to annotate GWAS associations in the absence of eQTL data for that cell type. We anticipate that such integrative, predictive modeling of cell specificity will improve our ability to understand the mechanistic basis of human complex phenotypic variation. When interpreting genome-wide association studies showing that specific genetic variants are associated with disease risk, scientists look for a link between the genetic variant and a biological mechanism behind that disease. One functional mechanism is that the genetic variant may influence gene transcription via a co-localized genomic regulatory element, such as a transcription factor binding site within an open chromatin region. Often this type of regulation occurs in some cell types but not others. In this study, we look across eleven gene expression studies with seven cell types and consider how genetic transcription regulators, or eQTLs, replicate within and between cell types. We identify pervasive allelic heterogeneity, or transcriptional control of a single gene by multiple, independent eQTLs. We integrate extensive data on cell type specific regulatory elements from ENCODE to identify general methods of transcription regulation through enrichment of eQTLs within regulatory elements. We also build a classifier to predict eQTL replication across cell types. The results in this paper present a path to an integrative, predictive approach to improve our ability to understand the mechanistic basis of human phenotypic variation.
Collapse
Affiliation(s)
- Christopher D. Brown
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America
- * E-mail: (CDB); (BEE)
| | | | - Barbara E. Engelhardt
- Biostatistics & Bioinformatics Department, Duke University, Durham, North Carolina, United States of America
- Department of Statistical Science, Duke University, Durham, North Carolina, United States of America
- Institute for Genome Sciences & Policy, Duke University, Durham, North Carolina, United States of America
- * E-mail: (CDB); (BEE)
| |
Collapse
|
37
|
Muratore KE, Engelhardt BE, Srouji JR, Jordan MI, Brenner SE, Kirsch JF. Molecular function prediction for a family exhibiting evolutionary tendencies toward substrate specificity swapping: recurrence of tyrosine aminotransferase activity in the Iα subfamily. Proteins 2013; 81:1593-609. [PMID: 23671031 PMCID: PMC3823064 DOI: 10.1002/prot.24318] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2013] [Revised: 04/11/2013] [Accepted: 04/19/2013] [Indexed: 11/17/2022]
Abstract
The subfamily Iα aminotransferases are typically categorized as having narrow specificity toward carboxylic amino acids (AATases), or broad specificity that includes aromatic amino acid substrates (TATases). Because of their general role in central metabolism and, more specifically, their association with liver-related diseases in humans, this subfamily is biologically interesting. The substrate specificities for only a few members of this subfamily have been reported, and the reliable prediction of substrate specificity from protein sequence has remained elusive. In this study, a diverse set of aminotransferases was chosen for characterization based on a scoring system that measures the sequence divergence of the active site. The enzymes that were experimentally characterized include both narrow-specificity AATases and broad-specificity TATases, as well as AATases with broader-specificity and TATases with narrower-specificity than the previously known family members. Molecular function and phylogenetic analyses underscored the complexity of this family's evolution as the TATase function does not follow a single evolutionary thread, but rather appears independently multiple times during the evolution of the subfamily. The additional functional characterizations described in this article, alongside a detailed sequence and phylogenetic analysis, provide some novel clues to understanding the evolutionary mechanisms at work in this family.
Collapse
Affiliation(s)
- Kathryn E Muratore
- Department of Molecular and Cell Biology, University of California, Berkeley, California
| | | | | | | | | | | |
Collapse
|
38
|
Hart AB, Engelhardt BE, Wardle MC, Sokoloff G, Stephens M, de Wit H, Palmer AA. Genome-wide association study of d-amphetamine response in healthy volunteers identifies putative associations, including cadherin 13 (CDH13). PLoS One 2012; 7:e42646. [PMID: 22952603 PMCID: PMC3429486 DOI: 10.1371/journal.pone.0042646] [Citation(s) in RCA: 68] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2012] [Accepted: 07/11/2012] [Indexed: 12/11/2022] Open
Abstract
Both the subjective response to d-amphetamine and the risk for amphetamine addiction are known to be heritable traits. Because subjective responses to drugs may predict drug addiction, identifying alleles that influence acute response may also provide insight into the genetic risk factors for drug abuse. We performed a Genome Wide Association Study (GWAS) for the subjective responses to amphetamine in 381 non-drug abusing healthy volunteers. Responses to amphetamine were measured using a double-blind, placebo-controlled, within-subjects design. We used sparse factor analysis to reduce the dimensionality of the data to ten factors. We identified several putative associations; the strongest was between a positive subjective drug-response factor and a SNP (rs3784943) in the 8(th) intron of cadherin 13 (CDH13; P = 4.58×10(-8)), a gene previously associated with a number of psychiatric traits including methamphetamine dependence. Additionally, we observed a putative association between a factor representing the degree of positive affect at baseline and a SNP (rs472402) in the 1(st) intron of steroid-5-alpha-reductase-α-polypeptide-1 (SRD5A1; P = 2.53×10(-7)), a gene whose protein product catalyzes the rate-limiting step in synthesis of the neurosteroid allopregnanolone. This SNP belongs to an LD-block that has been previously associated with the expression of SRD5A1 and differences in SRD5A1 enzymatic activity. The purpose of this study was to begin to explore the genetic basis of subjective responses to stimulant drugs using a GWAS approach in a modestly sized sample. Our approach provides a case study for analysis of high-dimensional intermediate pharmacogenomic phenotypes, which may be more tractable than clinical diagnoses.
Collapse
Affiliation(s)
- Amy B. Hart
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Barbara E. Engelhardt
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Computer Science, University of Chicago, Chicago, Illinois, United States of America
| | - Margaret C. Wardle
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, Illinois, United States of America
| | - Greta Sokoloff
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Matthew Stephens
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Statistics, University of Chicago, Chicago, Illinois, United States of America
| | - Harriet de Wit
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, Illinois, United States of America
| | - Abraham A. Palmer
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Psychiatry and Behavioral Neuroscience, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
39
|
Engelhardt BE, Jordan MI, Srouji JR, Brenner SE. Genome-scale phylogenetic function annotation of large and diverse protein families. Genome Res 2011; 21:1969-80. [PMID: 21784873 PMCID: PMC3205580 DOI: 10.1101/gr.104687.109] [Citation(s) in RCA: 47] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2009] [Accepted: 07/11/2011] [Indexed: 11/25/2022]
Abstract
The Statistical Inference of Function Through Evolutionary Relationships (SIFTER) framework uses a statistical graphical model that applies phylogenetic principles to automate precise protein function prediction. Here we present a revised approach (SIFTER version 2.0) that enables annotations on a genomic scale. SIFTER 2.0 produces equivalently precise predictions compared to the earlier version on a carefully studied family and on a collection of 100 protein families. We have added an approximation method to SIFTER 2.0 and show a 500-fold improvement in speed with minimal impact on prediction results in the functionally diverse sulfotransferase protein family. On the Nudix protein family, previously inaccessible to the SIFTER framework because of the 66 possible molecular functions, SIFTER achieved 47.4% accuracy on experimental data (where BLAST achieved 34.0%). Finally, we used SIFTER to annotate all of the Schizosaccharomyces pombe proteins with experimental functional characterizations, based on annotations from proteins in 46 fungal genomes. SIFTER precisely predicted molecular function for 45.5% of the characterized proteins in this genome, as compared with four current function prediction methods that precisely predicted function for 62.6%, 30.6%, 6.0%, and 5.7% of these proteins. We use both precision-recall curves and ROC analyses to compare these genome-scale predictions across the different methods and to assess performance on different types of applications. SIFTER 2.0 is capable of predicting protein molecular function for large and functionally diverse protein families using an approximate statistical model, enabling phylogenetics-based protein function prediction for genome-wide analyses. The code for SIFTER and protein family data are available at http://sifter.berkeley.edu.
Collapse
Affiliation(s)
- Barbara E Engelhardt
- Electrical Engineering and Computer Science Department, University of California, Berkeley, California 94720, USA.
| | | | | | | |
Collapse
|
40
|
Abstract
It is now easier to discover thousands of protein sequences in a new microbial genome than it is to biochemically characterize the specific activity of a single protein of unknown function. The molecular functions of protein sequences have typically been predicted using homology-based computational methods, which rely on the principle that homologous proteins share a similar function. However, some protein families include groups of proteins with different molecular functions. A phylogenetic approach for predicting molecular function (sometimes called "phylogenomics") is an effective means to predict protein molecular function. These methods incorporate functional evidence from all members of a family that have functional characterizations using the evolutionary history of the protein family to make robust predictions for the uncharacterized proteins. However, they are often difficult to apply on a genome-wide scale because of the time-consuming step of reconstructing the phylogenies of each protein to be annotated. Our automated approach for function annotation using phylogeny, the SIFTER (Statistical Inference of Function Through Evolutionary Relationships) methodology, uses a statistical graphical model to compute the probabilities of molecular functions for unannotated proteins. Our benchmark tests showed that SIFTER provides accurate functional predictions on various protein families, outperforming other available methods.
Collapse
|
41
|
Abstract
We present a statistical graphical model to infer specific molecular function for unannotated protein sequences using homology. Based on phylogenomic principles, SIFTER (Statistical Inference of Function Through Evolutionary Relationships) accurately predicts molecular function for members of a protein family given a reconciled phylogeny and available function annotations, even when the data are sparse or noisy. Our method produced specific and consistent molecular function predictions across 100 Pfam families in comparison to the Gene Ontology annotation database, BLAST, GOtcha, and Orthostrapper. We performed a more detailed exploration of functional predictions on the adenosine-5'-monophosphate/adenosine deaminase family and the lactate/malate dehydrogenase family, in the former case comparing the predictions against a gold standard set of published functional characterizations. Given function annotations for 3% of the proteins in the deaminase family, SIFTER achieves 96% accuracy in predicting molecular function for experimentally characterized proteins as reported in the literature. The accuracy of SIFTER on this dataset is a significant improvement over other currently available methods such as BLAST (75%), GeneQuiz (64%), GOtcha (89%), and Orthostrapper (11%). We also experimentally characterized the adenosine deaminase from Plasmodium falciparum, confirming SIFTER's prediction. The results illustrate the predictive power of exploiting a statistical model of function evolution in phylogenomic problems. A software implementation of SIFTER is available from the authors.
Collapse
Affiliation(s)
- Barbara E Engelhardt
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California, United States of America.
| | | | | | | |
Collapse
|