1
|
Abstract
We consider data-analysis settings where data are missing not at random. In these cases, the two basic modeling approaches are 1) pattern-mixture models, with separate distributions for missing data and observed data, and 2) selection models, with a distribution for the data preobservation and a missing-data mechanism that selects which data are observed. These two modeling approaches lead to distinct factorizations of the joint distribution of the observed-data and missing-data indicators. In this paper, we explore a third approach, apparently originally proposed by J. W. Tukey as a remark in a discussion between Rubin and Hartigan, and reported by Holland in a two-page note, which has been so far neglected. Data analyses typically rely upon assumptions about the missingness mechanisms that lead to observed versus missing data, assumptions that are typically unassessable. We explore an approach where the joint distribution of observed data and missing data are specified in a nonstandard way. In this formulation, which traces back to a representation of the joint distribution of the data and missingness mechanism, apparently first proposed by J. W. Tukey, the modeling assumptions about the distributions are either assessable or are designed to allow relatively easy incorporation of substantive knowledge about the problem at hand, thereby offering a possibly realistic portrayal of the data, both observed and missing. We develop Tukey’s representation for exponential-family models, propose a computationally tractable approach to inference in this class of models, and offer some general theoretical comments. We then illustrate the utility of this approach with an example in systems biology.
Collapse
|
2
|
Affiliation(s)
- Xiangyu Luo
- Department of Statistics, The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
| | - Yingying Wei
- Department of Statistics, The Chinese University of Hong Kong, Ma Liu Shui, Hong Kong
| |
Collapse
|
3
|
Ho B, Baryshnikova A, Brown GW. Unification of Protein Abundance Datasets Yields a Quantitative Saccharomyces cerevisiae Proteome. Cell Syst 2018; 6:192-205.e3. [PMID: 29361465 DOI: 10.1016/j.cels.2017.12.004] [Citation(s) in RCA: 239] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2017] [Revised: 10/10/2017] [Accepted: 12/08/2017] [Indexed: 12/20/2022]
Abstract
Protein activity is the ultimate arbiter of function in most cellular pathways, and protein concentration is fundamentally connected to protein action. While the proteome of yeast has been subjected to the most comprehensive analysis of any eukaryote, existing datasets are difficult to compare, and there is no consensus abundance value for each protein. We evaluated 21 quantitative analyses of the S. cerevisiae proteome, normalizing and converting all measurements of protein abundance into the intuitive measurement of absolute molecules per cell. We estimate the cellular abundance of 92% of the proteins in the yeast proteome and assess the variation in each abundance measurement. Using our protein abundance dataset, we find that a global response to diverse environmental stresses is not detected at the level of protein abundance, we find that protein tags have only a modest effect on protein abundance, and we identify proteins that are differentially regulated at the mRNA abundance, mRNA translation, and protein abundance levels.
Collapse
Affiliation(s)
- Brandon Ho
- Department of Biochemistry and Donnelly Center, University of Toronto, Toronto, ON M5S 1A8, Canada
| | - Anastasia Baryshnikova
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Grant W Brown
- Department of Biochemistry and Donnelly Center, University of Toronto, Toronto, ON M5S 1A8, Canada.
| |
Collapse
|
4
|
Abstract
Transcriptional and post-transcriptional regulation shape tissue-type-specific proteomes, but their relative contributions remain contested. Estimates of the factors determining protein levels in human tissues do not distinguish between (i) the factors determining the variability between the abundances of different proteins, i.e., mean-level-variability and, (ii) the factors determining the physiological variability of the same protein across different tissue types, i.e., across-tissues variability. We sought to estimate the contribution of transcript levels to these two orthogonal sources of variability, and found that scaled mRNA levels can account for most of the mean-level-variability but not necessarily for across-tissues variability. The reliable quantification of the latter estimate is limited by substantial measurement noise. However, protein-to-mRNA ratios exhibit substantial across-tissues variability that is functionally concerted and reproducible across different datasets, suggesting extensive post-transcriptional regulation. These results caution against estimating protein fold-changes from mRNA fold-changes between different cell-types, and highlight the contribution of post-transcriptional regulation to shaping tissue-type-specific proteomes.
Collapse
Affiliation(s)
- Alexander Franks
- Department of Statistics, University of Washington, Seattle, WA 98195, USA
| | - Edoardo Airoldi
- Department of Statistics, Harvard University, Cambridge, MA 02138, USA
- Broad Institute of MIT and Harvard University, Cambridge, MA 02142, USA
| | - Nikolai Slavov
- Department of Bioengineering, Northeastern University, Boston, MA 02115, USA
- Department of Biology, Northeastern University, Boston, MA 02115, USA
| |
Collapse
|
5
|
Re A, Waldron L, Quattrone A. Control of Gene Expression by RNA Binding Protein Action on Alternative Translation Initiation Sites. PLoS Comput Biol 2016; 12:e1005198. [PMID: 27923063 PMCID: PMC5140048 DOI: 10.1371/journal.pcbi.1005198] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2015] [Accepted: 10/13/2016] [Indexed: 11/18/2022] Open
Abstract
Transcript levels do not faithfully predict protein levels, due to post-transcriptional regulation of gene expression mediated by RNA binding proteins (RBPs) and non-coding RNAs. We developed a multivariate linear regression model integrating RBP levels and predicted RBP-mRNA regulatory interactions from matched transcript and protein datasets. RBPs significantly improved the accuracy in predicting protein abundance of a portion of the total modeled mRNAs in three panels of tissues and cells and for different methods employed in the detection of mRNA and protein. The presence of upstream translation initiation sites (uTISs) at the mRNA 5’ untranslated regions was strongly associated with improvement in predictive accuracy. On the basis of these observations, we propose that the recently discovered widespread uTISs in the human genome can be a previously unappreciated substrate of translational control mediated by RBPs. Gene expression is a dynamic program by which the information stored in the genome is rendered functional by production and degradation of two types of macromolecules, RNAs and proteins. mRNAs are templates for proteins; therefore we expect correspondence between quantities of mRNAs and proteins. Genome-wide studies instead indicate a marked discrepancy between them, when considering their steady-state levels or their variations across different conditions. We employed linear regression approaches with paired mRNA/protein datasets in order to develop a model predicting the protein level of a gene from both the mRNA level and the protein levels of RBPs inferred to bind the mRNA untranslated regions. The results of our analyses restricted the utility of RBPs to improve accuracy of predicted protein abundance to a small fraction of the total modelled genes, and identified a novel association of the improvement induced by RBPs with the presence of upstream translation sites. This finding suggests a new avenue of experimental studies aimed at exploring the hypothesis that RBPs could influence protein abundance by changing the preference for certain translation initiation sites.
Collapse
Affiliation(s)
- Angela Re
- Laboratory of Translational Genomics, Centre for Integrative Biology, University of Trento, Polo Scientifico e Tecnologico Fabio Ferrari, Trento, Italy
- * E-mail: (AR); (LW); (AQ)
| | - Levi Waldron
- City University of New York Graduate School of Public Health and Health Policy, New York, New York, United States of America
- * E-mail: (AR); (LW); (AQ)
| | - Alessandro Quattrone
- Laboratory of Translational Genomics, Centre for Integrative Biology, University of Trento, Polo Scientifico e Tecnologico Fabio Ferrari, Trento, Italy
- * E-mail: (AR); (LW); (AQ)
| |
Collapse
|
6
|
Csárdi G, Franks A, Choi DS, Airoldi EM, Drummond DA. Accounting for experimental noise reveals that mRNA levels, amplified by post-transcriptional processes, largely determine steady-state protein levels in yeast. PLoS Genet 2015; 11:e1005206. [PMID: 25950722 PMCID: PMC4423881 DOI: 10.1371/journal.pgen.1005206] [Citation(s) in RCA: 123] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2014] [Accepted: 04/10/2015] [Indexed: 11/25/2022] Open
Abstract
Cells respond to their environment by modulating protein levels through mRNA transcription and post-transcriptional control. Modest observed correlations between global steady-state mRNA and protein measurements have been interpreted as evidence that mRNA levels determine roughly 40% of the variation in protein levels, indicating dominant post-transcriptional effects. However, the techniques underlying these conclusions, such as correlation and regression, yield biased results when data are noisy, missing systematically, and collinear---properties of mRNA and protein measurements---which motivated us to revisit this subject. Noise-robust analyses of 24 studies of budding yeast reveal that mRNA levels explain more than 85% of the variation in steady-state protein levels. Protein levels are not proportional to mRNA levels, but rise much more rapidly. Regulation of translation suffices to explain this nonlinear effect, revealing post-transcriptional amplification of, rather than competition with, transcriptional signals. These results substantially revise widely credited models of protein-level regulation, and introduce multiple noise-aware approaches essential for proper analysis of many biological phenomena.
Collapse
Affiliation(s)
- Gábor Csárdi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America,
| | - Alexander Franks
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America,
| | - David S. Choi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America,
| | - Edoardo M. Airoldi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America,
- The Broad Institute of Harvard & MIT, Cambridge, Massachusetts, United States of America,
| | - D. Allan Drummond
- Dept. of Biochemistry & Molecular Biology, University of Chicago, Chicago, Illinois, United States of America,
- Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
7
|
Csárdi G, Franks A, Choi DS, Airoldi EM, Drummond DA. Accounting for experimental noise reveals that mRNA levels, amplified by post-transcriptional processes, largely determine steady-state protein levels in yeast. PLoS Genet 2015. [PMID: 25950722 DOI: 10.5061/dryad.d644f] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023] Open
Abstract
Cells respond to their environment by modulating protein levels through mRNA transcription and post-transcriptional control. Modest observed correlations between global steady-state mRNA and protein measurements have been interpreted as evidence that mRNA levels determine roughly 40% of the variation in protein levels, indicating dominant post-transcriptional effects. However, the techniques underlying these conclusions, such as correlation and regression, yield biased results when data are noisy, missing systematically, and collinear---properties of mRNA and protein measurements---which motivated us to revisit this subject. Noise-robust analyses of 24 studies of budding yeast reveal that mRNA levels explain more than 85% of the variation in steady-state protein levels. Protein levels are not proportional to mRNA levels, but rise much more rapidly. Regulation of translation suffices to explain this nonlinear effect, revealing post-transcriptional amplification of, rather than competition with, transcriptional signals. These results substantially revise widely credited models of protein-level regulation, and introduce multiple noise-aware approaches essential for proper analysis of many biological phenomena.
Collapse
Affiliation(s)
- Gábor Csárdi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Alexander Franks
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - David S Choi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Edoardo M Airoldi
- Dept. of Statistics, Harvard University, Cambridge, Massachusetts, United States of America,; The Broad Institute of Harvard & MIT, Cambridge, Massachusetts, United States of America
| | - D Allan Drummond
- Dept. of Biochemistry & Molecular Biology, University of Chicago, Chicago, Illinois, United States of America,; Dept. of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
8
|
Abstract
We consider the problem of quantifying temporal coordination between multiple high-dimensional responses. We introduce a family of multi-way stochastic blockmodels suited for this problem, which avoids pre-processing steps such as binning and thresholding commonly adopted for this type of problems, in biology. We develop two inference procedures based on collapsed Gibbs sampling and variational methods. We provide a thorough evaluation of the proposed methods on simulated data, in terms of membership and blockmodel estimation, predictions out-of-sample, and run-time. We also quantify the effects of censoring procedures such as binning and thresholding on the estimation tasks. We use these models to carry out an empirical analysis of the functional mechanisms driving the coordination between gene expression and metabolite concentrations during carbon and nitrogen starvation, in S. cerevisiae.
Collapse
|