1
|
Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong NA, Li Y, Eraslan G, AMIN TB, Goke J, Mueller NS, Kellis M, Kundaje A, Beer MA, Keles S, Gifford DK, Yosef N. Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 2017; 38:1240-1250. [PMID: 28220625 PMCID: PMC5560998 DOI: 10.1002/humu.23197] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Revised: 01/19/2017] [Accepted: 02/12/2017] [Indexed: 02/03/2023]
Abstract
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
Collapse
Affiliation(s)
- Anat Kreimer
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Matthew D. Edwards
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Sunyoung Shin
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Rene Welch
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Michael Wainberg
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Rahul Mohan
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Nicholas A. Sinnott-Armstrong
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Yue Li
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Gökcen Eraslan
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Talal Bin AMIN
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Jonathan Goke
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Nikola S. Mueller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Michael A Beer
- McKusick-Nathans Institute of Genetic Medicine, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sunduz Keles
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - David K. Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Nir Yosef
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Ragon Institute of Massachusetts General Hospital, MIT and Harvard, Cambridge, MA, 02139
| |
Collapse
|
2
|
Davidson NR, Brazma A, Brooks AN, Calabrese C, Fonseca NA, Goke J, He Y, Hu X, Kahles A, Lehmann KV, Liu F, Rätsch G, Li S, Schwarz RF, Yang M, Zhang Z, Zhang F, Zheng L. Abstract 389: Integrating diverse transcriptomic alterations to identify cancer-relevant genes. Cancer Res 2017. [DOI: 10.1158/1538-7445.am2017-389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Abstract
Introduction:
We present a novel method to identify cancer driver genes that jointly examines any number of diverse transcriptomic alterations with the goal to uncover highly recurrent and heterogeneous patterns in 1190 samples across 26 cancer types as part of the PanCancer Analysis of Whole Genomes (PCAWG) of the International Cancer Genome Consortium (ICGC).
Motivation:
Previous pan-cancer genomic studies have focused on the analysis of somatic mutations as the driver of phenotypic changes. Here, we propose a method to integrate a wide variety of RNA and DNA changes to redefine the concept of driver events and account for the transcriptome’s role in tumorigenesis. PTK2 provides a motivating example, since it has many RNA alterations that correlate with patient survival, such as overexpression, exon-skips, and alternative promoter usage.
In our analysis, we integrate an unprecedented amount of various alterations including gene fusions, RNA editing, alternative splicing, expression outliers, alternative promoters, allele specific expression, and somatic mutations. This enables us to also identify mutually exclusive (MutE) and co-occurring (CoO) patterns between different types of alterations within a gene.
Methods:
Our method has 3 main strengths: flexibility to handle any number or type of alteration, sensitivity to different frequencies of alterations so rare events are not lost in the recurrence analysis, and diversity of ranking such that genes with multiple alterations are prioritized. Our method is summarized in two steps:
1) Identify genes that are both recurrently and heterogeneously altered across many samples by calculating a rank-based score for each gene.
2) Identify MutE and CoO patterns between alteration types for the genes identified in the previous step.
To ensure that alterations were comparable, we applied a thresholding model to binarize all alterations for gene-sample pairs, allowing us to account for the properties of the different modalities involved.
Step 1 of our method calculates a score for each gene that takes into account: 1) the number of alterations to a gene across all samples, 2) the rarity of each alteration, and 3) how many types of alterations are observed per gene. The score is then used to rank the genes and top genes are considered for MutE and CoO analyses.
Results:
Our top 100 ranked genes were highly enriched for cancer census genes (adjusted p-value: 2.06e-9), indicating that we identify cancer relevant genes. Our top five ranked cancer census genes were IGF2, ERBB2, RARA, CREBBP, and ARID1A; all of which had at least 4 of 7 possible alterations, showing our scoring method prioritizes genes with diverse alterations. We also found that alternative promoter usage and alternative splicing were highly co-occurring alterations, with PTK2 having the highest co-occurrence between them. In summary, we propose a new method to analyze various RNA disruptions and show it can yield new insights beyond genomic variation.
Citation Format: Natalie R. Davidson, PanCancer Analysis of Whole Genomes 3 (PCAWG-3) for ICGC, Alvis Brazma, Angela N. Brooks, Claudia Calabrese, Nuno A. Fonseca, Jonathan Goke, Yao He, Xueda Hu, Andre Kahles, Kjong-Van Lehmann, Fenglin Liu, Gunnar Rätsch, Siliang Li, Roland F. Schwarz, Mingyu Yang, Zemin Zhang, Fan Zhang, Liangtao Zheng. Integrating diverse transcriptomic alterations to identify cancer-relevant genes [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2017; 2017 Apr 1-5; Washington, DC. Philadelphia (PA): AACR; Cancer Res 2017;77(13 Suppl):Abstract nr 389. doi:10.1158/1538-7445.AM2017-389
Collapse
Affiliation(s)
| | - Alvis Brazma
- 2European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, United Kingdom
| | | | - Claudia Calabrese
- 2European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, United Kingdom
| | - Nuno A. Fonseca
- 2European Molecular Biology Laboratory - European Bioinformatics Institute, Hinxton, United Kingdom
| | - Jonathan Goke
- 4Genome Institute of Singapore, Singapore, Singapore
| | - Yao He
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | - Xueda Hu
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | | | | | - Fenglin Liu
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | | | | | | | - Mingyu Yang
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | - Zemin Zhang
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | - Fan Zhang
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | - Liangtao Zheng
- 5Peking-Tsinghua Center for Life Sciences, Beijing, China
| | | |
Collapse
|
3
|
Kosta K, Sabroe I, Goke J, Nibbs RJ, Tsanakas J, Whyte MK, Teare MD. A Bayesian approach to copy-number-polymorphism analysis in nuclear pedigrees. Am J Hum Genet 2007; 81:808-12. [PMID: 17847005 PMCID: PMC2227930 DOI: 10.1086/520096] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2007] [Accepted: 05/09/2007] [Indexed: 02/03/2023] Open
Abstract
Segmental copy-number polymorphisms (CNPs) represent a significant component of human genetic variation and are likely to contribute to disease susceptibility. These potentially multiallelic and highly polymorphic systems present new challenges to family-based genetic-analysis tools that commonly assume codominant markers and allow for no genotyping error. The copy-number quantitation (CNP phenotype) represents the total number of segmental copies present in an individual and provides a means to infer, rather than to observe, the underlying allele segregation. We present an integrated approach to meet these challenges, in the form of a graphical model in which we infer the underlying CNP phenotype from the (single or replicate) quantitative measure within the analysis while assuming an allele-based system segregating through the pedigree. This approach can be readily applied to the study of any form of genetic measure, and the construction permits extension to a wide variety of hypothesis tests. We have implemented the basic model for use with nuclear families, and we illustrate its application through an analysis of the CNP located in gene CCL3L1 in 201 families with asthma.
Collapse
Affiliation(s)
- Konstantina Kosta
- School of Medicine and Biomedical Sciences, University of Sheffield, Sheffield, UK
| | | | | | | | | | | | | |
Collapse
|