1
|
Wang J, Liang H, Zhang Q, Ma S. Replicability in cancer omics data analysis: measures and empirical explorations. Brief Bioinform 2022; 23:bbac304. [PMID: 35876281 PMCID: PMC9487717 DOI: 10.1093/bib/bbac304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 06/30/2022] [Accepted: 07/06/2022] [Indexed: 02/05/2023] Open
Abstract
In biomedical research, the replicability of findings across studies is highly desired. In this study, we focus on cancer omics data, for which the examination of replicability has been mostly focused on important omics variables identified in different studies. In published literature, although there have been extensive attention and ad hoc discussions, there is insufficient quantitative research looking into replicability measures and their properties. The goal of this study is to fill this important knowledge gap. In particular, we consider three sensible replicability measures, for which we examine distributional properties and develop a way of making inference. Applying them to three The Cancer Genome Atlas (TCGA) datasets reveals in general low replicability and significant across-data variations. To further comprehend such findings, we resort to simulation, which confirms the validity of the findings with the TCGA data and further informs the dependence of replicability on signal level (or equivalently sample size). Overall, this study can advance our understanding of replicability for cancer omics and other studies that have identification as a key goal.
Collapse
Affiliation(s)
- Jiping Wang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Hongmin Liang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, Fujian, China
| | - Qingzhao Zhang
- Department of Statistics, School of Economics, Xiamen University, Xiamen, Fujian, China
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, Fujian, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
2
|
Kerr CH, Skinnider MA, Andrews DDT, Madero AM, Chan QWT, Stacey RG, Stoynov N, Jan E, Foster LJ. Dynamic rewiring of the human interactome by interferon signaling. Genome Biol 2020; 21:140. [PMID: 32539747 PMCID: PMC7294662 DOI: 10.1186/s13059-020-02050-y] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2019] [Accepted: 05/20/2020] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The type I interferon (IFN) response is an ancient pathway that protects cells against viral pathogens by inducing the transcription of hundreds of IFN-stimulated genes. Comprehensive catalogs of IFN-stimulated genes have been established across species and cell types by transcriptomic and biochemical approaches, but their antiviral mechanisms remain incompletely characterized. Here, we apply a combination of quantitative proteomic approaches to describe the effects of IFN signaling on the human proteome, and apply protein correlation profiling to map IFN-induced rearrangements in the human protein-protein interaction network. RESULTS We identify > 26,000 protein interactions in IFN-stimulated and unstimulated cells, many of which involve proteins associated with human disease and are observed exclusively within the IFN-stimulated network. Differential network analysis reveals interaction rewiring across a surprisingly broad spectrum of cellular pathways in the antiviral response. We identify IFN-dependent protein-protein interactions mediating novel regulatory mechanisms at the transcriptional and translational levels, with one such interaction modulating the transcriptional activity of STAT1. Moreover, we reveal IFN-dependent changes in ribosomal composition that act to buffer IFN-stimulated gene protein synthesis. CONCLUSIONS Our map of the IFN interactome provides a global view of the complex cellular networks activated during the antiviral response, placing IFN-stimulated genes in a functional context, and serves as a framework to understand how these networks are dysregulated in autoimmune or inflammatory disease.
Collapse
Affiliation(s)
- Craig H Kerr
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada
- Current Address: Department of Genetics, Stanford University, Stanford, CA, 94305, USA
| | - Michael A Skinnider
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Daniel D T Andrews
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada
| | - Angel M Madero
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Queenie W T Chan
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - R Greg Stacey
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Nikolay Stoynov
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Eric Jan
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada
| | - Leonard J Foster
- Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada.
- Department of Biochemistry and Molecular Biology, University of British Columbia, Vancouver, BC, V6T 1Z3, Canada.
| |
Collapse
|
3
|
Ballouz S, Pavlidis P, Gillis J. Using predictive specificity to determine when gene set analysis is biologically meaningful. Nucleic Acids Res 2018; 45:e20. [PMID: 28204549 PMCID: PMC5389513 DOI: 10.1093/nar/gkw957] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2016] [Revised: 10/04/2016] [Accepted: 10/10/2016] [Indexed: 11/14/2022] Open
Abstract
Gene set analysis, which translates gene lists into enriched functions, is among the most common bioinformatic methods. Yet few would advocate taking the results at face value. Not only is there no agreement on the algorithms themselves, there is no agreement on how to benchmark them. In this paper, we evaluate the robustness and uniqueness of enrichment results as a means of assessing methods even where correctness is unknown. We show that heavily annotated (‘multifunctional’) genes are likely to appear in genomics study results and drive the generation of biologically non-specific enrichment results as well as highly fragile significances. By providing a means of determining where enrichment analyses report non-specific and non-robust findings, we are able to assess where we can be confident in their use. We find significant progress in recent bias correction methods for enrichment and provide our own software implementation. Our approach can be readily adapted to any pre-existing package.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
4
|
Ballouz S, Weber M, Pavlidis P, Gillis J. EGAD: ultra-fast functional analysis of gene networks. Bioinformatics 2018; 33:612-614. [PMID: 27993773 DOI: 10.1093/bioinformatics/btw695] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2016] [Accepted: 11/03/2016] [Indexed: 12/25/2022] Open
Abstract
Summary Evaluating gene networks with respect to known biology is a common task but often a computationally costly one. Many computational experiments are difficult to apply exhaustively in network analysis due to run-times. To permit high-throughput analysis of gene networks, we have implemented a set of very efficient tools to calculate functional properties in networks based on guilt-by-association methods. ( xtending ' uilt-by- ssociation' by egree) allows gene networks to be evaluated with respect to hundreds or thousands of gene sets. The methods predict novel members of gene groups, assess how well a gene network groups known sets of genes, and determines the degree to which generic predictions drive performance. By allowing fast evaluations, whether of random sets or real functional ones, provides the user with an assessment of performance which can easily be used in controlled evaluations across many parameters. Availability and Implementation The software package is freely available at https://github.com/sarbal/EGAD and implemented for use in R and Matlab. The package is also freely available under the LGPL license from the Bioconductor web site ( http://bioconductor.org ). Contact JGillis@cshl.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Sara Ballouz
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| | - Melanie Weber
- Department of Mathematics and Computer Science, University of Leipzig, Leipzig, Germany
| | - Paul Pavlidis
- Department of Psychiatry and Michael Smith Laboratories, University of British Columbia, Vancouver, Canada
| | - Jesse Gillis
- Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Woodbury, NY 11797, USA
| |
Collapse
|
5
|
Amar D, Shamir R, Yekutieli D. Extracting replicable associations across multiple studies: Empirical Bayes algorithms for controlling the false discovery rate. PLoS Comput Biol 2017; 13:e1005700. [PMID: 28821015 PMCID: PMC5576761 DOI: 10.1371/journal.pcbi.1005700] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2017] [Revised: 08/30/2017] [Accepted: 07/24/2017] [Indexed: 12/03/2022] Open
Abstract
In almost every field in genomics, large-scale biomedical datasets are used to report associations. Extracting associations that recur across multiple studies while controlling the false discovery rate is a fundamental challenge. Here, we propose a new method to allow joint analysis of multiple studies. Given a set of p-values obtained from each study, the goal is to identify associations that recur in at least k > 1 studies while controlling the false discovery rate. We propose several new algorithms that differ in how the study dependencies are modeled, and compare them and extant methods under various simulated scenarios. The top algorithm, SCREEN (Scalable Cluster-based REplicability ENhancement), is our new algorithm that works in three stages: (1) clustering an estimated correlation network of the studies, (2) learning replicability (e.g., of genes) within clusters, and (3) merging the results across the clusters. When we applied SCREEN to two real datasets it greatly outperformed the results obtained via standard meta-analysis. First, on a collection of 29 case-control gene expression cancer studies, we detected a large set of consistently up-regulated genes related to proliferation and cell cycle regulation. These genes are both consistently up-regulated across many cancer studies, and are well connected in known gene networks. Second, on a recent pan-cancer study that examined the expression profiles of patients with and without mutations in the HLA complex, we detected a large active module of up-regulated genes that are both related to immune responses and are well connected in known gene networks. This module covers thrice more genes as compared to the original study at a similar false discovery rate, demonstrating the high power of SCREEN. An implementation of SCREEN is available in the supplement. When analyzing results from multiple studies, extracting replicated associations is the first step towards making new discoveries. The standard approach for this task is to use meta-analysis methods, which usually make an underlying null hypothesis that a gene has no effect in all studies. On the other hand, in replicability analysis we explicitly require that the gene will manifest a recurring pattern of effects. In this study we develop new algorithms for replicability analysis that are both scalable (i.e., can handle many studies) and allow controlling the false discovery rate. We show that our main algorithm called SCREEN (Scalable Cluster-based REplicability ENhancement) outperforms the other methods in simulated scenarios. Moreover, when applied to real datasets, SCREEN greatly extended the results of the meta-analysis, and can even facilitate detection of new biological results.
Collapse
Affiliation(s)
- David Amar
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
| | - Ron Shamir
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel
- * E-mail:
| | - Daniel Yekutieli
- Department of Statistics and OR, Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
6
|
Ballouz S, Gillis J. Strength of functional signature correlates with effect size in autism. Genome Med 2017; 9:64. [PMID: 28687074 PMCID: PMC5501949 DOI: 10.1186/s13073-017-0455-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Accepted: 06/23/2017] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Disagreements over genetic signatures associated with disease have been particularly prominent in the field of psychiatric genetics, creating a sharp divide between disease burdens attributed to common and rare variation, with study designs independently targeting each. Meta-analysis within each of these study designs is routine, whether using raw data or summary statistics, but combining results across study designs is atypical. However, tests of functional convergence are used across all study designs, where candidate gene sets are assessed for overlaps with previously known properties. This suggests one possible avenue for combining not study data, but the functional conclusions that they reach. METHOD In this work, we test for functional convergence in autism spectrum disorder (ASD) across different study types, and specifically whether the degree to which a gene is implicated in autism is correlated with the degree to which it drives functional convergence. Because different study designs are distinguishable by their differences in effect size, this also provides a unified means of incorporating the impact of study design into the analysis of convergence. RESULTS We detected remarkably significant positive trends in aggregate (p < 2.2e-16) with 14 individually significant properties (false discovery rate <0.01), many in areas researchers have targeted based on different reasoning, such as the fragile X mental retardation protein (FMRP) interactor enrichment (false discovery rate 0.003). We are also able to detect novel technical effects and we see that network enrichment from protein-protein interaction data is heavily confounded with study design, arising readily in control data. CONCLUSIONS We see a convergent functional signal for a subset of known and novel functions in ASD from all sources of genetic variation. Meta-analytic approaches explicitly accounting for different study designs can be adapted to other diseases to discover novel functional associations and increase statistical power.
Collapse
Affiliation(s)
- Sara Ballouz
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| | - Jesse Gillis
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724 USA
| |
Collapse
|
7
|
Lazzarini N, Widera P, Williamson S, Heer R, Krasnogor N, Bacardit J. Functional networks inference from rule-based machine learning models. BioData Min 2016; 9:28. [PMID: 27597880 PMCID: PMC5011349 DOI: 10.1186/s13040-016-0106-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2016] [Accepted: 08/11/2016] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Functional networks play an important role in the analysis of biological processes and systems. The inference of these networks from high-throughput (-omics) data is an area of intense research. So far, the similarity-based inference paradigm (e.g. gene co-expression) has been the most popular approach. It assumes a functional relationship between genes which are expressed at similar levels across different samples. An alternative to this paradigm is the inference of relationships from the structure of machine learning models. These models are able to capture complex relationships between variables, that often are different/complementary to the similarity-based methods. RESULTS We propose a protocol to infer functional networks from machine learning models, called FuNeL. It assumes, that genes used together within a rule-based machine learning model to classify the samples, might also be functionally related at a biological level. The protocol is first tested on synthetic datasets and then evaluated on a test suite of 8 real-world datasets related to human cancer. The networks inferred from the real-world data are compared against gene co-expression networks of equal size, generated with 3 different methods. The comparison is performed from two different points of view. We analyse the enriched biological terms in the set of network nodes and the relationships between known disease-associated genes in a context of the network topology. The comparison confirms both the biological relevance and the complementary character of the knowledge captured by the FuNeL networks in relation to similarity-based methods and demonstrates its potential to identify known disease associations as core elements of the network. Finally, using a prostate cancer dataset as a case study, we confirm that the biological knowledge captured by our method is relevant to the disease and consistent with the specialised literature and with an independent dataset not used in the inference process. AVAILABILITY The implementation of our network inference protocol is available at: http://ico2s.org/software/funel.html.
Collapse
Affiliation(s)
- Nicola Lazzarini
- Interdisciplinary Computing and Complex BioSystems (ICOS) research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
| | - Paweł, Widera
- Interdisciplinary Computing and Complex BioSystems (ICOS) research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
| | - Stuart Williamson
- Clinical and Experimental Pharmacology Group, Cancer Research UK Manchester Institute, University of Manchester, Manchester, UK
| | - Rakesh Heer
- Northern Institute for Cancer Research, Medical School, Newcastle University, Newcastle upon Tyne, UK
| | - Natalio Krasnogor
- Interdisciplinary Computing and Complex BioSystems (ICOS) research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
| | - Jaume Bacardit
- Interdisciplinary Computing and Complex BioSystems (ICOS) research group, School of Computing Science, Newcastle University, Newcastle upon Tyne, UK
| |
Collapse
|
8
|
O’Meara MJ, Ballouz S, Shoichet BK, Gillis J. Ligand Similarity Complements Sequence, Physical Interaction, and Co-Expression for Gene Function Prediction. PLoS One 2016; 11:e0160098. [PMID: 27467773 PMCID: PMC4965129 DOI: 10.1371/journal.pone.0160098] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2016] [Accepted: 07/13/2016] [Indexed: 12/13/2022] Open
Abstract
The expansion of protein-ligand annotation databases has enabled large-scale networking of proteins by ligand similarity. These ligand-based protein networks, which implicitly predict the ability of neighboring proteins to bind related ligands, may complement biologically-oriented gene networks, which are used to predict functional or disease relevance. To quantify the degree to which such ligand-based protein associations might complement functional genomic associations, including sequence similarity, physical protein-protein interactions, co-expression, and disease gene annotations, we calculated a network based on the Similarity Ensemble Approach (SEA: sea.docking.org), where protein neighbors reflect the similarity of their ligands. We also measured the similarity with functional genomic networks over a common set of 1,131 genes, and found that the networks had only small overlaps, which were significant only due to the large scale of the data. Consistent with the view that the networks contain different information, combining them substantially improved Molecular Function prediction within GO (from AUROC~0.63–0.75 for the individual data modalities to AUROC~0.8 in the aggregate). We investigated the boost in guilt-by-association gene function prediction when the networks are combined and describe underlying properties that can be further exploited.
Collapse
Affiliation(s)
- Matthew J. O’Meara
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, 94158–2550, United States of America
| | - Sara Ballouz
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY, 11797, United States of America
| | - Brian K. Shoichet
- Department of Pharmaceutical Chemistry, University of California San Francisco, San Francisco, California, 94158–2550, United States of America
- * E-mail: (BKS); (JG)
| | - Jesse Gillis
- Cold Spring Harbor Laboratory, Stanley Institute for Cognitive Genomics, 500 Sunnyside Boulevard, Woodbury, NY, 11797, United States of America
- * E-mail: (BKS); (JG)
| |
Collapse
|