1
|
Mbatchou J, McPeek MS. JASPER: Fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. Am J Hum Genet 2024; 111:1750-1769. [PMID: 39025064 PMCID: PMC11339629 DOI: 10.1016/j.ajhg.2024.06.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2023] [Revised: 06/19/2024] [Accepted: 06/20/2024] [Indexed: 07/20/2024] Open
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction, and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks, or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture, and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits, and microbiome abundances. It allows for covariates, ascertainment, and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, most of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA; Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
2
|
Mbatchou J, McPeek MS. JASPER: fast, powerful, multitrait association testing in structured samples gives insight on pleiotropy in gene expression. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.18.571948. [PMID: 38187553 PMCID: PMC10769254 DOI: 10.1101/2023.12.18.571948] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Joint association analysis of multiple traits with multiple genetic variants can provide insight into genetic architecture and pleiotropy, improve trait prediction and increase power for detecting association. Furthermore, some traits are naturally high-dimensional, e.g., images, networks or longitudinally measured traits. Assessing significance for multitrait genetic association can be challenging, especially when the sample has population sub-structure and/or related individuals. Failure to adequately adjust for sample structure can lead to power loss and inflated type 1 error, and commonly used methods for assessing significance can work poorly with a large number of traits or be computationally slow. We developed JASPER, a fast, powerful, robust method for assessing significance of multitrait association with a set of genetic variants, in samples that have population sub-structure, admixture and/or relatedness. In simulations, JASPER has higher power, better type 1 error control, and faster computation than existing methods, with the power and speed advantage of JASPER increasing with the number of traits. JASPER is potentially applicable to a wide range of association testing applications, including for multiple disease traits, expression traits, image-derived traits and microbiome abundances. It allows for covariates, ascertainment and rare variants and is robust to phenotype model misspecification. We apply JASPER to analyze gene expression in the Framingham Heart Study, where, compared to alternative approaches, JASPER finds more significant associations, including several that indicate pleiotropic effects, some of which replicate previous results, while others have not previously been reported. Our results demonstrate the promise of JASPER for powerful multitrait analysis in structured samples.
Collapse
Affiliation(s)
- Joelle Mbatchou
- Regeneron Genetics Center, Tarrytown, NY 10591, USA
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
| | - Mary Sara McPeek
- Department of Statistics, The University of Chicago, Chicago, IL 60637, USA
- Department of Human Genetics, The University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
3
|
Muñoz A, Martos G, Gonzalez J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol Comput Appl Probab 2023. [DOI: 10.1007/s11009-023-09990-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/13/2023]
|
4
|
Olshen AB, Segal MR. Does multi-way, long-range chromatin contact data advance 3D genome reconstruction? BMC Bioinformatics 2023; 24:64. [PMID: 36829114 PMCID: PMC9951495 DOI: 10.1186/s12859-023-05170-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2022] [Accepted: 02/02/2023] [Indexed: 02/26/2023] Open
Abstract
BACKGROUND Methods for inferring the three-dimensional (3D) configuration of chromatin from conformation capture assays that provide strictly pairwise interactions, notably Hi-C, utilize the attendant contact matrix as input. More recent assays, in particular split-pool recognition of interactions by tag extension (SPRITE), capture multi-way interactions instead of solely pairwise contacts. These assays yield contacts that straddle appreciably greater genomic distances than Hi-C, in addition to instances of exceptionally high-order chromatin interaction. Such attributes are anticipated to be consequential with respect to 3D genome reconstruction, a task yet to be undertaken with multi-way contact data. However, performing such 3D reconstruction using distance-based reconstruction techniques requires framing multi-way contacts as (pairwise) distances. Comparing approaches for so doing, and assessing the resultant impact of long-range and multi-way contacts, are the objectives of this study. RESULTS We obtained 3D reconstructions via multi-dimensional scaling under a variety of weighting schemes for mapping SPRITE multi-way contacts to pairwise distances. Resultant configurations were compared following Procrustes alignment and relationships were assessed between associated Procrustes root mean square errors and key features such as the extent of multi-way and/or long-range contacts. We found that these features had surprisingly limited influence on 3D reconstruction, a finding we attribute to their influence being diminished by the preponderance of pairwise contacts. CONCLUSION Distance-based 3D genome reconstruction using SPRITE multi-way contact data is not appreciably affected by the weighting scheme used to convert multi-way interactions to pairwise distances.
Collapse
Affiliation(s)
- Adam B. Olshen
- grid.266102.10000 0001 2297 6811Department of Epidemiology and Biostatistics and Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, CA USA
| | - Mark R. Segal
- grid.266102.10000 0001 2297 6811Department of Epidemiology and Biostatistics, University of California, San Francisco, CA USA
| |
Collapse
|
5
|
Song H, Liu H, Wu MC. A fast kernel independence test for cluster-correlated data. Sci Rep 2022; 12:21659. [PMID: 36522522 PMCID: PMC9755291 DOI: 10.1038/s41598-022-26278-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Accepted: 12/13/2022] [Indexed: 12/23/2022] Open
Abstract
Cluster-correlated data receives a lot of attention in biomedical and longitudinal studies and it is of interest to assess the generalized dependence between two multivariate variables under the cluster-correlated structure. The Hilbert-Schmidt independence criterion (HSIC) is a powerful kernel-based test statistic that captures various dependence between two random vectors and can be applied to an arbitrary non-Euclidean domain. However, the existing HSIC is not directly applicable to cluster-correlated data. Therefore, we propose a HSIC-based test of independence for cluster-correlated data. The new test statistic combines kernel information so that the dependence structure in each cluster is fully considered and exhibits good performance under high dimensions. Moreover, a rapid p value approximation makes the new test fast applicable to large datasets. Numerical studies show that the new approach performs well in both synthetic and real world data.
Collapse
Affiliation(s)
- Hoseung Song
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA
| | - Hongjiao Liu
- Department of Biostatistics, University of Washington, Seattle, WA, 98195, USA
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA, 98109, USA.
| |
Collapse
|
6
|
Li M, Tyx RE, Rivera AJ, Zhao N, Satten GA. What Can We Learn about the Bias of Microbiome Studies from Analyzing Data from Mock Communities? Genes (Basel) 2022; 13:1758. [PMID: 36292643 PMCID: PMC9601962 DOI: 10.3390/genes13101758] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2022] [Revised: 09/19/2022] [Accepted: 09/21/2022] [Indexed: 11/17/2022] Open
Abstract
It is known that data from both 16S and shotgun metagenomics studies are subject to biases that cause the observed relative abundances of taxa to differ from their true values. Model community analyses, in which the relative abundances of all taxa in the sample are known by construction, seem to offer the hope that these biases can be measured. However, it is unclear whether the bias we measure in a mock community analysis is the same as we measure in a sample in which taxa are spiked in at known relative abundance, or if the biases we measure in spike-in samples is the same as the bias we would measure in a real (e.g., biological) sample. Here, we consider these questions in the context of 16S rRNA measurements on three sets of samples: the commercially available Zymo cells model community; the Zymo model community mixed with Swedish Snus, a smokeless tobacco product that is virtually bacteria-free; and a set of commercially available smokeless tobacco products. Each set of samples was subject to four different extraction protocols. The goal of our analysis is to determine whether the patterns of bias observed in each set of samples are the same, i.e., can we learn about the bias in the commercially available smokeless tobacco products by studying the Zymo cells model community?
Collapse
Affiliation(s)
- Mo Li
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Robert E. Tyx
- Division of Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention, Atlanta, GA 30341, USA
| | - Angel J. Rivera
- Division of Laboratory Sciences, National Center for Environmental Health, Centers for Disease Control and Prevention, Atlanta, GA 30341, USA
| | - Ni Zhao
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Glen A. Satten
- Department of Gynecology and Obstetrics, School of Medicine, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
7
|
Segal MR. Can 3D diploid genome reconstruction from unphased Hi-C data be salvaged? NAR Genom Bioinform 2022; 4:lqac038. [PMID: 35571676 PMCID: PMC9097817 DOI: 10.1093/nargab/lqac038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 03/31/2022] [Accepted: 04/29/2022] [Indexed: 11/13/2022] Open
Abstract
The three-dimensional (3D) configuration of chromatin impacts numerous cellular processes. However, directly observing chromatin architecture at high resolution is challenging. Accordingly, inferring 3D structure utilizing chromatin conformation capture assays, notably Hi-C, has received considerable attention, with a multitude of reconstruction algorithms advanced. While these have enhanced appreciation of chromatin organization, most suffer from a serious shortcoming when faced with diploid genomes: inability to disambiguate contacts between corresponding loci on homologous chromosomes, making attendant reconstructions potentially meaningless. Three recent proposals offer a computational way forward at the expense of strong assumptions. Here, we show that making plausible assumptions about the components of homologous chromosome contacts provides a basis for rescuing conventional consensus-based, unphased reconstruction. This would be consequential since not only are assumptions needed for diploid reconstruction considerable, but the sophistication of select unphased algorithms affords substantive advantages with regard resolution and folding complexity. Rather than presuming that the requisite salvaging assumptions are met, we exploit a recent imaging technology, in situ genome sequencing (IGS), to comprehensively evaluate their reasonableness. We analogously use IGS to assess assumptions underpinning diploid reconstruction algorithms. Results convincingly demonstrate that, in all instances, assumptions are not met, making further algorithm development, potentially informed by IGS data, essential.
Collapse
Affiliation(s)
- Mark R Segal
- Department of Epidemiology and Biostatistics, University of California, 550 16th Street, San Francisco, CA 94143-0560, USA
| |
Collapse
|
8
|
|
9
|
Guo B, Wu B. Reader reaction on the fast small-sample kernel independence test for microbiome community-level association analysis. Biometrics 2017; 74:1120-1124. [PMID: 29192963 DOI: 10.1111/biom.12823] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2017] [Revised: 08/01/2017] [Accepted: 09/01/2017] [Indexed: 01/11/2023]
Abstract
Zhan et al. () presented a kernel RV coefficient (KRV) test to evaluate the overall association between host gene expression and microbiome composition, and showed its competitive performance compared to existing methods. In this article, we clarify the close relation of KRV to the existing generalized RV (GRV) coefficient, and show that KRV and GRV have very similar performance. Although the KRV test could control the type I error rate well at 1% and 5% levels, we show that it could largely underestimate p-values at small significance levels leading to significantly inflated type I errors. As a partial remedy, we propose an alternative p-value calculation, which is efficient and more accurate than KRV p-value at small significance levels. We recommend that small KRV test p-values should always be accompanied and verified by the permutation p-value in practice. In addition, we analytically show that KRV can be written as a form of correlation coefficient, which can dramatically expedite its computation and make permutation p-value calculation more efficient.
Collapse
Affiliation(s)
- Bin Guo
- Division of Biostatistics, School of Public Health University of Minnesota, Minneapolis, Minnesota, U.S.A
| | - Baolin Wu
- Division of Biostatistics, School of Public Health University of Minnesota, Minneapolis, Minnesota, U.S.A
| |
Collapse
|
10
|
Xu Z, Xu G, Pan W, for the Alzheimer's Disease Neuroimaging Initiative. Adaptive testing for association between two random vectors in moderate to high dimensions. Genet Epidemiol 2017; 41:599-609. [PMID: 28714590 PMCID: PMC5643233 DOI: 10.1002/gepi.22059] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Revised: 04/26/2017] [Accepted: 05/17/2017] [Indexed: 01/09/2023]
Abstract
Testing for association between two random vectors is a common and important task in many fields, however, existing tests, such as Escoufier's RV test, are suitable only for low-dimensional data, not for high-dimensional data. In moderate to high dimensions, it is necessary to consider sparse signals, which are often expected with only a few, but not many, variables associated with each other. We generalize the RV test to moderate-to-high dimensions. The key idea is to data adaptively weight each variable pair based on its empirical association. As the consequence, the proposed test is adaptive, alleviating the effects of noise accumulation in high-dimensional data, and thus maintaining the power for both dense and sparse alternative hypotheses. We show the connections between the proposed test with several existing tests, such as a generalized estimating equations-based adaptive test, multivariate kernel machine regression (KMR), and kernel distance methods. Furthermore, we modify the proposed adaptive test so that it can be powerful for nonlinear or nonmonotonic associations. We use both real data and simulated data to demonstrate the advantages and usefulness of the proposed new test. The new test is freely available in R package aSPC on CRAN at https://cran.r-project.org/web/packages/aSPC/index.html and https://github.com/jasonzyx/aSPC.
Collapse
Affiliation(s)
- Zhiyuan Xu
- Division of Biostatistics, University of Minnesota
| | - Gongjun Xu
- Department of Statistics, University of Michigan
| | - Wei Pan
- Division of Biostatistics, University of Minnesota
| | | |
Collapse
|
11
|
Powerful Genetic Association Analysis for Common or Rare Variants with High-Dimensional Structured Traits. Genetics 2017. [PMID: 28642271 DOI: 10.1534/genetics.116.199646] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/03/2023] Open
Abstract
Many genetic association studies collect a wide range of complex traits. As these traits may be correlated and share a common genetic mechanism, joint analysis can be statistically more powerful and biologically more meaningful. However, most existing tests for multiple traits cannot be used for high-dimensional and possibly structured traits, such as network-structured transcriptomic pathway expressions. To overcome potential limitations, in this article we propose the dual kernel-based association test (DKAT) for testing the association between multiple traits and multiple genetic variants, both common and rare. In DKAT, two individual kernels are used to describe the phenotypic and genotypic similarity, respectively, between pairwise subjects. Using kernels allows for capturing structure while accommodating dimensionality. Then, the association between traits and genetic variants is summarized by a coefficient which measures the association between two kernel matrices. Finally, DKAT evaluates the hypothesis of nonassociation with an analytical P-value calculation without any computationally expensive resampling procedures. By collapsing information in both traits and genetic variants using kernels, the proposed DKAT is shown to have a correct type-I error rate and higher power than other existing methods in both simulation studies and application to a study of genetic regulation of pathway gene expressions.
Collapse
|
12
|
Zhan X, Plantinga A, Zhao N, Wu MC. A fast small-sample kernel independence test for microbiome community-level association analysis. Biometrics 2017; 73:1453-1463. [PMID: 28295177 DOI: 10.1111/biom.12684] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Revised: 02/01/2017] [Accepted: 02/01/2017] [Indexed: 12/13/2022]
Abstract
To fully understand the role of microbiome in human health and diseases, researchers are increasingly interested in assessing the relationship between microbiome composition and host genomic data. The dimensionality of the data as well as complex relationships between microbiota and host genomics pose considerable challenges for analysis. In this article, we apply a kernel RV coefficient (KRV) test to evaluate the overall association between host gene expression and microbiome composition. The KRV statistic can capture nonlinear correlations and complex relationships among the individual data types and between gene expression and microbiome composition through measuring general dependency. Testing proceeds via a similar route as existing tests of the generalized RV coefficients and allows for rapid p-value calculation. Strategies to allow adjustment for confounding effects, which is crucial for avoiding misleading results, and to alleviate the problem of selecting the most favorable kernel are considered. Simulation studies show that KRV is useful in testing statistical independence with finite samples given the kernels are appropriately chosen, and can powerfully identify existing associations between microbiome composition and host genomic data while protecting type I error. We apply the KRV to a microbiome study examining the relationship between host transcriptome and microbiome composition within the context of inflammatory bowel disease and are able to derive new biological insights and provide formal inference on prior qualitative observations.
Collapse
Affiliation(s)
- Xiang Zhan
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| | - Anna Plantinga
- Department of Biostatistics, University of Washington, Seattle, Washington 98195, U.S.A
| | - Ni Zhao
- Department of Biostatistics, Johns Hopkins University, Baltimore, Maryland 21205, U.S.A
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 98109, U.S.A
| |
Collapse
|
13
|
Abstract
Simple correlation coefficients between two variables have been generalized to measure association between two matrices in many ways. Coefficients such as the RV coefficient, the distance covariance (dCov) coefficient and kernel based coefficients are being used by different research communities. Scientists use these coefficients to test whether two random vectors are linked. Once it has been ascertained that there is such association through testing, then a next step, often ignored, is to explore and uncover the association's underlying patterns. This article provides a survey of various measures of dependence between random vectors and tests of independence and emphasizes the connections and differences between the various approaches. After providing definitions of the coefficients and associated tests, we present the recent improvements that enhance their statistical properties and ease of interpretation. We summarize multi-table approaches and provide scenarii where the indices can provide useful summaries of heterogeneous multi-block data. We illustrate these different strategies on several examples of real data and suggest directions for future research.
Collapse
Affiliation(s)
- Julie Josse
- Department of Statistics, Agrocampus Ouest - INRIA, Saclay Paris Sud University, France
| | - Susan Holmes
- Department of Statistics, Stanford University, California, USA
| |
Collapse
|
14
|
Wang Z, Curry E, Montana G. Network-guided regression for detecting associations between DNA methylation and gene expression. ACTA ACUST UNITED AC 2014; 30:2693-701. [PMID: 24919878 DOI: 10.1093/bioinformatics/btu361] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION High-throughput profiling in biological research has resulted in the availability of a wealth of data cataloguing the genetic, epigenetic and transcriptional states of cells. These data could yield discoveries that may lead to breakthroughs in the diagnosis and treatment of human disease, but require statistical methods designed to find the most relevant patterns from millions of potential interactions. Aberrant DNA methylation is often a feature of cancer, and has been proposed as a therapeutic target. However, the relationship between DNA methylation and gene expression remains poorly understood. RESULTS We propose Network-sparse Reduced-Rank Regression (NsRRR), a multivariate regression framework capable of using prior biological knowledge expressed as gene interaction networks to guide the search for associations between gene expression and DNA methylation signatures. We use simulations to show the advantage of our proposed model in terms of variable selection accuracy over alternative models that do not use prior network information. We discuss an application of NsRRR to The Cancer Genome Atlas datasets on primary ovarian tumours. AVAILABILITY AND IMPLEMENTATION R code implementing the NsRRR model is available at http://www2.imperial.ac.uk/∼gmontana CONTACT giovanni.montana@kcl.ac.uk SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zi Wang
- Department of Mathematics, Imperial College London, London SW7 2AZ, Division of Cancer, Imperial College London, Hammersmith Hospital, London W12 0NN and Department of Biomedical Engineering, King's College London, St Thomas' Hospital, London SE1 7EH, UK
| | - Edward Curry
- Department of Mathematics, Imperial College London, London SW7 2AZ, Division of Cancer, Imperial College London, Hammersmith Hospital, London W12 0NN and Department of Biomedical Engineering, King's College London, St Thomas' Hospital, London SE1 7EH, UK
| | - Giovanni Montana
- Department of Mathematics, Imperial College London, London SW7 2AZ, Division of Cancer, Imperial College London, Hammersmith Hospital, London W12 0NN and Department of Biomedical Engineering, King's College London, St Thomas' Hospital, London SE1 7EH, UK Department of Mathematics, Imperial College London, London SW7 2AZ, Division of Cancer, Imperial College London, Hammersmith Hospital, London W12 0NN and Department of Biomedical Engineering, King's College London, St Thomas' Hospital, London SE1 7EH, UK
| |
Collapse
|
15
|
Segal MR, Xiong H, Capurso D, Vazquez M, Arsuaga J. Reproducibility of 3D chromatin configuration reconstructions. Biostatistics 2014; 15:442-56. [PMID: 24519450 DOI: 10.1093/biostatistics/kxu003] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
It is widely recognized that the three-dimensional (3D) architecture of eukaryotic chromatin plays an important role in processes such as gene regulation and cancer-driving gene fusions. Observing or inferring this 3D structure at even modest resolutions had been problematic, since genomes are highly condensed and traditional assays are coarse. However, recently devised high-throughput molecular techniques have changed this situation. Notably, the development of a suite of chromatin conformation capture (CCC) assays has enabled elicitation of contacts-spatially close chromosomal loci-which have provided insights into chromatin architecture. Most analysis of CCC data has focused on the contact level, with less effort directed toward obtaining 3D reconstructions and evaluating the accuracy and reproducibility thereof. While questions of accuracy must be addressed experimentally, questions of reproducibility can be addressed statistically-the purpose of this paper. We use a constrained optimization technique to reconstruct chromatin configurations for a number of closely related yeast datasets and assess reproducibility using four metrics that measure the distance between 3D configurations. The first of these, Procrustes fitting, measures configuration closeness after applying reflection, rotation, translation, and scaling-based alignment of the structures. The others base comparisons on the within-configuration inter-point distance matrix. Inferential results for these metrics rely on suitable permutation approaches. Results indicate that distance matrix-based approaches are preferable to Procrustes analysis, not because of the metrics per se but rather on account of the ability to customize permutation schemes to handle within-chromosome contiguity. It has recently been emphasized that the use of constrained optimization approaches to 3D architecture reconstruction are prone to being trapped in local minima. Our methods of reproducibility assessment provide a means for comparing 3D reconstruction solutions so that we can discern between local and global optima by contrasting solutions under perturbed inputs.
Collapse
Affiliation(s)
- Mark R Segal
- Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94143, USADepartment of Mathematics, San Francisco State University, San Francisco, CA 94132, USA
| | - Hao Xiong
- Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94143, USADepartment of Mathematics, San Francisco State University, San Francisco, CA 94132, USA
| | - Daniel Capurso
- Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94143, USADepartment of Mathematics, San Francisco State University, San Francisco, CA 94132, USA
| | - Mariel Vazquez
- Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94143, USADepartment of Mathematics, San Francisco State University, San Francisco, CA 94132, USA
| | - Javier Arsuaga
- Department of Epidemiology and Biostatistics, Center for Bioinformatics and Molecular Biostatistics, University of California, San Francisco, CA 94143, USADepartment of Mathematics, San Francisco State University, San Francisco, CA 94132, USA
| |
Collapse
|