1
|
Stone K, Platig J, Quackenbush J, Fagny M. The Importance of Regulatory Network Structure for Complex Trait Heritability and Evolution. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582063. [PMID: 38464142 PMCID: PMC10925220 DOI: 10.1101/2024.02.27.582063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Complex traits are determined by many loci-mostly regulatory elements-that, through combinatorial interactions, can affect multiple traits. Such high levels of epistasis and pleiotropy have been proposed in the omnigenic model and may explain why such a large part of complex trait heritability is usually missed by genome-wide association studies while raising questions about the possibility for such traits to evolve in response to environmental constraints. To explore the molecular bases of complex traits and understand how they can adapt, we systematically analyzed the distribution of SNP heritability for ten traits across 29 tissue-specific Expression Quantitative Trait Locus (eQTL) networks. We find that heritability is clustered in a small number of tissue-specific, functionally relevant SNP-gene modules and that the greatest heritability occurs in local "hubs" that are both the cornerstone of the network's modules and tissue-specific regulatory elements. The network structure could thus both amplify the genotype-phenotype connection and buffer the deleterious effect of the genetic variations on other traits. We confirm that this structure has allowed complex traits to evolve in response to environmental constraints, with the local "hubs" being the preferential targets of past and ongoing directional selection. Together, these results provide a conceptual framework for understanding complex trait architecture and evolution.
Collapse
Affiliation(s)
- Katherine Stone
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Data Science and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
| | - John Platig
- Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia, USA
- Department of Public Health Sciences, University of Virginia, Charlottesville, Virginia, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, Virginia, USA
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Data Science and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Channing Division of Network Medicine, Brigham and Women's Hospital, Boston, Massachusetts, United States
| | - Maud Fagny
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, Massachusetts, USA
- Department of Data Science and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, Massachusetts, USA
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, Genetique Quantitative et Evolution - Le Moulon, Gif-sur-Yvette 91190 France
| |
Collapse
|
2
|
Rudra P, Zhou YH, Nobel A, Wright FA. Control of false discoveries in grouped hypothesis testing for eQTL data. BMC Bioinformatics 2024; 25:147. [PMID: 38605284 PMCID: PMC11007981 DOI: 10.1186/s12859-024-05736-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Accepted: 03/08/2024] [Indexed: 04/13/2024] Open
Abstract
BACKGROUND Expression quantitative trait locus (eQTL) analysis aims to detect the genetic variants that influence the expression of one or more genes. Gene-level eQTL testing forms a natural grouped-hypothesis testing strategy with clear biological importance. Methods to control family-wise error rate or false discovery rate for group testing have been proposed earlier, but may not be powerful or easily apply to eQTL data, for which certain structured alternatives may be defensible and may enable the researcher to avoid overly conservative approaches. RESULTS In an empirical Bayesian setting, we propose a new method to control the false discovery rate (FDR) for grouped hypotheses. Here, each gene forms a group, with SNPs annotated to the gene corresponding to individual hypotheses. The heterogeneity of effect sizes in different groups is considered by the introduction of a random effects component. Our method, entitled Random Effects model and testing procedure for Group-level FDR control (REG-FDR), assumes a model for alternative hypotheses for the eQTL data and controls the FDR by adaptive thresholding. As a convenient alternate approach, we also propose Z-REG-FDR, an approximate version of REG-FDR, that uses only Z-statistics of association between genotype and expression for each gene-SNP pair. The performance of Z-REG-FDR is evaluated using both simulated and real data. Simulations demonstrate that Z-REG-FDR performs similarly to REG-FDR, but with much improved computational speed. CONCLUSION Our results demonstrate that the Z-REG-FDR method performs favorably compared to other methods in terms of statistical power and control of FDR. It can be of great practical use for grouped hypothesis testing for eQTL analysis or similar problems in statistical genomics due to its fast computation and ability to be fit using only summary data.
Collapse
Affiliation(s)
- Pratyaydipta Rudra
- Department of Statistics, Oklahoma State University, Stillwater, OK, USA.
| | - Yi-Hui Zhou
- Bioinformatics Research Center, Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, USA
| | - Andrew Nobel
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, NC, USA
| | - Fred A Wright
- Bioinformatics Research Center, Departments of Statistics and Biological Sciences, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
3
|
Keller MP, Hudkins KL, Shalev A, Bhatnagar S, Kebede MA, Merrins MJ, Davis DB, Alpers CE, Kimple ME, Attie AD. What the BTBR/J mouse has taught us about diabetes and diabetic complications. iScience 2023; 26:107036. [PMID: 37360692 PMCID: PMC10285641 DOI: 10.1016/j.isci.2023.107036] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/28/2023] Open
Abstract
Human and mouse genetics have delivered numerous diabetogenic loci, but it is mainly through the use of animal models that the pathophysiological basis for their contribution to diabetes has been investigated. More than 20 years ago, we serendipidously identified a mouse strain that could serve as a model of obesity-prone type 2 diabetes, the BTBR (Black and Tan Brachyury) mouse (BTBR T+ Itpr3tf/J, 2018) carrying the Lepob mutation. We went on to discover that the BTBR-Lepob mouse is an excellent model of diabetic nephropathy and is now widely used by nephrologists in academia and the pharmaceutical industry. In this review, we describe the motivation for developing this animal model, the many genes identified and the insights about diabetes and diabetes complications derived from >100 studies conducted in this remarkable animal model.
Collapse
Affiliation(s)
- Mark P. Keller
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| | - Kelly L. Hudkins
- Department of Pathology, University of Washington Medical Center, Seattle, WA 98195, USA
| | - Anath Shalev
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Alabama at Birmingham, Birmingham, AL 35294, UK
| | - Sushant Bhatnagar
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Alabama at Birmingham, Birmingham, AL 35294, UK
| | - Melkam A. Kebede
- School of Medical Sciences, Faculty of Medicine and Health, Charles Perkins Centre, University of Sydney, Camperdown, Sydney, NSW 2006, Australia
| | - Matthew J. Merrins
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Wisconsin School of Medicine and Public Health, Madison, WI 53705, USA
- William S. Middleton Memorial Veterans Hospital, Madison, WI 53705, USA
| | - Dawn Belt Davis
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Wisconsin School of Medicine and Public Health, Madison, WI 53705, USA
- William S. Middleton Memorial Veterans Hospital, Madison, WI 53705, USA
| | - Charles E. Alpers
- Department of Pathology, University of Washington Medical Center, Seattle, WA 98195, USA
| | - Michelle E. Kimple
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Wisconsin School of Medicine and Public Health, Madison, WI 53705, USA
- William S. Middleton Memorial Veterans Hospital, Madison, WI 53705, USA
| | - Alan D. Attie
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
- Department of Medicine, Division of Endocrinology, Diabetes, and Metabolism, University of Wisconsin School of Medicine and Public Health, Madison, WI 53705, USA
- Department of Chemistry, University of Wisconsin-Madison, Madison, WI 53706, USA
| |
Collapse
|
4
|
Marrella MA, Biase FH. Robust identification of regulatory variants (eQTLs) using a differential expression framework developed for RNA-sequencing. J Anim Sci Biotechnol 2023; 14:62. [PMID: 37143150 PMCID: PMC10161580 DOI: 10.1186/s40104-023-00861-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2022] [Accepted: 03/05/2023] [Indexed: 05/06/2023] Open
Abstract
BACKGROUND A gap currently exists between genetic variants and the underlying cell and tissue biology of a trait, and expression quantitative trait loci (eQTL) studies provide important information to help close that gap. However, two concerns that arise with eQTL analyses using RNA-sequencing data are normalization of data across samples and the data not following a normal distribution. Multiple pipelines have been suggested to address this. For instance, the most recent analysis of the human and farm Genotype-Tissue Expression (GTEx) project proposes using trimmed means of M-values (TMM) to normalize the data followed by an inverse normal transformation. RESULTS In this study, we reasoned that eQTL analysis could be carried out using the same framework used for differential gene expression (DGE), which uses a negative binomial model, a statistical test feasible for count data. Using the GTEx framework, we identified 35 significant eQTLs (P < 5 × 10-8) following the ANOVA model and 39 significant eQTLs (P < 5 × 10-8) following the additive model. Using a differential gene expression framework, we identified 930 and six significant eQTLs (P < 5 × 10-8) following an analytical framework equivalent to the ANOVA and additive model, respectively. When we compared the two approaches, there was no overlap of significant eQTLs between the two frameworks. Because we defined specific contrasts, we identified trans eQTLs that more closely resembled what we expect from genetic variants showing complete dominance between alleles. Yet, these were not identified by the GTEx framework. CONCLUSIONS Our results show that transforming RNA-sequencing data to fit a normal distribution prior to eQTL analysis is not required when the DGE framework is employed. Our proposed approach detected biologically relevant variants that otherwise would not have been identified due to data transformation to fit a normal distribution.
Collapse
Affiliation(s)
- Mackenzie A Marrella
- School of Animal Sciences, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA
| | - Fernando H Biase
- School of Animal Sciences, Virginia Polytechnic Institute and State University, Blacksburg, VA, USA.
| |
Collapse
|
5
|
Brown M, Greenwood E, Zeng B, Powell JE, Gibson G. Effect of all-but-one conditional analysis for eQTL isolation in peripheral blood. Genetics 2023; 223:iyac162. [PMID: 36321965 PMCID: PMC9836021 DOI: 10.1093/genetics/iyac162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Accepted: 10/13/2022] [Indexed: 11/13/2022] Open
Abstract
Expression quantitative trait locus detection has become increasingly important for understanding how noncoding variants contribute to disease susceptibility and complex traits. The major challenges in expression quantitative trait locus fine-mapping and causal variant discovery relate to the impact of linkage disequilibrium on signals due to one or multiple functional variants that lie within a credible set. We perform expression quantitative trait locus fine-mapping using the all-but-one approach, conditioning each signal on all others detected in an interval, on the Consortium for the Architecture of Gene Expression cohorts of microarray-based peripheral blood gene expression in 2,138 European-ancestry human adults. We contrast these results with traditional forward stepwise conditional analysis and a Bayesian localization method. All-but-one conditioning significantly modifies effect-size estimates for 51% of 2,351 expression quantitative trait locus peaks, but only modestly affects credible set size and location. On the other hand, both conditioning approaches result in unexpectedly low overlap with Bayesian credible sets, with just 57% peak concordance and between 50% and 70% SNP sharing, leading us to caution against the assumption that any one localization method is superior to another. We also cross reference our results with ATAC-seq data, cell-type-specific expression quantitative trait locus, and activity-by-contact-enhancers, leading to the proposal of a 5-tier approach to further reduce credible set sizes and prioritize likely causal variants for all known inflammatory bowel disease risk loci active in immune cells.
Collapse
Affiliation(s)
- Margaret Brown
- Center for Integrative Genomics, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Emily Greenwood
- Center for Integrative Genomics, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Biao Zeng
- Present address for Biao Zeng: Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA
| | - Joseph E Powell
- Present address for Joseph E Powell: Garvan-Weizmann Center for Cellular Genomics, Sydney, NSW 2010, Australia
| | - Greg Gibson
- Center for Integrative Genomics, School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, USA
| |
Collapse
|
6
|
Fitzgerald T, Jones A, Engelhardt BE. A Poisson reduced-rank regression model for association mapping in sequencing data. BMC Bioinformatics 2022; 23:529. [PMID: 36482321 PMCID: PMC9733401 DOI: 10.1186/s12859-022-05054-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 11/14/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cell-specific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, count-based scRNA-seq data pose challenges for traditional models based on Gaussian assumptions. RESULTS We aim to resolve these issues by developing a reduced-rank regression model that identifies low-dimensional linear associations between a large number of cell-specific covariates and high-dimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNA-seq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNA-seq dataset, a spatial gene expression dataset, and a bulk RNA-seq dataset to show its behavior in three distinct analyses. CONCLUSION We show that our statistical modeling approach, which is based on reduced-rank regression, captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcriptional states.
Collapse
Affiliation(s)
- Tiana Fitzgerald
- Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Andrew Jones
- Department of Computer Science, Princeton University, Princeton, NJ USA
| | - Barbara E. Engelhardt
- Department of Computer Science, Princeton University, Princeton, NJ USA
- Data Science and Biotechnology Institute, Gladstone Institutes, San Francisco, CA USA
- Department of Biomedical Data Science, Stanford University, Stanford, CA USA
| |
Collapse
|
7
|
Gaynor SM, Fagny M, Lin X, Platig J, Quackenbush J. Connectivity in eQTL networks dictates reproducibility and genomic properties. CELL REPORTS METHODS 2022; 2:100218. [PMID: 35637906 PMCID: PMC9142682 DOI: 10.1016/j.crmeth.2022.100218] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2021] [Revised: 02/08/2022] [Accepted: 04/25/2022] [Indexed: 01/11/2023]
Abstract
Expression quantitative trait locus (eQTL) analysis associates SNPs with gene expression; these relationships can be represented as a bipartite network with association strength as "edge weights" between SNPs and genes. However, most eQTL networks use binary edge weights based on thresholded FDR estimates: definitions that influence reproducibility and downstream analyses. We constructed twenty-nine tissue-specific eQTL networks using GTEx data and evaluated a comprehensive set of network specifications based on false discovery rates, test statistics, and p values, focusing on the degree centrality-a metric of an SNP or gene node's potential network influence. We found a thresholded Benjamini-Hochberg q value weighted by the Z-statistic balances metric reproducibility and computational efficiency. Our estimated gene degrees positively correlate with gene degrees in gene regulatory networks, demonstrating that these networks are complementary in understanding regulation. Gene degrees also correlate with genetic diversity, and heritability analyses show that highly connected nodes are enriched for tissue-relevant traits.
Collapse
Affiliation(s)
- Sheila M. Gaynor
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Maud Fagny
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
- Université Paris-Saclay, INRAE, CNRS, AgroParisTech, GQE - Le Moulon, 91190 Gif-sur-Yvette, France
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Statistics, Harvard University, Cambridge, MA 02138, USA
| | - John Platig
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
- Department of Medicine, Harvard Medical School, Boston, MA 02115, USA
| | - John Quackenbush
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
- Department of Biostatistics and Computational Biology and Center for Cancer Computational Biology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
- Channing Division of Network Medicine, Brigham and Women’s Hospital, Boston, MA 02115, USA
| |
Collapse
|
8
|
Gao C, Wei H, Zhang K. LORSEN: Fast and Efficient eQTL Mapping With Low Rank Penalized Regression. Front Genet 2021; 12:690926. [PMID: 34868194 PMCID: PMC8636089 DOI: 10.3389/fgene.2021.690926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 10/08/2021] [Indexed: 12/02/2022] Open
Abstract
Characterization of genetic variations that are associated with gene expression levels is essential to understand cellular mechanisms that underline human complex traits. Expression quantitative trait loci (eQTL) mapping attempts to identify genetic variants, such as single nucleotide polymorphisms (SNPs), that affect the expression of one or more genes. With the availability of a large volume of gene expression data, it is necessary and important to develop fast and efficient statistical and computational methods to perform eQTL mapping for such large scale data. In this paper, we proposed a new method, the low rank penalized regression method (LORSEN), for eQTL mapping. We evaluated and compared the performance of LORSEN with two existing methods for eQTL mapping using extensive simulations as well as real data from the HapMap3 project. Simulation studies showed that our method outperformed two commonly used methods for eQTL mapping, LORS and FastLORS, in many scenarios in terms of area under the curve (AUC). We illustrated the usefulness of our method by applying it to SNP variants data and gene expression levels on four chromosomes from the HapMap3 Project.
Collapse
Affiliation(s)
- Cheng Gao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI, United States
| | - Kui Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| |
Collapse
|
9
|
Yang F, Liu Z, Zhao M, Mu Q, Che T, Xie Y, Ma L, Mi L, Li J, Zhao Y. Skin transcriptome reveals the periodic changes in genes underlying cashmere (ground hair) follicle transition in cashmere goats. BMC Genomics 2020; 21:392. [PMID: 32503427 PMCID: PMC7275469 DOI: 10.1186/s12864-020-06779-5] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 05/13/2020] [Indexed: 02/06/2023] Open
Abstract
Background Cashmere goats make an outstanding contribution to the livestock textile industry and their cashmere is famous for its slenderness and softness and has been extensively studied. However, there are few reports on the molecular regulatory mechanisms of the secondary hair follicle growth cycle in cashmere goats. In order to explore the regular transition through the follicle cycle and the role of key genes in this cycle, we used a transcriptome sequencing technique to sequence the skin of Inner Mongolian cashmere goats during different months. We analyzed the variation and difference in genes throughout the whole hair follicle cycle. We then verified the regulatory mechanism of the cashmere goat secondary hair follicle growth cycle using fluorescence quantitative PCR. Results The growth cycle of cashmere hair could be divided into three distinct periods: a growth period (March–September), a regression period (September–December), and a resting period (December–March). The results of differential gene analyses showed that March was the most significant month. Cluster analysis of gene expression throughout the whole growth cycle further supported the key nodes of the three periods of cashmere growth, and the differential gene expression of keratin corresponding to the ground haircashmere growth cycle further supported the results from tissue slices. Quantitative fluorescence analysis showed that KAP3–1, KRTAP 8–1, and KRTAP 24–1 genes had close positive correlation with the cashmere growth cycle, and their regulation was consistent with the growth cycle of cashmere. Conclusion The growth cycle of cashmere cashmere could be divided into three distinct periods: a growth period (March–September), a regression period (September–December) and a resting period (December–March). March was considered to be the beginning of the cycle. KAP and KRTAP showed close positive correlation with the growth cycle of secondary hair follicle cashmere growth, and their regulation was consistent with the cashmere growth cycle. But hair follicle development-related genes are expressed earlier than cashmere growth, indicating that cycle regulation could alter the temporal growth of cashmere. This study laid a theoretical foundation for the study of the cashmere development cycle and provided evidence for key genes during transition through the cashmere cycle. Our study provides a theoretical basis for cashmere goat breeding.
Collapse
Affiliation(s)
- Feng Yang
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Zhihong Liu
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Meng Zhao
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Qing Mu
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Tianyu Che
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Yuchun Xie
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Lina Ma
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Lu Mi
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China
| | - Jinquan Li
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China.
| | - Yanhong Zhao
- College of Animal Science, Inner Mongolia Agricultural University, Hohhot, 010018, China.
| |
Collapse
|
10
|
Rhyne J, Jeng XJ, Chi EC, Tzeng J. FastLORS: Joint modelling for expression quantitative trait loci mapping in R. Stat (Int Stat Inst) 2020. [DOI: 10.1002/sta4.265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jacob Rhyne
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - X. Jessie Jeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Eric C. Chi
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Jung‐Ying Tzeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| |
Collapse
|
11
|
Jeng XJ, Rhyne J, Zhang T, Tzeng JY. Effective SNP ranking improves the performance of eQTL mapping. Genet Epidemiol 2020; 44:611-619. [PMID: 32216117 DOI: 10.1002/gepi.22293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 02/21/2020] [Accepted: 03/11/2020] [Indexed: 11/06/2022]
Abstract
Genome-wide expression quantitative trait loci (eQTLs) mapping explores the relationship between gene expression and DNA variants, such as single-nucleotide polymorphism (SNPs), to understand genetic basis of human diseases. Due to the large number of genes and SNPs that need to be assessed, current methods for eQTL mapping often suffer from low detection power, especially for identifying trans-eQTLs. In this paper, we propose the idea of performing SNP ranking based on the higher criticism statistic, a summary statistic developed in large-scale signal detection. We illustrate how the HC-based SNP ranking can effectively prioritize eQTL signals over noise, greatly reduce the burden of joint modeling, and improve the power for eQTL mapping. Numerical results in simulation studies demonstrate the superior performance of our method compared to existing methods. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs.
Collapse
Affiliation(s)
- X Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jacob Rhyne
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Teng Zhang
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan.,Division of Biostatistics, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
12
|
Lutz SM, Thwing A, Fingerlin T. eQTL mapping of rare variant associations using RNA-seq data: An evaluation of approaches. PLoS One 2019; 14:e0223273. [PMID: 31581212 PMCID: PMC6776318 DOI: 10.1371/journal.pone.0223273] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2019] [Accepted: 09/17/2019] [Indexed: 11/19/2022] Open
Abstract
Expression quantitative trait loci (eQTL) provide insight on transcription regulation and illuminate the molecular basis of phenotypic outcomes. High-throughput RNA sequencing (RNA-seq) is becoming a popular technique to measure gene expression abundance. Traditional eQTL mapping methods for microarray expression data often assume the expression data follow a normal distribution. As a result, for RNA-seq data, total read count measurements can be normalized by normal quantile transformation in order to fit the data using a linear regression. Other approaches model the total read counts using a negative binomial regression. While these methods work well for common variants (minor allele frequencies > 5% or 1%), an extension of existing methodology is needed to accommodate a collection of rare variants in RNA-seq data. Here, we examine 2 approaches that are direct applications of existing methodology and apply these approaches to RNAseq studies: 1) collapsing the rare variants in the region and using either negative binomial regression or Poisson regression and 2) using the normalized read counts with the Sequence Kernel Association Test (SKAT), the burden test for SKAT (SKAT-Burden), or an optimal combination of these two tests (SKAT-O). We evaluated these approaches via simulation studies under numerous scenarios and applied these approaches to the 1,000 Genomes Project.
Collapse
Affiliation(s)
- Sharon Marie Lutz
- Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care, Boston, MA, United States of America
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States of America
- * E-mail:
| | - Annie Thwing
- Department of Biostatistics and Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO, United States of America
| | - Tasha Fingerlin
- Department of Biostatistics and Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO, United States of America
- Center for Genes, Environment, and Health, National Jewish Health, Denver, CO, United States of America
| |
Collapse
|
13
|
Sun W, Bunn P, Jin C, Little P, Zhabotynsky V, Perou CM, Hayes DN, Chen M, Lin DY. The association between copy number aberration, DNA methylation and gene expression in tumor samples. Nucleic Acids Res 2019. [PMID: 29529299 PMCID: PMC5887505 DOI: 10.1093/nar/gky131] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
We systematically studied the association between somatic copy number aberration (SCNA), DNA methylation and gene expression using -omic data from The Cancer Genome Atlas (TCGA) on six cancer types: breast cancer, colon cancer, glioblastoma, leukemia, lower-grade glioma and prostate cancer. A major challenge for such integrated study is that the association between DNA methylation and gene expression is severely confounded by tumor purity and cell type composition, which are often unobserved and difficult to estimate. To overcome this challenge, we developed a method to remove confounding effects by calculating the principal components that span the space of the latent factors. Another intriguing findings of our study is that there could be both positive and negative associations between SCNA and DNA methylation, while the CpGs with negative/positive associations with SCNA are often located around CpG islands/ocean, respectively. A joint study of SCNA, DNA methylation, and gene expression suggest that SCNA often affect DNA methylation and gene expression independently.
Collapse
Affiliation(s)
- Wei Sun
- Public Health Science Division, Fred Hutchison Cancer Research Center, USA
| | - Paul Bunn
- Department of Biostatistics, University of North Carolina, Chapel Hill, USA
| | - Chong Jin
- Department of Biostatistics, University of North Carolina, Chapel Hill, USA
| | - Paul Little
- Department of Biostatistics, University of North Carolina, Chapel Hill, USA
| | - Vasyl Zhabotynsky
- Department of Biostatistics, University of North Carolina, Chapel Hill, USA
| | - Charles M Perou
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, USA.,Department of Genetics, University of North Carolina, Chapel Hill, USA
| | - David Neil Hayes
- Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, USA.,Department of Medicine, Division of Hematology/Oncology, University of North Carolina, Chapel Hill, USA
| | - Mengjie Chen
- Department of Medicine, University of Chicago, USA.,Department of Human Genetics, University of Chicago, USA
| | - Dan-Yu Lin
- Department of Biostatistics, University of North Carolina, Chapel Hill, USA.,Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, USA
| |
Collapse
|
14
|
Jiang F. Sufficient direction factor model and its application to gene expression quantitative trait loci discovery. Biometrika 2019; 106:417-432. [PMID: 31097835 PMCID: PMC6508038 DOI: 10.1093/biomet/asz010] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Indexed: 02/06/2023] Open
Abstract
Rapid improvement in technology has made it relatively cheap to collect genetic data, however statistical analysis of existing data is still much cheaper. Thus, secondary analysis of single-nucleotide polymorphism, SNP, data, i.e., reanalysing existing data in an effort to extract more information, is an attractive and cost-effective alternative to collecting new data. We study the relationship between gene expression and SNPs through a combination of factor analysis and dimension reduction estimation. To take advantage of the flexibility in traditional factor models where the latent factors are not required to be normal, we recommend using semiparametric sufficient dimension reduction methods in the joint estimation of the combined model. The resulting estimator is flexible and has superior performance relative to the existing estimator, which relies on additional assumptions on the latent factors. We quantify the asymptotic performance of the proposed parameter estimator and perform inference by assessing the estimation variability and by constructing confidence intervals. The new results enable us to identify, for the first time, statistically significant SNPs concerning gene-SNP relations in lung tissue from genotype-tissue expression data.
Collapse
Affiliation(s)
- F Jiang
- Department of Statistics, The University of Hong Kong, Pokfulam Road, Hong Kong
| |
Collapse
|
15
|
Lan T, Yang B, Zhang X, Wang T, Lu Q. Statistical Methods and Software for Substance Use and Dependence Genetic Research. Curr Genomics 2019; 20:172-183. [PMID: 31929725 PMCID: PMC6935956 DOI: 10.2174/1389202920666190617094930] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Revised: 05/16/2019] [Accepted: 05/24/2019] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Substantial substance use disorders and related health conditions emerged dur-ing the mid-20th century and continue to represent a remarkable 21st century global burden of disease. This burden is largely driven by the substance-dependence process, which is a complex process and is influenced by both genetic and environmental factors. During the past few decades, a great deal of pro-gress has been made in identifying genetic variants associated with Substance Use and Dependence (SUD) through linkage, candidate gene association, genome-wide association and sequencing studies. METHODS Various statistical methods and software have been employed in different types of SUD ge-netic studies, facilitating the identification of new SUD-related variants. CONCLUSION In this article, we review statistical methods and software that are currently available for SUD genetic studies, and discuss their strengths and limitations.
Collapse
Affiliation(s)
| | | | | | - Tong Wang
- Address correspondence to these authors at the Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China; Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA; Tel/ Fax: ++1-517-353-8623; E-mails: ;
| | - Qing Lu
- Address correspondence to these authors at the Department of Health Statistics, School of Public Health, Shanxi Medical University, Taiyuan, Shanxi, China; Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, MI, USA; Tel/ Fax: ++1-517-353-8623; E-mails: ;
| |
Collapse
|
16
|
Chaturvedi N, Menezes RXD, Goeman JJ, Wieringen WV. A test for detecting differential indirect trans effects between two groups of samples. Stat Appl Genet Mol Biol 2018; 17:/j/sagmb.ahead-of-print/sagmb-2017-0058/sagmb-2017-0058.xml. [PMID: 30059350 DOI: 10.1515/sagmb-2017-0058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Integrative analysis of copy number and gene expression data can help in understanding the cis and trans effect of copy number aberrations on transcription levels of genes involved in a pathway. To analyse how these copy number mediated gene-gene interactions differ between groups of samples we propose a new method, named dNET. Our method uses ridge regression to model the network topology involving one gene's expression level, its gene dosage and the expression levels of other genes in the network. The interaction parameters are estimated by fitting the model per gene for all samples together. However, instead of testing for differential network topology per gene, dNET tests for an overall difference in estimated parameters between two groups of samples and produces a single p-value. With the help of several simulation studies, we show that dNET can detect differential network nodes with high accuracy and low rate of false positives even in the presence of differential cis effects. We also apply dNET to publicly available TCGA cancer datasets and identify pathways where copy number mediated gene-gene interactions differ between samples with cancer stage lower than stage 3 and samples with cancer stage 3 or above.
Collapse
Affiliation(s)
- Nimisha Chaturvedi
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Netherlands Bioinformatics Center, 260 NBIC, 6500 HB Nijmegen, The Netherlands
| | - Renée X de Menezes
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Netherlands Bioinformatics Center, 260 NBIC, 6500 HB Nijmegen, The Netherlands
| | - Jelle J Goeman
- Department of Biomedical Data Sciences, Room Number S5-P, LUMC Main Building, Leiden University Medical Center, Albinusdreef 2, 2333 ZA Leiden, The Netherlands
| | - Wessel van Wieringen
- Afdeling Epidemiologie en Biostatistiek, Amsterdam Public Health Research Institute, Medische Faculteit (F-vleugel), VU Medisch Centrum, 1007 MB Amsterdam, The Netherlands
- Department of Mathematics, Amsterdam Public Health Research Institute, Faculty of Sciences, Vrije Universiteit, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
17
|
Abstract
MOTIVATION It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms. RESULTS In this article, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer's disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways to explain our findings. AVAILABILITY AND IMPLEMENTATION Software is available at http://www.sailing.cs.cmu.edu/ CONTACT : epxing@cs.cmu.edu.
Collapse
Affiliation(s)
- Seunghak Lee
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Soonho Kong
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Eric P Xing
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
18
|
Abstract
The aim of expression Quantitative Trait Locus (eQTL) mapping is the identification of DNA sequence variants that explain variation in gene expression. Given the recent yield of trait-associated genetic variants identified by large-scale genome-wide association analyses (GWAS), eQTL mapping has become a useful tool to understand the functional context where these variants operate and eventually narrow down functional gene targets for disease. Despite its extensive application to complex (polygenic) traits and disease, the majority of eQTL studies still rely on univariate data modeling strategies, i.e., testing for association of all transcript-marker pairs. However these "one at-a-time" strategies are (1) unable to control the number of false-positives when an intricate Linkage Disequilibrium structure is present and (2) are often underpowered to detect the full spectrum of trans-acting regulatory effects. Here we present our viewpoint on the most recent advances on eQTL mapping approaches, with a focus on Bayesian methodology. We review the advantages of the Bayesian approach over frequentist methods and provide an empirical example of polygenic eQTL mapping to illustrate the different properties of frequentist and Bayesian methods. Finally, we discuss how multivariate eQTL mapping approaches have distinctive features with respect to detection of polygenic effects, accuracy, and interpretability of the results.
Collapse
Affiliation(s)
- Martha Imprialou
- Centre for Complement and Inflammation Research, Imperial College London, Hammersmith Hospital, Du Cane Road, London, W12 0NN, UK
| | - Enrico Petretto
- Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore.
| | - Leonardo Bottolo
- Department of Medical Genetics, University of Cambridge, Box 238, Lv 6 Addenbrooke's Treatment Centre, Addenbrooke's Hospital, Cambridge Biomedical Campus, Cambridge, CB2 0QQ, UK.
- Department of Mathematics, Imperial College London, 180 Queen's Gate, London, SW7 2AZ, UK.
| |
Collapse
|
19
|
Moreno-Moral A, Pesce F, Behmoaras J, Petretto E. Systems Genetics as a Tool to Identify Master Genetic Regulators in Complex Disease. Methods Mol Biol 2017; 1488:337-362. [PMID: 27933533 DOI: 10.1007/978-1-4939-6427-7_16] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Systems genetics stems from systems biology and similarly employs integrative modeling approaches to describe the perturbations and phenotypic effects observed in a complex system. However, in the case of systems genetics the main source of perturbation is naturally occurring genetic variation, which can be analyzed at the systems-level to explain the observed variation in phenotypic traits. In contrast with conventional single-variant association approaches, the success of systems genetics has been in the identification of gene networks and molecular pathways that underlie complex disease. In addition, systems genetics has proven useful in the discovery of master trans-acting genetic regulators of functional networks and pathways, which in many cases revealed unexpected gene targets for disease. Here we detail the central components of a fully integrated systems genetics approach to complex disease, starting from assessment of genetic and gene expression variation, linking DNA sequence variation to mRNA (expression QTL mapping), gene regulatory network analysis and mapping the genetic control of regulatory networks. By summarizing a few illustrative (and successful) examples, we highlight how different data-modeling strategies can be effectively integrated in a systems genetics study.
Collapse
Affiliation(s)
- Aida Moreno-Moral
- Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore
| | - Francesco Pesce
- National Heart and Lung Institute, Faculty of Medicine, Imperial College London, Hammersmith Campus, Imperial Centre for Translational and Experimental Medicine, London, UK
| | - Jacques Behmoaras
- Centre for Complement and Inflammation Research, Imperial College London, Hammersmith Hospital, Du Cane Road, London, W12 0NN, UK
| | - Enrico Petretto
- Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore.
| |
Collapse
|
20
|
Richardson S, Tseng GC, Sun W. Statistical Methods in Integrative Genomics. ANNUAL REVIEW OF STATISTICS AND ITS APPLICATION 2016; 3:181-209. [PMID: 27482531 PMCID: PMC4963036 DOI: 10.1146/annurev-statistics-041715-033506] [Citation(s) in RCA: 67] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/28/2023]
Abstract
Statistical methods in integrative genomics aim to answer important biology questions by jointly analyzing multiple types of genomic data (vertical integration) or aggregating the same type of data across multiple studies (horizontal integration). In this article, we introduce different types of genomic data and data resources, and then review statistical methods of integrative genomics, with emphasis on the motivation and rationale of these methods. We conclude with some summary points and future research directions.
Collapse
Affiliation(s)
- Sylvia Richardson
- MRC Biostatistics Unit, Cambridge Institute of Public Health, University of Cambridge, CB2 0SR, United Kingdom
| | - George C. Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261
| | - Wei Sun
- Department of Biostatistics, Department of Genetics, University of North Carolina, Chapel Hill, NC 27599
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington 27516
| |
Collapse
|
21
|
Chaturvedi N, de Menezes RX, Goeman JJ. A global × global test for testing associations between two large sets of variables. Biom J 2016; 59:145-158. [PMID: 27225065 DOI: 10.1002/bimj.201500106] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2015] [Revised: 01/06/2016] [Accepted: 03/07/2016] [Indexed: 12/30/2022]
Abstract
In high-dimensional omics studies where multiple molecular profiles are obtained for each set of patients, there is often interest in identifying complex multivariate associations, for example, copy number regulated expression levels in a certain pathway or in a genomic region. To detect such associations, we present a novel approach to test for association between two sets of variables. Our approach generalizes the global test, which tests for association between a group of covariates and a single univariate response, to allow high-dimensional multivariate response. We apply the method to several simulated datasets as well as two publicly available datasets, where we compare the performance of multivariate global test (G2) with univariate global test. The method is implemented in R and will be available as a part of the globaltest package in R.
Collapse
Affiliation(s)
- Nimisha Chaturvedi
- Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.,Netherlands Bioinformatics Center, Nijmegen, The Netherlands
| | - Renée X de Menezes
- Epidemiology and Biostatistics, VU University Medical Center, Amsterdam, The Netherlands.,Netherlands Bioinformatics Center, Nijmegen, The Netherlands
| | - Jelle J Goeman
- Biostatistics, Department for Health Evidence, Radboud University Medical Center, Nijmegen, The Netherlands.,Department of Medical Statistics and Bioinformatics, Leiden University Medical Center, Leiden, The Netherlands
| |
Collapse
|
22
|
Botzman M, Nachshon A, Brodt A, Gat-Viks I. POEM: Identifying Joint Additive Effects on Regulatory Circuits. Front Genet 2016; 7:48. [PMID: 27148351 PMCID: PMC4835676 DOI: 10.3389/fgene.2016.00048] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2015] [Accepted: 03/17/2016] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Expression Quantitative Trait Locus (eQTL) mapping tackles the problem of identifying variation in DNA sequence that have an effect on the transcriptional regulatory network. Major computational efforts are aimed at characterizing the joint effects of several eQTLs acting in concert to govern the expression of the same genes. Yet, progress toward a comprehensive prediction of such joint effects is limited. For example, existing eQTL methods commonly discover interacting loci affecting the expression levels of a module of co-regulated genes. Such "modularization" approaches, however, are focused on epistatic relations and thus have limited utility for the case of additive (non-epistatic) effects. RESULTS Here we present POEM (Pairwise effect On Expression Modules), a methodology for identifying pairwise eQTL effects on gene modules. POEM is specifically designed to achieve high performance in the case of additive joint effects. We applied POEM to transcription profiles measured in bone marrow-derived dendritic cells across a population of genotyped mice. Our study reveals widespread additive, trans-acting pairwise effects on gene modules, characterizes their organizational principles, and highlights high-order interconnections between modules within the immune signaling network. These analyses elucidate the central role of additive pairwise effect in regulatory circuits, and provide computational tools for future investigations into the interplay between eQTLs. AVAILABILITY The software described in this article is available at csgi.tau.ac.il/POEM/.
Collapse
Affiliation(s)
- Maya Botzman
- Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University Tel Aviv, Israel
| | - Aharon Nachshon
- Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University Tel Aviv, Israel
| | - Avital Brodt
- Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University Tel Aviv, Israel
| | - Irit Gat-Viks
- Department of Cell Research and Immunology, The George S. Wise Faculty of Life Sciences, Tel Aviv University Tel Aviv, Israel
| |
Collapse
|
23
|
Wang N, Gosik K, Li R, Lindsay B, Wu R. A block mixture model to map eQTLs for gene clustering and networking. Sci Rep 2016; 6:21193. [PMID: 26892775 PMCID: PMC4759821 DOI: 10.1038/srep21193] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2015] [Accepted: 01/19/2016] [Indexed: 01/13/2023] Open
Abstract
To study how genes function in a cellular and physiological process, a general procedure is to classify gene expression profiles into categories based on their similarity and reconstruct a regulatory network for functional elements. However, this procedure has not been implemented with the genetic mechanisms that underlie the organization of gene clusters and networks, despite much effort made to map expression quantitative trait loci (eQTLs) that affect the expression of individual genes. Here we address this issue by developing a computational approach that integrates gene clustering and network reconstruction with genetic mapping into a unifying framework. The approach can not only identify specific eQTLs that control how genes are clustered and organized toward biological functions, but also enable the investigation of the biological mechanisms that individual eQTLs perturb in a signaling pathway. We applied the new approach to characterize the effects of eQTLs on the structure and organization of gene clusters in Caenorhabditis elegans. This study provides the first characterization, to our knowledge, of the effects of genetic variants on the regulatory network of gene expression. The approach developed can also facilitate the genetic dissection of other dynamic processes, including development, physiology and disease progression in any organisms.
Collapse
Affiliation(s)
- Ningtao Wang
- Department of Biostatistics, University of Texas School of Public Health, Houston, TX 77030, USA.,Department of Public Health Sciences, The Pennsylvania State University, Hershey, PA 17033, USA
| | - Kirk Gosik
- Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
| | - Runze Li
- Department of Biostatistics, University of Texas School of Public Health, Houston, TX 77030, USA.,Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
| | - Bruce Lindsay
- Department of Biostatistics, University of Texas School of Public Health, Houston, TX 77030, USA
| | - Rongling Wu
- Department of Biostatistics, University of Texas School of Public Health, Houston, TX 77030, USA.,Department of Statistics, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
24
|
Stell L, Sabatti C. Genetic Variant Selection: Learning Across Traits and Sites. Genetics 2016; 202:439-55. [PMID: 26680660 PMCID: PMC4788227 DOI: 10.1534/genetics.115.184572] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 11/30/2015] [Indexed: 11/18/2022] Open
Abstract
We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for the joint effects of multiple genes; and adopting a Bayesian approach leads to posterior probabilities that coherently incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variable site by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and reanalyzing a data set of sequencing variants.
Collapse
Affiliation(s)
- Laurel Stell
- Department of Health Research and Policy, Stanford University, Stanford, California 94305
| | - Chiara Sabatti
- Department of Health Research and Policy, Stanford University, Stanford, California 94305 Department of Statistics, Stanford University, Stanford, California 94305
| |
Collapse
|
25
|
Howrylak JA, Moll M, Weiss ST, Raby BA, Wu W, Xing EP. Gene expression profiling of asthma phenotypes demonstrates molecular signatures of atopy and asthma control. J Allergy Clin Immunol 2016; 137:1390-1397.e6. [PMID: 26792209 DOI: 10.1016/j.jaci.2015.09.058] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Revised: 08/13/2015] [Accepted: 09/30/2015] [Indexed: 12/13/2022]
Abstract
BACKGROUND Recent studies have used cluster analysis to identify phenotypic clusters of asthma with differences in clinical traits, as well as differences in response to therapy with anti-inflammatory medications. However, the correspondence between different phenotypic clusters and differences in the underlying molecular mechanisms of asthma pathogenesis remains unclear. OBJECTIVE We sought to determine whether clinical differences among children with asthma in different phenotypic clusters corresponded to differences in levels of gene expression. METHODS We explored differences in gene expression profiles of CD4(+) lymphocytes isolated from the peripheral blood of 299 young adult participants in the Childhood Asthma Management Program study. We obtained gene expression profiles from study subjects between 9 and 14 years of age after they participated in a randomized, controlled longitudinal study examining the effects of inhaled anti-inflammatory medications over a 48-month study period, and we evaluated the correspondence between our earlier phenotypic cluster analysis and subsequent follow-up clinical and molecular profiles. RESULTS We found that differences in clinical characteristics observed between subjects assigned to different phenotypic clusters persisted into young adulthood and that these clinical differences were associated with differences in gene expression patterns between subjects in different clusters. We identified a subset of genes associated with atopic status, validated the presence of an atopic signature among these genes in an independent cohort of asthmatic subjects, and identified the presence of common transcription factor binding sites corresponding to glucocorticoid receptor binding. CONCLUSION These findings suggest that phenotypic clusters are associated with differences in the underlying pathobiology of asthma. Further experiments are necessary to confirm these findings.
Collapse
Affiliation(s)
- Judie A Howrylak
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Medicine, Penn State Milton S. Hershey Medical Center, Hershey, Pa.
| | - Matthew Moll
- Department of Medicine, Boston University, Boston, Mass
| | - Scott T Weiss
- Harvard Medical School, Boston, Mass; Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Mass; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Mass
| | - Benjamin A Raby
- Harvard Medical School, Boston, Mass; Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Mass; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Brigham and Women's Hospital, Boston, Mass
| | - Wei Wu
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa
| | - Eric P Xing
- Department of Machine Learning, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pa
| |
Collapse
|
26
|
He H, Lin D, Zhang J, Wang Y, Deng HW. Biostatistics, Data Mining and Computational Modeling. TRANSLATIONAL BIOINFORMATICS 2016. [DOI: 10.1007/978-94-017-7543-4_2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
27
|
Xiong L, Kuan PF, Tian J, Keles S, Wang S. Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data. Cancer Inform 2015; 13:123-31. [PMID: 26609213 PMCID: PMC4648611 DOI: 10.4137/cin.s16353] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Revised: 03/16/2015] [Accepted: 03/20/2015] [Indexed: 12/29/2022] Open
Abstract
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies.
Collapse
Affiliation(s)
- Lie Xiong
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Pei-Fen Kuan
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, USA
| | - Jianan Tian
- Department of Statistics, University of Wisconsin, Madison, WI, USA
| | - Sunduz Keles
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| | - Sijian Wang
- Department of Statistics, University of Wisconsin, Madison, WI, USA. ; Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI, USA
| |
Collapse
|
28
|
Statistical and Computational Methods for Genetic Diseases: An Overview. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2015; 2015:954598. [PMID: 26106440 PMCID: PMC4464008 DOI: 10.1155/2015/954598] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 09/16/2014] [Accepted: 04/23/2015] [Indexed: 12/19/2022]
Abstract
The identification of causes of genetic diseases has been carried out by several approaches with increasing complexity. Innovation of genetic methodologies leads to the production of large amounts of data that needs the support of statistical and computational methods to be correctly processed. The aim of the paper is to provide an overview of statistical and computational methods paying attention to methods for the sequence analysis and complex diseases.
Collapse
|
29
|
Rodrigues LCDS, Holmes KE, Thompson V, Piskun CM, Lana SE, Newton MA, Stein TJ. Osteosarcoma tissues and cell lines from patients with differing serum alkaline phosphatase concentrations display minimal differences in gene expression patterns. Vet Comp Oncol 2015; 14:e58-69. [PMID: 25643733 DOI: 10.1111/vco.12132] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Revised: 11/19/2014] [Accepted: 11/19/2014] [Indexed: 12/17/2022]
Abstract
Serum alkaline phosphatase (ALP) concentration is a prognostic factor for osteosarcoma in multiple studies, although its biological significance remains incompletely understood. To determine whether gene expression patterns differed in osteosarcoma from patients with differing serum ALP concentrations, microarray analysis was performed on 18 primary osteosarcoma samples and six osteosarcoma cell lines from dogs with normal and increased serum ALP concentration. No differences in gene expression patterns were noted between tumours or cell lines with differing serum ALP concentration using a gene-specific two-sample t-test. Using a more sensitive empirical Bayes procedure, defective in cullin neddylation 1 domain containing 1 (DCUN1D1) was increased in both the tissue and cell lines of the normal ALP group. Using quantitative PCR (qPCR), differences in DCUN1D1 expression between the two groups failed to reach significance. The homogeneity of gene expression patterns of osteosarcoma associated differing serum ALP concentrations are consistent with previous studies suggesting serum ALP concentration is not associated with intrinsic differences of osteosarcoma cells.
Collapse
Affiliation(s)
- L C de Sá Rodrigues
- Department of Medical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
| | - K E Holmes
- Department of Medical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
| | - V Thompson
- Department of Medical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
| | - C M Piskun
- Department of Medical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA
| | - S E Lana
- Flint Animal Cancer Center, College of Veterinary Medicine and Biomedical Sciences, Colorado State University, Ft Collins, CO, USA
| | - M A Newton
- Departments of Statistics and of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA.,Carbone Cancer Center, University of Wisconsin-Madison, Madison, WI, USA
| | - T J Stein
- Department of Medical Sciences, School of Veterinary Medicine, University of Wisconsin-Madison, Madison, WI, USA.,Carbone Cancer Center, University of Wisconsin-Madison, Madison, WI, USA.,Institute for Clinical & Translational Research, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
30
|
Jiang B, Liu JS. Bayesian Partition Models for Identifying Expression Quantitative Trait Loci. J Am Stat Assoc 2015; 110:1350-1361. [PMID: 29056798 DOI: 10.1080/01621459.2015.1049746] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
Expression quantitative trait loci (eQTLs) are genomic locations associated with changes of expression levels of certain genes. By assaying gene expressions and genetic variations simultaneously on a genome-wide scale, scientists wish to discover genomic loci responsible for expression variations of a set of genes. The task can be viewed as a multivariate regression problem with variable selection on both responses (gene expression) and covariates (genetic variations), including also multi-way interactions among covariates. Instead of learning a predictive model of quantitative trait given combinations of genetic markers, we adopt an inverse modeling perspective to model the distribution of genetic markers conditional on gene expression traits. A particular strength of our method is its ability to detect interactive effects of genetic variations with high power even when their marginal effects are weak, addressing a key weakness of many existing eQTL mapping methods. Furthermore, we introduce a hierarchical model to capture the dependence structure among correlated genes. Through simulation studies and a real data example in yeast, we demonstrate how our Bayesian hierarchical partition model achieves a significantly improved power in detecting eQTLs compared to existing methods.
Collapse
Affiliation(s)
- Bo Jiang
- Harvard University, Cambridge, MA 02138
| | - Jun S Liu
- Department of Statistics, Harvard University, Cambridge, MA 02138
| |
Collapse
|
31
|
Abstract
Expression quantitative trait loci (eQTL) mapping constitutes a challenging problem due to, among other reasons, the high-dimensional multivariate nature of gene-expression traits. Next to the expression heterogeneity produced by confounding factors and other sources of unwanted variation, indirect effects spread throughout genes as a result of genetic, molecular, and environmental perturbations. From a multivariate perspective one would like to adjust for the effect of all of these factors to end up with a network of direct associations connecting the path from genotype to phenotype. In this article we approach this challenge with mixed graphical Markov models, higher-order conditional independences, and q-order correlation graphs. These models show that additive genetic effects propagate through the network as function of gene-gene correlations. Our estimation of the eQTL network underlying a well-studied yeast data set leads to a sparse structure with more direct genetic and regulatory associations that enable a straightforward comparison of the genetic control of gene expression across chromosomes. Interestingly, it also reveals that eQTLs explain most of the expression variability of network hub genes.
Collapse
|
32
|
Ho YY, Baechler EC, Ortmann W, Behrens TW, Graham RR, Bhangale TR, Pan W. Using gene expression to improve the power of genome-wide association analysis. Hum Hered 2014; 78:94-103. [PMID: 25096029 PMCID: PMC4152945 DOI: 10.1159/000362837] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 04/14/2014] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND/AIMS Genome-wide association (GWA) studies have reported susceptible regions in the human genome for many common diseases and traits; however, these loci only explain a minority of trait heritability. To boost the power of a GWA study, substantial research endeavors have been focused on integrating other available genomic information in the analysis. Advances in high through-put technologies have generated a wealth of genomic data and made combining SNP and gene expression data become feasible. RESULTS In this paper, we propose a novel procedure to incorporate gene expression information into GWA analysis. This procedure utilizes weights constructed by gene expression measurements to adjust p values from a GWA analysis. RESULTS from simulation analyses indicate that the proposed procedures may achieve substantial power gains, while controlling family-wise type I error rates at the nominal level. To demonstrate the implementation of our proposed approach, we apply the weight adjustment procedure to a GWA study on serum interferon-regulated chemokine levels in systemic lupus erythematosus patients. The study results can provide valuable insights for the functional interpretation of GWA signals. AVAILABILITY The R source code for implementing the proposed weighting procedure is available at http://www.biostat.umn.edu/∼yho/research.html.
Collapse
Affiliation(s)
- Yen-Yi Ho
- Division of Biostatistics, University of Minnesota
| | | | | | | | | | | | - Wei Pan
- Division of Biostatistics, University of Minnesota
| |
Collapse
|
33
|
Chakraborty A, Jiang G, Boustani M, Liu Y, Skaar T, Li L. Simultaneous inferences based on empirical Bayes methods and false discovery rates ineQTL data analysis. BMC Genomics 2014; 14 Suppl 8:S8. [PMID: 24564682 PMCID: PMC4042241 DOI: 10.1186/1471-2164-14-s8-s8] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
BACKGROUND Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with complex human diseases, clinical conditions and traits. Genetic mapping of expression quantitative trait loci (eQTLs) is providing us with novel functional effects of thousands of single nucleotide polymorphisms (SNPs). In a classical quantitative trail loci (QTL) mapping problem multiple tests are done to assess whether one trait is associated with a number of loci. In contrast to QTL studies, thousands of traits are measured alongwith thousands of gene expressions in an eQTL study. For such a study, a huge number of tests have to be performed (~10(6)). This extreme multiplicity gives rise to many computational and statistical problems. In this paper we have tried to address these issues using two closely related inferential approaches: an empirical Bayes method that bears the Bayesian flavor without having much a priori knowledge and the frequentist method of false discovery rates. A three-component t-mixture model has been used for the parametric empirical Bayes (PEB) method. Inferences have been obtained using Expectation/Conditional Maximization Either (ECME) algorithm. A simulation study has also been performed and has been compared with a nonparametric empirical Bayes (NPEB) alternative. RESULTS The results show that PEB has an edge over NPEB. The proposed methodology has been applied to human liver cohort (LHC) data. Our method enables to discover more significant SNPs with FDR<10% compared to the previous study done by Yang et al. (Genome Research, 2010). CONCLUSIONS In contrast to previously available methods based on p-values, the empirical Bayes method uses local false discovery rate (lfdr) as the threshold. This method controls false positive rate.
Collapse
|
34
|
Ray P, Zheng L, Lucas J, Carin L. Bayesian joint analysis of heterogeneous genomics data. Bioinformatics 2014; 30:1370-6. [DOI: 10.1093/bioinformatics/btu064] [Citation(s) in RCA: 54] [Impact Index Per Article: 4.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
|
35
|
An information-theoretic machine learning approach to expression QTL analysis. PLoS One 2013; 8:e67899. [PMID: 23825689 PMCID: PMC3692482 DOI: 10.1371/journal.pone.0067899] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2012] [Accepted: 05/21/2013] [Indexed: 11/19/2022] Open
Abstract
Expression Quantitative Trait Locus (eQTL) analysis is a powerful tool to study the biological mechanisms linking the genotype with gene expression. Such analyses can identify genomic locations where genotypic variants influence the expression of genes, both in close proximity to the variant (cis-eQTL), and on other chromosomes (trans-eQTL). Many traditional eQTL methods are based on a linear regression model. In this study, we propose a novel method by which to identify eQTL associations with information theory and machine learning approaches. Mutual Information (MI) is used to describe the association between genetic marker and gene expression. MI can detect both linear and non-linear associations. What’s more, it can capture the heterogeneity of the population. Advanced feature selection methods, Maximum Relevance Minimum Redundancy (mRMR) and Incremental Feature Selection (IFS), were applied to optimize the selection of the affected genes by the genetic marker. When we applied our method to a study of apoE-deficient mice, it was found that the cis-acting eQTLs are stronger than trans-acting eQTLs but there are more trans-acting eQTLs than cis-acting eQTLs. We compared our results (mRMR.eQTL) with R/qtl, and MatrixEQTL (modelLINEAR and modelANOVA). In female mice, 67.9% of mRMR.eQTL results can be confirmed by at least two other methods while only 14.4% of R/qtl result can be confirmed by at least two other methods. In male mice, 74.1% of mRMR.eQTL results can be confirmed by at least two other methods while only 18.2% of R/qtl result can be confirmed by at least two other methods. Our methods provide a new way to identify the association between genetic markers and gene expression. Our software is available from supporting information.
Collapse
|
36
|
Bhadra A, Mallick BK. Joint High‐Dimensional Bayesian Variable and Covariance Selection with an Application to eQTL Analysis. Biometrics 2013; 69:447-57. [DOI: 10.1111/biom.12021] [Citation(s) in RCA: 50] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2011] [Revised: 10/01/2012] [Accepted: 12/01/2012] [Indexed: 01/29/2023]
Affiliation(s)
- Anindya Bhadra
- Department of StatisticsPurdue University, West Lafayette Indiana 47907‐2066, U.S.A
| | - Bani K. Mallick
- Department of StatisticsTexas A&M University, College Station Texas 77843‐3143, U.S.A
| |
Collapse
|
37
|
Systems genetics in "-omics" era: current and future development. Theory Biosci 2012; 132:1-16. [PMID: 23138757 DOI: 10.1007/s12064-012-0168-x] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 10/25/2012] [Indexed: 02/06/2023]
Abstract
The systems genetics is an emerging discipline that integrates high-throughput expression profiling technology and systems biology approaches for revealing the molecular mechanism of complex traits, and will improve our understanding of gene functions in the biochemical pathway and genetic interactions between biological molecules. With the rapid advances of microarray analysis technologies, bioinformatics is extensively used in the studies of gene functions, SNP-SNP genetic interactions, LD block-block interactions, miRNA-mRNA interactions, DNA-protein interactions, protein-protein interactions, and functional mapping for LD blocks. Based on bioinformatics panel, which can integrate "-omics" datasets to extract systems knowledge and useful information for explaining the molecular mechanism of complex traits, systems genetics is all about to enhance our understanding of biological processes. Systems biology has provided systems level recognition of various biological phenomena, and constructed the scientific background for the development of systems genetics. In addition, the next-generation sequencing technology and post-genome wide association studies empower the discovery of new gene and rare variants. The integration of different strategies will help to propose novel hypothesis and perfect the theoretical framework of systems genetics, which will make contribution to the future development of systems genetics, and open up a whole new area of genetics.
Collapse
|
38
|
Ackermann M, Clément-Ziza M, Michaelson JJ, Beyer A. Teamwork: improved eQTL mapping using combinations of machine learning methods. PLoS One 2012; 7:e40916. [PMID: 22911718 PMCID: PMC3404069 DOI: 10.1371/journal.pone.0040916] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2012] [Accepted: 06/14/2012] [Indexed: 12/30/2022] Open
Abstract
Expression quantitative trait loci (eQTL) mapping is a widely used technique to uncover regulatory relationships between genes. A range of methodologies have been developed to map links between expression traits and genotypes. The DREAM (Dialogue on Reverse Engineering Assessments and Methods) initiative is a community project to objectively assess the relative performance of different computational approaches for solving specific systems biology problems. The goal of one of the DREAM5 challenges was to reverse-engineer genetic interaction networks from synthetic genetic variation and gene expression data, which simulates the problem of eQTL mapping. In this framework, we proposed an approach whose originality resides in the use of a combination of existing machine learning algorithms (committee). Although it was not the best performer, this method was by far the most precise on average. After the competition, we continued in this direction by evaluating other committees using the DREAM5 data and developed a method that relies on Random Forests and LASSO. It achieved a much higher average precision than the DREAM best performer at the cost of slightly lower average sensitivity.
Collapse
Affiliation(s)
- Marit Ackermann
- Biotechnology Center, Technical University Dresden, Dresden, Germany
| | | | | | - Andreas Beyer
- Biotechnology Center, Technical University Dresden, Dresden, Germany
- Center for Regenerative Therapy Dresden, Dresden, Germany
- * E-mail:
| |
Collapse
|
39
|
Wright FA, Shabalin AA, Rusyn I. Computational tools for discovery and interpretation of expression quantitative trait loci. Pharmacogenomics 2012; 13:343-52. [PMID: 22304583 DOI: 10.2217/pgs.11.185] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Expression quantitative trait locus (eQTL) analysis is rapidly moving from a cutting-edge concept in genomics to a mature area of investigation, with important connections to genome-wide association studies for human disease, pharmacogenomics and toxicogenomics. Despite the importance of the topic, many investigators must develop their own code or use tools not specifically suited for eQTL analysis. Convenient computational tools are becoming available, but they are not widely publicized, and investigators who are interested in discovery or eQTL, or in using them to interpret genome-wide association study results may have difficulty navigating the available resources. The purpose of this review is to help investigators find appropriate programs for eQTL analysis and interpretation.
Collapse
Affiliation(s)
- Fred A Wright
- Department of Biostatistics, University of North Carolina, Chapel Hill, NC 27599, USA
| | | | | |
Collapse
|
40
|
Scott-Boyer MP, Imholte GC, Tayeb A, Labbe A, Deschepper CF, Gottardo R. An integrated hierarchical Bayesian model for multivariate eQTL mapping. Stat Appl Genet Mol Biol 2012; 11:/j/sagmb.2012.11.issue-4/1544-6115.1760/1544-6115.1760.xml. [PMID: 22850063 PMCID: PMC4627701 DOI: 10.1515/1544-6115.1760] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Recently, expression quantitative loci (eQTL) mapping studies, where expression levels of thousands of genes are viewed as quantitative traits, have been used to provide greater insight into the biology of gene regulation. Originally, eQTLs were detected by applying standard QTL detection tools (using a "one gene at-a-time" approach), but this method ignores many possible interactions between genes. Several other methods have proposed to overcome these limitations, but each of them has some specific disadvantages. In this paper, we present an integrated hierarchical Bayesian model that jointly models all genes and SNPs to detect eQTLs. We propose a model (named iBMQ) that is specifically designed to handle a large number G of gene expressions, a large number S of regressors (genetic markers) and a small number n of individuals in what we call a ``large G, large S, small n'' paradigm. This method incorporates genotypic and gene expression data into a single model while 1) specifically coping with the high dimensionality of eQTL data (large number of genes), 2) borrowing strength from all gene expression data for the mapping procedures, and 3) controlling the number of false positives to a desirable level. To validate our model, we have performed simulation studies and showed that it outperforms other popular methods for eQTL detection, including QTLBIM, R-QTL, remMap and M-SPLS. Finally, we used our model to analyze a real expression dataset obtained in a panel of mice BXD Recombinant Inbred (RI) strains. Analysis of these data with iBMQ revealed the presence of multiple hotspots showing significant enrichment in genes belonging to one or more annotation categories.
Collapse
|
41
|
Mollah MMH, Mollah MNH, Kishino H. β-empirical Bayes inference and model diagnosis of microarray data. BMC Bioinformatics 2012; 13:135. [PMID: 22713095 PMCID: PMC3464654 DOI: 10.1186/1471-2105-13-135] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/23/2012] [Indexed: 12/04/2022] Open
Abstract
Background Microarray data enables the high-throughput survey of mRNA expression profiles at the genomic level; however, the data presents a challenging statistical problem because of the large number of transcripts with small sample sizes that are obtained. To reduce the dimensionality, various Bayesian or empirical Bayes hierarchical models have been developed. However, because of the complexity of the microarray data, no model can explain the data fully. It is generally difficult to scrutinize the irregular patterns of expression that are not expected by the usual statistical gene by gene models. Results As an extension of empirical Bayes (EB) procedures, we have developed the β-empirical Bayes (β-EB) approach based on a β-likelihood measure which can be regarded as an ’evidence-based’ weighted (quasi-) likelihood inference. The weight of a transcript t is described as a power function of its likelihood, fβ(yt|θ). Genes with low likelihoods have unexpected expression patterns and low weights. By assigning low weights to outliers, the inference becomes robust. The value of β, which controls the balance between the robustness and efficiency, is selected by maximizing the predictive β0-likelihood by cross-validation. The proposed β-EB approach identified six significant (p<10−5) contaminated transcripts as differentially expressed (DE) in normal/tumor tissues from the head and neck of cancer patients. These six genes were all confirmed to be related to cancer; they were not identified as DE genes by the classical EB approach. When applied to the eQTL analysis of Arabidopsis thaliana, the proposed β-EB approach identified some potential master regulators that were missed by the EB approach. Conclusions The simulation data and real gene expression data showed that the proposed β-EB method was robust against outliers. The distribution of the weights was used to scrutinize the irregular patterns of expression and diagnose the model statistically. When β-weights outside the range of the predicted distribution were observed, a detailed inspection of the data was carried out. The β-weights described here can be applied to other likelihood-based statistical models for diagnosis, and may serve as a useful tool for transcriptome and proteome studies.
Collapse
Affiliation(s)
- Mohammad Manir Hossain Mollah
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan.
| | | | | |
Collapse
|
42
|
Mutshinda CM, Noykova N, Sillanpää MJ. A hierarchical bayesian approach to multi-trait clinical quantitative trait locus modeling. Front Genet 2012; 3:97. [PMID: 22685451 PMCID: PMC3368303 DOI: 10.3389/fgene.2012.00097] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2011] [Accepted: 05/12/2012] [Indexed: 02/04/2023] Open
Abstract
Recent advances in high-throughput genotyping and transcript profiling technologies have enabled the inexpensive production of genome-wide dense marker maps in tandem with huge amounts of expression profiles. These large-scale data encompass valuable information about the genetic architecture of important phenotypic traits. Comprehensive models that combine molecular markers and gene transcript levels are increasingly advocated as an effective approach to dissecting the genetic architecture of complex phenotypic traits. The simultaneous utilization of marker and gene expression data to explain the variation in clinical quantitative trait, known as clinical quantitative trait locus (cQTL) mapping, poses challenges that are both conceptual and computational. Nonetheless, the hierarchical Bayesian (HB) modeling approach, in combination with modern computational tools such as Markov chain Monte Carlo (MCMC) simulation techniques, provides much versatility for cQTL analysis. Sillanpää and Noykova (2008) developed a HB model for single-trait cQTL analysis in inbred line cross-data using molecular markers, gene expressions, and marker-gene expression pairs. However, clinical traits generally relate to one another through environmental correlations and/or pleiotropy. A multi-trait approach can improve on the power to detect genetic effects and on their estimation precision. A multi-trait model also provides a framework for examining a number of biologically interesting hypotheses. In this paper we extend the HB cQTL model for inbred line crosses proposed by Sillanpää and Noykova to a multi-trait setting. We illustrate the implementation of our new model with simulated data, and evaluate the multi-trait model performance with regard to its single-trait counterpart. The data simulation process was based on the multi-trait cQTL model, assuming three traits with uncorrelated and correlated cQTL residuals, with the simulated data under uncorrelated cQTL residuals serving as our test set for comparing the performances of the multi-trait and single-trait models. The simulated data under correlated cQTL residuals were essentially used to assess how well our new model can estimate the cQTL residual covariance structure. The model fitting to the data was carried out by MCMC simulation through OpenBUGS. The multi-trait model outperformed its single-trait counterpart in identifying cQTLs, with a consistently lower false discovery rate. Moreover, the covariance matrix of cQTL residuals was typically estimated to an appreciable degree of precision under the multi-trait cQTL model, making our new model a promising approach to addressing a wide range of issues facing the analysis of correlated clinical traits.
Collapse
Affiliation(s)
- Crispin M Mutshinda
- Department of Mathematics and Statistics, University of Helsinki Helsinki, Finland
| | | | | |
Collapse
|
43
|
Bottolo L, Petretto E, Blankenberg S, Cambien F, Cook SA, Tiret L, Richardson S. Bayesian detection of expression quantitative trait loci hot spots. Genetics 2011; 189:1449-59. [PMID: 21926303 PMCID: PMC3241411 DOI: 10.1534/genetics.111.131425] [Citation(s) in RCA: 34] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2011] [Accepted: 08/23/2011] [Indexed: 12/21/2022] Open
Abstract
High-throughput genomics allows genome-wide quantification of gene expression levels in tissues and cell types and, when combined with sequence variation data, permits the identification of genetic control points of expression (expression QTL or eQTL). Clusters of eQTL influenced by single genetic polymorphisms can inform on hotspots of regulation of pathways and networks, although very few hotspots have been robustly detected, replicated, or experimentally verified. Here we present a novel modeling strategy to estimate the propensity of a genetic marker to influence several expression traits at the same time, based on a hierarchical formulation of related regressions. We implement this hierarchical regression model in a Bayesian framework using a stochastic search algorithm, HESS, that efficiently probes sparse subsets of genetic markers in a high-dimensional data matrix to identify hotspots and to pinpoint the individual genetic effects (eQTL). Simulating complex regulatory scenarios, we demonstrate that our method outperforms current state-of-the-art approaches, in particular when the number of transcripts is large. We also illustrate the applicability of HESS to diverse real-case data sets, in mouse and human genetic settings, and show that it provides new insights into regulatory hotspots that were not detected by conventional methods. The results suggest that the combination of our modeling strategy and algorithmic implementation provides significant advantages for the identification of functional eQTL hotspots, revealing key regulators underlying pathways.
Collapse
Affiliation(s)
- Leonardo Bottolo
- MRC Clinical Sciences Centre, Imperial College, London W12 0NN United Kingdom
- Department of Epidemiology and Biostatistics, Imperial College, London W2 1PG, United Kingdom
| | - Enrico Petretto
- MRC Clinical Sciences Centre, Imperial College, London W12 0NN United Kingdom
- Department of Epidemiology and Biostatistics, Imperial College, London W2 1PG, United Kingdom
| | | | - François Cambien
- INSERM UMRS 937, Pierre and Marie Curie University, 75013 Paris, France
| | - Stuart A. Cook
- MRC Clinical Sciences Centre, Imperial College, London W12 0NN United Kingdom
- National Heart and Lung Institute, Imperial College, London W2 1PG, United Kingdom
| | - Laurence Tiret
- INSERM UMRS 937, Pierre and Marie Curie University, 75013 Paris, France
| | - Sylvia Richardson
- Department of Epidemiology and Biostatistics, Imperial College, London W2 1PG, United Kingdom
- MRC–HPA Centre for Environment and Health, Imperial College, London-Harefield Hospital, Harefield, Middlesex UB9 6JH, United Kingdom
| |
Collapse
|
44
|
Yin J, Li H. A SPARSE CONDITIONAL GAUSSIAN GRAPHICAL MODEL FOR ANALYSIS OF GENETICAL GENOMICS DATA. Ann Appl Stat 2011; 5:2630-2650. [PMID: 22905077 DOI: 10.1214/11-aoas494] [Citation(s) in RCA: 90] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Genetical genomics experiments have now been routinely conducted to measure both the genetic markers and gene expression data on the same subjects. The gene expression levels are often treated as quantitative traits and are subject to standard genetic analysis in order to identify the gene expression quantitative loci (eQTL). However, the genetic architecture for many gene expressions may be complex, and poorly estimated genetic architecture may compromise the inferences of the dependency structures of the genes at the transcriptional level. In this paper, we introduce a sparse conditional Gaussian graphical model for studying the conditional independent relationships among a set of gene expressions adjusting for possible genetic effects where the gene expressions are modeled with seemingly unrelated regressions. We present an efficient coordinate descent algorithm to obtain the penalized estimation of both the regression coefficients and sparse concentration matrix. The corresponding graph can be used to determine the conditional independence among a group of genes while adjusting for shared genetic effects. Simulation experiments and asymptotic convergence rates and sparsistency are used to justify our proposed methods. By sparsistency, we mean the property that all parameters that are zero are actually estimated as zero with probability tending to one. We apply our methods to the analysis of a yeast eQTL data set and demonstrate that the conditional Gaussian graphical model leads to more interpretable gene network than standard Gaussian graphical model based on gene expression data alone.
Collapse
Affiliation(s)
- Jianxin Yin
- University of Pennsylvania School of Medicine
| | | |
Collapse
|
45
|
Abstract
RNA-seq may replace gene expression microarrays in the near future. Using RNA-seq, the expression of a gene can be estimated using the total number of sequence reads mapped to that gene, known as the total read count (TReC). Traditional expression quantitative trait locus (eQTL) mapping methods, such as linear regression, can be applied to TReC measurements after they are properly normalized. In this article, we show that eQTL mapping, by directly modeling TReC using discrete distributions, has higher statistical power than the two-step approach: data normalization followed by linear regression. In addition, RNA-seq provides information on allele-specific expression (ASE) that is not available from microarrays. By combining the information from TReC and ASE, we can computationally distinguish cis- and trans-eQTL and further improve the power of cis-eQTL mapping. Both simulation and real data studies confirm the improved power of our new methods. We also discuss the design issues of RNA-seq experiments. Specifically, we show that by combining TReC and ASE measurements, it is possible to minimize cost and retain the statistical power of cis-eQTL mapping by reducing sample size while increasing the number of sequence reads per sample. In addition to RNA-seq data, our method can also be employed to study the genetic basis of other types of sequencing data, such as chromatin immunoprecipitation followed by DNA sequencing data. In this article, we focus on eQTL mapping of a single gene using the association-based method. However, our method establishes a statistical framework for future developments of eQTL mapping methods using RNA-seq data (e.g., linkage-based eQTL mapping), and the joint study of multiple genetic markers and/or multiple genes.
Collapse
Affiliation(s)
- Wei Sun
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina 27599, USA.
| |
Collapse
|
46
|
Abstract
We describe a method for integrating gene expression information into genome scans and show that this can substantially increase the statistical power of QTL mapping. The method has three stages. First, standard clustering methods identify small (size 5-20) groups of genes with similar expression patterns. Second, each gene group is tested for a causative genetic locus shared with the clinical trait of interest. This is done using an EM algorithm approach that treats genotype at the putative causative locus as an unobserved variable and combines expression information from all of the genes in the group to infer genotype information at the locus. Finally, expression QTL (eQTL) are mapped for each gene group that shares a causative locus with the clinical trait. Such eQTL are candidates for the causative locus. Simulation results show that this method has far superior power to standard QTL mapping techniques in many circumstances. We applied this method to existing data on mouse obesity. Our method identified 27 putative body weight QTL, whereas standard QTL mapping produced only one. Furthermore, most gene groups with body weight QTL included cis genes, so candidate genes could be immediately identified. Eleven body weight QTL produced 16 candidate genes that have been previously associated with body weight or body weight-related traits, thus validating our method. In addition, 15 of the 16 other loci produced 32 candidate genes that have not been associated with body weight. Thus, this method shows great promise for finding new causative loci for complex traits.
Collapse
|
47
|
Newton MA, Chung LM. GAMMA-BASED CLUSTERING VIA ORDERED MEANS WITH APPLICATION TO GENE-EXPRESSION ANALYSIS. Ann Stat 2010; 38:3217-3244. [PMID: 21113321 PMCID: PMC2990889 DOI: 10.1214/10-aos805] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Discrete mixture models provide a well-known basis for effective clustering algorithms, although technical challenges have limited their scope. In the context of gene-expression data analysis, a model is presented that mixes over a finite catalog of structures, each one representing equality and inequality constraints among latent expected values. Computations depend on the probability that independent gamma-distributed variables attain each of their possible orderings. Each ordering event is equivalent to an event in independent negative-binomial random variables, and this finding guides a dynamic-programming calculation. The structuring of mixture-model components according to constraints among latent means leads to strict concavity of the mixture log likelihood. In addition to its beneficial numerical properties, the clustering method shows promising results in an empirical study.
Collapse
Affiliation(s)
- Michael A. Newton
- Department of Statistics, University of Wisconsin, Madison, 1300 University Ave, Madison, Wisconsin 53706-1532, USA
| | - Lisa M. Chung
- Department of Statistics, University of Wisconsin, Madison, 1300 University Ave, Madison, Wisconsin 53706-1532, USA
| |
Collapse
|
48
|
Abstract
Identifying the genetic basis of complex traits remains an important and challenging problem with the potential to affect a broad range of biological endeavors. A number of statistical methods are available for mapping quantitative trait loci (QTL), but their application to high-throughput phenotypes has been limited as most require user input and interaction. Recently, methods have been developed specifically for expression QTL (eQTL) mapping, but they too are limited in that they do not allow for interactions and QTL of moderate effect. We here propose an automated model-selection-based approach that identifies multiple eQTL in experimental populations, allowing for eQTL of moderate effect and interactions. Output can be used to identify groups of transcripts that are likely coregulated, as demonstrated in a study of diabetes in mouse.
Collapse
|
49
|
Zhan H, Chen X, Xu S. A stochastic expectation and maximization algorithm for detecting quantitative trait-associated genes. Bioinformatics 2010; 27:63-9. [DOI: 10.1093/bioinformatics/btq558] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
50
|
Systems genetics, bioinformatics and eQTL mapping. Genetica 2010; 138:915-24. [DOI: 10.1007/s10709-010-9480-x] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2010] [Accepted: 07/30/2010] [Indexed: 12/15/2022]
|