1
|
Liu W, Lin H, Liu L, Ma Y, Wei Y, Li Y. Supervised structural learning of semiparametric regression on high-dimensional correlated covariates with applications to eQTL studies. Stat Med 2023; 42:3145-3163. [PMID: 37458069 DOI: 10.1002/sim.9769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 02/18/2023] [Accepted: 04/26/2023] [Indexed: 07/18/2023]
Abstract
Expression quantitative trait loci (eQTL) studies utilize regression models to explain the variance of gene expressions with genetic loci or single nucleotide polymorphisms (SNPs). However, regression models for eQTL are challenged by the presence of high dimensional non-sparse and correlated SNPs with small effects, and nonlinear relationships between responses and SNPs. Principal component analyses are commonly conducted for dimension reduction without considering responses. Because of that, this non-supervised learning method often does not work well when the focus is on discovery of the response-covariate relationship. We propose a new supervised structural dimensional reduction method for semiparametric regression models with high dimensional and correlated covariates; we extract low-dimensional latent features from a vast number of correlated SNPs while accounting for their relationships, possibly nonlinear, with gene expressions. Our model identifies important SNPs associated with gene expressions and estimates the association parameters via a likelihood-based algorithm. A GTEx data application on a cancer related gene is presented with 18 novel eQTLs detected by our method. In addition, extensive simulations show that our method outperforms the other competing methods in bias, efficiency, and computational cost.
Collapse
Affiliation(s)
- Wei Liu
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
| | - Huazhen Lin
- Center of Statistical Research and School of Statistics, Southwestern University of Finance and Economics, Chengdu, China
- New Cornerstone Science Laboratory, Shenzhen, China
| | - Li Liu
- School of Mathematics and Statistics, Wuhan University, Wuhan, China
| | - Yanyuan Ma
- Department of Statistics, Penn State University, University Park, State College, Pennsylvania
| | - Ying Wei
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, USA
| | - Yi Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
2
|
Treppner M, Binder H, Hess M. Interpretable generative deep learning: an illustration with single cell gene expression data. Hum Genet 2022; 141:1481-1498. [PMID: 34988661 PMCID: PMC9360114 DOI: 10.1007/s00439-021-02417-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Accepted: 08/06/2021] [Indexed: 11/26/2022]
Abstract
Deep generative models can learn the underlying structure, such as pathways or gene programs, from omics data. We provide an introduction as well as an overview of such techniques, specifically illustrating their use with single-cell gene expression data. For example, the low dimensional latent representations offered by various approaches, such as variational auto-encoders, are useful to get a better understanding of the relations between observed gene expressions and experimental factors or phenotypes. Furthermore, by providing a generative model for the latent and observed variables, deep generative models can generate synthetic observations, which allow us to assess the uncertainty in the learned representations. While deep generative models are useful to learn the structure of high-dimensional omics data by efficiently capturing non-linear dependencies between genes, they are sometimes difficult to interpret due to their neural network building blocks. More precisely, to understand the relationship between learned latent variables and observed variables, e.g., gene transcript abundances and external phenotypes, is difficult. Therefore, we also illustrate current approaches that allow us to infer the relationship between learned latent variables and observed variables as well as external phenotypes. Thereby, we render deep learning approaches more interpretable. In an application with single-cell gene expression data, we demonstrate the utility of the discussed methods.
Collapse
Affiliation(s)
- Martin Treppner
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Stefan-Meier-Str. 26, Freiburg, 79104, Germany.
| | - Harald Binder
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| | - Moritz Hess
- Freiburg Center for Data Analysis and Modeling, University of Freiburg, Freiburg, 79104, Germany
| |
Collapse
|
3
|
Wang Y, Sun F, Lin W, Zhang S. AC-PCoA: Adjustment for confounding factors using principal coordinate analysis. PLoS Comput Biol 2022; 18:e1010184. [PMID: 35830390 PMCID: PMC9278763 DOI: 10.1371/journal.pcbi.1010184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 05/08/2022] [Indexed: 12/01/2022] Open
Abstract
Confounding factors exist widely in various biological data owing to technical variations, population structures and experimental conditions. Such factors may mask the true signals and lead to spurious associations in the respective biological data, making it necessary to adjust confounding factors accordingly. However, existing confounder correction methods were mainly developed based on the original data or the pairwise Euclidean distance, either one of which is inadequate for analyzing different types of data, such as sequencing data. In this work, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, which reduces data dimension and extracts the information from different distance measures using principal coordinate analysis, and adjusts confounding factors across multiple datasets by minimizing the associations between lower-dimensional representations and confounding variables. Application of the proposed method was further extended to classification and prediction. We demonstrated the efficacy of AC-PCoA on three simulated datasets and five real datasets. Compared to the existing methods, AC-PCoA shows better results in visualization, statistical testing, clustering, and classification. With today’s unprecedented amount of data, researchers are challenged by the need to enhance meaningful signals without the interference of unwanted confounders hidden inside the data. Data visualization is an important step toward exploring and explaining data in order to intuitively identify the dominant patterns. Principal coordinate analysis (PCoA), as a visualization tool, allows flexible ways to define pairwise distances and project the samples into lower dimensions without changing the distances. However, when visualizing large-scale biological datasets, the true patterns are often hindered by unwanted confounding variations, either biologically or technically in origin. To eliminate these confounding factors and recover underlying signals, we proposed a method called Adjustment for Confounding factors using Principal Coordinate Analysis, or AC-PCoA, and showed that it significantly outperforms existing methods in visualization through three simulation studies and five real datasets. We further showed that the low-dimensional representations given by AC-PCoA provide promising results in statistical testing, clustering, and classification as well.
Collapse
Affiliation(s)
- Yu Wang
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, California, United States of America
| | - Wei Lin
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai, China
- State Key Laboratory of Medical Neurobiology, MOE Frontiers Center for Brain Science, and Institutes of Brain Science, Fudan University, Shanghai, China
- Shanghai Artificial Intelligence Laboratory, Shanghai, China
- Key Laboratory of Mathematics for Nonlinear Science (Fudan University), Ministry of Education, Shanghai, China
- Shanghai Key Laboratory for Contemporary Applied Mathematics (Fudan University), Shanghai, China
| | - Shuqin Zhang
- School of Mathematical Sciences, Fudan University, Shanghai, China
- Key Laboratory of Mathematics for Nonlinear Science (Fudan University), Ministry of Education, Shanghai, China
- Shanghai Key Laboratory for Contemporary Applied Mathematics (Fudan University), Shanghai, China
- * E-mail:
| |
Collapse
|
4
|
Gao C, Wei H, Zhang K. LORSEN: Fast and Efficient eQTL Mapping With Low Rank Penalized Regression. Front Genet 2021; 12:690926. [PMID: 34868194 PMCID: PMC8636089 DOI: 10.3389/fgene.2021.690926] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2021] [Accepted: 10/08/2021] [Indexed: 12/02/2022] Open
Abstract
Characterization of genetic variations that are associated with gene expression levels is essential to understand cellular mechanisms that underline human complex traits. Expression quantitative trait loci (eQTL) mapping attempts to identify genetic variants, such as single nucleotide polymorphisms (SNPs), that affect the expression of one or more genes. With the availability of a large volume of gene expression data, it is necessary and important to develop fast and efficient statistical and computational methods to perform eQTL mapping for such large scale data. In this paper, we proposed a new method, the low rank penalized regression method (LORSEN), for eQTL mapping. We evaluated and compared the performance of LORSEN with two existing methods for eQTL mapping using extensive simulations as well as real data from the HapMap3 project. Simulation studies showed that our method outperformed two commonly used methods for eQTL mapping, LORS and FastLORS, in many scenarios in terms of area under the curve (AUC). We illustrated the usefulness of our method by applying it to SNP variants data and gene expression levels on four chromosomes from the HapMap3 Project.
Collapse
Affiliation(s)
- Cheng Gao
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| | - Hairong Wei
- College of Forest Resources and Environmental Science, Michigan Technological University, Houghton, MI, United States
| | - Kui Zhang
- Department of Mathematical Sciences, Michigan Technological University, Houghton, MI, United States
| |
Collapse
|
5
|
Zhou X, Cai X. Joint eQTL mapping and inference of gene regulatory network improves power of detecting both cis- and trans-eQTLs. Bioinformatics 2021; 38:149-156. [PMID: 34487140 PMCID: PMC8696109 DOI: 10.1093/bioinformatics/btab609] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 07/19/2021] [Accepted: 08/25/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Genetic variations of expression quantitative trait loci (eQTLs) play a critical role in influencing complex traits and diseases development. Two main factors that affect the statistical power of detecting eQTLs are: (i) relatively small size of samples available, and (ii) heavy burden of multiple testing due to a very large number of variants to be tested. The later issue is particularly severe when one tries to identify trans-eQTLs that are far away from the genes they influence. If one can exploit co-expressed genes jointly in eQTL-mapping, effective sample size can be increased. Furthermore, using the structure of the gene regulatory network (GRN) may help to identify trans-eQTLs without increasing multiple testing burden. RESULTS In this article, we use the structure equation model (SEM) to model both GRN and effect of eQTLs on gene expression, and then develop a novel algorithm, named sparse SEM for eQTL mapping (SSEMQ), to conduct joint eQTL mapping and GRN inference. The SEM can exploit co-expressed genes jointly in eQTL mapping and also use GRN to determine trans-eQTLs. Computer simulations demonstrate that our SSEMQ significantly outperforms nine existing eQTL mapping methods. SSEMQ is further used to analyze two real datasets of human breast and whole blood tissues, yielding a number of cis- and trans-eQTLs. AVAILABILITY AND IMPLEMENTATION R package ssemQr is available at https://github.com/Ivis4ml/ssemQr.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Xin Zhou
- Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL 33146, USA
| | | |
Collapse
|
6
|
Gerard D, Stephens M. UNIFYING AND GENERALIZING METHODS FOR REMOVING UNWANTED VARIATION BASED ON NEGATIVE CONTROLS. Stat Sin 2021; 31:1145-1166. [PMID: 38148787 PMCID: PMC10751021 DOI: 10.5705/ss.202018.0345] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023]
Abstract
Unwanted variation, including hidden confounding, is a well-known problem in many fields, but particularly in large-scale gene expression studies. Recent proposals to use control genes, genes assumed to be unassociated with the covariates of interest, have led to new methods to deal with this problem. Several versions of these removing unwanted variation (RUV) methods have been proposed, including RUV1, RUV2, RUV4, RUVinv, RUVrinv, and RUVfun. Here, we introduce a general framework, RUV*, that both unites and generalizes these approaches. This unifying framework helps clarify the connections between existing methods. In particular, we provide conditions under which RUV2 and RUV4 are equivalent. The RUV* framework preserves an advantage of the RUV approaches, namely, their modularity, which facilitates the development of novel methods based on existing matrix imputation algorithms. We illustrate this by implementing RUVB, a version of RUV* based on Bayesian factor analysis. In realistic simulations based on real data, we found RUVB to be competitive with existing methods in terms of both power and calibration. However, providing a consistently reliable calibration among the data sets remains challenging.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Washington, DC 20016, USA
| | - Matthew Stephens
- Departments of Human Genetics and Statistics, University of Chicago, Chicago, IL 60637, USA
| |
Collapse
|
7
|
|
8
|
Abstract
BACKGROUND With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method's performance. RESULTS Rather than generate data from a theoretical model, in this paper we develop methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. Our procedures may be applied to both single-cell and bulk RNA-seq. We show that our simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. We also demonstrate our approach by comparing various factor analysis techniques on RNA-seq datasets. CONCLUSIONS Using data simulated from a theoretical model can substantially impact the results of a study. We developed more realistic simulation techniques for RNA-seq data. Our tools are available in the seqgendiff R package on the Comprehensive R Archive Network: https://cran.r-project.org/package=seqgendiff.
Collapse
Affiliation(s)
- David Gerard
- Department of Mathematics and Statistics, American University, Massachusetts Ave NW, Washington, DC, 20016, USA.
| |
Collapse
|
9
|
Rhyne J, Jeng XJ, Chi EC, Tzeng J. FastLORS: Joint modelling for expression quantitative trait loci mapping in R. Stat (Int Stat Inst) 2020. [DOI: 10.1002/sta4.265] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Affiliation(s)
- Jacob Rhyne
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - X. Jessie Jeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Eric C. Chi
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| | - Jung‐Ying Tzeng
- Department of Statistics North Carolina State University Raleigh 27695 NC USA
| |
Collapse
|
10
|
Jeng XJ, Rhyne J, Zhang T, Tzeng JY. Effective SNP ranking improves the performance of eQTL mapping. Genet Epidemiol 2020; 44:611-619. [PMID: 32216117 DOI: 10.1002/gepi.22293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 02/21/2020] [Accepted: 03/11/2020] [Indexed: 11/06/2022]
Abstract
Genome-wide expression quantitative trait loci (eQTLs) mapping explores the relationship between gene expression and DNA variants, such as single-nucleotide polymorphism (SNPs), to understand genetic basis of human diseases. Due to the large number of genes and SNPs that need to be assessed, current methods for eQTL mapping often suffer from low detection power, especially for identifying trans-eQTLs. In this paper, we propose the idea of performing SNP ranking based on the higher criticism statistic, a summary statistic developed in large-scale signal detection. We illustrate how the HC-based SNP ranking can effectively prioritize eQTL signals over noise, greatly reduce the burden of joint modeling, and improve the power for eQTL mapping. Numerical results in simulation studies demonstrate the superior performance of our method compared to existing methods. The proposed method is also evaluated in HapMap eQTL data analysis and the results are compared to a database of known eQTLs.
Collapse
Affiliation(s)
- X Jessie Jeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jacob Rhyne
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Teng Zhang
- Department of Statistics, North Carolina State University, Raleigh, North Carolina
| | - Jung-Ying Tzeng
- Department of Statistics, North Carolina State University, Raleigh, North Carolina.,Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina.,Department of Statistics, National Cheng-Kung University, Tainan, Taiwan.,Division of Biostatistics, Institute of Epidemiology and Preventive Medicine, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
11
|
A Multi-Omics Perspective of Quantitative Trait Loci in Precision Medicine. Trends Genet 2020; 36:318-336. [PMID: 32294413 DOI: 10.1016/j.tig.2020.01.009] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2019] [Revised: 01/05/2020] [Accepted: 01/21/2020] [Indexed: 02/07/2023]
Abstract
Quantitative trait loci (QTL) analysis is an important approach to investigate the effects of genetic variants identified through an increasing number of large-scale, multidimensional 'omics data sets. In this 'big data' era, the research community has identified a significant number of molecular QTLs (molQTLs) and increased our understanding of their effects. Herein, we review multiple categories of molQTLs, including those associated with transcriptome, post-transcriptional regulation, epigenetics, proteomics, metabolomics, and the microbiome. We summarize approaches to identify molQTLs and to infer their causal effects. We further discuss the integrative analysis of molQTLs through a multi-omics perspective. Our review highlights future opportunities to better understand the functional significance of genetic variants and to utilize the discovery of molQTLs in precision medicine.
Collapse
|
12
|
Liu J, Wan X, Wang C, Yang C, Zhou X, Yang C. LLR: a latent low-rank approach to colocalizing genetic risk variants in multiple GWAS. Bioinformatics 2018; 33:3878-3886. [PMID: 28961754 DOI: 10.1093/bioinformatics/btx512] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2016] [Accepted: 08/09/2017] [Indexed: 12/30/2022] Open
Abstract
Motivation Genome-wide association studies (GWAS), which genotype millions of single nucleotide polymorphisms (SNPs) in thousands of individuals, are widely used to identify the risk SNPs underlying complex human phenotypes (quantitative traits or diseases). Most conventional statistical methods in GWAS only investigate one phenotype at a time. However, an increasing number of reports suggest the ubiquity of pleiotropy, i.e. many complex phenotypes sharing common genetic bases. This motivated us to leverage pleiotropy to develop new statistical approaches to joint analysis of multiple GWAS. Results In this study, we propose a latent low-rank (LLR) approach to colocalizing genetic risk variants using summary statistics. In the presence of pleiotropy, there exist risk loci that affect multiple phenotypes. To leverage pleiotropy, we introduce a low-rank structure to modulate the probabilities of the latent association statuses between loci and phenotypes. Regarding the computational efficiency of LLR, a novel expectation-maximization-path (EM-path) algorithm has been developed to greatly reduce the computational cost and facilitate model selection and inference. We demonstrate the advantages of LLR over competing approaches through simulation studies and joint analysis of 18 GWAS datasets. Availability and implementation The LLR software is available on https://sites.google.com/site/liujin810822. Contact macyang@ust.hk.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jin Liu
- Center for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Xiang Wan
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Chaolong Wang
- Genome Institute of Singapore, A*STAR, Singapore, Singapore
| | | | - Xiaowei Zhou
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong, China.,Department of Mathematics, Hong Kong Baptist University, Hong Kong, China
| |
Collapse
|
13
|
Sun J, Herazo-Maya JD, Huang X, Kaminski N, Zhao H. Distance-correlation based gene set analysis in longitudinal studies. Stat Appl Genet Mol Biol 2018; 17:sagmb-2017-0053. [PMID: 29397393 DOI: 10.1515/sagmb-2017-0053] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Longitudinal gene expression profiles of subjects are collected in some clinical studies to monitor disease progression and understand disease etiology. The identification of gene sets that have coordinated changes with relevant clinical outcomes over time from these data could provide significant insights into the molecular basis of disease progression and lead to better treatments. In this article, we propose a Distance-Correlation based Gene Set Analysis (dcGSA) method for longitudinal gene expression data. dcGSA is a non-parametric approach, statistically robust, and can capture both linear and nonlinear relationships between gene sets and clinical outcomes. In addition, dcGSA is able to identify related gene sets in cases where the effects of gene sets on clinical outcomes differ across subjects due to the subject heterogeneity, remove the confounding effects of some unobserved time-invariant covariates, and allow the assessment of associations between gene sets and multiple related outcomes simultaneously. Through extensive simulation studies, we demonstrate that dcGSA is more powerful of detecting relevant genes than other commonly used gene set analysis methods. When dcGSA is applied to a real dataset on systemic lupus erythematosus, we are able to identify more disease related gene sets than other methods.
Collapse
Affiliation(s)
- Jiehuan Sun
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| | - Jose D Herazo-Maya
- Internal Medicine: Pulmonary, Critical Care and Sleep Medicine, Yale School of Medcine, New Haven, CT 06519, USA
| | - Xiu Huang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT 06520, USA
| | - Naftali Kaminski
- Internal Medicine: Pulmonary, Critical Care and Sleep Medicine, Yale School of Medcine, New Haven, CT 06519, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT 06510, USA
| |
Collapse
|
14
|
Abstract
Many statistical learning methods such as matrix completion, matrix regression, and multiple response regression estimate a matrix of parameters. The nuclear norm regularization is frequently employed to achieve shrinkage and low rank solutions. To minimize a nuclear norm regularized loss function, a vital and most time-consuming step is singular value thresholding, which seeks the singular values of a large matrix exceeding a threshold and their associated singular vectors. Currently MATLAB lacks a function for singular value thresholding. Its built-in svds function computes the top r singular values/vectors by Lanczos iterative method but is only efficient for sparse matrix input, while aforementioned statistical learning algorithms perform singular value thresholding on dense but structured matrices. To address this issue, we provide a MATLAB wrapper function svt that implements singular value thresholding. It encompasses both top singular value decomposition and thresholding, handles both large sparse matrices and structured matrices, and reduces the computation cost in matrix learning algorithms.
Collapse
Affiliation(s)
- Cai Li
- North Carolina State University
| | - Hua Zhou
- University of California, Los Angeles
| |
Collapse
|
15
|
Controlling for Confounding Effects in Single Cell RNA Sequencing Studies Using both Control and Target Genes. Sci Rep 2017; 7:13587. [PMID: 29051597 PMCID: PMC5648789 DOI: 10.1038/s41598-017-13665-w] [Citation(s) in RCA: 26] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2017] [Accepted: 09/29/2017] [Indexed: 11/24/2022] Open
Abstract
Single cell RNA sequencing (scRNAseq) technique is becoming increasingly popular for unbiased and high-resolutional transcriptome analysis of heterogeneous cell populations. Despite its many advantages, scRNAseq, like any other genomic sequencing technique, is susceptible to the influence of confounding effects. Controlling for confounding effects in scRNAseq data is a crucial step for accurate downstream analysis. Here, we present a novel statistical method, which we refer to as scPLS (single cell partial least squares), for robust and accurate inference of confounding effects. scPLS takes advantage of the fact that genes in a scRNAseq study often can be naturally classified into two sets: a control set of genes that are free of effects of the predictor variables and a target set of genes that are of primary interest. By modeling the two sets of genes jointly using the partial least squares regression, scPLS is capable of making full use of the data to improve the inference of confounding effects. With extensive simulations and comparisons with other methods, we demonstrate the effectiveness of scPLS. Finally, we apply scPLS to analyze two scRNAseq data sets to illustrate its benefits in removing technical confounding effects as well as for removing cell cycle effects.
Collapse
|
16
|
Yuan L, Zhu L, Guo WL, Zhou X, Zhang Y, Huang Z, Huang DS. Nonconvex Penalty Based Low-Rank Representation and Sparse Regression for eQTL Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 14:1154-1164. [PMID: 28114074 DOI: 10.1109/tcbb.2016.2609420] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
This paper addresses the problem of accounting for confounding factors and expression quantitative trait loci (eQTL) mapping in the study of SNP-gene associations. The existing convex penalty based algorithm has limited capacity to keep main information of matrix in the process of reducing matrix rank. We present an algorithm, which use nonconvex penalty based low-rank representation to account for confounding factors and make use of sparse regression for eQTL mapping (NCLRS). The efficiency of the presented algorithm is evaluated by comparing the results of 18 synthetic datasets given by NCLRS and presented algorithm, respectively. The experimental results or biological dataset show that our approach is an effective tool to account for non-genetic effects than currently existing methods.
Collapse
|
17
|
Ju JH, Shenoy SA, Crystal RG, Mezey JG. An independent component analysis confounding factor correction framework for identifying broad impact expression quantitative trait loci. PLoS Comput Biol 2017; 13:e1005537. [PMID: 28505156 PMCID: PMC5448815 DOI: 10.1371/journal.pcbi.1005537] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2016] [Revised: 05/30/2017] [Accepted: 04/28/2017] [Indexed: 11/19/2022] Open
Abstract
Genome-wide expression Quantitative Trait Loci (eQTL) studies in humans have provided numerous insights into the genetics of both gene expression and complex diseases. While the majority of eQTL identified in genome-wide analyses impact a single gene, eQTL that impact many genes are particularly valuable for network modeling and disease analysis. To enable the identification of such broad impact eQTL, we introduce CONFETI: Confounding Factor Estimation Through Independent component analysis. CONFETI is designed to address two conflicting issues when searching for broad impact eQTL: the need to account for non-genetic confounding factors that can lower the power of the analysis or produce broad impact eQTL false positives, and the tendency of methods that account for confounding factors to model broad impact eQTL as non-genetic variation. The key advance of the CONFETI framework is the use of Independent Component Analysis (ICA) to identify variation likely caused by broad impact eQTL when constructing the sample covariance matrix used for the random effect in a mixed model. We show that CONFETI has better performance than other mixed model confounding factor methods when considering broad impact eQTL recovery from synthetic data. We also used the CONFETI framework and these same confounding factor methods to identify eQTL that replicate between matched twin pair datasets in the Multiple Tissue Human Expression Resource (MuTHER), the Depression Genes Networks study (DGN), the Netherlands Study of Depression and Anxiety (NESDA), and multiple tissue types in the Genotype-Tissue Expression (GTEx) consortium. These analyses identified both cis-eQTL and trans-eQTL impacting individual genes, and CONFETI had better or comparable performance to other mixed model confounding factor analysis methods when identifying such eQTL. In these analyses, we were able to identify and replicate a few broad impact eQTL although the overall number was small even when applying CONFETI. In light of these results, we discuss the broad impact eQTL that have been previously reported from the analysis of human data and suggest that considerable caution should be exercised when making biological inferences based on these reported eQTL.
Collapse
Affiliation(s)
- Jin Hyun Ju
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
- Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Sushila A. Shenoy
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Ronald G. Crystal
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
| | - Jason G. Mezey
- Department of Genetic Medicine, Weill Cornell Medical College, New York, NY, United States of America
- Institute for Computational Biomedicine, Weill Cornell Medical College, New York, NY, United States of America
- Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, NY, United States of America
- * E-mail:
| |
Collapse
|
18
|
Simultaneous dimension reduction and adjustment for confounding variation. Proc Natl Acad Sci U S A 2016; 113:14662-14667. [PMID: 27930330 DOI: 10.1073/pnas.1617317113] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Dimension reduction methods are commonly applied to high-throughput biological datasets. However, the results can be hindered by confounding factors, either biological or technical in origin. In this study, we extend principal component analysis (PCA) to propose AC-PCA for simultaneous dimension reduction and adjustment for confounding (AC) variation. We show that AC-PCA can adjust for (i) variations across individual donors present in a human brain exon array dataset and (ii) variations of different species in a model organism ENCODE RNA sequencing dataset. Our approach is able to recover the anatomical structure of neocortical regions and to capture the shared variation among species during embryonic development. For gene selection purposes, we extend AC-PCA with sparsity constraints and propose and implement an efficient algorithm. The methods developed in this paper can also be applied to more general settings. The R package and MATLAB source code are available at https://github.com/linzx06/AC-PCA.
Collapse
|
19
|
Cheng W, Shi Y, Zhang X, Wang W. Sparse regression models for unraveling group and individual associations in eQTL mapping. BMC Bioinformatics 2016; 17:136. [PMID: 27000043 PMCID: PMC4802846 DOI: 10.1186/s12859-016-0986-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2015] [Accepted: 03/10/2016] [Indexed: 11/18/2022] Open
Abstract
Background As a promising tool for dissecting the genetic basis of common diseases, expression quantitative trait loci (eQTL) study has attracted increasing research interest. Traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to biological pathways. Results To alleviate this limitation, in this paper, we propose geQTL, a sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits. geQTL can also correct the effects of potential confounders. Our method employs computationally efficient technique, thus it is able to fulfill large scale studies. Moreover, our method can automatically infer the proper number of group-wise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. The results show that geQTL can effectively detect both individual and group-wise signals and outperforms the state-of-the-arts by a large margin. Conclusions This paper well illustrates that decoupling individual and group-wise associations for association mapping is able to improve eQTL mapping accuracy, and inferring individual and group-wise associations. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0986-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Cheng
- Department of Computer Science, UNC at Chapel Hill, 201 S Columbia St., Chapel Hill, NC 27599, USA.
| | - Yu Shi
- Computer Science at the University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, IL 61801, USA
| | - Xiang Zhang
- Department of Elect. Eng. and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, OH 44106, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, 3531-G Boelter Hall, Los Angeles, CA 90095, USA
| |
Collapse
|
20
|
Albert FW, Kruglyak L. The role of regulatory variation in complex traits and disease. Nat Rev Genet 2015; 16:197-212. [DOI: 10.1038/nrg3891] [Citation(s) in RCA: 675] [Impact Index Per Article: 67.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
21
|
Cheng W, Shi Y, Zhang X, Wang W. Fast and robust group-wise eQTL mapping using sparse graphical models. BMC Bioinformatics 2015; 16:2. [PMID: 25593000 PMCID: PMC4387667 DOI: 10.1186/s12859-014-0421-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2014] [Accepted: 12/11/2014] [Indexed: 01/01/2023] Open
Abstract
Background Genome-wide expression quantitative trait loci (eQTL) studies have emerged as a powerful tool to understand the genetic basis of gene expression and complex traits. The traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to hidden biological pathways. Results We introduce a new approach to identify novel group-wise associations between sets of SNPs and sets of genes. Such associations are captured by hidden variables connecting SNPs and genes. Our model is a linear-Gaussian model and uses two types of hidden variables. One captures the set associations between SNPs and genes, and the other captures confounders. We develop an efficient optimization procedure which makes this approach suitable for large scale studies. Extensive experimental evaluations on both simulated and real datasets demonstrate that the proposed methods can effectively capture both individual and group-wise signals that cannot be identified by the state-of-the-art eQTL mapping methods. Conclusions Considering group-wise associations significantly improves the accuracy of eQTL mapping, and the successful multi-layer regression model opens a new approach to understand how multiple SNPs interact with each other to jointly affect the expression level of a group of genes. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0421-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Wei Cheng
- Department of Computer Science, UNC at Chapel Hill, 201 S Columbia St., Chapel Hill, 27599, NC, USA.
| | - Yu Shi
- Computer Science at the University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, 61801, IL, USA.
| | - Xiang Zhang
- Department of Elect. Eng. and Computer Science, Case Western Reserve University, 10900 Euclid Avenue, Cleveland, 44106, OH, USA.
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, 3531-G Boelter Hall, Los Angeles, 90095, CA, USA.
| |
Collapse
|
22
|
Abstract
MOTIVATION As a promising tool for dissecting the genetic basis of complex traits, expression quantitative trait loci (eQTL) mapping has attracted increasing research interest. An important issue in eQTL mapping is how to effectively integrate networks representing interactions among genetic markers and genes. Recently, several Lasso-based methods have been proposed to leverage such network information. Despite their success, existing methods have three common limitations: (i) a preprocessing step is usually needed to cluster the networks; (ii) the incompleteness of the networks and the noise in them are not considered; (iii) other available information, such as location of genetic markers and pathway information are not integrated. RESULTS To address the limitations of the existing methods, we propose Graph-regularized Dual Lasso (GDL), a robust approach for eQTL mapping. GDL integrates the correlation structures among genetic markers and traits simultaneously. It also takes into account the incompleteness of the networks and is robust to the noise. GDL utilizes graph-based regularizers to model the prior networks and does not require an explicit clustering step. Moreover, it enables further refinement of the partial and noisy networks. We further generalize GDL to incorporate the location of genetic makers and gene-pathway information. We perform extensive experimental evaluations using both simulated and real datasets. Experimental results demonstrate that the proposed methods can effectively integrate various available priori knowledge and significantly outperform the state-of-the-art eQTL mapping methods. AVAILABILITY Software for both C++ version and Matlab version is available at http://www.cs.unc.edu/∼weicheng/.
Collapse
Affiliation(s)
- Wei Cheng
- Department of Computer Science, UNC at Chapel Hill, Chapel Hill, NC 27599, Department of EECS, Case Western Reserve University, OH 44106, USA Department of Mathematics, University of Science and Technology of China, Hefei 23002, China and Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Xiang Zhang
- Department of Computer Science, UNC at Chapel Hill, Chapel Hill, NC 27599, Department of EECS, Case Western Reserve University, OH 44106, USA Department of Mathematics, University of Science and Technology of China, Hefei 23002, China and Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Zhishan Guo
- Department of Computer Science, UNC at Chapel Hill, Chapel Hill, NC 27599, Department of EECS, Case Western Reserve University, OH 44106, USA Department of Mathematics, University of Science and Technology of China, Hefei 23002, China and Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Yu Shi
- Department of Computer Science, UNC at Chapel Hill, Chapel Hill, NC 27599, Department of EECS, Case Western Reserve University, OH 44106, USA Department of Mathematics, University of Science and Technology of China, Hefei 23002, China and Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, UNC at Chapel Hill, Chapel Hill, NC 27599, Department of EECS, Case Western Reserve University, OH 44106, USA Department of Mathematics, University of Science and Technology of China, Hefei 23002, China and Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
23
|
Ho YY, Baechler EC, Ortmann W, Behrens TW, Graham RR, Bhangale TR, Pan W. Using gene expression to improve the power of genome-wide association analysis. Hum Hered 2014; 78:94-103. [PMID: 25096029 PMCID: PMC4152945 DOI: 10.1159/000362837] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2013] [Accepted: 04/14/2014] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND/AIMS Genome-wide association (GWA) studies have reported susceptible regions in the human genome for many common diseases and traits; however, these loci only explain a minority of trait heritability. To boost the power of a GWA study, substantial research endeavors have been focused on integrating other available genomic information in the analysis. Advances in high through-put technologies have generated a wealth of genomic data and made combining SNP and gene expression data become feasible. RESULTS In this paper, we propose a novel procedure to incorporate gene expression information into GWA analysis. This procedure utilizes weights constructed by gene expression measurements to adjust p values from a GWA analysis. RESULTS from simulation analyses indicate that the proposed procedures may achieve substantial power gains, while controlling family-wise type I error rates at the nominal level. To demonstrate the implementation of our proposed approach, we apply the weight adjustment procedure to a GWA study on serum interferon-regulated chemokine levels in systemic lupus erythematosus patients. The study results can provide valuable insights for the functional interpretation of GWA signals. AVAILABILITY The R source code for implementing the proposed weighting procedure is available at http://www.biostat.umn.edu/∼yho/research.html.
Collapse
Affiliation(s)
- Yen-Yi Ho
- Division of Biostatistics, University of Minnesota
| | | | | | | | | | | | - Wei Pan
- Division of Biostatistics, University of Minnesota
| |
Collapse
|
24
|
Wang Y, Wang L, Yang D, Deng M. Imputing missing values for genetic interaction data. Methods 2014; 67:269-77. [PMID: 24718098 DOI: 10.1016/j.ymeth.2014.03.032] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2013] [Revised: 03/19/2014] [Accepted: 03/27/2014] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since such missing values hinder downstream analysis. While some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accuracy rate compared with existing methods. RESULTS The improved SVD imputation method adopts an effective soft-threshold to the SVD approach which has been shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. CONCLUSION While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions.
Collapse
Affiliation(s)
- Yishu Wang
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Lin Wang
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| | - Dejie Yang
- Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, China
| | - Minghua Deng
- Center for Quantitative Biology, Peking University, Beijing 100871, China; School of Mathematical Sciences, Peking University, Beijing 100871, China; Center for Statistical Sciences, Peking University, Beijing 100871, China.
| |
Collapse
|
25
|
Gao C, Tignor NL, Salit J, Strulovici-Barel Y, Hackett NR, Crystal RG, Mezey JG. HEFT: eQTL analysis of many thousands of expressed genes while simultaneously controlling for hidden factors. ACTA ACUST UNITED AC 2013; 30:369-76. [PMID: 24307700 DOI: 10.1093/bioinformatics/btt690] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
MOTIVATION Identification of expression Quantitative Trait Loci (eQTL), the genetic loci that contribute to heritable variation in gene expression, can be obstructed by factors that produce variation in expression profiles if these factors are unmeasured or hidden from direct analysis. METHODS We have developed a method for Hidden Expression Factor analysis (HEFT) that identifies individual and pleiotropic effects of eQTL in the presence of hidden factors. The HEFT model is a combined multivariate regression and factor analysis, where the complete likelihood of the model is used to derive a ridge estimator for simultaneous factor learning and detection of eQTL. HEFT requires no pre-estimation of hidden factor effects; it provides P-values and is extremely fast, requiring just a few hours to complete an eQTL analysis of thousands of expression variables when analyzing hundreds of thousands of single nucleotide polymorphisms on a standard 8 core 2.6 G desktop. RESULTS By analyzing simulated data, we demonstrate that HEFT can correct for an unknown number of hidden factors and significantly outperforms all related hidden factor methods for eQTL analysis when there are eQTL with univariate and multivariate (pleiotropic) effects. To demonstrate a real-world application, we applied HEFT to identify eQTL affecting gene expression in the human lung for a study that included presumptive hidden factors. HEFT identified all of the cis-eQTL found by other hidden factor methods and 91 additional cis-eQTL. HEFT also identified a number of eQTLs with direct relevance to lung disease that could not be found without a hidden factor analysis, including cis-eQTL for GTF2H1 and MTRR, genes that have been independently associated with lung cancer. AVAILABILITY Software is available at http://mezeylab.cb.bscb.cornell.edu/Software.aspx. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chuan Gao
- Biological Statistics and Computational Biology, Cornell University, Ithaca, NY 14850, USA and Department of Genetic Medicine, Weill Cornell Medical College, New York, NY 10021, USA
| | | | | | | | | | | | | |
Collapse
|