1
|
Kim K, Jun TH, Ha BK, Wang S, Sun H. New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits. BMC Bioinformatics 2023; 24:381. [PMID: 37817069 PMCID: PMC10563219 DOI: 10.1186/s12859-023-05505-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 09/28/2023] [Indexed: 10/12/2023] Open
Abstract
BACKGROUND Identification of pleiotropic variants associated with multiple phenotypic traits has received increasing attention in genetic association studies. Overlapping genetic associations from multiple traits help to detect weak genetic associations missed by single-trait analyses. Many statistical methods were developed to identify pleiotropic variants with most of them being limited to quantitative traits when pleiotropic effects on both quantitative and qualitative traits have been observed. This is a statistically challenging problem because there does not exist an appropriate multivariate distribution to model both quantitative and qualitative data together. Alternatively, meta-analysis methods can be applied, which basically integrate summary statistics of individual variants associated with either a quantitative or a qualitative trait without accounting for correlations among genetic variants. RESULTS We propose a new statistical selection method based on a unified selection score quantifying how a genetic variant, i.e., a pleiotropic variant associates with both quantitative and qualitative traits. In our extensive simulation studies where various types of pleiotropic effects on both quantitative and qualitative traits were considered, we demonstrated that the proposed method outperforms the existing meta-analysis methods in terms of true positive selection. We also applied the proposed method to a peanut dataset with 6 quantitative and 2 qualitative traits, and a cowpea dataset with 2 quantitative and 6 qualitative traits. We were able to detect some potentially pleiotropic variants missed by the existing methods in both analyses. CONCLUSIONS The proposed method is able to locate pleiotropic variants associated with both quantitative and qualitative traits. It has been implemented into an R package 'UNISS', which can be downloaded from http://github.com/statpng/uniss.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, 46241, Busan, Korea
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, 50463, Miryang, Korea
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, 61186, Gwangju, Korea
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, 10032, USA
| | - Hokeun Sun
- Department of Statistic, Pusan National University, 46241, Busan, Korea.
| |
Collapse
|
2
|
Tang X, Mo Z, Chang C, Qian X. Group-shrinkage feature selection with a spatial network for mining DNA methylation data. Comput Biol Med 2023; 154:106573. [PMID: 36706568 DOI: 10.1016/j.compbiomed.2023.106573] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2022] [Revised: 01/05/2023] [Accepted: 01/22/2023] [Indexed: 01/25/2023]
Abstract
Identifying disease-related biomarkers from high-dimensional DNA methylation data helps in reducing early screening costs and inferring pathogenesis mechanisms. Good discovery results have been achieved through spatial correlation methods of methylation sites, group-based regularization, and network constraints. However, these methods still have some key limitations as they cannot exclude isolated differential sites and only consider adjacent site ordering. Therefore, we propose a group-shrinkage feature selection algorithm to encourage the selection of clustered sites and discourage the selection of isolated differential sites. Specifically, a network-guided group-shrinkage strategy is developed to penalize weakly-correlated isolated methylation sites through a network structure constraint. The spatial network is constructed based on spatial correlation information of DNA methylation sites, where this information accounts for the uneven site distribution. The experimental simulations and applications demonstrated that the proposed method outperforms the advanced regularization methods, especially in rejecting isolated methylation sites; hence this study provides an efficient and clinical-valuable method for biomarker candidate discovery in DNA methylation data. Additionally, the proposed method exhibits enhanced reliability due to introducing biological prior knowledge into a regularization-based feature selection framework and could promote more research in the integration between biological prior knowledge and classical feature selection methods, thus facilitating their clinical application. Our source codes will be released at https://github.com/SJTUBME-QianLab/Group-shrinkage-Spatial-Network once this manuscript is accepted for publication.
Collapse
Affiliation(s)
- Xinlu Tang
- Medical Image and Health Informatics Lab, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.
| | - Zhanfeng Mo
- School of Computer Science and Engineering, Nanyang Technological University, Singapore.
| | - Cheng Chang
- Department of Nuclear Medicine, Shanghai, Chest Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai 200030, China.
| | - Xiaohua Qian
- Medical Image and Health Informatics Lab, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.
| |
Collapse
|
3
|
Saddiki H, Colicino E, Lesseur C. Assessing Differential Variability of High-Throughput DNA Methylation Data. Curr Environ Health Rep 2022; 9:625-630. [PMID: 36040576 DOI: 10.1007/s40572-022-00374-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/18/2022] [Indexed: 01/31/2023]
Abstract
PURPOSE OF REVIEW DNA methylation (DNAm) is essential to human development and plays an important role as a biomarker due to its susceptibility to environmental exposures. This article reviews the current state of statistical methods developed for differential variability analysis focusing on DNAm data. RECENT FINDINGS With the advent of high-throughput technologies allowing for highly reliable and cost-effective measurements of DNAm, many epigenome studies have analyzed DNAm levels to uncover biological mechanisms underlying past environmental exposures and subsequent health outcomes. These studies typically focused on detecting sites or regions which differ in their mean DNAm levels among exposure groups. However, more recent studies highlighted the importance of identifying differentially variable sites or regions as biologically relevant features. Currently, the analysis of differentially variable DNAm sites has not yet gained widespread adoption in environmental studies; yet, it is important to examine the effects of environmental exposures on inter-individual epigenetic variability. In this article, we describe six of the most widely used statistical approaches for analyzing differential variability of DNAm levels and provide a discussion of their advantages and current limitations.
Collapse
Affiliation(s)
- Hachem Saddiki
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Elena Colicino
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Corina Lesseur
- Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
4
|
Oh M, Kim K, Sun H. Covariance thresholding to detect differentially co-expressed genes from microarray gene expression data. J Bioinform Comput Biol 2021; 18:2050002. [PMID: 32336254 DOI: 10.1142/s021972002050002x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene set analysis aims to identify differentially expressed or co-expressed genes within a biological pathway between two experimental conditions, so that it can eventually reveal biological processes and pathways involved in disease development. In the last few decades, various statistical and computational methods have been proposed to improve statistical power of gene set analysis. In recent years, much attention has been paid to differentially co-expressed genes since they can be potentially disease-related genes without significant difference in average expression levels between two conditions. In this paper, we propose a new statistical method to identify differentially co-expressed genes from microarray gene expression data. The proposed method first estimates co-expression levels of paired genes using covariance regularization by thresholding, and then significance of difference in covariance estimation between two conditions is evaluated. We demonstrated that the proposed method is more powerful than the existing main-stream methods to detect co-expressed genes through extensive simulation studies. Also, we applied it to various microarray gene expression datasets related with mutant p53 transcriptional activity, and epithelium and stroma breast cancer.
Collapse
Affiliation(s)
- Mingyu Oh
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| |
Collapse
|
5
|
Zou K, Kim KS, Kim K, Kang D, Park YH, Sun H, Ha BK, Ha J, Jun TH. Genetic Diversity and Genome-Wide Association Study of Seed Aspect Ratio Using a High-Density SNP Array in Peanut ( Arachis hypogaea L.). Genes (Basel) 2020; 12:E2. [PMID: 33375051 PMCID: PMC7822046 DOI: 10.3390/genes12010002] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 12/09/2020] [Accepted: 12/17/2020] [Indexed: 12/12/2022] Open
Abstract
Peanut (Arachis hypogaea L.) is one of the important oil crops of the world. In this study, we aimed to evaluate the genetic diversity of 384 peanut germplasms including 100 Korean germplasms and 284 core collections from the United States Department of Agriculture (USDA) using an Axiom_Arachis array with 58K single-nucleotide polymorphisms (SNPs). We evaluated the evolutionary relationships among 384 peanut germplasms using a genome-wide association study (GWAS) of seed aspect ratio data processed by ImageJ software. In total, 14,030 filtered polymorphic SNPs were identified from the peanut 58K SNP array. We identified five SNPs with significant associations to seed aspect ratio on chromosomes Aradu.A09, Aradu.A10, Araip.B08, and Araip.B09. AX-177640219 on chromosome Araip.B08 was the most significantly associated marker in GAPIT and Regularization method. Phosphoenolpyruvate carboxylase (PEPC) was found among the eleven genes within a linkage disequilibrium (LD) of the significant SNPs on Araip.B08 and could have a strong causal effect in determining seed aspect ratio. The results of the present study provide information and methods that are useful for further genetic and genomic studies as well as molecular breeding programs in peanuts.
Collapse
Affiliation(s)
- Kunyan Zou
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | | | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Dongwoo Kang
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Yu-Hyeon Park
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, Gwangju 61186, Korea;
| | - Jungmin Ha
- Department of Plant Science, Gangneung-Wonju National University, Gangneung 25457, Korea;
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
- Life and Industry Convergence Research Institute, Pusan National University, Miryang 50463, Korea
| |
Collapse
|
6
|
Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2020. [DOI: 10.29220/csam.2020.27.5.535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
7
|
Kim K, Koo J, Sun H. An empirical threshold of selection probability for analysis of high-dimensional correlated data. J STAT COMPUT SIM 2020. [DOI: 10.1080/00949655.2020.1739286] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Affiliation(s)
- Kipoong Kim
- Department of Statistics, Pusan National University, Busan, South Korea
| | - Jajoon Koo
- Department of Statistics, Pusan National University, Busan, South Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, South Korea
| |
Collapse
|
8
|
Duan R, Ning Y, Wang S, Lindsay BG, Carroll RJ, Chen Y. A fast score test for generalized mixture models. Biometrics 2019; 76:811-820. [PMID: 31863595 DOI: 10.1111/biom.13204] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2018] [Revised: 11/15/2019] [Accepted: 12/04/2019] [Indexed: 11/30/2022]
Abstract
In biomedical studies, testing for homogeneity between two groups, where one group is modeled by mixture models, is often of great interest. This paper considers the semiparametric exponential family mixture model proposed by Hong et al. (2017) and studies the score test for homogeneity under this model. The score test is nonregular in the sense that nuisance parameters disappear under the null hypothesis. To address this difficulty, we propose a modification of the score test, so that the resulting test enjoys the Wilks phenomenon. In finite samples, we show that with fixed nuisance parameters the score test is locally most powerful. In large samples, we establish the asymptotic power functions under two types of local alternative hypotheses. Our simulation studies illustrate that the proposed score test is powerful and computationally fast. We apply the proposed score test to an UK ovarian cancer DNA methylation data for identification of differentially methylated CpG sites.
Collapse
Affiliation(s)
- Rui Duan
- Department of Biostatistics, Epidemiology, and Informatics, The University of Pennsylvania, Philadelphia, Pennsylvania
| | - Yang Ning
- Department of Statistical Science, Cornell University, Ithaca, New York
| | - Shuang Wang
- Department of Biostatistics, Columbia University, New York, New York
| | - Bruce G Lindsay
- Department of Statistics, Pennsylvania State University, State College, Pennsylvania
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, College Station, Texas
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, The University of Pennsylvania, Philadelphia, Pennsylvania
| |
Collapse
|
9
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
10
|
Data-Driven-Based Approach to Identifying Differentially Methylated Regions Using Modified 1D Ising Model. BIOMED RESEARCH INTERNATIONAL 2019; 2018:1070645. [PMID: 30581840 PMCID: PMC6276520 DOI: 10.1155/2018/1070645] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2018] [Revised: 10/15/2018] [Accepted: 10/31/2018] [Indexed: 12/19/2022]
Abstract
Background DNA methylation is essential for regulating gene expression, and the changes of DNA methylation status are commonly discovered in disease. Therefore, identification of differentially methylation patterns, especially differentially methylated regions (DMRs), in two different groups is important for understanding the mechanism of complex diseases. Few tools exist for DMR identification through considering features of methylation data, but there is no comprehensive integration of the characteristics of DNA methylation data in current methods. Results Accounting for the characteristics of methylation data, such as the correlation characteristics of neighboring CpG sites and the high heterogeneity of DNA methylation data, we propose a data-driven approach for DMR identification through evaluating the energy of single site using modified 1D Ising model. Applied to both simulated and publicly available datasets, our approach is compared with other popular methods in terms of performance. Simulated results show that our method is more sensitive than competing methods. Applied to the real data, our method can identify more common DMRs than DMRcate, ProbeLasso, and Wang's methods with a high overlapping ratio. Also, the necessity of integrating the heterogeneity and correlation characteristics in identifying DMR is shown through comparing results with only considering mean or variance signals and without considering relationship of neighboring CpG sites, respectively. Through analyzing the number of DMRs identified in real data located in different genomic regions, we find that about 90% DMRs are located in CGI which always regulates the expression of genes. It may help us understand the functional effect of DNA methylation on disease.
Collapse
|
11
|
Choi J, Kim K, Sun H. New variable selection strategy for analysis of high-dimensional DNA methylation data. J Bioinform Comput Biol 2018; 16:1850010. [PMID: 29954287 DOI: 10.1142/s0219720018500105] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In genetic association studies, regularization methods are often used due to their computational efficiency for analysis of high-dimensional genomic data. DNA methylation data generated from Infinium HumanMethylation450 BeadChip Kit have a group structure where an individual gene consists of multiple Cytosine-phosphate-Guanine (CpG) sites. Consequently, group-based regularization can precisely detect outcome-related CpG sites. Representative examples are sparse group lasso (SGL) and network-based regularization. The former is powerful when most of the CpG sites within the same gene are associated with a phenotype outcome. In contrast, the latter is preferred when only a few of the CpG sites within the same gene are related to the outcome. In this paper, we propose new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization. In extensive simulation study, we demonstrated that the proposed strategy can show relatively outstanding selection performance under any situation, compared with both SGL and network-based regularization. Also, we applied the proposed strategy to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.
Collapse
Affiliation(s)
- Jiyun Choi
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Kipoong Kim
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Hokeun Sun
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| |
Collapse
|
12
|
Wang Y, Teschendorff AE, Widschwendter M, Wang S. Accounting for differential variability in detecting differentially methylated regions. Brief Bioinform 2017; 20:47-57. [DOI: 10.1093/bib/bbx097] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2017] [Indexed: 12/11/2022] Open
Affiliation(s)
- Ya Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Andrew E Teschendorff
- Department of Women's Cancer, University College London, London, UK
- CAS Key Lab of Computational Biology, Shanghai Institute for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Statistical Cancer Genomics, UCL Cancer Institute, University College London, London, UK
| | | | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, NY, USA
| |
Collapse
|