1
|
Sun W. Integrative functional logistic regression model for genome-wide association studies. Comput Biol Med 2025; 187:109766. [PMID: 39919666 DOI: 10.1016/j.compbiomed.2025.109766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 01/08/2025] [Accepted: 01/28/2025] [Indexed: 02/09/2025]
Abstract
BACKGROUND Progress in rapid genomic sequencing techniques have transformed the field of disease biomarker identification by offering vast genetic information. The complexity of traits is not only influenced by single genetic loci but also by interactions among multiple genetic loci. When the dimensionality of SNP data is large, identifying a significant number of genetic variants associated with diseases becomes extremely challenging. To address these high-dimensionality issues, we employed functional data analysis techniques. METHODS Because there are a lot of ordered genetic variants spread out across a small space, multiple gene variations are handled as a continuous data set rather than discrete variables in some areas. This paper introduces a novel approach for analyzing the association of multiple genes within a region, by employing an integrative functional logistic regression model. RESULTS The proposed technique has shown promising results in both simulation and real data analysis, indicating its ability to generate smooth signals and accurately estimate the coefficients of the function while recognizing the null regions. CONCLUSIONS Integrative functional logistic regression method adopt functional data analysis and assume that high-dimensional genetic data follow a continuous process. It not only naturally accommodates correlations among adjacent SNPs but also avoids the unstable estimation of a large number of parameters. This is especially desirable with the rapidly increasing dimensions of SNP data but still limited sample size. In summary, the suggested approach offers a valuable new avenue for identifying disease-related genetic variants in GWAS.
Collapse
Affiliation(s)
- Wenyuan Sun
- Department of Mathematics, College of Science, Yanbian University, Yanji, 133002, Jilin, China.
| |
Collapse
|
2
|
Li J, Zhang Q, Chen S, Fang K. Weighted multiple blockwise imputation method for high-dimensional regression with blockwise missing data. J STAT COMPUT SIM 2022. [DOI: 10.1080/00949655.2022.2109636] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2022]
Affiliation(s)
- Jingmao Li
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, People's Republic of China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, People's Republic of China
- The Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, People's Republic of China
| | - Song Chen
- College of Micro-Finance, Taizhou University, Taizhou, People's Republic of China
| | - Kuangnan Fang
- Department of Statistics and Data Science, School of Economics, Xiamen University, Xiamen, People's Republic of China
| |
Collapse
|
3
|
Wang JH, Wang KH, Chen YH. Overlapping group screening for detection of gene-environment interactions with application to TCGA high-dimensional survival genomic data. BMC Bioinformatics 2022; 23:202. [PMID: 35637439 PMCID: PMC9150322 DOI: 10.1186/s12859-022-04750-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Accepted: 05/25/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the context of biomedical and epidemiological research, gene-environment (G-E) interaction is of great significance to the etiology and progression of many complex diseases. In high-dimensional genetic data, two general models, marginal and joint models, are proposed to identify important interaction factors. Most existing approaches for identifying G-E interactions are limited owing to the lack of robustness to outliers/contamination in response and predictor data. In particular, right-censored survival outcomes make the associated feature screening even challenging. In this article, we utilize the overlapping group screening (OGS) approach to select important G-E interactions related to clinical survival outcomes by incorporating the gene pathway information under a joint modeling framework. RESULTS Simulation studies under various scenarios are carried out to compare the performances of our proposed method with some commonly used methods. In the real data applications, we use our proposed method to identify G-E interactions related to the clinical survival outcomes of patients with head and neck squamous cell carcinoma, and esophageal carcinoma in The Cancer Genome Atlas clinical survival genetic data, and further establish corresponding survival prediction models. Both simulation and real data studies show that our method performs well and outperforms existing methods in the G-E interaction selection, effect estimation, and survival prediction accuracy. CONCLUSIONS The OGS approach is useful for selecting important environmental factors, genes and G-E interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The same idea of the OGS approach can apply to other outcome models, such as the proportional odds survival time model, the logistic regression model for binary outcomes, and the multinomial logistic regression model for multi-class outcomes.
Collapse
Affiliation(s)
- Jie-Huei Wang
- Department of Statistics, Feng Chia University, Seatwen, Taichung, 40724, Taiwan.
| | - Kang-Hsin Wang
- Department of Statistics, Feng Chia University, Seatwen, Taichung, 40724, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, 11529, Taiwan
| |
Collapse
|
4
|
Li Z, Luo Z, Sun Y. Robust nonparametric integrative analysis to decipher heterogeneity and commonality across subgroups using sparse boosting. Stat Med 2022; 41:1658-1687. [PMID: 35072291 DOI: 10.1002/sim.9322] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 11/20/2021] [Accepted: 01/03/2022] [Indexed: 11/09/2022]
Abstract
In many biomedical problems, data are often heterogeneous, with samples spanning multiple patient subgroups, where different subgroups may have different disease subtypes, stages, or other medical contexts. These subgroups may be related, but they are also expected to have differences with respect to the underlying biology. The heterogeneous data presents a precious opportunity to explore the heterogeneities and commonalities between related subgroups. Unfortunately, effective statistical analysis methods are still lacking. Recently, several novel methods based on integrative analysis have been proposed to tackle this challenging problem. Despite promising results, the existing studies are still limited by ignoring data contamination and making strict assumptions of linear effects of covariates on response. As such, we develop a robust nonparametric integrative analysis approach to identify heterogeneity and commonality, as well as select important covariates and estimate covariate effects. Possible data contamination is accommodated by adopting the Cauchy loss function, and a nonparametric model is built to accommodate nonlinear effects. The proposed approach is based on a sparse boosting technique. The advantages of the proposed approach are demonstrated in extensive simulations. The analysis of The Cancer Genome Atlas data on glioblastoma multiforme and lung adenocarcinoma shows that the proposed approach makes biologically meaningful findings with satisfactory prediction.
Collapse
Affiliation(s)
- Zihan Li
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Ziye Luo
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
5
|
Wu M, Qin X, Ma S. GEInter: an R package for robust gene-environment interaction analysis. Bioinformatics 2021; 37:3691-3692. [PMID: 33961050 PMCID: PMC8545291 DOI: 10.1093/bioinformatics/btab318] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2021] [Revised: 03/30/2021] [Accepted: 04/26/2021] [Indexed: 11/14/2022] Open
Abstract
SUMMARY For understanding complex diseases, gene-environment (G-E) interactions have important implications beyond main G and E effects. Most of the existing analysis approaches and software packages cannot accommodate data contamination/long-tailed distribution. We develop GEInter, a comprehensive R package tailored to robust G-E interaction analysis. For both marginal and joint analysis, for data without and with missingness, for continuous and censored survival responses, it comprehensively conducts identification, estimation, visualization, and prediction. It can fill an important gap in the existing literature and enjoy broad applicability. AVAILABILITY AND IMPLEMENTATION https://cran.r-project.org/web/packages/GEInter/. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Xing Qin
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT 06520, USA
| |
Collapse
|
6
|
Zhou F, Ren J, Lu X, Ma S, Wu C. Gene-Environment Interaction: A Variable Selection Perspective. Methods Mol Biol 2021; 2212:191-223. [PMID: 33733358 DOI: 10.1007/978-1-0716-0947-7_13] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/31/2023]
Abstract
Gene-environment interactions have important implications for elucidating the genetic basis of complex diseases beyond the joint function of multiple genetic factors and their interactions (or epistasis). In the past, G × E interactions have been mainly conducted within the framework of genetic association studies. The high dimensionality of G × E interactions, due to the complicated form of environmental effects and the presence of a large number of genetic factors including gene expressions and SNPs, has motivated the recent development of penalized variable selection methods for dissecting G × E interactions, which has been ignored in the majority of published reviews on genetic interaction studies. In this article, we first survey existing studies on both gene-environment and gene-gene interactions. Then, after a brief introduction to the variable selection methods, we review penalization and relevant variable selection methods in marginal and joint paradigms, respectively, under a variety of conceptual models. Discussions on strengths and limitations, as well as computational aspects of the variable selection methods tailored for G × E studies, have also been provided.
Collapse
Affiliation(s)
- Fei Zhou
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Jie Ren
- Department of Biostatistics, Indiana University School of Medicine, Indianapolis, IN, USA
| | - Xi Lu
- Department of Statistics, Kansas State University, Manhattan, KS, USA
| | - Shuangge Ma
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, USA
| | - Cen Wu
- Department of Statistics, Kansas State University, Manhattan, KS, USA.
| |
Collapse
|
7
|
Li Y, Wang F, Wu M, Ma S. Integrative functional linear model for genome-wide association studies with multiple traits. Biostatistics 2020; 23:574-590. [PMID: 33040145 DOI: 10.1093/biostatistics/kxaa043] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 06/30/2020] [Accepted: 09/12/2020] [Indexed: 11/14/2022] Open
Abstract
In recent biomedical research, genome-wide association studies (GWAS) have demonstrated great success in investigating the genetic architecture of human diseases. For many complex diseases, multiple correlated traits have been collected. However, most of the existing GWAS are still limited because they analyze each trait separately without considering their correlations and suffer from a lack of sufficient information. Moreover, the high dimensionality of single nucleotide polymorphism (SNP) data still poses tremendous challenges to statistical methods, in both theoretical and practical aspects. In this article, we innovatively propose an integrative functional linear model for GWAS with multiple traits. This study is the first to approximate SNPs as functional objects in a joint model of multiple traits with penalization techniques. It effectively accommodates the high dimensionality of SNPs and correlations among multiple traits to facilitate information borrowing. Our extensive simulation studies demonstrate the satisfactory performance of the proposed method in the identification and estimation of disease-associated genetic variants, compared to four alternatives. The analysis of type 2 diabetes data leads to biologically meaningful findings with good prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Yang Li
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Fan Wang
- Center For Applied Statistics, School Of Statistics, And Statistical Consulting Center, Renmin University Of China, Beijing 100872, China
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven 06520, USA
| |
Collapse
|
8
|
Penalized Variable Selection for Lipid-Environment Interactions in a Longitudinal Lipidomics Study. Genes (Basel) 2019; 10:genes10121002. [PMID: 31816972 PMCID: PMC6947406 DOI: 10.3390/genes10121002] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2019] [Accepted: 11/26/2019] [Indexed: 12/20/2022] Open
Abstract
Lipid species are critical components of eukaryotic membranes. They play key roles in many biological processes such as signal transduction, cell homeostasis, and energy storage. Investigations of lipid-environment interactions, in addition to the lipid and environment main effects, have important implications in understanding the lipid metabolism and related changes in phenotype. In this study, we developed a novel penalized variable selection method to identify important lipid-environment interactions in a longitudinal lipidomics study. An efficient Newton-Raphson based algorithm was proposed within the generalized estimating equation (GEE) framework. We conducted extensive simulation studies to demonstrate the superior performance of our method over alternatives, in terms of both identification accuracy and prediction performance. As weight control via dietary calorie restriction and exercise has been demonstrated to prevent cancer in a variety of studies, analysis of the high-dimensional lipid datasets collected using 60 mice from the skin cancer prevention study identified meaningful markers that provide fresh insight into the underlying mechanism of cancer preventive effects.
Collapse
|