1
|
Sun W. Integrative functional logistic regression model for genome-wide association studies. Comput Biol Med 2025; 187:109766. [PMID: 39919666 DOI: 10.1016/j.compbiomed.2025.109766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 01/08/2025] [Accepted: 01/28/2025] [Indexed: 02/09/2025]
Abstract
BACKGROUND Progress in rapid genomic sequencing techniques have transformed the field of disease biomarker identification by offering vast genetic information. The complexity of traits is not only influenced by single genetic loci but also by interactions among multiple genetic loci. When the dimensionality of SNP data is large, identifying a significant number of genetic variants associated with diseases becomes extremely challenging. To address these high-dimensionality issues, we employed functional data analysis techniques. METHODS Because there are a lot of ordered genetic variants spread out across a small space, multiple gene variations are handled as a continuous data set rather than discrete variables in some areas. This paper introduces a novel approach for analyzing the association of multiple genes within a region, by employing an integrative functional logistic regression model. RESULTS The proposed technique has shown promising results in both simulation and real data analysis, indicating its ability to generate smooth signals and accurately estimate the coefficients of the function while recognizing the null regions. CONCLUSIONS Integrative functional logistic regression method adopt functional data analysis and assume that high-dimensional genetic data follow a continuous process. It not only naturally accommodates correlations among adjacent SNPs but also avoids the unstable estimation of a large number of parameters. This is especially desirable with the rapidly increasing dimensions of SNP data but still limited sample size. In summary, the suggested approach offers a valuable new avenue for identifying disease-related genetic variants in GWAS.
Collapse
Affiliation(s)
- Wenyuan Sun
- Department of Mathematics, College of Science, Yanbian University, Yanji, 133002, Jilin, China.
| |
Collapse
|
2
|
Wang F, Jia K, Li Y. Integrative deep learning with prior assisted feature selection. Stat Med 2024; 43:3792-3814. [PMID: 38923006 DOI: 10.1002/sim.10148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 04/23/2024] [Accepted: 06/07/2024] [Indexed: 06/28/2024]
Abstract
Integrative analysis has emerged as a prominent tool in biomedical research, offering a solution to the "smalln $$ n $$ and largep $$ p $$ " challenge. Leveraging the powerful capabilities of deep learning in extracting complex relationship between genes and diseases, our objective in this study is to incorporate deep learning into the framework of integrative analysis. Recognizing the redundancy within candidate features, we introduce a dedicated feature selection layer in the proposed integrative deep learning method. To further improve the performance of feature selection, the rich previous researches are utilized by an ensemble learning method to identify "prior information". This leads to the proposed prior assisted integrative deep learning (PANDA) method. We demonstrate the superiority of the PANDA method through a series of simulation studies, showing its clear advantages over competing approaches in both feature selection and outcome prediction. Finally, a skin cutaneous melanoma (SKCM) dataset is extensively analyzed by the PANDA method to show its practical application.
Collapse
Affiliation(s)
- Feifei Wang
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| | - Ke Jia
- School of Statistics, Renmin University of China, Beijing, China
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China
- School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
3
|
Mei B, Jiang Y, Sun Y. Unveiling Commonalities and Differences in Genetic Regulations via Two-Way Fusion. J Comput Biol 2024; 31:834-870. [PMID: 39133672 DOI: 10.1089/cmb.2023.0437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/10/2024] Open
Abstract
Understanding the genetic regulation, for example, gene expressions (GEs) by copy number variations and methylations, is crucial to uncover the development and progression of complex diseases. Advancing from early studies that are mostly focused on homogeneous groups of patients, some recent studies have shifted their focus toward different patient groups, explored their commonalities and differences, and led to insightful findings. However, the analysis can be very challenging with one GE possibly regulated by multiple regulators and one regulator potentially regulating the expressions of multiple genes, leading to two distinct types of commonalities/differences in the patterns of genetic regulation. In addition, the high dimensionality of both sides of regulation poses challenges to computation. In this study, we develop a two-way fusion integrative analysis approach, which innovatively applies two fusion penalties to simultaneously identify commonalities/differences in the regulated pattern of GEs and regulating pattern of regulators, and adopt a Huber loss function to accommodate the possible data contamination. Moreover, a simple yet efficient iterative optimization algorithm is developed, which does not need to introduce any auxiliary variables and extra tuning parameters and is guaranteed to converge to a globally optimal solution. The advantages of the proposed approach are demonstrated in extensive simulations. The analysis of The Cancer Genome Atlas data on melanoma and lung cancer leads to interesting findings and satisfactory prediction performance.
Collapse
Affiliation(s)
- Biao Mei
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
- Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, Beijing, China
| |
Collapse
|
4
|
Qin X, Hu J, Ma S, Wu M. Estimation of multiple networks with common structures in heterogeneous subgroups. J MULTIVARIATE ANAL 2024; 202:105298. [PMID: 38433779 PMCID: PMC10907012 DOI: 10.1016/j.jmva.2024.105298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/05/2024]
Abstract
Network estimation has been a critical component of high-dimensional data analysis and can provide an understanding of the underlying complex dependence structures. Among the existing studies, Gaussian graphical models have been highly popular. However, they still have limitations due to the homogeneous distribution assumption and the fact that they are only applicable to small-scale data. For example, cancers have various levels of unknown heterogeneity, and biological networks, which include thousands of molecular components, often differ across subgroups while also sharing some commonalities. In this article, we propose a new joint estimation approach for multiple networks with unknown sample heterogeneity, by decomposing the Gaussian graphical model (GGM) into a collection of sparse regression problems. A reparameterization technique and a composite minimax concave penalty are introduced to effectively accommodate the specific and common information across the networks of multiple subgroups, making the proposed estimator significantly advancing from the existing heterogeneity network analysis based on the regularized likelihood of GGM directly and enjoying scale-invariant, tuning-insensitive, and optimization convexity properties. The proposed analysis can be effectively realized using parallel computing. The estimation and selection consistency properties are rigorously established. The proposed approach allows the theoretical studies to focus on independent network estimation only and has the significant advantage of being both theoretically and computationally applicable to large-scale data. Extensive numerical experiments with simulated data and the TCGA breast cancer data demonstrate the prominent performance of the proposed approach in both subgroup and network identifications.
Collapse
Affiliation(s)
- Xing Qin
- School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai, China
| | - Jianhua Hu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
5
|
Wang X, Jiang Y, Sun Y. Revealing genomic heterogeneity and commonality: A penalized integrative analysis approach accounting for the adjacency structure of measurements. Genet Epidemiol 2024; 48:114-140. [PMID: 38317326 DOI: 10.1002/gepi.22549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 12/18/2023] [Accepted: 01/08/2024] [Indexed: 02/07/2024]
Abstract
Advancements in high-throughput genomic technologies have revolutionized the field of disease biomarker identification by providing large-scale genomic data. There is an increasing focus on understanding the relationships among diverse patient groups with distinct disease subtypes and characteristics. Complex diseases exhibit both heterogeneity and shared genomic factors, making it essential to investigate these patterns to accurately detect markers and comprehensively understand the diseases. Integrative analysis has emerged as a promising approach to address this challenge. However, existing studies have been limited by ignoring the adjacency structure of genomic measurements, such as single nucleotide polymorphisms (SNPs) and DNA methylations. In this study, we propose a structured integrative analysis method that incorporates a spline type penalty to accommodate this adjacency structure. We utilize a fused lasso type penalty to identify both heterogeneity and commonality across the groups. Extensive simulations demonstrate its superiority compared to several direct competing methods. The analysis of The Cancer Genome Atlas melanoma data with DNA methylation measurements and GENEVA diabetes data with SNP measurements exhibit that the proposed analysis lead to meaningful findings with better prediction performance and higher selection stability.
Collapse
Affiliation(s)
- Xindi Wang
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, The University of Memphis, Memphis, Tennessee, USA
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
6
|
Wang P, Wang H, Li Q, Shen D, Liu Y. Joint and Individual Component Regression. J Comput Graph Stat 2023; 33:763-773. [PMID: 39526223 PMCID: PMC11545161 DOI: 10.1080/10618600.2023.2284227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 10/30/2023] [Indexed: 11/16/2024]
Abstract
Multi-group data, which include the same set of variables on separate groups of samples, are commonly seen in practice. Such data structure consists of data from multiple groups and can be challenging to analyze due to data heterogeneity. We propose a novel Joint and Individual Component Regression (JICO) model to analyze multi-group data. Our proposed model decomposes the response into shared and group-specific components, which are driven by low-rank approximations of joint and individual structures from the predictors respectively. The joint structure has the same regression coefficients across multiple groups, whereas individual structures have group-specific regression coefficients. We formulate this framework under the representation of latent components and propose an iterative algorithm to solve for the joint and individual scores. We utilize the Continuum Regression (CR) to estimate the latent scores, which provides a unified framework that covers the Ordinary Least Squares (OLS), the Partial Least Squares (PLS), and the Principal Component Regression (PCR) as its special cases. We show that JICO attains a good balance between global and group-specific models and remains flexible due to the usage of CR. We conduct simulation studies and analysis of an Alzheimer's disease dataset to further demonstrate the effectiveness of JICO. R package of JICO is available online at https://CRAN.R-project.org/package=JICO.
Collapse
Affiliation(s)
- Peiyao Wang
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill
| | - Haodong Wang
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill
| | - Quefeng Li
- Department of Biostatistics, University of North Carolina at Chapel Hill
| | - Dinggang Shen
- School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai 201210, China
- Shanghai United Imaging Intelligence Co., Ltd., Shanghai 200230, China
- Shanghai Clinical Research and Trial Center, Shanghai, 201210, China
| | - Yufeng Liu
- Department of Statistics and Operations Research, University of North Carolina at Chapel Hill
- Department of Biostatistics, University of North Carolina at Chapel Hill
- Department of Genetics, University of North Carolina at Chapel Hill
| |
Collapse
|
7
|
Motwani K, Bacher R, Molstad AJ. Binned multinomial logistic regression for integrative cell-type annotation. Ann Appl Stat 2023; 17:3426-3449. [PMID: 40206429 PMCID: PMC11981643 DOI: 10.1214/23-aoas1769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2025]
Abstract
Categorizing individual cells into one of many known cell type categories, also known as cell type annotation, is a critical step in the analysis of single-cell genomics data. The current process of annotation is time-intensive and subjective, which has led to different studies describing cell types with labels of varying degrees of resolution. While supervised learning approaches have provided automated solutions to annotation, there remains a significant challenge in fitting a unified model for multiple datasets with inconsistent labels. In this article, we propose a new multinomial logistic regression estimator which can be used to model cell type probabilities by integrating multiple datasets with labels of varying resolution. To compute our estimator, we solve a nonconvex optimization problem using a blockwise proximal gradient descent algorithm. We show through simulation studies that our approach estimates cell type probabilities more accurately than competitors in a wide variety of scenarios. We apply our method to ten single-cell RNA-seq datasets and demonstrate its utility in predicting fine resolution cell type labels on unlabeled data as well as refining cell type labels on data with existing coarse resolution annotations. Finally, we demonstrate that our method can lead to novel scientific insights in the context of a differential expression analysis comparing peripheral blood gene expression before and after treatment with interferon- β . An R package implementing the method is available at https://github.com/keshav-motwani/IBMR and the collection of datasets we analyze is available at https://github.com/keshav-motwani/AnnotatedPBMC.
Collapse
Affiliation(s)
| | - Rhonda Bacher
- Department of Biostatistics, University of Florida
- Genetics Institute, University of Florida
| | - Aaron J. Molstad
- Department of Statistics, University of Florida
- Genetics Institute, University of Florida
| |
Collapse
|
8
|
Molstad AJ, Patra RK. Dimension reduction for integrative survival analysis. Biometrics 2023; 79:1610-1623. [PMID: 35964256 DOI: 10.1111/biom.13736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2021] [Accepted: 06/20/2022] [Indexed: 11/26/2022]
Abstract
We propose a constrained maximum partial likelihood estimator for dimension reduction in integrative (e.g., pan-cancer) survival analysis with high-dimensional predictors. We assume that for each population in the study, the hazard function follows a distinct Cox proportional hazards model. To borrow information across populations, we assume that each of the hazard functions depend only on a small number of linear combinations of the predictors (i.e., "factors"). We estimate these linear combinations using an algorithm based on "distance-to-set" penalties. This allows us to impose both low-rankness and sparsity on the regression coefficient matrix estimator. We derive asymptotic results that reveal that our estimator is more efficient than fitting a separate proportional hazards model for each population. Numerical experiments suggest that our method outperforms competitors under various data generating models. We use our method to perform a pan-cancer survival analysis relating protein expression to survival across 18 distinct cancer types. Our approach identifies six linear combinations, depending on only 20 proteins, which explain survival across the cancer types. Finally, to validate our fitted model, we show that our estimated factors can lead to better prediction than competitors on four external datasets.
Collapse
Affiliation(s)
- Aaron J Molstad
- Department of Statistics and Genetics Institute, University of Florida, Gainesville, Florida, USA
| | - Rohit K Patra
- Department of Statistics, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
9
|
Wang F, Liang D, Li Y, Ma S. Prior information-assisted integrative analysis of multiple datasets. Bioinformatics 2023; 39:btad452. [PMID: 37490475 PMCID: PMC10400378 DOI: 10.1093/bioinformatics/btad452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2022] [Revised: 05/13/2023] [Accepted: 07/24/2023] [Indexed: 07/27/2023] Open
Abstract
MOTIVATION Analyzing genetic data to identify markers and construct predictive models is of great interest in biomedical research. However, limited by cost and sample availability, genetic studies often suffer from the "small sample size, high dimensionality" problem. To tackle this problem, an integrative analysis that collectively analyzes multiple datasets with compatible designs is often conducted. For regularizing estimation and selecting relevant variables, penalization and other regularization techniques are routinely adopted. "Blindly" searching over a vast number of variables may not be efficient. RESULTS We propose incorporating prior information to assist integrative analysis of multiple genetic datasets. To obtain accurate prior information, we adopt a convolutional neural network with an active learning strategy to label textual information from previous studies. Then the extracted prior information is incorporated using a group LASSO-based technique. We conducted a series of simulation studies that demonstrated the satisfactory performance of the proposed method. Finally, data on skin cutaneous melanoma are analyzed to establish practical utility. AVAILABILITY AND IMPLEMENTATION Code is available at https://github.com/ldz7/PAIA. The data that support the findings in this article are openly available in TCGA (The Cancer Genome Atlas) at https://portal.gdc.cancer.gov/.
Collapse
Affiliation(s)
- Feifei Wang
- Center for Applied Statistics, Renmin University of China, Beijing 100872, China
- School of Statistics, Renmin University of China, Beijing 100872, China
- Institute for Data Science in Health, Renmin University of China, Beijing 100872, China
| | - Dongzuo Liang
- School of Statistics, Renmin University of China, Beijing 100872, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing 100872, China
- School of Statistics, Renmin University of China, Beijing 100872, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing 100872, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
10
|
Chang C, Dai Z, Oh J, Long Q. Integrative Learning of Structured High-Dimensional Data from Multiple Datasets. Stat Anal Data Min 2023; 16:120-134. [PMID: 37213790 PMCID: PMC10195070 DOI: 10.1002/sam.11601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2021] [Accepted: 10/14/2022] [Indexed: 11/11/2022]
Abstract
Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
Collapse
Affiliation(s)
- Changgee Chang
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Zongyu Dai
- School of Arts and Science, University of Pennsylvania, Pennsylvania, U.S.A
| | - Jihwan Oh
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| | - Qi Long
- Perelman School of Medicine, University of Pennsylvania, Pennsylvania, U.S.A
| |
Collapse
|
11
|
Li Z, Luo Z, Sun Y. Robust nonparametric integrative analysis to decipher heterogeneity and commonality across subgroups using sparse boosting. Stat Med 2022; 41:1658-1687. [PMID: 35072291 DOI: 10.1002/sim.9322] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 11/20/2021] [Accepted: 01/03/2022] [Indexed: 11/09/2022]
Abstract
In many biomedical problems, data are often heterogeneous, with samples spanning multiple patient subgroups, where different subgroups may have different disease subtypes, stages, or other medical contexts. These subgroups may be related, but they are also expected to have differences with respect to the underlying biology. The heterogeneous data presents a precious opportunity to explore the heterogeneities and commonalities between related subgroups. Unfortunately, effective statistical analysis methods are still lacking. Recently, several novel methods based on integrative analysis have been proposed to tackle this challenging problem. Despite promising results, the existing studies are still limited by ignoring data contamination and making strict assumptions of linear effects of covariates on response. As such, we develop a robust nonparametric integrative analysis approach to identify heterogeneity and commonality, as well as select important covariates and estimate covariate effects. Possible data contamination is accommodated by adopting the Cauchy loss function, and a nonparametric model is built to accommodate nonlinear effects. The proposed approach is based on a sparse boosting technique. The advantages of the proposed approach are demonstrated in extensive simulations. The analysis of The Cancer Genome Atlas data on glioblastoma multiforme and lung adenocarcinoma shows that the proposed approach makes biologically meaningful findings with satisfactory prediction.
Collapse
Affiliation(s)
- Zihan Li
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Ziye Luo
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| |
Collapse
|
12
|
Bi X, Feng L, Li C, Zhang H. Modeling Pregnancy Outcomes through Sequentially Nested Regression Models. J Am Stat Assoc 2022; 117:602-616. [PMID: 36090951 PMCID: PMC9454338 DOI: 10.1080/01621459.2021.2006666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/07/2023]
Abstract
The polycystic ovary syndrome (PCOS) is a most common cause of infertility among women of reproductive age. Unfortunately, the etiology of PCOS is poorly understood. Large scale clinical trials for Pregnancy in Polycystic Ovary Syndrome (PPCOS) were conducted to evaluate the effectiveness of treatments. Ovulation, pregnancy, and live birth are three sequentially nested binary outcomes, typically analyzed separately. However, the separate models may lose power in detecting the treatment effects and influential variables for live birth, due to decreased sample sizes and unbalanced event counts. It has been a long-held hypothesis among the clinicians that some of the important variables for early pregnancy outcomes may continue their influence on live birth. To consider this possibility, we develop an ℓ 0-norm based regularization method in favor of variables that have been identified from an earlier stage. Our approach explicitly bridges the connections across nested outcomes through computationally easy algorithms and enjoys theoretical guarantee of estimation and variable selection. By analyzing the PPCOS data, we successfully uncover the hidden influence of risk factors on live birth, which confirm clinical experience. Moreover, we provide novel infertility treatment recommendations (e.g., letrozole vs clomiphene citrate) for women with PCOS to improve their chances of live birth.
Collapse
|
13
|
|
14
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
15
|
Abstract
The finite mixture of regression (FMR) model is a popular tool for accommodating data heterogeneity. In the analysis of FMR models with high-dimensional covariates, it is necessary to conduct regularized estimation and identify important covariates rather than noises. In the literature, there has been a lack of attention paid to the differences among important covariates, which can lead to the underlying structure of covariate effects. Specifically, important covariates can be classified into two types: those that behave the same in different subpopulations and those that behave differently. It is of interest to conduct structured analysis to identify such structures, which will enable researchers to better understand covariates and their associations with outcomes. Specifically, the FMR model with high-dimensional covariates is considered. A structured penalization approach is developed for regularized estimation, selection of important variables, and, equally importantly, identification of the underlying covariate effect structure. The proposed approach can be effectively realized, and its statistical properties are rigorously established. Simulation demonstrates its superiority over alternatives. In the analysis of cancer gene expression data, interesting models/structures missed by the existing analysis are identified.
Collapse
Affiliation(s)
- Mengque Liu
- School of Journalism and New Media, Xi’an Jiaotong University
| | - Qingzhao Zhang
- School of Economics, Xiamen University
- Wang Yanan Institute for Studies in Economics, Xiamen University
| | | | - Shuangge Ma
- School of Economics, Xiamen University
- Department of Biostatistics, Yale University
| |
Collapse
|
16
|
Chang C, Oh J, Long Q. GRIA: Graphical Regularization for Integrative Analysis. PROCEEDINGS OF THE ... SIAM INTERNATIONAL CONFERENCE ON DATA MINING. SIAM INTERNATIONAL CONFERENCE ON DATA MINING 2020; 2020:604-612. [PMID: 32440369 PMCID: PMC7241091 DOI: 10.1137/1.9781611976236.68] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Integrative analysis jointly analyzes multiple data sets to overcome curse of dimensionality. It can detect important but weak signals by jointly selecting features for all data sets, but unfortunately the sets of important features are not always the same for all data sets. Variations which allows heterogeneous sparsity structure-a subset of data sets can have a zero coefficient for a selected feature-have been proposed, but it compromises the effect of integrative analysis recalling the problem of losing weak important signals. We propose a new integrative analysis approach which not only aggregates weak important signals well in homogeneity setting but also substantially alleviates the problem of losing weak important signals in heterogeneity setting. Our approach exploits a priori known graphical structure of features by forcing joint selection of adjacent features, and integrating such information over multiple data sets can increase the power while taking into account the heterogeneity across data sets. We confirm the problem of existing approaches and demonstrate the superiority of our method through a simulation study and an application to gene expression data from ADNI.
Collapse
Affiliation(s)
- Changgee Chang
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania
| | - Jihwan Oh
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania
| |
Collapse
|
17
|
Wu M, Zhang Q, Ma S. Structured gene-environment interaction analysis. Biometrics 2019; 76:23-35. [PMID: 31424088 DOI: 10.1111/biom.13139] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Accepted: 08/06/2019] [Indexed: 01/03/2023]
Abstract
For the etiology, progression, and treatment of complex diseases, gene-environment (G-E) interactions have important implications beyond the main G and E effects. G-E interaction analysis can be more challenging with higher dimensionality and need for accommodating the "main effects, interactions" hierarchy. In recent literature, an array of novel methods, many of which are based on the penalization technique, have been developed. In most of these studies, however, the structures of G measurements, for example, the adjacency structure of single nucleotide polymorphisms (SNPs; attributable to their physical adjacency on the chromosomes) and the network structure of gene expressions (attributable to their coordinated biological functions and correlated measurements) have not been well accommodated. In this study, we develop structured G-E interaction analysis, where such structures are accommodated using penalization for both the main G effects and interactions. Penalization is also applied for regularized estimation and selection. The proposed structured interaction analysis can be effectively realized. It is shown to have consistency properties under high-dimensional settings. Simulations and analysis of GENEVA diabetes data with SNP measurements and TCGA melanoma data with gene expression measurements demonstrate its competitive practical performance.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| | - Qingzhao Zhang
- School of Economics and Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
18
|
Sun Y, Sun Z, Jiang Y, Li Y, Ma S. An integrative sparse boosting analysis of cancer genomic commonality and difference. Stat Methods Med Res 2019; 29:1325-1337. [PMID: 31282286 DOI: 10.1177/0962280219859026] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
In cancer research, high-throughput profiling has been extensively conducted. In recent studies, the integrative analysis of data on multiple cancer patient groups/subgroups has been conducted. Such analysis has the potential to reveal the genomic commonality as well as difference across groups/subgroups. However, in the existing literature, methods with a special attention to the genomic commonality and difference are very limited. In this study, a novel estimation and marker selection method based on the sparse boosting technique is developed to address the commonality/difference problem. In terms of technical innovation, a new penalty and computation of increments are introduced. The proposed method can also effectively accommodate the grouping structure of covariates. Simulation shows that it can outperform direct competitors under a wide spectrum of settings. The analysis of two TCGA (The Cancer Genome Atlas) datasets is conducted, showing that the proposed analysis can identify markers with important biological implications and have satisfactory prediction and stability.
Collapse
Affiliation(s)
- Yifan Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Zhengyang Sun
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, University of Memphis, Tennessee, USA
| | - Yang Li
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| |
Collapse
|
19
|
Li Y, Li R, Lin C, Qin Y, Ma S. Penalized integrative semiparametric interaction analysis for multiple genetic datasets. Stat Med 2019; 38:3221-3242. [PMID: 30993736 DOI: 10.1002/sim.8172] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 02/08/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022]
Abstract
In this article, we consider a semiparametric additive partially linear interaction model for the integrative analysis of multiple genetic datasets. The goals are to identify important genetic predictors and gene-gene interactions and to estimate the nonparametric functions that describe the environmental effects at the same time. To find the similarities and differences of the genetic effects across different datasets, we impose a group structure on the regression coefficients matrix under the homogeneity assumption, ie, models for different datasets share the same sparsity structure, but the coefficients may differ across datasets. We develop an iterative approach to estimate the parameters of main effects, interactions and nonparametric functions, where a reparametrization of interaction parameters is implemented to meet the strong hierarchy assumption. We demonstrate the advantages of the proposed method in identification, estimation, and prediction in a series of numerical studies. We also apply the proposed method to the Skin Cutaneous Melanoma data and the lung cancer data from the Cancer Genome Atlas.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Rong Li
- School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Cunjie Lin
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Yichen Qin
- Department of Operations, Business Analytics and Information Systems, University of Cincinnati, Cincinatti, Ohio
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
20
|
Shi X, Ma S, Huang Y. Promoting sign consistency in the cure model estimation and selection. Stat Methods Med Res 2019; 29:15-28. [PMID: 30600776 DOI: 10.1177/0962280218820356] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
In survival analysis, when a subset of subjects has extremely long survival, the two-part cure rate model has been commonly adopted. In the two-part model, the first part is for a binary response and describes the probability of cure. The second part is for a survival response and describes the probability of survival. Despite their intuitive interconnections, most of the existing works estimate the two parts without any constraint. The existing works on proportionality promote similarity in magnitudes (i.e. quantitative similarity) and can be too restrictive. In this study, for the two-part cure rate model, we propose imposing a sign-based penalty to promote similarity in signs (i.e. qualitative similarity). The proposed strategy can be more informative than those that neglect the two-part interconnections and be less restrictive than the existing proportionality works. Penalty is also imposed to select relevant variables and accommodate high-dimensional data. Numerical studies, including simulation and two data analyses, demonstrate the advantageous performance of the proposed approach.
Collapse
Affiliation(s)
- Xingjie Shi
- Department of Statistics, Nanjing University of Finance and Economics, Nanjing, Jiangsu, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Yuan Huang
- Department of Biostatistics, University of Iowa, Iowa City, IA, USA
| |
Collapse
|
21
|
Sun Y, Jiang Y, Li Y, Ma S. Identification of cancer omics commonality and difference via community fusion. Stat Med 2018; 38:1200-1212. [PMID: 30421444 DOI: 10.1002/sim.8027] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Revised: 10/06/2018] [Accepted: 10/13/2018] [Indexed: 12/18/2022]
Abstract
The analysis of cancer omics data is a "classic" problem; however, it still remains challenging. Advancing from early studies that are mostly focused on a single type of cancer, some recent studies have analyzed data on multiple "related" cancer types/subtypes, examined their commonality and difference, and led to insightful findings. In this article, we consider the analysis of multiple omics datasets, with each dataset on one type/subtype of "related" cancers. A Community Fusion (CoFu) approach is developed, which conducts marker selection and model building using a novel penalization technique, informatively accommodates the network community structure of omics measurements, and automatically identifies the commonality and difference of cancer omics markers. Simulation demonstrates its superiority over direct competitors. The analysis of TCGA lung cancer and melanoma data leads to interesting findings.
Collapse
Affiliation(s)
- Yifan Sun
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China
| | - Yu Jiang
- School of Public Health, The University of Memphis, Memphis, Tennessee
| | - Yang Li
- Center for Applied Statistics, Renmin University of China, Beijing, China.,School of Statistics, Renmin University of China, Beijing, China.,Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Shuangge Ma
- School of Statistics, Renmin University of China, Beijing, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
22
|
Kontar R, Zhou S, Sankavaram C, Du X, Zhang Y. Nonparametric Modeling and Prognosis of Condition Monitoring Signals Using Multivariate Gaussian Convolution Processes. Technometrics 2018. [DOI: 10.1080/00401706.2017.1383310] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Raed Kontar
- Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, WI
| | - Shiyu Zhou
- Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, WI
| | | | - Xinyu Du
- General Motors Research & Development, Warren, MI
| | - Yilu Zhang
- General Motors Research & Development, Warren, MI
| |
Collapse
|