1
|
Ge Y, Li T, Feng X, Wu M, Liu H. Structured feature ranking for genomic marker identification accommodating multiple types of networks. Biometrics 2024; 80:ujae158. [PMID: 39745855 DOI: 10.1093/biomtc/ujae158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Revised: 10/04/2024] [Accepted: 12/12/2024] [Indexed: 01/04/2025]
Abstract
Numerous statistical methods have been developed to search for genomic markers associated with the development, progression, and response to treatment of complex diseases. Among them, feature ranking plays a vital role due to its intuitive formulation and computational efficiency. However, most of the existing methods are based on the marginal importance of molecular predictors and share the limitation that the dependence (network) structures among predictors are not well accommodated, where a disease phenotype usually reflects various biological processes that interact in a complex network. In this paper, we propose a structured feature ranking method for identifying genomic markers, where such network structures are effectively accommodated using Laplacian regularization. The proposed method innovatively investigates multiple network scenarios, where the networks can be known a priori and data-dependently estimated. In addition, we rigorously explore the noise and uncertainty in the networks and control their impacts with proper selection of tuning parameters. These characteristics make the proposed method enjoy especially broad applicability. Theoretical result of our proposal is rigorously established. Compared to the original marginal measure, the proposed network structured measure can achieve sure screening properties with a faster convergence rate under mild conditions. Extensive simulations and analysis of The Cancer Genome Atlas melanoma data demonstrate the improvement of finite sample performance and practical usefulness of the proposed method.
Collapse
Affiliation(s)
- Yeheng Ge
- School of Statistics and Data Science, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - Tao Li
- School of Statistics and Data Science, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - Xingdong Feng
- School of Statistics and Data Science, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - Mengyun Wu
- School of Statistics and Data Science, Shanghai University of Finance and Economics, 777 Guoding Road, Shanghai 200433, China
| | - Hailong Liu
- Department of Urology, Xinhua Hospital affiliated to Shanghai Jiao Tong University School of Medicine, 1665 Kongjiang Road, Shanghai 200092, China
| |
Collapse
|
2
|
Tan X, Zhang X, Cui Y, Liu X. Uncertainty quantification in high-dimensional linear models incorporating graphical structures with applications to gene set analysis. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae541. [PMID: 39254590 PMCID: PMC11434165 DOI: 10.1093/bioinformatics/btae541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 08/04/2024] [Accepted: 09/06/2024] [Indexed: 09/11/2024]
Abstract
MOTIVATION The functions of genes in networks are typically correlated due to their functional connectivity. Variable selection methods have been developed to select important genes associated with a trait while incorporating network graphical information. However, no method has been proposed to quantify the uncertainty of individual genes under such settings. RESULTS In this paper, we construct confidence intervals (CIs) and provide P-values for parameters of a high-dimensional linear model incorporating graphical structures where the number of variables p diverges with the number of observations. For combining the graphical information, we propose a graph-constrained desparsified LASSO (least absolute shrinkage and selection operator) (GCDL) estimator, which reduces dramatically the influence of high correlation of predictors and enjoys the advantage of faster computation and higher accuracy compared with the desparsified LASSO. Theoretical results show that the GCDL estimator achieves asymptotic normality. The asymptotic property of the uniform convergence is established, with which an explicit expression of the uniform CI can be derived. Extensive numerical results indicate that the GCDL estimator and its (uniform) CI perform well even when predictors are highly correlated. AVAILABILITY AND IMPLEMENTATION An R package implementing the proposed method is available at https://github.com/XiaoZhangryy/gcdl.
Collapse
Affiliation(s)
- Xiangyong Tan
- School of Statistics and Data Science, Jiangxi University of Finance and Economics, Nanchang 330013, China
| | - Xiao Zhang
- School of Data Science, The Chinese University of Hong Kong, Shenzhen 518172, China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, United States
| | - Xu Liu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai 200433, China
- Yunnan Key Laboratory of Statistical Modeling and Data Analysis, Yunnan University, Kunming 650500, China
| |
Collapse
|
3
|
Zhu B, Zhang Z, Leung SY, Fan X. NetMIM: network-based multi-omics integration with block missingness for biomarker selection and disease outcome prediction. Brief Bioinform 2024; 25:bbae454. [PMID: 39288230 PMCID: PMC11407451 DOI: 10.1093/bib/bbae454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2023] [Revised: 07/24/2024] [Accepted: 08/30/2024] [Indexed: 09/19/2024] Open
Abstract
Compared with analyzing omics data from a single platform, an integrative analysis of multi-omics data provides a more comprehensive understanding of the regulatory relationships among biological features associated with complex diseases. However, most existing frameworks for integrative analysis overlook two crucial aspects of multi-omics data. Firstly, they neglect the known dependencies among biological features that exist in highly credible biological databases. Secondly, most existing integrative frameworks just simply remove the subjects without full omics data to handle block missingness, resulting in decreasing statistical power. To overcome these issues, we propose a network-based integrative Bayesian framework for biomarker selection and disease outcome prediction based on multi-omics data. Our framework utilizes Dirac spike-and-slab variable selection prior to identifying a small subset of biomarkers. The incorporation of gene pathway information improves the interpretability of feature selection. Furthermore, with the strategy in the FBM (stand for "full Bayesian model with missingness") model where missing omics data are augmented via a mechanistic model, our framework handles block missingness in multi-omics data via a data augmentation approach. The real application illustrates that our approach, which incorporates existing gene pathway information and includes subjects without DNA methylation data, results in more interpretable feature selection results and more accurate predictions.
Collapse
Affiliation(s)
- Bencong Zhu
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Zhen Zhang
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| | - Suet Yi Leung
- Department of Pathology, School of Clinical Medicine, LKS Faculty of Medicine, The University of Hong Kong, Queen Mary Hospital, Hong Kong SAR, China
| | - Xiaodan Fan
- Department of Statistics, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
| |
Collapse
|
4
|
Qu J, Cui Y. Gene set analysis with graph-embedded kernel association test. Bioinformatics 2021; 38:1560-1567. [PMID: 34935928 PMCID: PMC8896609 DOI: 10.1093/bioinformatics/btab851] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2021] [Revised: 11/20/2021] [Accepted: 12/16/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Kernel-based association test (KAT) has been a popular approach to evaluate the association of expressions of a gene set (e.g. pathway) with a phenotypic trait. KATs rely on kernel functions which capture the sample similarity across multiple features, to capture potential linear or non-linear relationship among features in a gene set. When calculating the kernel functions, no network graphical information about the features is considered. While genes in a functional group (e.g. a pathway) are not independent in general due to regulatory interactions, incorporating regulatory network (or graph) information can potentially increase the power of KAT. In this work, we propose a graph-embedded kernel association test, termed gKAT. gKAT incorporates prior pathway knowledge when constructing a kernel function into hypothesis testing. RESULTS We apply a diffusion kernel to capture any graph structures in a gene set, then incorporate such information to build a kernel function for further association test. We illustrate the geometric meaning of the approach. Through extensive simulation studies, we show that the proposed gKAT algorithm can improve testing power compared to the one without considering graph structures. Application to a real dataset further demonstrate the utility of the method. AVAILABILITY AND IMPLEMENTATION The R code used for the analysis can be accessed at https://github.com/JialinQu/gKAT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jialin Qu
- Department of Statistics and Probability, Michigan State University, East Lansing, MI 48824, USA
| | - Yuehua Cui
- To whom correspondence should be addressed.
| |
Collapse
|
5
|
Qin X, Ma S, Wu M. Gene-gene interaction analysis incorporating network information via a structured Bayesian approach. Stat Med 2021; 40:6619-6633. [PMID: 34542187 PMCID: PMC8595614 DOI: 10.1002/sim.9202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 08/22/2021] [Accepted: 08/30/2021] [Indexed: 01/14/2023]
Abstract
Increasing evidence has shown that gene-gene interactions have important effects in biological processes of human diseases. Due to the high dimensionality of genetic measurements, interaction analysis usually suffers from a lack of sufficient information and has unsatisfactory results. Biological network information has been massively accumulated, allowing researchers to identify biomarkers while taking a system perspective, conducting network selection (of functionally related biomarkers), and accommodating network structures. In main-effect-only analysis, network information has been incorporated. However, effort has been limited in interaction analysis. Recently, link networks that describe the relationships between genetic interactions have been demonstrated as effective for revealing multiscale hierarchical organizations in networks and providing interesting findings beyond node networks. In this study, we develop a novel structured Bayesian interaction analysis approach to effectively incorporate network information. This study is among the first to identify gene-gene interactions with the assistance of network selection, while simultaneously accommodating the underlying network structures of both main effects and interactions. It innovatively respects multiple hierarchies among main effects, interactions, and networks. The Bayesian technique is adopted, which may be more informative for estimation and prediction over some other techniques. An efficient variational Bayesian expectation-maximization algorithm is developed to explore the posterior distribution. Extensive simulation studies demonstrate the practical superiority of the proposed approach. The analysis of TCGA data on melanoma and lung cancer leads to biologically sensible findings with satisfactory prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Xing Qin
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
6
|
Le T, Zhong P. High‐dimensional Precision Matrix Estimation with a Known Graphical Structure. Stat (Int Stat Inst) 2021. [DOI: 10.1002/sta4.424] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Affiliation(s)
- Thien‐Minh Le
- Department of BioStatistics Harvard T.H. Chan School of Public Health Massachusetts USA
| | - Ping‐Shou Zhong
- Department of Mathematics, Statistics and Computer Science University of Illinois at Chicago Illinois USA
| |
Collapse
|
7
|
Wu M, Yi H, Ma S. Vertical integration methods for gene expression data analysis. Brief Bioinform 2021; 22:bbaa169. [PMID: 32793970 PMCID: PMC8138889 DOI: 10.1093/bib/bbaa169] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2020] [Revised: 06/18/2020] [Accepted: 07/04/2020] [Indexed: 12/12/2022] Open
Abstract
Gene expression data have played an essential role in many biomedical studies. When the number of genes is large and sample size is limited, there is a 'lack of information' problem, leading to low-quality findings. To tackle this problem, both horizontal and vertical data integrations have been developed, where vertical integration methods collectively analyze data on gene expressions as well as their regulators (such as mutations, DNA methylation and miRNAs). In this article, we conduct a selective review of vertical data integration methods for gene expression data. The reviewed methods cover both marginal and joint analysis and supervised and unsupervised analysis. The main goal is to provide a sketch of the vertical data integration paradigm without digging into too many technical details. We also briefly discuss potential pitfalls, directions for future developments and application notes.
Collapse
Affiliation(s)
- Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics
| | - Huangdi Yi
- Department of Biostatistics at Yale University
| | - Shuangge Ma
- Department of Biostatistics at Yale University
| |
Collapse
|
8
|
Crawford J, Greene CS. Incorporating biological structure into machine learning models in biomedicine. Curr Opin Biotechnol 2020; 63:126-134. [PMID: 31962244 PMCID: PMC7308204 DOI: 10.1016/j.copbio.2019.12.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2019] [Revised: 12/17/2019] [Accepted: 12/19/2019] [Indexed: 12/19/2022]
Abstract
In biomedical applications of machine learning, relevant information often has a rich structure that is not easily encoded as real-valued predictors. Examples of such data include DNA or RNA sequences, gene sets or pathways, gene interaction or coexpression networks, ontologies, and phylogenetic trees. We highlight recent examples of machine learning models that use structure to constrain model architecture or incorporate structured data into model training. For machine learning in biomedicine, where sample size is limited and model interpretability is crucial, incorporating prior knowledge in the form of structured data can be particularly useful. The area of research would benefit from performant open source implementations and independent benchmarking efforts.
Collapse
Affiliation(s)
- Jake Crawford
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, Philadelphia, PA, United States.
| |
Collapse
|