1
|
Qin X, Ma S, Wu M. Gene-gene interaction analysis incorporating network information via a structured Bayesian approach. Stat Med 2021; 40:6619-6633. [PMID: 34542187 PMCID: PMC8595614 DOI: 10.1002/sim.9202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 08/22/2021] [Accepted: 08/30/2021] [Indexed: 01/14/2023]
Abstract
Increasing evidence has shown that gene-gene interactions have important effects in biological processes of human diseases. Due to the high dimensionality of genetic measurements, interaction analysis usually suffers from a lack of sufficient information and has unsatisfactory results. Biological network information has been massively accumulated, allowing researchers to identify biomarkers while taking a system perspective, conducting network selection (of functionally related biomarkers), and accommodating network structures. In main-effect-only analysis, network information has been incorporated. However, effort has been limited in interaction analysis. Recently, link networks that describe the relationships between genetic interactions have been demonstrated as effective for revealing multiscale hierarchical organizations in networks and providing interesting findings beyond node networks. In this study, we develop a novel structured Bayesian interaction analysis approach to effectively incorporate network information. This study is among the first to identify gene-gene interactions with the assistance of network selection, while simultaneously accommodating the underlying network structures of both main effects and interactions. It innovatively respects multiple hierarchies among main effects, interactions, and networks. The Bayesian technique is adopted, which may be more informative for estimation and prediction over some other techniques. An efficient variational Bayesian expectation-maximization algorithm is developed to explore the posterior distribution. Extensive simulation studies demonstrate the practical superiority of the proposed approach. The analysis of TCGA data on melanoma and lung cancer leads to biologically sensible findings with satisfactory prediction accuracy and selection stability.
Collapse
Affiliation(s)
- Xing Qin
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, CT, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China
| |
Collapse
|
2
|
Nies HW, Mohamad MS, Zakaria Z, Chan WH, Remli MA, Nies YH. Enhanced Directed Random Walk for the Identification of Breast Cancer Prognostic Markers from Multiclass Expression Data. ENTROPY (BASEL, SWITZERLAND) 2021; 23:1232. [PMID: 34573857 PMCID: PMC8472068 DOI: 10.3390/e23091232] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Revised: 09/14/2021] [Accepted: 09/16/2021] [Indexed: 12/12/2022]
Abstract
Artificial intelligence in healthcare can potentially identify the probability of contracting a particular disease more accurately. There are five common molecular subtypes of breast cancer: luminal A, luminal B, basal, ERBB2, and normal-like. Previous investigations showed that pathway-based microarray analysis could help in the identification of prognostic markers from gene expressions. For example, directed random walk (DRW) can infer a greater reproducibility power of the pathway activity between two classes of samples with a higher classification accuracy. However, most of the existing methods (including DRW) ignored the characteristics of different cancer subtypes and considered all of the pathways to contribute equally to the analysis. Therefore, an enhanced DRW (eDRW+) is proposed to identify breast cancer prognostic markers from multiclass expression data. An improved weight strategy using one-way ANOVA (F-test) and pathway selection based on the greatest reproducibility power is proposed in eDRW+. The experimental results show that the eDRW+ exceeds other methods in terms of AUC. Besides this, the eDRW+ identifies 294 gene markers and 45 pathway markers from the breast cancer datasets with better AUC. Therefore, the prognostic markers (pathway markers and gene markers) can identify drug targets and look for cancer subtypes with clinically distinct outcomes.
Collapse
Affiliation(s)
- Hui Wen Nies
- School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Malaysia; (Z.Z.); (W.H.C.)
| | - Mohd Saberi Mohamad
- Health Data Science Lab, Department of Genetics and Genomics, College of Medical and Health Sciences, United Arab Emirates University, Al Ain 17666, United Arab Emirates;
| | - Zalmiyah Zakaria
- School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Malaysia; (Z.Z.); (W.H.C.)
| | - Weng Howe Chan
- School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Malaysia; (Z.Z.); (W.H.C.)
| | - Muhammad Akmal Remli
- Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu 16100, Malaysia;
| | - Yong Hui Nies
- Department of Anatomy, Faculty of Medicine, Universiti Kebangsaan Malaysia Medical Centre, Cheras, Kuala Lumpur 56000, Malaysia;
| |
Collapse
|
3
|
Cai Q, Kang J, Yu T. Bayesian Network Marker Selection via the Thresholded Graph Laplacian Gaussian Prior. BAYESIAN ANALYSIS 2020; 15:79-102. [PMID: 32802246 PMCID: PMC7428197 DOI: 10.1214/18-ba1142] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
Selecting informative nodes over large-scale networks becomes increasingly important in many research areas. Most existing methods focus on the local network structure and incur heavy computational costs for the large-scale problem. In this work, we propose a novel prior model for Bayesian network marker selection in the generalized linear model (GLM) framework: the Thresholded Graph Laplacian Gaussian (TGLG) prior, which adopts the graph Laplacian matrix to characterize the conditional dependence between neighboring markers accounting for the global network structure. Under mild conditions, we show the proposed model enjoys the posterior consistency with a diverging number of edges and nodes in the network. We also develop a Metropolis-adjusted Langevin algorithm (MALA) for efficient posterior computation, which is scalable to large-scale networks. We illustrate the superiorities of the proposed method compared with existing alternatives via extensive simulation studies and an analysis of the breast cancer gene expression dataset in the Cancer Genome Atlas (TCGA).
Collapse
Affiliation(s)
- Qingpo Cai
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| | - Jian Kang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA
| | - Tianwei Yu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA
| |
Collapse
|
4
|
Zhao Y, Chang C, Long Q. Knowledge-Guided Statistical Learning Methods for Analysis of High-Dimensional -Omics Data in Precision Oncology. JCO Precis Oncol 2019; 3:PO.19.00018. [PMID: 35100722 PMCID: PMC9797232 DOI: 10.1200/po.19.00018] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 03/25/2019] [Indexed: 12/31/2022] Open
Abstract
High-dimensional -omics data such as genomic, transcriptomic, and metabolomic data offer great promise in advancing precision medicine. In particular, such data have enabled the investigation of complex diseases such as cancer at an unprecedented scale and in multiple dimensions. However, a number of analytical challenges complicate analysis of high-dimensional -omics data. One is the growing recognition that complex diseases such as cancer are multifactorial and may be attributed to harmful changes on multiple -omics levels and on the pathway level. When individual genes in an important pathway have relatively weak signals, it can be challenging to detect them on their own, but the aggregated signal in the pathway can be considerably stronger and hence easier to detect with the same sample size. To address these challenges, there is a growing body of literature on knowledge-guided statistical learning methods for analysis of high-dimensional -omics data that can incorporate biological knowledge such as functional genomics and functional proteomics. These methods have been shown to improve predication and classification accuracy and yield biologically more interpretable results compared with statistical learning methods that do not use biological knowledge. In this review, we survey current knowledge-guided statistical learning methods, including both supervised learning and unsupervised learning, and their applications to precision oncology, and we discuss future research directions.
Collapse
Affiliation(s)
- Yize Zhao
- Weill Cornell Medicine, New York, NY
| | - Changgee Chang
- University of Pennsylvania Perelman School
of Medicine, Philadelphia, PA
| | - Qi Long
- University of Pennsylvania Perelman School
of Medicine, Philadelphia, PA
| |
Collapse
|
5
|
Min W, Liu J, Zhang S. Network-Regularized Sparse Logistic Regression Models for Clinical Risk Prediction and Biomarker Discovery. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:944-953. [PMID: 28113328 DOI: 10.1109/tcbb.2016.2640303] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Molecular profiling data (e.g., gene expression) has been used for clinical risk prediction and biomarker discovery. However, it is necessary to integrate other prior knowledge like biological pathways or gene interaction networks to improve the predictive ability and biological interpretability of biomarkers. Here, we first introduce a general regularized Logistic Regression (LR) framework with regularized term , which can reduce to different penalties, including Lasso, elastic net, and network-regularized terms with different . This framework can be easily solved in a unified manner by a cyclic coordinate descent algorithm which can avoid inverse matrix operation and accelerate the computing speed. However, if those estimated and have opposite signs, then the traditional network-regularized penalty may not perform well. To address it, we introduce a novel network-regularized sparse LR model with a new penalty to consider the difference between the absolute values of the coefficients. We develop two efficient algorithms to solve it. Finally, we test our methods and compare them with the related ones using simulated and real data to show their efficiency.
Collapse
|
6
|
Wu MY, Zhang XF, Dai DQ, Ou-Yang L, Zhu Y, Yan H. Regularized logistic regression with network-based pairwise interaction for biomarker identification in breast cancer. BMC Bioinformatics 2016; 17:108. [PMID: 26921029 PMCID: PMC4769543 DOI: 10.1186/s12859-016-0951-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2015] [Accepted: 01/28/2016] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND To facilitate advances in personalized medicine, it is important to detect predictive, stable and interpretable biomarkers related with different clinical characteristics. These clinical characteristics may be heterogeneous with respect to underlying interactions between genes. Usually, traditional methods just focus on detection of differentially expressed genes without taking the interactions between genes into account. Moreover, due to the typical low reproducibility of the selected biomarkers, it is difficult to give a clear biological interpretation for a specific disease. Therefore, it is necessary to design a robust biomarker identification method that can predict disease-associated interactions with high reproducibility. RESULTS In this article, we propose a regularized logistic regression model. Different from previous methods which focus on individual genes or modules, our model takes gene pairs, which are connected in a protein-protein interaction network, into account. A line graph is constructed to represent the adjacencies between pairwise interactions. Based on this line graph, we incorporate the degree information in the model via an adaptive elastic net, which makes our model less dependent on the expression data. Experimental results on six publicly available breast cancer datasets show that our method can not only achieve competitive performance in classification, but also retain great stability in variable selection. Therefore, our model is able to identify the diagnostic and prognostic biomarkers in a more robust way. Moreover, most of the biomarkers discovered by our model have been verified in biochemical or biomedical researches. CONCLUSIONS The proposed method shows promise in the diagnosis of disease pathogenesis with different clinical characteristics. These advances lead to more accurate and stable biomarker discovery, which can monitor the functional changes that are perturbed by diseases. Based on these predictions, researchers may be able to provide suggestions for new therapeutic approaches.
Collapse
Affiliation(s)
- Meng-Yun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Guoding Road, Shanghai, 200433, China. .,Key Laboratory of Mathematical Economics SUFE, Ministry of Education, Guoding Road, Shanghai, 200433, China.
| | - Xiao-Fei Zhang
- School of Mathematics and Statistics & Hubei Key Laboratory of Mathematical Sciences, Central China Normal University, Luoyu Road, Wuhan, 430079, China.
| | - Dao-Qing Dai
- Intelligent Data Center and Department of Mathematics, Sun Yat-Sen University, Xingang West Road, Guangzhou, 510275, China.
| | - Le Ou-Yang
- College of Information Engineering, Shenzhen University, Nanhai Avenue, Shenzhen, 518060, China.
| | - Yuan Zhu
- School of Automation, China University of Geosciences, Lumo Road, Wuhan, 430074, China.
| | - Hong Yan
- Department of Electronic and Engineering, City University of Hong Kong, Tat Chee Avenue, Hong Kong, 999077, China.
| |
Collapse
|
7
|
Barter RL, Schramm SJ, Mann GJ, Yang YH. Network-based biomarkers enhance classical approaches to prognostic gene expression signatures. BMC SYSTEMS BIOLOGY 2014; 8 Suppl 4:S5. [PMID: 25521200 PMCID: PMC4290694 DOI: 10.1186/1752-0509-8-s4-s5] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
BACKGROUND Classical approaches to predicting patient clinical outcome via gene expression information are primarily based on differential expression of unrelated genes (single-gene approaches) or genes related by, for example, biologic pathway or function (gene-sets). Recently, network-based approaches utilising interaction information between genes have emerged. An open problem is whether such approaches add value to the more traditional methods of signature modelling. We explored this question via comparison of the most widely employed single-gene, gene-set, and network-based methods, using gene expression microarray data from two different cancers: melanoma and ovarian. We considered two kinds of network approaches. The first of these identifies informative genes using gene expression and network connectivity information combined, the latter drawn from prior knowledge of protein-protein interactions. The second approach focuses on identification of informative sub-networks (small networks of interacting proteins, again from prior knowledge networks). For all methods we performed 100 rounds of 5-fold cross-validation under 3 different classifiers. For network-based approaches, we considered two different protein-protein interaction networks. We quantified resulting patterns of misclassification and discussed the relative value of each relative to ongoing development of prognostic biomarkers. RESULTS We found that single-gene, gene-set and network methods yielded similar error rates in melanoma and ovarian cancer data. Crucially, however, our novel and detailed patient-level analyses revealed that the different methods were correctly classifying alternate subsets of patients in each cohort. We also found that the network-based NetRank feature selection method was the most stable. CONCLUSIONS Next-generation methods of gene expression signature modelling harness data from external networks and are foreshadowed as a standard mode of analysis. But what do they add to traditional approaches? Our findings indicate there is value in the way in which different subspaces of the patient sample are captured differently among the various methods, highlighting the possibility of 'combination' classifiers capable of identifying which patients will be more accurately classified by one particular method over another. We have seen this clearly for the first time because of our in-depth analysis at the level of individual patients.
Collapse
Affiliation(s)
- Rebecca L Barter
- School of Mathematics and Statistics at The University of Sydney, F07, The University of Sydney, NSW, 2006, Australia
| | - Sarah-Jane Schramm
- Westmead Millennium Institute at The University of Sydney, 176 Hawkesbury Road, Westmead, NSW, 2145, Australia
- Melanoma Institute Australia, 40 Rocklands Rd, North Sydney, NSW, 2060, Australia
| | - Graham J Mann
- Westmead Millennium Institute at The University of Sydney, 176 Hawkesbury Road, Westmead, NSW, 2145, Australia
- Melanoma Institute Australia, 40 Rocklands Rd, North Sydney, NSW, 2060, Australia
| | - Yee Hwa Yang
- School of Mathematics and Statistics at The University of Sydney, F07, The University of Sydney, NSW, 2006, Australia
- Melanoma Institute Australia, 40 Rocklands Rd, North Sydney, NSW, 2060, Australia
| |
Collapse
|