1
|
Park H, Miyano S. Computational network biology analysis revealed COVID-19 severity markers: Molecular interplay between HLA-II with CIITA. PLoS One 2025; 20:e0319205. [PMID: 40163554 PMCID: PMC11957389 DOI: 10.1371/journal.pone.0319205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Accepted: 01/28/2025] [Indexed: 04/02/2025] Open
Abstract
COVID-19, severe acute respiratory syndrome coronavirus 2, rapidly spread worldwide. Severe and critical patients are expected to rapidly deteriorate. Although several studies have attempted to uncover the mechanisms underlying COVID-19 severity, most have focused on the perturbations of single genes. However, the complex mechanism of COVID-19 involves numerous perturbed genes in a molecular network rather than a single abnormal gene. Thus, we aimed to identify COVID-19 severity-specific markers in the Japanese population using gene network analysis. In order to reveal the severity-specific molecular interplays, we developed a novel computational network biology strategy that measures dissimilarity between networks based on the comprehensive information of gene network (i.e., expression levels of genes and network structure) by using Kullback-Leibler divergence. Monte Carlo simulations demonstrated the effectiveness of our strategy for differential gene network analysis. We applied this method to publicly available whole blood RNA-seq data from the Japan coronavirus disease 2019 Task Force and identified differentially regulated molecular interplays between 368 severe and 105 non-severe samples. Our analysis suggests the gene network between HLA class II, CIITA, and CD74 as a COVID-19 severity specific molecular marker. Although the association between HLA class II and COVID-19 has been demonstrated, our data analysis revealed that the molecular interplay of HLA class II with its target and/or regulator is a crucial marker for COVID-19 severity. Our findings from computational network biology analysis suggest that suppression and activation of the molecular interplay between HLA class II, CIITA, and CD74 provide crucial clues to uncover the mechanisms of COVID-19 severity.
Collapse
Affiliation(s)
- Heewon Park
- School of Mathematics, Statistics and Data Science, Sungshin Women’s University, Seoul, Republic of Korea
- M&D Data Science Center, Institute of Science Tokyo, Bunkyo-ku, Tokyo, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan
| | - Satoru Miyano
- M&D Data Science Center, Institute of Science Tokyo, Bunkyo-ku, Tokyo, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan
| |
Collapse
|
2
|
Bruner WS, Grant SFA. Translation of genome-wide association study: from genomic signals to biological insights. Front Genet 2024; 15:1375481. [PMID: 39421299 PMCID: PMC11484060 DOI: 10.3389/fgene.2024.1375481] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 09/24/2024] [Indexed: 10/19/2024] Open
Abstract
Since the turn of the 21st century, genome-wide association study (GWAS) have successfully identified genetic signals associated with a myriad of common complex traits and diseases. As we transition from establishing robust genetic associations with diverse phenotypes, the central challenge is now focused on characterizing the underlying functional mechanisms driving these signals. Previous GWAS efforts have revealed multiple variants, each conferring relatively subtle susceptibility, collectively contributing to the pathogenesis of various common diseases. Such variants can further exhibit associations with multiple other traits and differ across ancestries, plus disentangling causal variants from non-causal due to linkage disequilibrium complexities can lead to challenges in drawing direct biological conclusions. Combined with cellular context considerations, such challenges can reduce the capacity to definitively elucidate the biological significance of GWAS signals, limiting the potential to define mechanistic insights. This review will detail current and anticipated approaches for functional interpretation of GWAS signals, both in terms of characterizing the underlying causal variants and the corresponding effector genes.
Collapse
Affiliation(s)
- Winter S. Bruner
- Center for Spatial and Functional Genomics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
- Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
| | - Struan F. A. Grant
- Center for Spatial and Functional Genomics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
- Division of Human Genetics, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
- Division of Endocrinology and Diabetes, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
| |
Collapse
|
3
|
Park H. Unveiling Gene Regulatory Networks That Characterize Difference of Molecular Interplays Between Gastric Cancer Drug Sensitive and Resistance Cell Lines. J Comput Biol 2024; 31:257-274. [PMID: 38394313 DOI: 10.1089/cmb.2023.0215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2024] Open
Abstract
Gastric cancer is a leading cause of cancer-related deaths globally and chemotherapy is widely accepted as the standard treatment for gastric cancer. However, drug resistance in cancer cells poses a significant obstacle to the success of chemotherapy, limiting its effectiveness in treating gastric cancer. Although many studies have been conducted to unravel the mechanisms of acquired drug resistance, the existing studies were based on abnormalities of a single gene, that is, differential gene expression (DGE) analysis. Single gene-based analysis alone is insufficient to comprehensively understand the mechanisms of drug resistance in cancer cells, because the underlying processes of the mechanism involve perturbations of the molecular interactions. To uncover the mechanism of acquired gastric cancer drug resistance, we perform for identification of differentially regulated gene networks between drug-sensitive and drug-resistant cell lines. We develop a computational strategy for identifying phenotype-specific gene networks by extending the existing method, CIdrgn, that quantifies the dissimilarity of gene networks based on comprehensive information of network structure, that is, regulatory effect between genes, structure of edge, and expression levels of genes. To enhance the efficiency of identifying differentially regulated gene networks and improve the biological relevance of our findings, we integrate additional information and incorporate knowledge of network biology, such as hubness of genes and weighted adjacency matrices. The outstanding capabilities of the developed strategy are validated through Monte Carlo simulations. By using our strategy, we uncover gene regulatory networks that specifically capture the molecular interplays distinguishing drug-sensitive and drug-resistant profiles in gastric cancer. The reliability and significance of the identified drug-sensitive and resistance-specific gene networks, as well as their related markers, are verified through literature. Our analysis for differentially regulated gene network identification has the capacity to characterize the drug-sensitive and resistance-specific molecular interplays related to mechanisms of acquired drug resistance that cannot be revealed by analysis based solely on abnormalities of a single gene, for example, DGE analysis. Through our analysis and comprehensive examination of relevant literature, we suggest that targeting the suppressors of the identified drug-resistant markers, such as the Melanoma Antigen (MAGE) family, Trefoil Factor (TFF) family, and Ras-Associated Binding 25 (RAB25), while enhancing the expression of inducers of the drug sensitivity markers [e.g., Serum Amyloid A (SAA) family], could potentially reduce drug resistance and enhance the effectiveness of chemotherapy for gastric cancer. We expect that the developed strategy will serve as a useful tool for uncovering cancer-related phenotype-specific gene regulatory networks that provide essential clues for uncovering not only drug resistance mechanisms but also complex biological systems of cancer.
Collapse
Affiliation(s)
- Heewon Park
- School of Mathematics, Statistics and Data Science, Sungshin Women's University, Seoul, Korea
| |
Collapse
|
4
|
Sharma V, Singh A, Chauhan S, Sharma PK, Chaudhary S, Sharma A, Porwal O, Fuloria NK. Role of Artificial Intelligence in Drug Discovery and Target Identification in Cancer. Curr Drug Deliv 2024; 21:870-886. [PMID: 37670704 DOI: 10.2174/1567201821666230905090621] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Revised: 03/08/2023] [Accepted: 03/24/2023] [Indexed: 09/07/2023]
Abstract
Drug discovery and development (DDD) is a highly complex process that necessitates precise monitoring and extensive data analysis at each stage. Furthermore, the DDD process is both timeconsuming and costly. To tackle these concerns, artificial intelligence (AI) technology can be used, which facilitates rapid and precise analysis of extensive datasets within a limited timeframe. The pathophysiology of cancer disease is complicated and requires extensive research for novel drug discovery and development. The first stage in the process of drug discovery and development involves identifying targets. Cell structure and molecular functioning are complex due to the vast number of molecules that function constantly, performing various roles. Furthermore, scientists are continually discovering novel cellular mechanisms and molecules, expanding the range of potential targets. Accurately identifying the correct target is a crucial step in the preparation of a treatment strategy. Various forms of AI, such as machine learning, neural-based learning, deep learning, and network-based learning, are currently being utilised in applications, online services, and databases. These technologies facilitate the identification and validation of targets, ultimately contributing to the success of projects. This review focuses on the different types and subcategories of AI databases utilised in the field of drug discovery and target identification for cancer.
Collapse
Affiliation(s)
- Vishal Sharma
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Amit Singh
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Sanjana Chauhan
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Pramod Kumar Sharma
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Shubham Chaudhary
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Astha Sharma
- Department of Pharmacy, Galgotias University, Greater Noida, Uttar Pradesh, 201310, India
| | - Omji Porwal
- Department of Pharmacognosy, Faculty of Pharmacy, Tishk International University, Erbil 44001, Iraq
| | | |
Collapse
|
5
|
Zheng LL, Wang YR, Liu ZR, Wang ZH, Tao CC, Xiao YG, Zhang K, Wu AK, Li HY, Wu JX, Xiao T, Rong WQ. High spindle and kinetochore-associated complex subunit-3 expression predicts poor prognosis and correlates with adverse immune infiltration in hepatocellular carcinoma. World J Gastrointest Surg 2023; 15:1600-1614. [PMID: 37701707 PMCID: PMC10494596 DOI: 10.4240/wjgs.v15.i8.1600] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 06/14/2023] [Accepted: 07/17/2023] [Indexed: 08/25/2023] Open
Abstract
BACKGROUND Spindle and kinetochore-associated complex subunit 3 (SKA3) is a malignancy-associated gene that plays a critical role in the regulation of chromosome separation and cell division. However, the molecular mechanism through which SKA3 regulates tumor cell proliferation in hepatocellular carcinoma (HCC) has not been fully elucidated. AIM To investigate the molecular mechanisms underlying the role of SKA3 in HCC. METHODS SKA3 expression, clinicopathological, and survival analyses were performed using multiple public database platforms, and the results were verified by Western blot and immunohistochemistry staining using collected clinical samples. Functional enrichment analyses were performed to evaluate the biological functions and molecular mechanisms of SKA3 in HCC. Furthermore, the Tumor Immune Estimation Resource and single-sample Gene Set Enrichment Analysis (ssGSEA) algorithms were utilized to investigate the abundance of tumor-infiltrating immune cells in HCC. The response to chemotherapeutic drugs was evaluated by the R package "pRRophetic". RESULTS We found that upregulated SKA3 expression was significantly correlated with poor prognosis in patients with HCC. Multivariable Cox regression analysis indicated that SKA3 was an independent risk factor for survival. GSEA revealed that SKA3 expression may facilitate proliferation and migratory processes by regulating the cell cycle and DNA repair. Moreover, patients with high SKA3 expression had significantly decreased ratios of CD8+ T cells, natural killer cells, and dendritic cells. Drug sensitivity analysis showed that the high SKA3 group was more sensitive to sorafenib, sunitinib, paclitaxel, doxorubicin, gemcitabine, and vx-680. CONCLUSION High SKA3 expression led to poor prognosis in patients with HCC by enhancing HCC proliferation and repressing immune cell infiltration surrounding HCC. SKA3 may be used as a biomarker for poor prognosis and as a therapeutic target in HCC.
Collapse
Affiliation(s)
- Lin-Lin Zheng
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Ya-Ru Wang
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Zhen-Rong Liu
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Zhi-Hao Wang
- Department of Hepatobiliary Hernia Surgery, Liaocheng Dongcangfu People's Hospital, Liaocheng 252000, Shandong Province, China
| | - Chang-Cheng Tao
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Yong-Gang Xiao
- The Second Ward of Hepatobiliary Surgery, Qianxinan People's Hospital, Xingyi 562400, Guizhou Province, China
| | - Kai Zhang
- Department of Interventional Therapy, Tianjin Medical University Cancer Institute and Hospital, National Clinical Research Center for Cancer, Key Laboratory of Cancer Prevention and Therapy, Tianjin's Clinical Research Center for Cancer, Tianjin 300000, China
| | - An-Ke Wu
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Hai-Yang Li
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Jian-Xiong Wu
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Ting Xiao
- State Key Laboratory of Molecular Oncology, Department of Etiology and Carcinogenesis, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| | - Wei-Qi Rong
- Department of Hepatobiliary Surgery, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100021, China
| |
Collapse
|
6
|
Park H, Imoto S, Miyano S. Comprehensive information-based differential gene regulatory networks analysis (CIdrgn): Application to gastric cancer and chemotherapy-responsive gene network identification. PLoS One 2023; 18:e0286044. [PMID: 37610997 PMCID: PMC10446197 DOI: 10.1371/journal.pone.0286044] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 05/07/2023] [Indexed: 08/25/2023] Open
Abstract
Biological condition-responsive gene network analysis has attracted considerable research attention because of its ability to identify pathways or gene modules involved in the underlying mechanisms of diseases. Although many condition-specific gene network identification methods have been developed, they are based on partial or incomplete gene regulatory network information, with most studies only considering the differential expression levels or correlations among genes. However, a single gene-based analysis cannot effectively identify the molecular interactions involved in the mechanisms underlying diseases, which reflect perturbations in specific molecular network functions rather than disorders of a single gene. To comprehensively identify differentially regulated gene networks, we propose a novel computational strategy called comprehensive analysis of differential gene regulatory networks (CIdrgn). Our strategy incorporates comprehensive information on the networks between genes, including the expression levels, edge structures and regulatory effects, to measure the dissimilarity among networks. We extended the proposed CIdrgn to cell line characteristic-specific gene network analysis. Monte Carlo simulations showed the effectiveness of CIdrgn for identifying differentially regulated gene networks with different network structures and scales. Moreover, condition-responsive network identification in cell line characteristic-specific gene network analyses was verified. We applied CIdrgn to identify gastric cancer and itsf chemotherapy (capecitabine and oxaliplatin) -responsive network based on the Cancer Dependency Map. The CXC family of chemokines and cadherin gene family networks were identified as gastric cancer-specific gene regulatory networks, which was verified through a literature survey. The networks of the olfactory receptor family with the ASCL1/FOS family were identified as capecitabine- and oxaliplatin sensitive -specific gene networks. We expect that the proposed CIdrgn method will be a useful tool for identifying crucial molecular interactions involved in the specific biological conditions of cancer cell lines, such as the cancer stage or acquired anticancer drug resistance.
Collapse
Affiliation(s)
- Heewon Park
- School of Mathematics, Statistics and Data Science, Sungshin Women’s University, Seoul, Korea
| | - Seiya Imoto
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan
| | - Satoru Miyano
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Minato-ku, Tokyo, Japan
- M&D Data Science Center, Tokyo Medical and Dental University, Bunkyo-ku, Tokyo, Japan
| |
Collapse
|
7
|
Amini P, Hajihosseini M, Pyne S, Dinu I. Geographically weighted linear combination test for gene-set analysis of a continuous spatial phenotype as applied to intratumor heterogeneity. Front Cell Dev Biol 2023; 11:1065586. [PMID: 36998245 PMCID: PMC10044624 DOI: 10.3389/fcell.2023.1065586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 02/22/2023] [Indexed: 03/11/2023] Open
Abstract
Background: The impact of gene-sets on a spatial phenotype is not necessarily uniform across different locations of cancer tissue. This study introduces a computational platform, GWLCT, for combining gene set analysis with spatial data modeling to provide a new statistical test for location-specific association of phenotypes and molecular pathways in spatial single-cell RNA-seq data collected from an input tumor sample.Methods: The main advantage of GWLCT consists of an analysis beyond global significance, allowing the association between the gene-set and the phenotype to vary across the tumor space. At each location, the most significant linear combination is found using a geographically weighted shrunken covariance matrix and kernel function. Whether a fixed or adaptive bandwidth is determined based on a cross-validation cross procedure. Our proposed method is compared to the global version of linear combination test (LCT), bulk and random-forest based gene-set enrichment analyses using data created by the Visium Spatial Gene Expression technique on an invasive breast cancer tissue sample, as well as 144 different simulation scenarios.Results: In an illustrative example, the new geographically weighted linear combination test, GWLCT, identifies the cancer hallmark gene-sets that are significantly associated at each location with the five spatially continuous phenotypic contexts in the tumors defined by different well-known markers of cancer-associated fibroblasts. Scan statistics revealed clustering in the number of significant gene-sets. A spatial heatmap of combined significance over all selected gene-sets is also produced. Extensive simulation studies demonstrate that our proposed approach outperforms other methods in the considered scenarios, especially when the spatial association increases.Conclusion: Our proposed approach considers the spatial covariance of gene expression to detect the most significant gene-sets affecting a continuous phenotype. It reveals spatially detailed information in tissue space and can thus play a key role in understanding the contextual heterogeneity of cancer cells.
Collapse
Affiliation(s)
- Payam Amini
- Department of Biostatistics, School of Public Health, Iran University of Medical Sciences, Tehran, Iran
- School of Medicine, Keele University, Keele, Staffordshire, United Kingdom
| | - Morteza Hajihosseini
- School of Public Health, University of Alberta, Edmonton, AB, Canada
- Stanford Department of Urology, Center for Academic Medicine, Palo Alto, CA, United States
| | - Saumyadipta Pyne
- Health Analytics Network, Pittsburgh, PA, United States
- University of California, Santa Barbara, Santa Barbara, CA, United States
- *Correspondence: Saumyadipta Pyne, ; Irina Dinu,
| | - Irina Dinu
- School of Public Health, University of Alberta, Edmonton, AB, Canada
- *Correspondence: Saumyadipta Pyne, ; Irina Dinu,
| |
Collapse
|
8
|
Mubeen S, Tom Kodamullil A, Hofmann-Apitius M, Domingo-Fernández D. On the influence of several factors on pathway enrichment analysis. Brief Bioinform 2022; 23:bbac143. [PMID: 35453140 PMCID: PMC9116215 DOI: 10.1093/bib/bbac143] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Revised: 03/21/2022] [Accepted: 03/30/2022] [Indexed: 02/01/2023] Open
Abstract
Pathway enrichment analysis has become a widely used knowledge-based approach for the interpretation of biomedical data. Its popularity has led to an explosion of both enrichment methods and pathway databases. While the elegance of pathway enrichment lies in its simplicity, multiple factors can impact the results of such an analysis, which may not be accounted for. Researchers may fail to give influential aspects their due, resorting instead to popular methods and gene set collections, or default settings. Despite ongoing efforts to establish set guidelines, meaningful results are still hampered by a lack of consensus or gold standards around how enrichment analysis should be conducted. Nonetheless, such concerns have prompted a series of benchmark studies specifically focused on evaluating the influence of various factors on pathway enrichment results. In this review, we organize and summarize the findings of these benchmarks to provide a comprehensive overview on the influence of these factors. Our work covers a broad spectrum of factors, spanning from methodological assumptions to those related to prior biological knowledge, such as pathway definitions and database choice. In doing so, we aim to shed light on how these aspects can lead to insignificant, uninteresting or even contradictory results. Finally, we conclude the review by proposing future benchmarks as well as solutions to overcome some of the challenges, which originate from the outlined factors.
Collapse
Affiliation(s)
- Sarah Mubeen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, 53115 Bonn, Germany
- Fraunhofer Center for Machine Learning, Germany
| | - Alpha Tom Kodamullil
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, 53115 Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin 53757, Germany
- Fraunhofer Center for Machine Learning, Germany
- Enveda Biosciences, Boulder, CO, 80301, USA
| |
Collapse
|
9
|
Comprehensive Analysis of CPA4 as a Poor Prognostic Biomarker Correlated with Immune Cells Infiltration in Bladder Cancer. BIOLOGY 2021; 10:biology10111143. [PMID: 34827136 PMCID: PMC8615209 DOI: 10.3390/biology10111143] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Revised: 10/29/2021] [Accepted: 11/04/2021] [Indexed: 12/24/2022]
Abstract
Simple Summary The overexpression of Carboxypeptidase A4 (CPA4) has been observed in plenty of types of cancer and has been elucidated to promote tumor growth and invasion; however, its role in bladder urothelial carcinoma (BLCA) is still unclear. Therefore, we aimed to show the prognostic role of CPA4 and its relationship with immune infiltrates in BLCA. We confirmed that the overexpression of CPA4 is associated with shorter overall survival, disease-specific survival, progress-free intervals, and higher dead events. Moreover, we found that several infiltrating immune cells (Th1cell, Th2 cell, T cell exhaustion, and Tumor-associated macrophage) were correlated with the expression of CPA4 in bladder cancer using TIMER2 and GEPIA2. In conclusion, CPA4 may be a novel and great prognostic biomarker based on bioinformation analysis in BLCA. Abstract Carboxypeptidase A4 (CPA4) has shown the potential to be a biomarker in the early diagnosis of certain cancers. However, no previous research has linked CPA4 to therapeutic or prognostic significance in bladder cancer. Using data from The Cancer Genome Atlas (TCGA) database, we set out to determine the full extent of the link between CPA4 and BLCA. We further analyzed the interacting proteins of CPA4 and infiltrated immune cells via the TIMER2, STRING, and GEPIA2 databases. The expression of CPA4 in tumor and normal tissues was compared using the TCGA + GETx database. The connection between CPA4 expression and clinicopathologic characteristics and overall survival (OS) was investigated using multivariate methods and Kaplan–Meier survival curves. The potential functions and pathways were investigated via gene set enrichment analysis. Furthermore, we analyze the associations between CPA4 expression and infiltrated immune cells with their respective gene marker sets using the ssGSEA, TIMER2, and GEPIA2 databases. Compared with matching normal tissues, human CPA4 was found to be substantially expressed. We confirmed that the overexpression of CPA4 is linked with shorter OS, DSF(Disease-specific survival), PFI(Progression-free interval), and increased diagnostic potential using Kaplan–Meier and ROC analysis. The expression of CPA4 is related to T-bet, IL12RB2, CTLA4, and LAG3, among which T-bet and IL12RB2 are Th1 marker genes while CTLA4 and LAG3 are related to T cell exhaustion, which may be used to guide the application of checkpoint blockade and the adoption of T cell transfer therapy.
Collapse
|
10
|
Xie X, Kendzior MC, Ge X, Mainzer LS, Sinha S. VarSAn: associating pathways with a set of genomic variants using network analysis. Nucleic Acids Res 2021; 49:8471-8487. [PMID: 34313777 PMCID: PMC8421213 DOI: 10.1093/nar/gkab624] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2020] [Revised: 05/18/2021] [Accepted: 07/20/2021] [Indexed: 02/01/2023] Open
Abstract
There is a pressing need today to mechanistically interpret sets of genomic variants associated with diseases. Here we present a tool called ‘VarSAn’ that uses a network analysis algorithm to identify pathways relevant to a given set of variants. VarSAn analyzes a configurable network whose nodes represent variants, genes and pathways, using a Random Walk with Restarts algorithm to rank pathways for relevance to the given variants, and reports P-values for pathway relevance. It treats non-coding and coding variants differently, properly accounts for the number of pathways impacted by each variant and identifies relevant pathways even if many variants do not directly impact genes of the pathway. We use VarSAn to identify pathways relevant to variants related to cancer and several other diseases, as well as drug response variation. We find VarSAn's pathway ranking to be complementary to the standard approach of enrichment tests on genes related to the query set. We adopt a novel benchmarking strategy to quantify its advantage over this baseline approach. Finally, we use VarSAn to discover key pathways, including the VEGFA-VEGFR2 pathway, related to de novo variants in patients of Hypoplastic Left Heart Syndrome, a rare and severe congenital heart defect.
Collapse
Affiliation(s)
- Xiaoman Xie
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Matthew C Kendzior
- National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Xiyu Ge
- Department of Molecular and Integrative Physiology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Liudmila S Mainzer
- National Center for Supercomputing Applications, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA.,Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA.,Carl R. Woese Institute for Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.,Cancer Center of Illinois, University of Illinois Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
11
|
Peng S, Yang Y, Liu W, Li F, Liao X. Discriminant Projection Shared Dictionary Learning for Classification of Tumors Using Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1464-1473. [PMID: 31675339 DOI: 10.1109/tcbb.2019.2950209] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
With a variety of tumor subtypes, personalized treatments need to identify the subtype of a tumor as accurately as possible. The development of DNA microarrays provides an opportunity to predict tumor classification. One strategy is to use gene expression profiling to extend current biological insights into the disease. However, overfitting problems exist in most machine learning methods when classifying tumor gene expression profile data characterized by high dimensional, small samples and nonlinearities. As a new machine learning methods, dictionary learning has become a more effective algorithm for gene expression profile classification. Here, a new method called discriminant projection shared dictionary learning (DPSDL) is proposed for classifying tumor subtypes using LINCS gene expression profile data. The method trains a shared dictionary, embeds Fisher discriminant criteria to obtain a class-specific sub-dictionary and coding coefficients. At the same time, a projection matrix is trained to widen the distance between different classes of samples. Experimental results show that our method performs better classification based on gene expression profile than the other dictionary learning methods and machine learning methods.
Collapse
|
12
|
Oh M, Kim K, Sun H. Covariance thresholding to detect differentially co-expressed genes from microarray gene expression data. J Bioinform Comput Biol 2021; 18:2050002. [PMID: 32336254 DOI: 10.1142/s021972002050002x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Gene set analysis aims to identify differentially expressed or co-expressed genes within a biological pathway between two experimental conditions, so that it can eventually reveal biological processes and pathways involved in disease development. In the last few decades, various statistical and computational methods have been proposed to improve statistical power of gene set analysis. In recent years, much attention has been paid to differentially co-expressed genes since they can be potentially disease-related genes without significant difference in average expression levels between two conditions. In this paper, we propose a new statistical method to identify differentially co-expressed genes from microarray gene expression data. The proposed method first estimates co-expression levels of paired genes using covariance regularization by thresholding, and then significance of difference in covariance estimation between two conditions is evaluated. We demonstrated that the proposed method is more powerful than the existing main-stream methods to detect co-expressed genes through extensive simulation studies. Also, we applied it to various microarray gene expression datasets related with mutant p53 transcriptional activity, and epithelium and stroma breast cancer.
Collapse
Affiliation(s)
- Mingyu Oh
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan, 46241, Korea
| |
Collapse
|
13
|
Liang L, Huang Q, Gan M, Jiang L, Yan H, Lin Z, Zhu H, Wang R, Hu K. High SEC61G expression predicts poor prognosis in patients with Head and Neck Squamous Cell Carcinomas. J Cancer 2021; 12:3887-3899. [PMID: 34093796 PMCID: PMC8176234 DOI: 10.7150/jca.51467] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2020] [Accepted: 04/23/2021] [Indexed: 01/11/2023] Open
Abstract
Background: Overexpression of the membrane protein SEC61 translocon gamma subunit (SEC61G) has been observed in a variety of cancers; however, its role in head and neck squamous cell carcinomas (HNSCC) is unknown. This study aimed to elucidate the relationship between SEC61G and HNSCC based on data from The Cancer Genome Atlas (TCGA) database. Methods: Data for HNSCC patients were collected from TCGA and the expression level of SEC61G was compared between paired HNSCC and normal tissues using the Wilcoxon rank-sum test. The relationship between clinicopathologic features and SEC61G expression was also analyzed using the Wilcoxon rank-sum test and logistic regression. Receiver operating characteristic (ROC) curves were generated to evaluate the value of SEC61G as a binary classifier using the area under the curve (AUC value). The association of clinicopathologic characteristics with prognosis in HNSCC patients was assessed using Cox regression and the Kaplan-Meier methods. A nomogram, based on Cox multivariate analysis, was used to predict the impact of SEC61G on prognosis. Functional enrichment analysis was performed to determine the hallmark pathways associated with differentially expressed genes in HNSCC patients exhibiting high and low SEC61G expression. Results: The expression of SEC61G was significantly elevated in HNSCC tissues compared to normal tissues (P < 0.001). The high expression of SEC61G was significantly correlated with the T stage, M stage, clinical stage, TP53 mutation status, PIK3CA mutation status, primary therapy outcome, and cervical lymph node dissection (all P < 0.05). Meanwhile, ROC curves suggested the significant diagnostic ability of SEC61G for HNSCC (AUC = 0.923). Kaplan-Meier survival analysis showed that patients with HNSCC characterized by high SEC61G expression had a poorer prognosis than patients with low SEC61G expression (hazard ratio = 1.95, 95% confidence interval 1.48-2.56, P < 0.001). Univariate and multivariate analyses revealed that SEC61G was independently associated with overall survival (P = 0.027). Functional annotations indicated that SEC61G is involved in pathways related to translation and regulation of SLITs/ROBOs expression, SRP-dependent co-translational protein targeting to the membrane, nonsense-mediated decay, oxidative phosphorylation, and Parkinson's disease. Conclusion: SEC61G plays a vital role in HNSCC progression and prognosis; it may, therefore, serve as an effective biomarker for the prediction of patient survival.
Collapse
Affiliation(s)
- Leifeng Liang
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Qingwen Huang
- Department of Pathology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Mei Gan
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Liujun Jiang
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Haolin Yan
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Zhan Lin
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Haisheng Zhu
- Department of Oncology, The Sixth Affiliated Hospital of Guangxi Medical University, Yulin, Guangxi, China
| | - Rensheng Wang
- Department of Radiation Oncology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| | - Kai Hu
- Department of Radiation Oncology, The First Affiliated Hospital of Guangxi Medical University, Nanning, Guangxi, China
| |
Collapse
|
14
|
Yu J, Hu D, Cheng Y, Guo J, Wang Y, Tan Z, Peng J, Zhou H. Lipidomics and transcriptomics analyses of altered lipid species and pathways in oxaliplatin-treated colorectal cancer cells. J Pharm Biomed Anal 2021; 200:114077. [PMID: 33892396 DOI: 10.1016/j.jpba.2021.114077] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 03/13/2021] [Accepted: 04/12/2021] [Indexed: 12/17/2022]
Abstract
Drug resistance and adverse reactions to oxaliplatin remain a considerable issue in clinical practice. Emerging evidence has suggested that alterations in the lipid metabolism during drug therapy affect cancer cells. To gain insight into the important process of lipid metabolism, we investigated the lipid and gene expression profile changes in HT29 cells treated with oxaliplatin. A total of 1403 lipid species from 16 lipid classes were identified by UHPLC-MS. Interestingly, phospholipids, including phosphatidylglycerol (PG), phosphatidic acid (PA), phosphatidylcholine (PC), and most of phosphatidylethanolamine (PE) with polyunsaturated fatty acid (PUFA) chains, were significantly higher due to oxaliplatin treatment, while triacylglycerols (TAGs) with a saturated fatty acid chain or monounsaturated fatty acid were significantly downregulated. Gene Set Enrichment Analysis (GSEA) based on RNA sequencing data suggested that neutral lipid metabolism was enriched in the control group, whereas the phospholipid metabolic process was enriched in the oxaliplatin-treated group. We observed that altered lipid metabolism enzyme genes were involved in the synthesis and lipolysis of TAGs and the Lands cycle pathway based on the network between the core lipid-related gene and lipid species, which was further verified by qRT-PCR. In summary, our findings revealed that oxaliplatin impressed a specific lipid profile signature and lipid transcriptional reprogramming in HT29 cells, which provides new insights into biomarker discovery and pathways for overcoming drug resistance and adverse reactions.
Collapse
Affiliation(s)
- Jing Yu
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Dongli Hu
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Yu Cheng
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Jiwei Guo
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Yicheng Wang
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Zhirong Tan
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China
| | - Jingbo Peng
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China.
| | - Honghao Zhou
- Department of Clinical Pharmacology, Xiangya Hospital, Central South University, 87 Xiangya Road, Changsha 410008, PR China; Institute of Clinical Pharmacology, Central South University, Hunan Key Laboratory of Pharmacogenetics, 110 Xiangya Road, Changsha 410078, PR China; Engineering Research Center of Applied Technology of Pharmacogenomics, Ministry of Education, 110 Xiangya Road, Changsha 410078, PR China; National Clinical Research Center for Geriatric Disorders, 87 Xiangya Road, Changsha 410008, Hunan, PR China.
| |
Collapse
|
15
|
Wang XJ, Carlson P, Chedid V, Maselli DB, Taylor AL, McKinzie S, Camilleri M. Differential mRNA Expression in Ileal Mucosal Biopsies of Patients With Diarrhea- or Constipation-Predominant Irritable Bowel Syndrome. Clin Transl Gastroenterol 2021; 12:e00329. [PMID: 33843785 PMCID: PMC8043738 DOI: 10.14309/ctg.0000000000000329] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 02/17/2021] [Indexed: 12/22/2022] Open
Abstract
INTRODUCTION Previous studies in patients with irritable bowel syndrome (IBS) showed immune activation, secretion, and barrier dysfunction in duodenal, jejunal, or colorectal mucosa. This study aimed to measure ileal mucosal expression of genes and proteins associated with mucosal functions. METHODS We measured by reverse transcription polymerase chain reaction messenger RNA (mRNA) expression of 78 genes (reflecting tight junction proteins, chemokines, innate immunity, ion channels, and transmitters) and 5 proteins (barrier, bile acid receptor, and ion exchanger) in terminal ileal mucosa from 11 patients with IBS-diarrhea (IBS-D), 17 patients with IBS-constipation (IBS-C), and 14 healthy controls. Fold changes in mRNA were calculated using 2(-Δ, ΔCT) formula. Group differences were measured using analysis of variance. Protein ratios relative to healthy controls were based on Western blot analysis. Nominal P values (P < 0.05) are reported. RESULTS In ileal mucosal biopsies, significant differences of mRNA expression in IBS-D relative to IBS-C were upregulation of barrier proteins (TJP1, FN1, CLDN1, and CLDN12), repair function (TFF1), and cellular functions. In ileal mucosal biopsies, mRNA expression in IBS-C relative to healthy controls was reduced GPBAR1 receptor, myosin light chain kinase (MYLK in barrier function), and innate immunity (TLR3), but increased mRNA expression of cadherin cell adhesion mechanisms (CTNNB1) and transport genes SLC9A1 (Na-H exchanger [NHE1]) and INADL (indirect effect on ion transport). DISCUSSION These data support a role of ileal mucosal dysfunction in IBS, including barrier dysfunction in IBS-D and alterations in absorption/secretion mechanisms in IBS-C.
Collapse
Affiliation(s)
- Xiao Jing Wang
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Paula Carlson
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Victor Chedid
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Daniel B. Maselli
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Ann L. Taylor
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Sanna McKinzie
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| | - Michael Camilleri
- Clinical Enteric Neuroscience Translational and Epidemiological Research (CENTER), Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
16
|
Abstract
Background:
Gene set enrichment analyses (GSEA) provide a useful and powerful
approach to identify differentially expressed gene sets with prior biological knowledge. Several
GSEA algorithms have been proposed to perform enrichment analyses on groups of genes.
However, many of these algorithms have focused on the identification of differentially expressed
gene sets in a given phenotype.
Objective:
In this paper, we propose a gene set analytic framework, Gene Set Correlation Analysis (GSCoA), that simultaneously measures within and between gene sets variation to identify sets of genes enriched for differential expression
and highly co-related pathways.
Methods:
We apply co-inertia analysis to the comparisons of cross-gene sets in gene expression data
to measure the co-structure of expression profiles in pairs of gene sets. Co-inertia analysis (CIA) is
one multivariate method to identify trends or co-relationships in multiple datasets, which contain the
same samples. The objective of CIA is to seek ordinations (dimension reduction diagrams) of two
gene sets such that the square covariance between the projections of the gene sets on successive axes
is maximized. Simulation studies illustrate that CIA offers superior performance in identifying corelationships
between gene sets in all simulation settings when compared to correlation-based gene
set methods.
Result and Conclusion:
We also combine between-gene set CIA and GSEA to discover the relationships between gene
sets significantly associated with phenotypes. In addition, we provide a graphical technique for visualizing and simultaneously exploring the associations of between and within gene sets and their interaction and network. We then demonstrate
integration of within and between gene sets variation using CIA and GSEA, applied to the p53 gene expression data using
the c2 curated gene sets. Ultimately, the GSCoA approach provides an attractive tool for identification and visualization
of novel associations between pairs of gene sets by integrating co-relationships between gene sets into gene set analysis.
Collapse
Affiliation(s)
- Chen-An Tsai
- Department of Agronomy, National Taiwan University, Taipei,Taiwan
| | - James J. Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, FDA, Jefferson, AR 72079,United States
| |
Collapse
|
17
|
|
18
|
Karatzas E, Zachariou M, Bourdakou MM, Minadakis G, Oulas A, Kolios G, Delis A, Spyrou GM. PathWalks: identifying pathway communities using a disease-related map of integrated information. Bioinformatics 2020; 36:4070-4079. [PMID: 32369599 PMCID: PMC7332569 DOI: 10.1093/bioinformatics/btaa291] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2020] [Revised: 04/11/2020] [Accepted: 04/27/2020] [Indexed: 12/17/2022] Open
Abstract
MOTIVATION Understanding the underlying biological mechanisms and respective interactions of a disease remains an elusive, time consuming and costly task. Computational methodologies that propose pathway/mechanism communities and reveal respective relationships can be of great value as they can help expedite the process of identifying how perturbations in a single pathway can affect other pathways. RESULTS We present a random-walks-based methodology called PathWalks, where a walker crosses a pathway-to-pathway network under the guidance of a disease-related map. The latter is a gene network that we construct by integrating multi-source information regarding a specific disease. The most frequent trajectories highlight communities of pathways that are expected to be strongly related to the disease under study.We apply the PathWalks methodology on Alzheimer's disease and idiopathic pulmonary fibrosis and establish that it can highlight pathways that are also identified by other pathway analysis tools as well as are backed through bibliographic references. More importantly, PathWalks produces additional new pathways that are functionally connected with those already established, giving insight for further experimentation. AVAILABILITY AND IMPLEMENTATION https://github.com/vagkaratzas/PathWalks. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Evangelos Karatzas
- Department of Informatics and Telecommunications, University of Athens, Athens 15703, Greece
| | - Margarita Zachariou
- Department of Bioinformatics, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus.,The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus
| | - Marilena M Bourdakou
- Department of Bioinformatics, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus.,Department of Medicine, Laboratory of Pharmacology, Democritus University of Thrace, Komotini, Greece
| | - George Minadakis
- Department of Bioinformatics, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus.,The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus
| | - Anastasis Oulas
- Department of Bioinformatics, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus.,The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus
| | - George Kolios
- Department of Medicine, Laboratory of Pharmacology, Democritus University of Thrace, Komotini, Greece
| | - Alex Delis
- Department of Informatics and Telecommunications, University of Athens, Athens 15703, Greece
| | - George M Spyrou
- Department of Bioinformatics, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus.,The Cyprus School of Molecular Medicine, The Cyprus Institute of Neurology and Genetics, Nicosia 2370, Cyprus
| |
Collapse
|
19
|
Schneider K, Venn B, Mühlhaus T. TMEA: A Thermodynamically Motivated Framework for Functional Characterization of Biological Responses to System Acclimation. ENTROPY 2020; 22:e22091030. [PMID: 33286800 PMCID: PMC7597090 DOI: 10.3390/e22091030] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 09/07/2020] [Accepted: 09/11/2020] [Indexed: 12/16/2022]
Abstract
The objective of gene set enrichment analysis (GSEA) in modern biological studies is to identify functional profiles in huge sets of biomolecules generated by high-throughput measurements of genes, transcripts, metabolites, and proteins. GSEA is based on a two-stage process using classical statistical analysis to score the input data and subsequent testing for overrepresentation of the enrichment score within a given functional coherent set. However, enrichment scores computed by different methods are merely statistically motivated and often elusive to direct biological interpretation. Here, we propose a novel approach, called Thermodynamically Motivated Enrichment Analysis (TMEA), to account for the energy investment in biological relevant processes. Therefore, TMEA is based on surprisal analysis, which offers a thermodynamic-free energy-based representation of the biological steady state and of the biological change. The contribution of each biomolecule underlying the changes in free energy is used in a Monte Carlo resampling procedure resulting in a functional characterization directly coupled to the thermodynamic characterization of biological responses to system perturbations. To illustrate the utility of our method on real experimental data, we benchmark our approach on plant acclimation to high light and compare the performance of TMEA with the most frequently used method for GSEA.
Collapse
|
20
|
Fifteen Years of Gene Set Analysis for High-Throughput Genomic Data: A Review of Statistical Approaches and Future Challenges. ENTROPY 2020; 22:e22040427. [PMID: 33286201 PMCID: PMC7516904 DOI: 10.3390/e22040427] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/24/2020] [Revised: 03/18/2020] [Accepted: 04/03/2020] [Indexed: 12/22/2022]
Abstract
Over the last decade, gene set analysis has become the first choice for gaining insights into underlying complex biology of diseases through gene expression and gene association studies. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Although gene set analysis approaches are extensively used in gene expression and genome wide association data analysis, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. In this article, we provide a comprehensive overview, statistical structure and steps of gene set analysis approaches used for microarrays, RNA-sequencing and genome wide association data analysis. Further, we also classify the gene set analysis approaches and tools by the type of genomic study, null hypothesis, sampling model and nature of the test statistic, etc. Rather than reviewing the gene set analysis approaches individually, we provide the generation-wise evolution of such approaches for microarrays, RNA-sequencing and genome wide association studies and discuss their relative merits and limitations. Here, we identify the key biological and statistical challenges in current gene set analysis, which will be addressed by statisticians and biologists collectively in order to develop the next generation of gene set analysis approaches. Further, this study will serve as a catalog and provide guidelines to genome researchers and experimental biologists for choosing the proper gene set analysis approach based on several factors.
Collapse
|
21
|
Geistlinger L, Csaba G, Santarelli M, Ramos M, Schiffer L, Turaga N, Law C, Davis S, Carey V, Morgan M, Zimmer R, Waldron L. Toward a gold standard for benchmarking gene set enrichment analysis. Brief Bioinform 2020; 22:545-556. [PMID: 32026945 PMCID: PMC7820859 DOI: 10.1093/bib/bbz158] [Citation(s) in RCA: 66] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 10/11/2019] [Accepted: 11/09/2019] [Indexed: 12/22/2022] Open
Abstract
MOTIVATION Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT ludwig.geistlinger@sph.cuny.edu.
Collapse
Affiliation(s)
- Ludwig Geistlinger
- Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA
| | - Gergely Csaba
- Institute for Implementation Science and Population Health, City University of New York, New York, NY 10027, USA
| | - Mara Santarelli
- Institute for Bioinformatics, Ludwig-Maximilians-Universität München, 80333 Munich, Germany
| | - Marcel Ramos
- Roswell Park Cancer Institute, Buffalo, NY 14203, USA
| | - Lucas Schiffer
- Graduate School of Arts and Sciences, Boston University, Boston, MA 02215, USA
| | - Nitesh Turaga
- Epigenetics and Development Division, The Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria 3052, Australia
| | - Charity Law
- Department of Medical Biology, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Sean Davis
- Center for Cancer Research, National Cancer Institute, Bethesda, MD 20892, USA
| | | | | | | | - Levi Waldron
- Graduate School of Public Health and Health Policy, City University of New York, New York, NY 10027, USA
| |
Collapse
|
22
|
Mandelboum S, Manber Z, Elroy-Stein O, Elkon R. Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias. PLoS Biol 2019; 17:e3000481. [PMID: 31714939 PMCID: PMC6850523 DOI: 10.1371/journal.pbio.3000481] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Accepted: 10/08/2019] [Indexed: 11/19/2022] Open
Abstract
Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene–gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data. Analysis of numerous RNA-seq datasets reveals a recurrent sample-specific length bias that causes frequent false positive calls by gene-set enrichment analyses, leading to functional misinterpretation of the data. Its removal requires methods that allow the integration of gene length as sample-specific covariate.
Collapse
Affiliation(s)
- Shir Mandelboum
- School of Molecular Cell Biology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
| | - Zohar Manber
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Orna Elroy-Stein
- School of Molecular Cell Biology and Biotechnology, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
- * E-mail: (OE-S); (RE)
| | - Ran Elkon
- Sagol School of Neuroscience, Tel Aviv University, Tel Aviv, Israel
- Department of Human Molecular Genetics and Biochemistry, Sackler School of Medicine, Tel Aviv University, Tel Aviv, Israel
- * E-mail: (OE-S); (RE)
| |
Collapse
|
23
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
24
|
Nguyen TM, Shafi A, Nguyen T, Draghici S. Identifying significantly impacted pathways: a comprehensive review and assessment. Genome Biol 2019; 20:203. [PMID: 31597578 PMCID: PMC6784345 DOI: 10.1186/s13059-019-1790-4] [Citation(s) in RCA: 108] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2019] [Accepted: 08/13/2019] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Many high-throughput experiments compare two phenotypes such as disease vs. healthy, with the goal of understanding the underlying biological phenomena characterizing the given phenotype. Because of the importance of this type of analysis, more than 70 pathway analysis methods have been proposed so far. These can be categorized into two main categories: non-topology-based (non-TB) and topology-based (TB). Although some review papers discuss this topic from different aspects, there is no systematic, large-scale assessment of such methods. Furthermore, the majority of the pathway analysis approaches rely on the assumption of uniformity of p values under the null hypothesis, which is often not true. RESULTS This article presents the most comprehensive comparative study on pathway analysis methods available to date. We compare the actual performance of 13 widely used pathway analysis methods in over 1085 analyses. These comparisons were performed using 2601 samples from 75 human disease data sets and 121 samples from 11 knockout mouse data sets. In addition, we investigate the extent to which each method is biased under the null hypothesis. Together, these data and results constitute a reliable benchmark against which future pathway analysis methods could and should be tested. CONCLUSION Overall, the result shows that no method is perfect. In general, TB methods appear to perform better than non-TB methods. This is somewhat expected since the TB methods take into consideration the structure of the pathway which is meant to describe the underlying phenomena. We also discover that most, if not all, listed approaches are biased and can produce skewed results under the null.
Collapse
Affiliation(s)
- Tuan-Minh Nguyen
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
| | - Adib Shafi
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
| | - Tin Nguyen
- Department of Computer Science and Engineering, University of Nevada, Reno, 89557 USA
| | - Sorin Draghici
- Department of Computer Science, Wayne State University, Detroit, 48202 USA
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, 48202 USA
| |
Collapse
|
25
|
Vatanpour S, Pyne S, Leite AP, Dinu I. Gene set analysis and reduction for a continuous phenotype: Identifying markers of birth weight variation based on embryonic stem cells and immunologic signatures. Comput Biol Med 2019; 113:103389. [PMID: 31442861 DOI: 10.1016/j.compbiomed.2019.103389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2019] [Revised: 08/14/2019] [Accepted: 08/15/2019] [Indexed: 10/26/2022]
Abstract
BACKGROUND Gene set analysis is a popular approach to examine the association between a predefined gene set and a phenotype. Few methods have been developed for a continuous phenotype. However, often not all the genes within a significant gene set contribute to its significance. There is no gene set reduction method developed for continuous phenotype. We developed a computationally efficient analytical tool, called linear combination test for gene set reduction (LCT-GSR) to identify core subsets of gene sets associated with a continuous phenotype. Identifying the core subset enhances our understanding of the biological mechanism and reduces costs of disease risk assessment, diagnosis and treatment. RESULTS We evaluated the performance of our analytical tool by applying it to two real microarray studies. In the first application, we analyzed pathway expression measurements in newborns' blood to discover core genes contributing to the variation in birth weight. On average, we were able to reduce the number of genes in the 33 significant gene sets of embryonic stem cell signatures by 84.3% resulting in 229 unique genes. Using immunologic signatures, on average we reduced the number of genes in the 210 significant gene sets by 89% leading to 1603 unique genes. There were 180 unique core genes overlapping across the two databases. In the second application, we analyzed pathway expression measurements in a cohort of lethal prostate cancer patients from Swedish Watchful Waiting cohort to identify main genes associated with tumor volume. On average, we were able to reduce the number of genes in the 17 gene sets by 90% resulting in 47 unique genes. CONCLUSIONS We conclude that LCT-GSR is a statistically sound analytical tool that can be used to extract core genes associated with a continuous phenotype. It can be applied to a wide range of studies in which dichotomizing the continuous phenotype is neither easy nor meaningful. Reduction to the most predictive genes is crucial in advancing our understanding of issues such as disease prevention, faster and more efficient diagnosis, intervention strategies and personalized medicine.
Collapse
Affiliation(s)
| | - Saumyadipta Pyne
- Public Health Dynamics Laboratory, Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, PA, USA.
| | | | - Irina Dinu
- School of Public Health, University of Alberta, AB, Canada.
| |
Collapse
|
26
|
Saha S, Sengupta K, Chatterjee P, Basu S, Nasipuri M. Analysis of protein targets in pathogen-host interaction in infectious diseases: a case study on Plasmodium falciparum and Homo sapiens interaction network. Brief Funct Genomics 2019; 17:441-450. [PMID: 29028886 DOI: 10.1093/bfgp/elx024] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Infection and disease progression is the outcome of protein interactions between pathogen and host. Pathogen, the role player of Infection, is becoming a severe threat to life as because of its adaptability toward drugs and evolutionary dynamism in nature. Identifying protein targets by analyzing protein interactions between host and pathogen is the key point. Proteins with higher degree and possessing some topologically significant graph theoretical measures are found to be drug targets. On the other hand, exceptional nodes may be involved in infection mechanism because of some pathway process and biologically unknown factors. In this article, we attempt to investigate characteristics of host-pathogen protein interactions by presenting a comprehensive review of computational approaches applied on different infectious diseases. As an illustration, we have analyzed a case study on infectious disease malaria, with its causative agent Plasmodium falciparum acting as 'Bait' and host, Homo sapiens/human acting as 'Prey'. In this pathogen-host interaction network based on some interconnectivity and centrality properties, proteins are viewed as central, peripheral, hub and non-hub nodes and their significance on infection process. Besides, it is observed that because of sparseness of the pathogen and host interaction network, there may be some topologically unimportant but biologically significant proteins, which can also act as Bait/Prey. So, functional similarity or gene ontology mapping can help us in this case to identify these proteins.
Collapse
Affiliation(s)
- Sovan Saha
- Department of Computer Science and Engineering at Dr Sudhir Chandra Sur Degree Engineering College, India
| | - Kaustav Sengupta
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Piyali Chatterjee
- Department of Computer Science and Engineering, Netaji Subhash Engineering College, Garia, India
| | - Subhadip Basu
- Department of Computer Science and Engineering, Jadavpur University, India
| | - Mita Nasipuri
- Department of Computer Science and Engineering, Jadavpur University, India
| |
Collapse
|
27
|
Qin W, Wang X, Zhao H, Lu H. A Novel Joint Gene Set Analysis Framework Improves Identification of Enriched Pathways in Cross Disease Transcriptomic Analysis. Front Genet 2019; 10:293. [PMID: 31031796 PMCID: PMC6473067 DOI: 10.3389/fgene.2019.00293] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 03/19/2019] [Indexed: 12/25/2022] Open
Abstract
Motivation: Gene set enrichment analysis is a widely accepted expression analysis tool which aims at detecting coordinated expression change within a pre-defined gene sets rather than individual genes. The benefit of gene set analysis over individual differentially expressed (DE) gene analysis includes more reproducible and interpretable results and detecting small but consistent change among gene set which could not be detected by DE gene analysis. There have been many successful gene set analysis applications in human diseases. However, when the sample size of a disease study is small and no other public data sets of the same disease are available, it will lead to lack of power to detect pathways of importance to the disease. Results: We have developed a novel joint gene set analysis statistical framework which aims at improving the power of identifying enriched gene sets through integrating multiple similar disease data sets. Through comprehensive simulation studies, we demonstrated that our proposed frameworks obtained much better AUC scores than single data set analysis and another meta-analysis method in identification of enriched pathways. When applied to two real data sets, the proposed framework could retain the enriched gene sets identified by single data set analysis and exclusively obtained up to 200% more disease-related gene sets demonstrating the improved identification power through information shared between similar diseases. We expect that the proposed framework would enable researchers to better explore public data sets when the sample size of their study is limited.
Collapse
Affiliation(s)
- Wenyi Qin
- Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai Jiaotong University, Shanghai, China
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
- Department of Genetics, School of Medicine, Yale University, New Haven, CT, United States
| | - Xujun Wang
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
| | - Hongyu Zhao
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States
| | - Hui Lu
- Center for Biomedical Informatics, Shanghai Children's Hospital, Shanghai Jiaotong University, Shanghai, China
- Department of Bioengineering, University of Illinois at Chicago, Chicago, IL, United States
- Department of Bioinformatics and Biostatistics, SJTU-Yale Joint Center for Biostatistics, Shanghai Jiaotong University, Shanghai, China
- Department of Biostatistics, School of Public Health, Yale University, New Haven, CT, United States
| |
Collapse
|
28
|
Tian S, Wang C, Wang B. Incorporating Pathway Information into Feature Selection towards Better Performed Gene Signatures. BIOMED RESEARCH INTERNATIONAL 2019; 2019:2497509. [PMID: 31073522 PMCID: PMC6470448 DOI: 10.1155/2019/2497509] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2018] [Accepted: 03/07/2019] [Indexed: 12/29/2022]
Abstract
To analyze gene expression data with sophisticated grouping structures and to extract hidden patterns from such data, feature selection is of critical importance. It is well known that genes do not function in isolation but rather work together within various metabolic, regulatory, and signaling pathways. If the biological knowledge contained within these pathways is taken into account, the resulting method is a pathway-based algorithm. Studies have demonstrated that a pathway-based method usually outperforms its gene-based counterpart in which no biological knowledge is considered. In this article, a pathway-based feature selection is firstly divided into three major categories, namely, pathway-level selection, bilevel selection, and pathway-guided gene selection. With bilevel selection methods being regarded as a special case of pathway-guided gene selection process, we discuss pathway-guided gene selection methods in detail and the importance of penalization in such methods. Last, we point out the potential utilizations of pathway-guided gene selection in one active research avenue, namely, to analyze longitudinal gene expression data. We believe this article provides valuable insights for computational biologists and biostatisticians so that they can make biology more computable.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin 130021, China
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St., Lexington, KY 40536, USA
| | - Bing Wang
- School of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin 130012, China
| |
Collapse
|
29
|
Evolving Genomics of Pulmonary Fibrosis. Respir Med 2019. [DOI: 10.1007/978-3-319-99975-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
30
|
Tian S, Wang C, Chang HH. A longitudinal feature selection method identifies relevant genes to distinguish complicated injury and uncomplicated injury over time. BMC Med Inform Decis Mak 2018; 18:115. [PMID: 30526581 PMCID: PMC6284265 DOI: 10.1186/s12911-018-0685-8] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
BACKGROUND Feature selection and gene set analysis are of increasing interest in the field of bioinformatics. While these two approaches have been developed for different purposes, we describe how some gene set analysis methods can be utilized to conduct feature selection. METHODS We adopted a gene set analysis method, the significance analysis of microarray gene set reduction (SAMGSR) algorithm, to carry out feature selection for longitudinal gene expression data. RESULTS Using a real-world application and simulated data, it is demonstrated that the proposed SAMGSR extension outperforms other relevant methods. In this study, we illustrate that a gene's expression profiles over time can be regarded as a gene set and then a suitable gene set analysis method can be utilized directly to select relevant genes associated with the phenotype of interest over time. CONCLUSIONS We believe this work will motivate more research to bridge feature selection and gene set analysis, with the development of novel algorithms capable of carrying out feature selection for longitudinal gene expression data.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71Xinmin Street, Changchun, 130021, Jilin, China.
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St, Lexington, KY, 40536, USA
| | - Howard H Chang
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA
| |
Collapse
|
31
|
Association between bivariate expression of key oncogenes and metabolic phenotypes of patients with prostate cancer. Comput Biol Med 2018; 103:55-63. [PMID: 30340213 DOI: 10.1016/j.compbiomed.2018.09.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2018] [Revised: 09/20/2018] [Accepted: 09/23/2018] [Indexed: 11/20/2022]
Abstract
BACKGROUND AKT and MYC are two of the most prevalent oncogenes associated with prostate cancer. The precise effects of overexpression of these two key oncogenes on the regulation of metabolic pathways in prostate cancer are under active investigation; however, few studies have investigated their bivariate oncogene-pair expressions in metabolic prostate cancer phenotypes. This is primarily due to the lack of a suitable statistical method to analyze the data in the presence of oncogene interactions and within-metabolite-set correlations. METHODS We analyzed data on the expressions of phosphorylated AKT1 and MYC and the concentrations of 228 metabolites from 60 human prostate tumor samples and 16 normal tissue samples. The metabolomic data allowed us to study not only the measurement of individual metabolites, which can exhibit a dynamic range, but the enriched phenotypes in terms of "metabolite sets" that come from known metabolic pathways. We studied 71 metabolite sets defined by KEGG annotation. We used a modification of linear combination test (LCT) for multiple continuous outcomes to find associations between metabolite sets and oncogenic expressions, after accounting for the correlation between AKT1 and MYC expressions and the correlation between metabolites in a metabolite set. The LCT performance was evaluated using a simulation study. RESULTS Through a comprehensive analytical method, our study linked oncogenomics and metabolomics data from patients to improve our understanding of the interconnected mechanisms underlying prostate cancer. This study showed that dysregulations of AKT1 and MYC significantly alter the metabolic pathways activated by nonglucose nutrient sources and their downstream targets. Our findings highlighted the role of MYC as the leading, but not the only, oncogene in prostate oncogenesis. In our simulation study, the LCT performed better than the known alternative method, gene-set enrichment analysis (GSEA). CONCLUSIONS Our study offers a solution for linking genomics and metabolomics, working directly with multiple continuous and correlated measurements.
Collapse
|
32
|
Tian S, Wang C, Chang HH. To select relevant features for longitudinal gene expression data by extending a pathway analysis method. F1000Res 2018; 7:1166. [PMID: 30271585 PMCID: PMC6124382 DOI: 10.12688/f1000research.15357.1] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 07/25/2018] [Indexed: 01/30/2023] Open
Abstract
The emerging field of pathway-based feature selection that incorporates biological information conveyed by gene sets/pathways to guide the selection of relevant genes has become increasingly popular and widespread. In this study, we adapt a gene set analysis method - the significance analysis of microarray gene set reduction (SAMGSR) algorithm to carry out feature selection for longitudinal microarray data, and propose a pathway-based feature selection algorithm - the two-level SAMGSR method. By using simulated data and a real-world application, we demonstrate that a gene's expression profiles over time can be considered as a gene set. Thus a suitable gene set analysis method can be utilized or modified to execute the selection of relevant genes for longitudinal omics data. We believe this work paves the way for more research to bridge feature selection and gene set analysis with the development of novel pathway-based feature selection algorithms.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, Jilin, 130021, China
| | - Chi Wang
- Department of Biostatistics, The University of Kentuchy, Lexington, KY, 40536, USA
| | - Howard H. Chang
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, 30322, USA
| |
Collapse
|
33
|
Zhang Y, Topham DJ, Thakar J, Qiu X. FUNNEL-GSEA: FUNctioNal ELastic-net regression in time-course gene set enrichment analysis. Bioinformatics 2018; 33:1944-1952. [PMID: 28334094 PMCID: PMC5939227 DOI: 10.1093/bioinformatics/btx104] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2016] [Accepted: 02/17/2017] [Indexed: 01/26/2023] Open
Abstract
Motivation Gene set enrichment analyses (GSEAs) are widely used in genomic research to identify underlying biological mechanisms (defined by the gene sets), such as Gene Ontology terms and molecular pathways. There are two caveats in the currently available methods: (i) they are typically designed for group comparisons or regression analyses, which do not utilize temporal information efficiently in time-series of transcriptomics measurements; and (ii) genes overlapping in multiple molecular pathways are considered multiple times in hypothesis testing. Results We propose an inferential framework for GSEA based on functional data analysis, which utilizes the temporal information based on functional principal component analysis, and disentangles the effects of overlapping genes by a functional extension of the elastic-net regression. Furthermore, the hypothesis testing for the gene sets is performed by an extension of Mann-Whitney U test which is based on weighted rank sums computed from correlated observations. By using both simulated datasets and a large-scale time-course gene expression data on human influenza infection, we demonstrate that our method has uniformly better receiver operating characteristic curves, and identifies more pathways relevant to immune-response to human influenza infection than the competing approaches. Availability and Implementation The methods are implemented in R package FUNNEL, freely and publicly available at: https://github.com/yunzhang813/FUNNEL-GSEA-R-Package. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yun Zhang
- Department of Biostatistics and Computational Biology
| | - David J Topham
- Department of Microbiology and Immunology, University of Rochester, Rochester, NY 14642, USA
| | - Juilee Thakar
- Department of Biostatistics and Computational Biology.,Department of Microbiology and Immunology, University of Rochester, Rochester, NY 14642, USA
| | - Xing Qiu
- Department of Biostatistics and Computational Biology
| |
Collapse
|
34
|
Zhang A, Tian S. Classification of early-stage non-small cell lung cancer by weighing gene expression profiles with connectivity information. Biom J 2017; 60:537-546. [PMID: 29206308 DOI: 10.1002/bimj.201700010] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2017] [Revised: 09/10/2017] [Accepted: 10/22/2017] [Indexed: 11/11/2022]
Abstract
Pathway-based feature selection algorithms, which utilize biological information contained in pathways to guide which features/genes should be selected, have evolved quickly and become widespread in the field of bioinformatics. Based on how the pathway information is incorporated, we classify pathway-based feature selection algorithms into three major categories-penalty, stepwise forward, and weighting. Compared to the first two categories, the weighting methods have been underutilized even though they are usually the simplest ones. In this article, we constructed three different genes' connectivity information-based weights for each gene and then conducted feature selection upon the resulting weighted gene expression profiles. Using both simulations and a real-world application, we have demonstrated that when the data-driven connectivity information constructed from the data of specific disease under study is considered, the resulting weighted gene expression profiles slightly outperform the original expression profiles. In summary, a big challenge faced by the weighting method is how to estimate pathway knowledge-based weights more accurately and precisely. Only until the issue is conquered successfully will wide utilization of the weighting methods be impossible.
Collapse
Affiliation(s)
- Ao Zhang
- Intensive Care Unit (ICU), The First Hospital of Jilin University, Changchun, 130021, China
| | - Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, Changchun, 130021, China
| |
Collapse
|
35
|
Peng S, Yang S, Bo X, Li F. paraGSEA: a scalable approach for large-scale gene expression profiling. Nucleic Acids Res 2017; 45:e155. [PMID: 28973463 PMCID: PMC5737394 DOI: 10.1093/nar/gkx679] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2017] [Accepted: 07/27/2017] [Indexed: 12/28/2022] Open
Abstract
More studies have been conducted using gene expression similarity to identify functional connections among genes, diseases and drugs. Gene Set Enrichment Analysis (GSEA) is a powerful analytical method for interpreting gene expression data. However, due to its enormous computational overhead in the estimation of significance level step and multiple hypothesis testing step, the computation scalability and efficiency are poor on large-scale datasets. We proposed paraGSEA for efficient large-scale transcriptome data analysis. By optimization, the overall time complexity of paraGSEA is reduced from O(mn) to O(m+n), where m is the length of the gene sets and n is the length of the gene expression profiles, which contributes more than 100-fold increase in performance compared with other popular GSEA implementations such as GSEA-P, SAM-GS and GSEA2. By further parallelization, a near-linear speed-up is gained on both workstations and clusters in an efficient manner with high scalability and performance on large-scale datasets. The analysis time of whole LINCS phase I dataset (GSE92742) was reduced to nearly half hour on a 1000 node cluster on Tianhe-2, or within 120 hours on a 96-core workstation. The source code of paraGSEA is licensed under the GPLv3 and available at http://github.com/ysycloud/paraGSEA.
Collapse
Affiliation(s)
- Shaoliang Peng
- College of Computer Science and Electronic Engineering & National Supercomputer Centre in Changsha, Hunan University, Changsha 410082, China.,School of Computer Science, National University of Defense Technology, Changsha 410073, China
| | - Shunyun Yang
- School of Computer Science, National University of Defense Technology, Changsha 410073, China
| | - Xiaochen Bo
- Department of biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Fei Li
- Department of biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| |
Collapse
|
36
|
Dinu I, Poudel S, Pyne S. Gene-Set Reduction for Analysis of Major and Minor Gleason Scores Based on Differential Gene-Set Expressions and Biological Pathways in Prostate Cancer. Cancer Inform 2017; 16:1176935117730016. [PMID: 28932104 PMCID: PMC5598806 DOI: 10.1177/1176935117730016] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2017] [Accepted: 07/26/2017] [Indexed: 11/27/2022] Open
Abstract
The Gleason score (GS) plays an important role in prostate cancer detection and treatment. It is calculated based on a sum between its major and minor components, each ranging from 1 to 5, assigned after examination of sample cells taken from each side of the prostate gland during biopsy. A total GS of at least 7 is associated with more aggressive prostate cancer. However, it is still unclear how prostate cancer outcomes differ for various distributions of GS between its major and minor components. This article applies Significance Analysis of Microarray for Gene-Set Reduction to a real microarray study of patients with prostate cancer and identifies 13 core genes differentially expressed between patients with a major GS of 3 and a minor GS of 4, or (3,4), vs patients with a combination of (4,3), starting from a less aggressive GS combination of (3,3), and moving toward a more aggressive one of (4,4) via gray areas of (3,4) and (4,3). The resulting core genes may improve understanding of prostate cancer in patients with a total GS of 7, the most common grade and most challenging with respect to prognosis.
Collapse
Affiliation(s)
- Irina Dinu
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | - Surya Poudel
- School of Public Health, University of Alberta, Edmonton, AB, Canada
| | - Saumyadipta Pyne
- Indian Institute of Public Health, Public Health Foundation of India, Hyderabad, India
| |
Collapse
|
37
|
Abstract
The analysis of gene sets (in a form of functionally related genes or pathways) has become the method of choice for extracting the strongest signals from omics data. The motivation behind using gene sets instead of individual genes is two-fold. First, this approach incorporates pre-existing biological knowledge into the analysis and facilitates the interpretation of experimental results. Second, it employs a statistical hypotheses testing framework. Here, we briefly review main Gene Set Analysis (GSA) approaches for testing differential expression of gene sets and several GSA approaches for testing statistical hypotheses beyond differential expression that allow extracting additional biological information from the data. We distinguish three major types of GSA approaches testing: (1) differential expression (DE), (2) differential variability (DV), and (3) differential co-expression (DC) of gene sets between two phenotypes. We also present comparative power analysis and Type I error rates for different approaches in each major type of GSA on simulated data. Our evaluation presents a concise guideline for selecting GSA approaches best performing under particular experimental settings. The value of the three major types of GSA approaches is illustrated with real data example. While being applied to the same data set, major types of GSA approaches result in complementary biological information.
Collapse
|
38
|
Abstract
Approaches to identify significant pathways from high-throughput quantitative data have been developed in recent years. Still, the analysis of proteomic data stays difficult because of limited sample size. This limitation also leads to the practice of using a competitive null as common approach; which fundamentally implies genes or proteins as independent units. The independent assumption ignores the associations among biomolecules with similar functions or cellular localization, as well as the interactions among them manifested as changes in expression ratios. Consequently, these methods often underestimate the associations among biomolecules and cause false positives in practice. Some studies incorporate the sample covariance matrix into the calculation to address this issue. However, sample covariance may not be a precise estimation if the sample size is very limited, which is usually the case for the data produced by mass spectrometry. In this study, we introduce a multivariate test under a self-contained null to perform pathway analysis for quantitative proteomic data. The covariance matrix used in the test statistic is constructed by the confidence scores retrieved from the STRING database or the HitPredict database. We also design an integrating procedure to retain pathways of sufficient evidence as a pathway group. The performance of the proposed T2-statistic is demonstrated using five published experimental datasets: the T-cell activation, the cAMP/PKA signaling, the myoblast differentiation, and the effect of dasatinib on the BCR-ABL pathway are proteomic datasets produced by mass spectrometry; and the protective effect of myocilin via the MAPK signaling pathway is a gene expression dataset of limited sample size. Compared with other popular statistics, the proposed T2-statistic yields more accurate descriptions in agreement with the discussion of the original publication. We implemented the T2-statistic into an R package T2GA, which is available at https://github.com/roqe/T2GA. Pathway analysis is a common approach to quickly access the pathways being regulated in the experiments. There are numerous statistics to perform pathway analysis; most of them assume that the genes or proteins are independent of each other for statistical ease. This assumption, however, is unrealistic to the real biological system and may cause false positives in practice. A standard way to address this issue is to measure the associations among genes or proteins. Unfortunately, the estimation of associations requires sufficient sample size, which is usually not available for proteomic data produced by mass spectrometry. In this study, we propose a T2-statistic, which estimates the associations among gene products, to perform pathway analysis for quantitative proteomic data. Instead of calculating the associations directly from data, we use the confidence scores retrieved from protein-protein interaction databases. We also design an integrating procedure to reserve pathways of sufficient evidence as a regulated pathway group. We compare the proposed T2-statistic to other popular statistics using five published experimental datasets, and the T2-statistic yields more accurate descriptions in agreement with the discussion of the original papers.
Collapse
|
39
|
Rahmatallah Y, Zybailov B, Emmert-Streib F, Glazko G. GSAR: Bioconductor package for Gene Set analysis in R. BMC Bioinformatics 2017; 18:61. [PMID: 28118818 PMCID: PMC5259853 DOI: 10.1186/s12859-017-1482-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 01/10/2017] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Gene set analysis (in a form of functionally related genes or pathways) has become the method of choice for analyzing omics data in general and gene expression data in particular. There are many statistical methods that either summarize gene-level statistics for a gene set or apply a multivariate statistic that accounts for intergene correlations. Most available methods detect complex departures from the null hypothesis but lack the ability to identify the specific alternative hypothesis that rejects the null. RESULTS GSAR (Gene Set Analysis in R) is an open-source R/Bioconductor software package for gene set analysis (GSA). It implements self-contained multivariate non-parametric statistical methods testing a complex null hypothesis against specific alternatives, such as differences in mean (shift), variance (scale), or net correlation structure. The package also provides a graphical visualization tool, based on the union of two minimum spanning trees, for correlation networks to examine the change in the correlation structures of a gene set between two conditions and highlight influential genes (hubs). CONCLUSIONS Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives. The methods in package GSAR are applicable to any type of omics data that can be represented in a matrix format. The package, with detailed instructions and examples, is freely available under the GPL (> = 2) license from the Bioconductor web site.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Boris Zybailov
- Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Frank Emmert-Streib
- Computational Medicine and Statistical Learning Laboratory, Tampere University of Technology, Korkeakoulunkatu 1, Tampere, FI-33720, Finland
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| |
Collapse
|
40
|
Zhang L, Wang L, Tian P, Tian S. Identification of Genes Discriminating Multiple Sclerosis Patients from Controls by Adapting a Pathway Analysis Method. PLoS One 2016; 11:e0165543. [PMID: 27846233 PMCID: PMC5112852 DOI: 10.1371/journal.pone.0165543] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2016] [Accepted: 09/13/2016] [Indexed: 11/18/2022] Open
Abstract
The focus of analyzing data from microarray experiments has shifted from the identification of associated individual genes to that of associated biological pathways or gene sets. In bioinformatics, a feature selection algorithm is usually used to cope with the high dimensionality of microarray data. In addition to those algorithms that use the biological information contained within a gene set as a priori to facilitate the process of feature selection, various gene set analysis methods can be applied directly or modified readily for the purpose of feature selection. Significance analysis of microarray to gene-set reduction analysis (SAM-GSR) algorithm, a novel direction of gene set analysis, is one of such methods. Here, we explore the feature selection property of SAM-GSR and provide a modification to better achieve the goal of feature selection. In a multiple sclerosis (MS) microarray data application, both SAM-GSR and our modification of SAM-GSR perform well. Our results show that SAM-GSR can carry out feature selection indeed, and modified SAM-GSR outperforms SAM-GSR. Given pathway information is far from completeness, a statistical method capable of constructing biologically meaningful gene networks is of interest. Consequently, both SAM-GSR algorithms will be continuously revaluated in our future work, and thus better characterized.
Collapse
Affiliation(s)
- Lei Zhang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
- Department of Neurology, The Second Hospital of Jilin University, 218 Ziqiang Street, Changchun, Jilin, China, 130041
| | - Linlin Wang
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Pu Tian
- College of Life Science, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012
| | - Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71 Xinmin Street, Changchun, Jilin, China, 130021
| |
Collapse
|
41
|
Sensitivity analysis of gene ranking methods in phenotype prediction. J Biomed Inform 2016; 64:255-264. [PMID: 27793724 DOI: 10.1016/j.jbi.2016.10.012] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2016] [Revised: 09/20/2016] [Accepted: 10/24/2016] [Indexed: 11/20/2022]
Abstract
INTRODUCTION It has become clear that noise generated during the assay and analytical processes has the ability to disrupt accurate interpretation of genomic studies. Not only does such noise impact the scientific validity and costs of studies, but when assessed in the context of clinically translatable indications such as phenotype prediction, it can lead to inaccurate conclusions that could ultimately impact patients. We applied a sequence of ranking methods to damp noise associated with microarray outputs, and then tested the utility of the approach in three disease indications using publically available datasets. MATERIALS AND METHODS This study was performed in three phases. We first theoretically analyzed the effect of noise in phenotype prediction problems showing that it can be expressed as a modeling error that partially falsifies the pathways. Secondly, via synthetic modeling, we performed the sensitivity analysis for the main gene ranking methods to different types of noise. Finally, we studied the predictive accuracy of the gene lists provided by these ranking methods in synthetic data and in three different datasets related to cancer, rare and neurodegenerative diseases to better understand the translational aspects of our findings. RESULTS AND DISCUSSION In the case of synthetic modeling, we showed that Fisher's Ratio (FR) was the most robust gene ranking method in terms of precision for all the types of noise at different levels. Significance Analysis of Microarrays (SAM) provided slightly lower performance and the rest of the methods (fold change, entropy and maximum percentile distance) were much less precise and accurate. The predictive accuracy of the smallest set of high discriminatory probes was similar for all the methods in the case of Gaussian and Log-Gaussian noise. In the case of class assignment noise, the predictive accuracy of SAM and FR is higher. Finally, for real datasets (Chronic Lymphocytic Leukemia, Inclusion Body Myositis and Amyotrophic Lateral Sclerosis) we found that FR and SAM provided the highest predictive accuracies with the smallest number of genes. Biological pathways were found with an expanded list of genes whose discriminatory power has been established via FR. CONCLUSIONS We have shown that noise in expression data and class assignment partially falsifies the sets of discriminatory probes in phenotype prediction problems. FR and SAM better exploit the principle of parsimony and are able to find subsets with less number of high discriminatory genes. The predictive accuracy and the precision are two different metrics to select the important genes, since in the presence of noise the most predictive genes do not completely coincide with those that are related to the phenotype. Based on the synthetic results, FR and SAM are recommended to unravel the biological pathways that are involved in the disease development.
Collapse
|
42
|
Self-Contained Statistical Analysis of Gene Sets. PLoS One 2016; 11:e0163918. [PMID: 27711232 PMCID: PMC5053608 DOI: 10.1371/journal.pone.0163918] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2016] [Accepted: 09/17/2016] [Indexed: 11/25/2022] Open
Abstract
Microarrays are a powerful tool for studying differential gene expression. However, lists of many differentially expressed genes are often generated, and unraveling meaningful biological processes from the lists can be challenging. For this reason, investigators have sought to quantify the statistical probability of compiled gene sets rather than individual genes. The gene sets typically are organized around a biological theme or pathway. We compute correlations between different gene set tests and elect to use Fisher’s self-contained method for gene set analysis. We improve Fisher’s differential expression analysis of a gene set by limiting the p-value of an individual gene within the gene set to prevent a small percentage of genes from determining the statistical significance of the entire set. In addition, we also compute dependencies among genes within the set to determine which genes are statistically linked. The method is applied to T-ALL (T-lineage Acute Lymphoblastic Leukemia) to identify differentially expressed gene sets between T-ALL and normal patients and T-ALL and AML (Acute Myeloid Leukemia) patients.
Collapse
|
43
|
Tian S, Chang HH, Wang C. Weighted-SAMGSR: combining significance analysis of microarray-gene set reduction algorithm with pathway topology-based weights to select relevant genes. Biol Direct 2016; 11:50. [PMID: 27681389 PMCID: PMC5041498 DOI: 10.1186/s13062-016-0152-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2016] [Accepted: 09/20/2016] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND It has been demonstrated that a pathway-based feature selection method that incorporates biological information within pathways during the process of feature selection usually outperforms a gene-based feature selection algorithm in terms of predictive accuracy and stability. Significance analysis of microarray-gene set reduction algorithm (SAMGSR), an extension to a gene set analysis method with further reduction of the selected pathways to their respective core subsets, can be regarded as a pathway-based feature selection method. METHODS In SAMGSR, whether a gene is selected is mainly determined by its expression difference between the phenotypes, and partially by the number of pathways to which this gene belongs. It ignores the topology information among pathways. In this study, we propose a weighted version of the SAMGSR algorithm by constructing weights based on the connectivity among genes and then combing these weights with the test statistics. RESULTS Using both simulated and real-world data, we evaluate the performance of the proposed SAMGSR extension and demonstrate that the weighted version outperforms its original version. CONCLUSIONS: To conclude, the additional gene connectivity information does faciliatate feature selection. REVIEWERS This article was reviewed by Drs. Limsoon Wong, Lev Klebanov, and, I. King Jordan.
Collapse
Affiliation(s)
- Suyan Tian
- Division of Clinical Research, The First Hospital of Jilin University, 71Xinmin Street, Changchun, Jilin, China, 130021. .,School of Mathematics, Jilin University, 2699 Qianjin Street, Changchun, Jilin, China, 130012.
| | - Howard H Chang
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA
| | - Chi Wang
- Department of Biostatistics, Markey Cancer Center, The University of Kentucky, 800 Rose St., Lexington, KY, 40536, USA
| |
Collapse
|
44
|
Zeng Y, Breheny P. Overlapping Group Logistic Regression with Applications to Genetic Pathway Selection. Cancer Inform 2016; 15:179-87. [PMID: 27679461 PMCID: PMC5026200 DOI: 10.4137/cin.s40043] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 08/07/2016] [Accepted: 08/13/2016] [Indexed: 11/24/2022] Open
Abstract
Discovering important genes that account for the phenotype of interest has long been a challenge in genome-wide expression analysis. Analyses such as gene set enrichment analysis (GSEA) that incorporate pathway information have become widespread in hypothesis testing, but pathway-based approaches have been largely absent from regression methods due to the challenges of dealing with overlapping pathways and the resulting lack of available software. The R package grpreg is widely used to fit group lasso and other group-penalized regression models; in this study, we develop an extension, grpregOverlap, to allow for overlapping group structure using a latent variable approach. We compare this approach to the ordinary lasso and to GSEA using both simulated and real data. We find that incorporation of prior pathway information can substantially improve the accuracy of gene expression classifiers, and we shed light on several ways in which hypothesis-testing approaches such as GSEA differ from regression approaches with respect to the analysis of pathway data.
Collapse
Affiliation(s)
- Yaohui Zeng
- Department of Biostatistics, University of Iowa, Iowa City, IA, USA
| | - Patrick Breheny
- Department of Biostatistics, University of Iowa, Iowa City, IA, USA
| |
Collapse
|
45
|
Camilleri M, Carlson P, Valentin N, Acosta A, O'Neill J, Eckert D, Dyer R, Na J, Klee EW, Murray JA. Pilot study of small bowel mucosal gene expression in patients with irritable bowel syndrome with diarrhea. Am J Physiol Gastrointest Liver Physiol 2016; 311:G365-76. [PMID: 27445342 PMCID: PMC5076008 DOI: 10.1152/ajpgi.00037.2016] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/27/2016] [Accepted: 07/09/2016] [Indexed: 02/08/2023]
Abstract
Prior studies in with irritable bowel syndrome with diarrhea (IBS-D) patients showed immune activation, secretion, and barrier dysfunction in jejunal or colorectal mucosa. We measured mRNA expression by RT-PCR of 91 genes reflecting tight junction proteins, chemokines, innate immunity, ion channels, transmitters, housekeeping genes, and controls for DNA contamination and PCR efficiency in small intestinal mucosa from 15 IBS-D and 7 controls (biopsies negative for celiac disease). Fold change was calculated using 2((-ΔΔCT)) formula. Nominal P values (P < 0.05) were interpreted with false detection rate (FDR) correction (q value). Cluster analysis with Lens for Enrichment and Network Studies (LENS) explored connectivity of mechanisms. Upregulated genes (uncorrected P < 0.05) were related to ion transport (INADL, MAGI1, and SONS1), barrier (TJP1, 2, and 3 and CLDN) or immune functions (TLR3, IL15, and MAPKAPK5), or histamine metabolism (HNMT); downregulated genes were related to immune function (IL-1β, TGF-β1, and CCL20) or antigen detection (TLR1 and 8). The following genes were significantly upregulated (q < 0.05) in IBS-D: INADL, MAGI1, PPP2R5C, MAPKAPK5, TLR3, and IL-15. Among the 14 nominally upregulated genes, there was clustering of barrier and PDZ domains (TJP1, TJP2, TJP3, CLDN4, INADL, and MAGI1) and clustering of downregulated genes (CCL20, TLR1, IL1B, and TLR8). Protein expression of PPP2R5C in nuclear lysates was greater in patients with IBS-D and controls. There was increase in INADL protein (median 9.4 ng/ml) in patients with IBS-D relative to controls (median 5.8 ng/ml, P > 0.05). In conclusion, altered transcriptome (and to lesser extent protein) expression of ion transport, barrier, immune, and mast cell mechanisms in small bowel may reflect different alterations in function and deserves further study in IBS-D.
Collapse
Affiliation(s)
- Michael Camilleri
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Paula Carlson
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Nelson Valentin
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Andres Acosta
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Jessica O'Neill
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Deborah Eckert
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Roy Dyer
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| | - Jie Na
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and Division of Biomedical Statistic and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Eric W Klee
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and Division of Biomedical Statistic and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota
| | - Joseph A Murray
- Clinical Enteric Neuroscience Translational and Epidemiological Research (C.E.N.T.E.R.), Mayo Clinic, Rochester, Minnesota; and
| |
Collapse
|
46
|
Deandrés-Galiana EJ, Fernández-Martínez JL, Saligan LN, Sonis ST. Impact of Microarray Preprocessing Techniques in Unraveling Biological Pathways. J Comput Biol 2016; 23:957-968. [PMID: 27494192 DOI: 10.1089/cmb.2016.0042] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2023] Open
Abstract
To better understand the impact of microarray preprocessing normalization techniques on the analysis of biological pathways in the prediction of chronic fatigue (CF) following radiation therapy, this study has compared the list of predictive genes found using the Robust Multiarray Averaging (RMA) and the Affymetrix MAS5 method, with the list that is obtained working with raw data (without any preprocessing). First, we modeled the spiked-in data set where differentially expressed genes were known and spiked-in at different known concentrations, showing that the precisions established by different gene ranking methods were higher than working with raw data. The results obtained from the spiked-in experiment were extrapolated to the CF data set to run learning and blind validation. RMA and MAS5 provided different sets of discriminatory genes that have a higher predictive accuracy in the learning phase, but lower predictive accuracy during the blind validation phase, suggesting that the genetic signatures generated using both preprocessing techniques cannot be generalizable. The pathways found using the raw data set better described what is a priori known for the CF disease. Besides, RMA produced more reliable pathways than MAS5. Understanding the strengths of these two preprocessing techniques in phenotype prediction is critical for precision medicine. Particularly, this article concludes that biological pathways might be better unraveled working with raw expression data. Moreover, the interpretation of the predictive gene profiles generated by RMA and MAS5 should be done with caution. This is an important conclusion with a high translational impact that should be confirmed in other disease data sets.
Collapse
|
47
|
Alvarez MJ, Shen Y, Giorgi FM, Lachmann A, Ding BB, Ye BH, Califano A. Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nat Genet 2016; 48:838-47. [PMID: 27322546 PMCID: PMC5040167 DOI: 10.1038/ng.3593] [Citation(s) in RCA: 576] [Impact Index Per Article: 64.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2016] [Accepted: 05/23/2016] [Indexed: 01/05/2023]
Abstract
Identifying the multiple dysregulated oncoproteins that contribute to tumorigenesis in a given patient is crucial for developing personalized treatment plans. However, accurate inference of aberrant protein activity in biological samples is still challenging as genetic alterations are only partially predictive and direct measurements of protein activity are generally not feasible. To address this problem we introduce and experimentally validate a new algorithm, virtual inference of protein activity by enriched regulon analysis (VIPER), for accurate assessment of protein activity from gene expression data. We used VIPER to evaluate the functional relevance of genetic alterations in regulatory proteins across all samples in The Cancer Genome Atlas (TCGA). In addition to accurately infer aberrant protein activity induced by established mutations, we also identified a fraction of tumors with aberrant activity of druggable oncoproteins despite a lack of mutations, and vice versa. In vitro assays confirmed that VIPER-inferred protein activity outperformed mutational analysis in predicting sensitivity to targeted inhibitors.
Collapse
Affiliation(s)
- Mariano J. Alvarez
- Department of Systems Biology, Columbia University, New York, USA
- DarwinHealth Inc., New York, USA
| | - Yao Shen
- Department of Systems Biology, Columbia University, New York, USA
- DarwinHealth Inc., New York, USA
| | | | | | - B. Belinda Ding
- Department of Cell Biology, Albert Einstein College of Medicine, New York, USA
| | - B. Hilda Ye
- Department of Cell Biology, Albert Einstein College of Medicine, New York, USA
| | - Andrea Califano
- Department of Systems Biology, Columbia University, New York, USA
- Department of Biomedical Informatics, Columbia University, New York, USA
- Department of Biochemistry & Molecular Biophysics, Columbia University, New York, USA
- Institute for Cancer Genetics, Columbia University, New York, USA
- Motor Neuron Center, Columbia University, New York, USA
- Columbia Initiative in Stem Cells, Columbia University, New York, USA
| |
Collapse
|
48
|
Classification of Non-Small Cell Lung Cancer Using Significance Analysis of Microarray-Gene Set Reduction Algorithm. BIOMED RESEARCH INTERNATIONAL 2016; 2016:2491671. [PMID: 27446945 PMCID: PMC4944087 DOI: 10.1155/2016/2491671] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/03/2015] [Revised: 05/09/2016] [Accepted: 06/05/2016] [Indexed: 01/15/2023]
Abstract
Among non-small cell lung cancer (NSCLC), adenocarcinoma (AC), and squamous cell carcinoma (SCC) are two major histology subtypes, accounting for roughly 40% and 30% of all lung cancer cases, respectively. Since AC and SCC differ in their cell of origin, location within the lung, and growth pattern, they are considered as distinct diseases. Gene expression signatures have been demonstrated to be an effective tool for distinguishing AC and SCC. Gene set analysis is regarded as irrelevant to the identification of gene expression signatures. Nevertheless, we found that one specific gene set analysis method, significance analysis of microarray-gene set reduction (SAMGSR), can be adopted directly to select relevant features and to construct gene expression signatures. In this study, we applied SAMGSR to a NSCLC gene expression dataset. When compared with several novel feature selection algorithms, for example, LASSO, SAMGSR has equivalent or better performance in terms of predictive ability and model parsimony. Therefore, SAMGSR is a feature selection algorithm, indeed. Additionally, we applied SAMGSR to AC and SCC subtypes separately to discriminate their respective stages, that is, stage II versus stage I. Few overlaps between these two resulting gene signatures illustrate that AC and SCC are technically distinct diseases. Therefore, stratified analyses on subtypes are recommended when diagnostic or prognostic signatures of these two NSCLC subtypes are constructed.
Collapse
|
49
|
Lee S. Identifying statistically significant gene sets based on differential expression and differential coexpression. KOREAN JOURNAL OF APPLIED STATISTICS 2016. [DOI: 10.5351/kjas.2016.29.3.437] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
50
|
Hsueh HM, Tsai CA. Gene set analysis using sufficient dimension reduction. BMC Bioinformatics 2016; 17:74. [PMID: 26852017 PMCID: PMC4744442 DOI: 10.1186/s12859-016-0928-6] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2015] [Accepted: 02/01/2016] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Gene set analysis (GSA) aims to evaluate the association between the expression of biological pathways, or a priori defined gene sets, and a particular phenotype. Numerous GSA methods have been proposed to assess the enrichment of sets of genes. However, most methods are developed with respect to a specific alternative scenario, such as a differential mean pattern or a differential coexpression. Moreover, a very limited number of methods can handle either binary, categorical, or continuous phenotypes. In this paper, we develop two novel GSA tests, called SDRs, based on the sufficient dimension reduction technique, which aims to capture sufficient information about the relationship between genes and the phenotype. The advantages of our proposed methods are that they allow for categorical and continuous phenotypes, and they are also able to identify a variety of enriched gene sets. RESULTS Through simulation studies, we compared the type I error and power of SDRs with existing GSA methods for binary, triple, and continuous phenotypes. We found that SDR methods adequately control the type I error rate at the pre-specified nominal level, and they have a satisfactory power to detect gene sets with differential coexpression and to test non-linear associations between gene sets and a continuous phenotype. In addition, the SDR methods were compared with seven widely-used GSA methods using two real microarray datasets for illustration. CONCLUSIONS We concluded that the SDR methods outperform the others because of their flexibility with regard to handling different kinds of phenotypes and their power to detect a wide range of alternative scenarios. Our real data analysis highlights the differences between GSA methods for detecting enriched gene sets.
Collapse
Affiliation(s)
- Huey-Miin Hsueh
- Department of Statistics, National Chengchi UniversityZhinan Road, Taipei116, Taiwan, Taipei, 116, Taiwan.
| | - Chen-An Tsai
- Department of Agronomy, National Taiwan University, No. 1, Section 4, Roosevelt Road, Taipei, 106, Taiwan.
| |
Collapse
|