26
|
Zou Y, Bui TT, Selvarajoo K. ABioTrans: A Biostatistical Tool for Transcriptomics Analysis. Front Genet 2019; 10:499. [PMID: 31214245 PMCID: PMC6555198 DOI: 10.3389/fgene.2019.00499] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2018] [Accepted: 05/07/2019] [Indexed: 11/13/2022] Open
Abstract
Here we report a bio-statistical/informatics tool, ABioTrans, developed in R for gene expression analysis. The tool allows the user to directly read RNA-Seq data files deposited in the Gene Expression Omnibus or GEO database. Operated using any web browser application, ABioTrans provides easy options for multiple statistical distribution fitting, Pearson and Spearman rank correlations, PCA, k-means and hierarchical clustering, differential expression (DE) analysis, Shannon entropy and noise (square of coefficient of variation) analyses, as well as Gene ontology classifications.
Collapse
|
27
|
Acosta JP, Restrepo S, Henao JD, López-Kleine L. Multivariate Method for Inferential Identification of Differentially Expressed Genes in Gene Expression Experiments. J Comput Biol 2019; 26:866-874. [PMID: 31063414 DOI: 10.1089/cmb.2018.0013] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Microarray technology is widely recognized as one of the most important tools when it comes to understanding genetic expression in biological processes. In light of the thousands of gene expression level measurements (including measurements across a number of conditions), identifying differentially expressed genes necessarily implies data mining or large-scale multiple testing procedures. To date, advances with regard to this field have been multivariate-descriptive or inferential-univariate in nature and therefore have important limitations regarding the biological validity of detected genes. In the present article, we present a new multivariate inferential method designed to detect active differentially expressed genes in gene expression data. The proposed method estimates false discovery rates using artificial components. Our method excels when applied to the most common gene expression data structures, providing new insights into differentially expressed genes. The method described herein was programmed in an R-Bioconductor package called acde that has been available since 2015.
Collapse
|
28
|
Guan L, Luo Q, Liang N, Liu H. A prognostic prediction system for hepatocellular carcinoma based on gene co-expression network. Exp Ther Med 2019; 17:4506-4516. [PMID: 31086582 PMCID: PMC6489019 DOI: 10.3892/etm.2019.7494] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2018] [Accepted: 01/25/2019] [Indexed: 12/11/2022] Open
Abstract
In the present study, gene expression data of hepatocellular carcinoma (HCC) were analyzed by using a multi-step Bioinformatics approach to establish a novel prognostic prediction system. Gene expression profiles were downloaded from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO) databases. The overlapping differentially expressed genes (DEGs) between these two datasets were identified using the limma package in R. Prognostic genes were further identified by Cox regression using the survival package. The significantly co-expressed gene pairs were selected using the R function cor to construct the co-expression network. Functional and module analyses were also performed. Next, a prognostic prediction system was established by Bayes discriminant analysis using the discriminant.bayes function in the e1071 package, which was further validated in another independent GEO dataset. A total of 177 overlapping DEGs were identified from TCGA and the GEO dataset (GSE36376). Furthermore, 161 prognostic genes were selected and the top six were stanniocalcin 2, carbonic anhydrase 12, cell division cycle (CDC) 20, deoxyribonuclease 1 like 3, glucosylceramidase β3 and metallothionein 1G. A gene co-expression network involving 41 upregulated and 52 downregulated genes was constructed. SPC24, endothelial cell specific molecule 1, CDC20, CDCA3, cyclin (CCN) E1 and chromatin licensing and DNA replication factor 1 were significantly associated with cell division, mitotic cell cycle and positive regulation of cell proliferation. CCNB1, CCNE1, CCNB2 and stratifin were clearly associated with the p53 signaling pathway. A prognostic prediction system containing 55 signature genes was established and then validated in the GEO dataset GSE20140. In conclusion, the present study identified a number of prognostic genes and established a prediction system to assess the prognosis of HCC patients.
Collapse
|
29
|
Teran Hidalgo SJ, Zhu T, Wu M, Ma S. Overlapping clustering of gene expression data using penalized weighted normalized cut. Genet Epidemiol 2018; 42:796-811. [PMID: 30302823 PMCID: PMC6239939 DOI: 10.1002/gepi.22164] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2018] [Revised: 07/24/2018] [Accepted: 08/28/2018] [Indexed: 02/06/2023]
Abstract
Clustering has been widely conducted in the analysis of gene expression data. For complex diseases, it has played an important role in identifying unknown functions of genes, serving as the basis of other analysis, and others. A common limitation of most existing clustering approaches is to assume that genes are separated into disjoint clusters. As genes often have multiple functions and thus can belong to more than one functional cluster, the disjoint clustering results can be unsatisfactory. In addition, due to the small sample sizes of genetic profiling studies and other factors, there may not be sufficient evidence to confirm the specific functions of some genes and cluster them definitively into disjoint clusters. In this study, we develop an effective overlapping clustering approach, which takes account into the multiplicity of gene functions and lack of certainty in practical analysis. A penalized weighted normalized cut (PWNCut) criterion is proposed based on the NCut technique and an <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:msub><mml:mi>L</mml:mi> <mml:mn>2</mml:mn></mml:msub> </mml:math> norm constraint. It outperforms multiple competitors in simulation. The analysis of the cancer genome atlas (TCGA) data on breast cancer and cervical cancer leads to biologically sensible findings which differ from those using the alternatives. To facilitate implementation, we develop the function pwncut in the R package NCutYX.
Collapse
|
30
|
Makhijani RK, Raut SA, Purohit HJ. Fold change based approach for identification of significant network markers in breast, lung and prostate cancer. IET Syst Biol 2018; 12:213-218. [PMID: 30259866 PMCID: PMC8687202 DOI: 10.1049/iet-syb.2018.0012] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2018] [Revised: 04/12/2018] [Accepted: 04/15/2018] [Indexed: 12/17/2022] Open
Abstract
Cancer belongs to a class of highly aggressive diseases and a leading cause of death in the world. With more than 100 types of cancers, breast, lung and prostate cancer remain to be the most common types. To identify essential network markers (NMs) and therapeutic targets in these cancers, the authors present a novel approach which uses gene expression data from microarray and RNA-seq platforms and utilises the results from this data to evaluate protein-protein interaction (PPI) network. Differentially expressed genes (DEGs) are extracted from microarray data using three different statistical methods in R, to produce a consistent set of genes. Also, DEGs are extracted from RNA-seq data for the same three cancer types. DEG sets found to be common in both platforms are obtained at three fold change (FC) cut-off levels to accurately identify the level of change in expression of these genes in all three cancers. A cancer network is built using PPI data characterising gene sets at log-FC (LFC)>1, LFC>1.5 and LFC>2, and interconnection between principal hub nodes of these networks is observed. Resulting network of hubs at three FC levels highlights prime NMs with high confidence in multiple cancers as validated by Gene Ontology functional enrichment and maximal complete subgraphs from CFinder.
Collapse
|
31
|
Li Y, Bie R, Teran Hidalgo SJ, Qin Y, Wu M, Ma S. Assisted gene expression-based clustering with AWNCut. Stat Med 2018; 37:4386-4403. [PMID: 30094873 DOI: 10.1002/sim.7928] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2018] [Revised: 05/15/2018] [Accepted: 07/05/2018] [Indexed: 01/06/2023]
Abstract
In the research on complex diseases, gene expression (GE) data have been extensively used for clustering samples. The clusters so generated can serve as the basis for disease subtype identification, risk stratification, and many other purposes. With the small sample sizes of genetic profiling studies and noisy nature of GE data, clustering analysis results are often unsatisfactory. In the most recent studies, a prominent trend is to conduct multidimensional profiling, which collects data on GEs and their regulators (copy number alterations, microRNAs, methylation, etc.) on the same subjects. With the regulation relationships, regulators contain important information on the properties of GEs. We develop a novel assisted clustering method, which effectively uses regulator information to improve clustering analysis using GE data. To account for the fact that not all GEs are informative, we propose a weighted strategy, where the weights are determined data-dependently and can discriminate informative GEs from noises. The proposed method is built on the NCut technique and effectively realized using a simulated annealing algorithm. Simulations demonstrate that it can well outperform multiple direct competitors. In the analysis of TCGA cutaneous melanoma and lung adenocarcinoma data, biologically sensible findings different from the alternatives are made.
Collapse
|
32
|
Shahbeig S, Rahideh A, Helfroush MS, Kazemi K. Gene expression feature selection for prostate cancer diagnosis using a two-phase heuristic-deterministic search strategy. IET Syst Biol 2018; 12:162-169. [PMID: 33451186 DOI: 10.1049/iet-syb.2017.0044] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2017] [Revised: 02/19/2018] [Accepted: 03/08/2018] [Indexed: 01/28/2023] Open
Abstract
Here, a two-phase search strategy is proposed to identify the biomarkers in gene expression data set for the prostate cancer diagnosis. A statistical filtering method is initially employed to remove the noisiest data. In the first phase of the search strategy, a multi-objective optimisation based on the binary particle swarm optimisation algorithm tuned by a chaotic method is proposed to select the optimal subset of genes with the minimum number of genes and the maximum classification accuracy. Finally, in the second phase of the search strategy, the cache-based modification of the sequential forward floating selection algorithm is used to find the most discriminant genes from the optimal subset of genes selected in the first phase. The results of applying the proposed algorithm on the available challenging prostate cancer data set demonstrate that the proposed algorithm can perfectly identify the informative genes such that the classification accuracy, sensitivity, and specificity of 100% are achieved with only nine biomarkers.
Collapse
|
33
|
Irimie AI, Braicu C, Cojocneanu R, Magdo L, Onaciu A, Ciocan C, Mehterov N, Dudea D, Buduru S, Berindan-Neagoe I. Differential Effect of Smoking on Gene Expression in Head and Neck Cancer Patients. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2018; 15:ijerph15071558. [PMID: 30041465 PMCID: PMC6069101 DOI: 10.3390/ijerph15071558] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Revised: 07/11/2018] [Accepted: 07/17/2018] [Indexed: 12/13/2022]
Abstract
Smoking is a well-known behavior that has an important negative impact on human health, and is considered to be a significant factor related to the development and progression of head and neck squamous cell carcinomas (HNSCCs). Use of high-dimensional datasets to discern novel HNSCC driver genes related to smoking represents an important challenge. The Cancer Genome Atlas (TCGA) analysis was performed in three co-existing groups of HNSCC in order to assess whether gene expression landscape is affected by tobacco smoking, having quit, or non-smoking status. We identified a set of differentially expressed genes that discriminate between smokers and non-smokers or based on human papilloma virus (HPV)16 status, or the co-occurrence of these two exposome components in HNSCC. Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways classification shows that most of the genes are specific to cellular metabolism, emphasizing metabolic detoxification pathways, metabolism of chemical carcinogenesis, or drug metabolism. In the case of HPV16-positive patients it has been demonstrated that the altered genes are related to cellular adhesion and inflammation. The correlation between smoking and the survival rate was not statistically significant. This emphasizes the importance of the complex environmental exposure and genetic factors in order to establish prevention assays and personalized care system for HNSCC, with the potential for being extended to other cancer types.
Collapse
|
34
|
Xia CQ, Han K, Qi Y, Zhang Y, Yu DJ. A Self-Training Subspace Clustering Algorithm under Low-Rank Representation for Cancer Classification on Gene Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1315-1324. [PMID: 28600258 PMCID: PMC5986621 DOI: 10.1109/tcbb.2017.2712607] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Accurate identification of the cancer types is essential to cancer diagnoses and treatments. Since cancer tissue and normal tissue have different gene expression, gene expression data can be used as an efficient feature source for cancer classification. However, accurate cancer classification directly using original gene expression profiles remains challenging due to the intrinsic high-dimension feature and the small size of the data samples. We proposed a new self-training subspace clustering algorithm under low-rank representation, called SSC-LRR, for cancer classification on gene expression data. Low-rank representation (LRR) is first applied to extract discriminative features from the high-dimensional gene expression data; the self-training subspace clustering (SSC) method is then used to generate the cancer classification predictions. The SSC-LRR was tested on two separate benchmark datasets in control with four state-of-the-art classification methods. It generated cancer classification predictions with an overall accuracy 89.7 percent and a general correlation 0.920, which are 18.9 and 24.4 percent higher than that of the best control method respectively. In addition, several genes (RNF114, HLA-DRB5, USP9Y, and PTPN20) were identified by SSC-LRR as new cancer identifiers that deserve further clinical investigation. Overall, the study demonstrated a new sensitive avenue to recognize cancer classifications from large-scale gene expression data.
Collapse
|
35
|
Sheng Y, Tang J, Ren K, Manor LC, Cao H. Integrative computational approach to evaluate risk genes for postmenopausal osteoporosis. IET Syst Biol 2018; 12:118-122. [PMID: 29745905 PMCID: PMC8687217 DOI: 10.1049/iet-syb.2017.0043] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2017] [Revised: 01/03/2018] [Accepted: 01/19/2018] [Indexed: 02/01/2024] Open
Abstract
In recent years, numerous studies reported over a hundred of genes playing roles in the etiology of postmenopausal osteoporosis (PO). However, many of these candidate genes were lack of replication and results were not always consistent. Here, the authors proposed a computational workflow to curate and evaluate PO related genes. They integrate large-scale literature knowledge data and gene expression data (PO case/control: 10/10) for the marker evaluation. Pathway enrichment, sub-network enrichment, and gene-gene interaction analysis were conducted to study the pathogenic profile of the candidate genes, with four metrics proposed and validated for each gene. By using the authors' approach, a scalable PO genetic database was developed; including PO related genes, diseases, pathways, and the supporting references. The PO case/control classification supported the effectiveness of the four proposed metrics, which successfully identified eight well-studied top PO genes (e.g. TGFB1, IL6, IL1B, TNF, ESR2, IGF1, HIF1A, and COL1A1) and highlighted one recently reported PO genes (e.g. IFNG). The computational biology approach and the PO database developed in this study provide a valuable resource which may facilitate understanding the genetic profile of PO.
Collapse
|
36
|
Li S, Liu X, Li H, Pan H, Acharya A, Deng Y, Yu Y, Haak R, Schmidt J, Schmalz G, Ziebolz D. Integrated analysis of long noncoding RNA-associated competing endogenous RNA network in periodontitis. J Periodontal Res 2018. [PMID: 29516510 DOI: 10.1111/jre.12539] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
BACKGROUND AND OBJECTIVES Long noncoding RNAs (lncRNAs) play critical and complex roles in regulating various biological processes of periodontitis. This bioinformatic study aims to construct a putative competing endogenous RNA (ceRNA) network by integrating lncRNA, miRNA and mRNA expression, based on high-throughput RNA sequencing and microarray data about periodontitis. MATERIAL AND METHODS Data from 1 miRNA and 3 mRNA expression profiles were obtained to construct the lncRNA-associated ceRNA network. Gene Ontology enrichment analysis and pathway analysis were performed using the Gene Ontology website and Kyoto Encyclopedia of Genes and Genomes. A protein-protein interaction network was constructed based on the Search Tool for the retrieval of Interacting Genes/Proteins. Transcription factors (TFs) of differentially expressed genes were identified based on TRANSFAC database and then a regulatory network was constructed. RESULTS Through constructing the dysregulated ceRNA network, 6 genes (HSPA4L, PANK3, YOD1, CTNNBIP1, EVI2B, ITGAL) and 3 miRNAs (miR-125a-3p, miR-200a, miR-142-3p) were detected. Three lncRNAs (MALAT1, TUG1, FGD5-AS1) were found to target both miR-125a-3p and miR-142-3p in this ceRNA network. Protein-protein interaction network analysis identified several hub genes, including VCAM1, ITGA4, UBC, LYN and SSX2IP. Three pathways (cytokine-cytokine receptor, cell adhesion molecules, chemokine signaling pathway) were identified to be overlapping results with the previous bioinformatics studies in periodontitis. Moreover, 2 TFs including FOS and EGR were identified to be involved in the regulatory network of the differentially expressed genes-TFs in periodontitis. CONCLUSION These findings suggest that 6 mRNAs (HSPA4L, PANK3, YOD1, CTNNBIP1, EVI2B, ITGAL), 3 miRNAs (hsa-miR-125a-3p, hsa-miR-200a, hsa-miR-142-3p) and 3 lncRNAs (MALAT1, TUG1, FGD5-AS1) might be involved in the lncRNA-associated ceRNA network of periodontitis. This study sought to illuminate further the genetic and epigenetic mechanisms of periodontitis through constructing an lncRNA-associated ceRNA network.
Collapse
|
37
|
M P, R T. Informative Gene Selection for Cancer Classification with Microarray Data Using a Metaheuristic Framework. Asian Pac J Cancer Prev 2018; 19:561-564. [PMID: 29481013 PMCID: PMC5980950 DOI: 10.22034/apjcp.2018.19.2.561] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/03/2017] [Indexed: 11/27/2022] Open
Abstract
Objective: Cancer diagnosis is one of the most vital emerging clinical applications of microarray data. Due to the high dimensionality, gene selection is an important step for improving expression data classification performance. There is therefore a need for effective methods to select informative genes for prediction and diagnosis of cancer. The main objective of this research was to derive a heuristic approach to select highly informative genes. Methods: A metaheuristic approach with a Genetic Algorithm with Levy Flight (GA-LV) was applied for classification of cancer genes in microarrays. The experimental results were analyzed with five major cancer gene expression benchmark datasets. Result: GA-LV proved superior to GA and statistical approaches, with 100% accuracy for the dataset for Leukemia, Lung and Lymphoma. For Prostate and Colon datasets the GA-LV was 99.5% and 99.2% accurate, respectively. Conclusion: The experimental results show that the proposed approach is suitable for effective gene selection with all benchmark datasets, removing irrelevant and redundant genes to improve classification accuracy.
Collapse
|
38
|
M P, R B, N S. Cancer Detection in Microarray Data Using a Modified Cat Swarm Optimization Clustering Approach. Asian Pac J Cancer Prev 2017; 18:3451-3455. [PMID: 29286618 PMCID: PMC5980909 DOI: 10.22034/apjcp.2017.18.12.3451] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Objective: A better understanding of functional genomics can be obtained by extracting patterns hidden in gene expression data. This could have paramount implications for cancer diagnosis, gene treatments and other domains. Clustering may reveal natural structures and identify interesting patterns in underlying data. The main objective of this research was to derive a heuristic approach to detection of highly co-expressed genes related to cancer from gene expression data with minimum Mean Squared Error (MSE). Methods: A modified CSO algorithm using Harmony Search (MCSO-HS) for clustering cancer gene expression data was applied. Experiment results are analyzed using two cancer gene expression benchmark datasets, namely for leukaemia and for breast cancer. Result: The results indicated MCSO-HS to be better than HS and CSO, 13% and 9% with the leukaemia dataset. For breast cancer dataset improvement was by 22% and 17%, respectively, in terms of MSE. Conclusion: The results showed MCSO-HS to outperform HS and CSO with both benchmark datasets. To validate the clustering results, this work was tested with internal and external cluster validation indices. Also this work points to biological validation of clusters with gene ontology in terms of function, process and component.
Collapse
|
39
|
Liu J, Wang X, Cheng Y, Zhang L. Tumor gene expression data classification via sample expansion-based deep learning. Oncotarget 2017; 8:109646-109660. [PMID: 29312636 PMCID: PMC5752549 DOI: 10.18632/oncotarget.22762] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2017] [Accepted: 10/29/2017] [Indexed: 12/15/2022] Open
Abstract
Since tumor is seriously harmful to human health, effective diagnosis measures are in urgent need for tumor therapy. Early detection of tumor is particularly important for better treatment of patients. A notable issue is how to effectively discriminate tumor samples from normal ones. Many classification methods, such as Support Vector Machines (SVMs), have been proposed for tumor classification. Recently, deep learning has achieved satisfactory performance in the classification task of many areas. However, the application of deep learning is rare in tumor classification due to insufficient training samples of gene expression data. In this paper, a Sample Expansion method is proposed to address the problem. Inspired by the idea of Denoising Autoencoder (DAE), a large number of samples are obtained by randomly cleaning partially corrupted input many times. The expanded samples can not only maintain the merits of corrupted data in DAE but also deal with the problem of insufficient training samples of gene expression data to a certain extent. Since Stacked Autoencoder (SAE) and Convolutional Neural Network (CNN) models show excellent performance in classification task, the applicability of SAE and 1-dimensional CNN (1DCNN) on gene expression data is analyzed. Finally, two deep learning models, Sample Expansion-Based SAE (SESAE) and Sample Expansion-Based 1DCNN (SE1DCNN), are designed to carry out tumor gene expression data classification by using the expanded samples. Experimental studies indicate that SESAE and SE1DCNN are very effective in tumor classification.
Collapse
|
40
|
Liu M, Fan X, Fang K, Zhang Q, Ma S. Integrative sparse principal component analysis of gene expression data. Genet Epidemiol 2017; 41:844-865. [PMID: 29114920 DOI: 10.1002/gepi.22089] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 10/03/2017] [Accepted: 10/04/2017] [Indexed: 12/16/2022]
Abstract
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance.
Collapse
|
41
|
Yang H, Liu X. Studies on the Clustering Algorithm for Analyzing Gene Expression Data with a Bidirectional Penalty. J Comput Biol 2017; 24:689-698. [PMID: 28489418 DOI: 10.1089/cmb.2017.0051] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
This article reports a new clustering method based on the k-means algorithm to high-dimensional gene expression data. The proposed approach makes use of bidirectional penalties to constrain the number of clusters and centroids of clusters to simultaneously determine the unknown number of clusters and handle large amounts of noise in gene expression data. Numeric studies indicate that this algorithm not only performs better in clustering but is also comparable to other approaches in its ability to obtain the correct number of clusters and correct signal features. Finally, we apply the proposed approach to analyze two benchmark gene expression datasets. These analyses again indicate that the proposed algorithm performs well in clustering high-dimensional gene expression data with an unknown number of clusters.
Collapse
|
42
|
Zang H, Li N, Pan Y, Hao J. Identification of upstream transcription factors (TFs) for expression signature genes in breast cancer. Gynecol Endocrinol 2017; 33:193-198. [PMID: 27809618 DOI: 10.1080/09513590.2016.1239253] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Breast cancer is a common malignancy among women with a rising incidence. Our intention was to detect transcription factors (TFs) for deeper understanding of the underlying mechanisms of breast cancer. Integrated analysis of gene expression datasets of breast cancer was performed. Then, functional annotation of differentially expressed genes (DEGs) was conducted, including Gene Ontology (GO) enrichment and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment. Furthermore, TFs were identified and a global transcriptional regulatory network was constructed. Seven publically available GEO datasets were obtained, and a set of 1196 DEGs were identified (460 up-regulated and 736 down-regulated). Functional annotation results showed that cell cycle was the most significantly enriched pathway, which was consistent with the fact that cell cycle is closely related to various tumors. Fifty-three differentially expressed TFs were identified, and the regulatory networks consisted of 817 TF-target interactions between 46 TFs and 602 DEGs in the context of breast cancer. Top 10 TFs covering the most downstream DEGs were SOX10, NFATC2, ZNF354C, ARID3A, BRCA1, FOXO3, GATA3, ZEB1, HOXA5 and EGR1. The transcriptional regulatory networks could enable a better understanding of regulatory mechanisms of breast cancer pathology and provide an opportunity for the development of potential therapy.
Collapse
|
43
|
Learning Parsimonious Classification Rules from Gene Expression Data Using Bayesian Networks with Local Structure. DATA 2017; 2. [PMID: 28331847 PMCID: PMC5358670 DOI: 10.3390/data2010005] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
The comprehensibility of good predictive models learned from high-dimensional gene expression data is attractive because it can lead to biomarker discovery. Several good classifiers provide comparable predictive performance but differ in their abilities to summarize the observed data. We extend a Bayesian Rule Learning (BRL-GSS) algorithm, previously shown to be a significantly better predictor than other classical approaches in this domain. It searches a space of Bayesian networks using a decision tree representation of its parameters with global constraints, and infers a set of IF-THEN rules. The number of parameters and therefore the number of rules are combinatorial to the number of predictor variables in the model. We relax these global constraints to a more generalizable local structure (BRL-LSS). BRL-LSS entails more parsimonious set of rules because it does not have to generate all combinatorial rules. The search space of local structures is much richer than the space of global structures. We design the BRL-LSS with the same worst-case time-complexity as BRL-GSS while exploring a richer and more complex model space. We measure predictive performance using Area Under the ROC curve (AUC) and Accuracy. We measure model parsimony performance by noting the average number of rules and variables needed to describe the observed data. We evaluate the predictive and parsimony performance of BRL-GSS, BRL-LSS and the state-of-the-art C4.5 decision tree algorithm, across 10-fold cross-validation using ten microarray gene-expression diagnostic datasets. In these experiments, we observe that BRL-LSS is similar to BRL-GSS in terms of predictive performance, while generating a much more parsimonious set of rules to explain the same observed data. BRL-LSS also needs fewer variables than C4.5 to explain the data with similar predictive performance. We also conduct a feasibility study to demonstrate the general applicability of our BRL methods on the newer RNA sequencing gene-expression data.
Collapse
|
44
|
Oyelade J, Isewon I, Oladipupo F, Aromolaran O, Uwoghiren E, Ameh F, Achas M, Adebiyi E. Clustering Algorithms: Their Application to Gene Expression Data. Bioinform Biol Insights 2016; 10:237-253. [PMID: 27932867 PMCID: PMC5135122 DOI: 10.4137/bbi.s38316] [Citation(s) in RCA: 64] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2016] [Revised: 09/05/2016] [Accepted: 09/09/2016] [Indexed: 12/17/2022] Open
Abstract
Gene expression data hide vital information required to understand the biological process that takes place in a particular organism in relation to its environment. Deciphering the hidden patterns in gene expression data proffers a prodigious preference to strengthen the understanding of functional genomics. The complexity of biological networks and the volume of genes present increase the challenges of comprehending and interpretation of the resulting mass of data, which consists of millions of measurements; these data also inhibit vagueness, imprecision, and noise. Therefore, the use of clustering techniques is a first step toward addressing these challenges, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. The clustering of gene expression data has been proven to be useful in making known the natural structure inherent in gene expression data, understanding gene functions, cellular processes, and subtypes of cells, mining useful information from noisy data, and understanding gene regulation. The other benefit of clustering gene expression data is the identification of homology, which is very important in vaccine design. This review examines the various clustering algorithms applicable to the gene expression data in order to discover and provide useful knowledge of the appropriate clustering technique that will guarantee stability and high degree of accuracy in its analysis procedure.
Collapse
|
45
|
Chang JG, Chen CC, Wu YY, Che TF, Huang YS, Yeh KT, Shieh GS, Yang PC. Uncovering synthetic lethal interactions for therapeutic targets and predictive markers in lung adenocarcinoma. Oncotarget 2016; 7:73664-73680. [PMID: 27655641 PMCID: PMC5342006 DOI: 10.18632/oncotarget.12046] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2016] [Accepted: 08/24/2016] [Indexed: 12/28/2022] Open
Abstract
Two genes are called synthetic lethal (SL) if their simultaneous mutation leads to cell death, but mutation of either individual does not. Targeting SL partners of mutated cancer genes can selectively kill cancer cells, but leave normal cells intact. We present an integrated approach to uncover SL gene pairs as novel therapeutic targets of lung adenocarcinoma (LADC). Of 24 predicted SL pairs, PARP1-TP53 was validated by RNAi knockdown to have synergistic toxicity in H1975 and invasive CL1-5 LADC cells; additionally FEN1-RAD54B, BRCA1-TP53, BRCA2-TP53 and RB1-TP53 were consistent with the literature. While metastasis remains a bottleneck in cancer treatment and inhibitors of PARP1 have been developed, this result may have therapeutic potential for LADC, in which TP53 is commonly mutated. We also demonstrated that silencing PARP1 enhanced the cell death induced by the platinum-based chemotherapy drug carboplatin in lung cancer cells (CL1-5 and H1975). IHC of RAD54B↑, BRCA1↓-RAD54B↑, FEN1(N)↑-RAD54B↑ and PARP1↑-RAD54B↑ were shown to be prognostic markers for 131 Asian LADC patients, and all markers except BRCA1↓-RAD54B↑ were further confirmed by three independent gene expression data sets (a total of 426 patients) including The Cancer Genome Atlas (TCGA) cohort of LADC. Importantly, we identified POLB-TP53 and POLB as predictive markers for the TCGA cohort (230 subjects), independent of age and stage. Thus, POLB and POLB-TP53 may be used to stratify future non-Asian LADC patients for therapeutic strategies.
Collapse
|
46
|
Abstract
In recent years several methods have been proposed to assign pairwise mechanism- based similarity scores to human diseases. Despite their differences in approach and performance, these methods work in a somewhat similar manner: first a set of biomolecules (genes, proteins, chemicals, etc.) is associated with each disease, and then a measure is defined to calculate the similarity between the sets assigned to a pair of diseases. Since the similarity score between two diseases is defined based on the underlying molecular processes, a high score may hint at a shared cause, and therefore a similar treatment, for both diseases. This is of great practical importance especially when a rare or newly-discovered disease, for which limited information is available, is found to be related to a disease with a known treatment. Thus, in this mini-review we briefly discuss the recently developed methods for computing mechanism-based disease- disease similarities.
Collapse
|
47
|
Song B, Du J, Deng N, Ren JC, Shu ZB. Comparative analysis of gene expression profiles of gastric cardia adenocarcinoma and gastric non-cardia adenocarcinoma. Oncol Lett 2016; 12:3866-3874. [PMID: 27895742 PMCID: PMC5104197 DOI: 10.3892/ol.2016.5161] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2015] [Accepted: 08/04/2016] [Indexed: 12/17/2022] Open
Abstract
In the present study, gene expression profiles were analyzed to identify the molecular mechanisms underlying gastric cardia adenocarcinoma (GCA) and gastric non-cardia adenocarcinoma (GNCA). A gene expression dataset (accession number GSE29272) was downloaded from Gene Expression Omnibus, and consisted of 62 GCA samples and 62 normal controls, as well as 72 GNCA samples and 72 normal controls. The two groups of differentially-expressed genes (DEGs) were compared to obtain common and unique DEGs. A differential analysis was performed using the Linear Models for Microarray Data package in R. Functional enrichment analysis was conducted for the DEGs using the Database for Annotation, Visualization and Integrated Discovery. Protein-protein interaction (PPI) networks were constructed for the DEGs with information from the Search Tool for the Retrieval of Interacting Genes. Subnetworks were extracted from the whole network with Cytoscape. Compared with the control, 284 and 268 genes were differentially-expressed in GCA and GNCA, respectively, of which 194 DEGs were common between GCA and GNCA. Common DEGs [e.g., claudin (CLDN)7, CLDN4 and CLDN3] were associated with cell adhesion and digestion. GCA-unique DEGs [e.g., MAD1 mitotic arrest deficient like 1, cyclin (CCN)B1, CCNB2 and CCNE1] were associated with the cell cycle and the regulation of cell proliferation, while GNCA-unique DEGs (e.g., GATA binding protein 6 and hyaluronoglucosaminidase 1) were implicated in cell death. A PPI network with 141 nodes and 446 edges were obtained, from which two subnetworks were extracted. Genes [e.g., fibronectin 1, collagen type I α2 chain (COL1A2) and COL1A1] from the two subnetworks were implicated in extracellular matrix organization. These common DEGs could advance our understanding of the etiology of gastric cancer, while the unique DEGs in GCA and GNCA could better define the properties of specific cancers and provide potential biomarkers for diagnosis, prognosis or therapy.
Collapse
|
48
|
Muetze T, Goenawan IH, Wiencko HL, Bernal-Llinares M, Bryan K, Lynn DJ. Contextual Hub Analysis Tool (CHAT): A Cytoscape app for identifying contextually relevant hubs in biological networks. F1000Res 2016; 5:1745. [PMID: 27853512 PMCID: PMC5105880 DOI: 10.12688/f1000research.9118.1] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/26/2016] [Indexed: 07/30/2023] Open
Abstract
UNLABELLED Highly connected nodes (hubs) in biological networks are topologically important to the structure of the network and have also been shown to be preferentially associated with a range of phenotypes of interest. The relative importance of a hub node, however, can change depending on the biological context. Here, we report a Cytoscape app, the Contextual Hub Analysis Tool (CHAT), which enables users to easily construct and visualize a network of interactions from a gene or protein list of interest, integrate contextual information, such as gene expression or mass spectrometry data, and identify hub nodes that are more highly connected to contextual nodes (e.g. genes or proteins that are differentially expressed) than expected by chance. In a case study, we use CHAT to construct a network of genes that are differentially expressed in Dengue fever, a viral infection. CHAT was used to identify and compare contextual and degree-based hubs in this network. The top 20 degree-based hubs were enriched in pathways related to the cell cycle and cancer, which is likely due to the fact that proteins involved in these processes tend to be highly connected in general. In comparison, the top 20 contextual hubs were enriched in pathways commonly observed in a viral infection including pathways related to the immune response to viral infection. This analysis shows that such contextual hubs are considerably more biologically relevant than degree-based hubs and that analyses which rely on the identification of hubs solely based on their connectivity may be biased towards nodes that are highly connected in general rather than in the specific context of interest. AVAILABILITY CHAT is available for Cytoscape 3.0+ and can be installed via the Cytoscape App Store ( http://apps.cytoscape.org/apps/chat).
Collapse
|
49
|
Muetze T, Goenawan IH, Wiencko HL, Bernal-Llinares M, Bryan K, Lynn DJ. Contextual Hub Analysis Tool (CHAT): A Cytoscape app for identifying contextually relevant hubs in biological networks. F1000Res 2016; 5:1745. [PMID: 27853512 PMCID: PMC5105880 DOI: 10.12688/f1000research.9118.2] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/26/2016] [Indexed: 01/21/2023] Open
Abstract
Highly connected nodes (hubs) in biological networks are topologically important to the structure of the network and have also been shown to be preferentially associated with a range of phenotypes of interest. The relative importance of a hub node, however, can change depending on the biological context. Here, we report a Cytoscape app, the Contextual Hub Analysis Tool (CHAT), which enables users to easily construct and visualize a network of interactions from a gene or protein list of interest, integrate contextual information, such as gene expression or mass spectrometry data, and identify hub nodes that are more highly connected to contextual nodes (e.g. genes or proteins that are differentially expressed) than expected by chance. In a case study, we use CHAT to construct a network of genes that are differentially expressed in Dengue fever, a viral infection. CHAT was used to identify and compare contextual and degree-based hubs in this network. The top 20 degree-based hubs were enriched in pathways related to the cell cycle and cancer, which is likely due to the fact that proteins involved in these processes tend to be highly connected in general. In comparison, the top 20 contextual hubs were enriched in pathways commonly observed in a viral infection including pathways related to the immune response to viral infection. This analysis shows that such
contextual hubs are considerably more biologically relevant than degree-based hubs and that analyses which rely on the identification of hubs solely based on their connectivity may be biased towards nodes that are highly connected in general rather than in the specific context of interest. Availability: CHAT is available for Cytoscape 3.0+ and can be installed via the Cytoscape App Store (
http://apps.cytoscape.org/apps/chat).
Collapse
|
50
|
Jeyaswamidoss JE, Thangaraj K, Ramar K, Chitra M. A rough set based rational clustering framework for determining correlated genes. Acta Microbiol Immunol Hung 2016; 63:185-201. [PMID: 27352972 DOI: 10.1556/030.63.2016.2.4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters.
Collapse
|