1
|
Ghosh S, Mandal SD, Thakur S. Biomarker-driven drug repurposing for NAFLD-associated hepatocellular carcinoma using machine learning integrated ensemble feature selection. FRONTIERS IN BIOINFORMATICS 2025; 5:1522401. [PMID: 40313868 PMCID: PMC12043677 DOI: 10.3389/fbinf.2025.1522401] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2024] [Accepted: 04/04/2025] [Indexed: 05/03/2025] Open
Abstract
The incidence of non-alcoholic fatty liver disease (NAFLD), encompassing the more severe non-alcoholic steatohepatitis (NASH), is rising alongside the surges in diabetes and obesity. Increasing evidence indicates that NASH is responsible for a significant share of idiopathic hepatocellular carcinoma (HCC) cases, a fatal cancer with a 5-year survival rate below 22%. Biomarkers can facilitate early screening and monitoring of at-risk NAFLD/NASH patients and assist in identifying potential drug candidates for treatment. This study utilized an ensemble feature selection framework to analyze transcriptomic data, identifying biomarker genes associated with the stage-wise progression of NAFLD-related HCC. Seven machine learning algorithms were assessed for disease stage classification. Twelve feature selection methods including correlation-based techniques, mutual information-based methods, and embedded techniques were utilized to rank the top genes as features, through this approach, multiple feature selection methods were combined to yield more robust features important in this disease progression. Cox regression-based survival analysis was carried out to evaluate the biomarker potentiality of these genes. Furthermore, multiphase drug repurposing strategy and molecular docking were employed to identify potential drug candidates against these biomarkers. Among the seven machine learning models initially evaluated, DISCR resulted as the most accurate disease stage classifier. Ensemble feature selection identified ten top genes, among which eight were recognized as potential biomarkers based on survival analysis. These include genes ABAT, ABCB11, MBTPS1, and ZFP1 mostly involved in alanine and glutamate metabolism, butanoate metabolism, and ER protein processing. Through drug repurposing, 81 candidate drugs were found to be effective against these markers genes, with Diosmin, Esculin, Lapatinib, and Phenelzine as the best candidates screened through molecular docking and MMGBSA. The consensus derived from multiple methods enhances the accuracy of identifying relevant robust biomarkers for NAFLD-associated HCC. The use of these biomarkers in a multiphase drug repurposing strategy highlights potential therapeutic options for early intervention, which is essential to stop disease progression and improve outcomes.
Collapse
Affiliation(s)
- Subhajit Ghosh
- Department of Bioinformatics, University of North Bengal, Darjeeling, West Bengal, India
| | - Sukhen Das Mandal
- Department of Computer Science and Engineering, Ghani Khan Choudhury Institute of Engineering and Technology (GKCIET), Malda, India
| | - Subarna Thakur
- Department of Bioinformatics, University of North Bengal, Darjeeling, West Bengal, India
| |
Collapse
|
2
|
Bar N, Nikparvar B, Jayavelu ND, Roessler FK. Constrained Fourier estimation of short-term time-series gene expression data reduces noise and improves clustering and gene regulatory network predictions. BMC Bioinformatics 2022; 23:330. [PMID: 35945515 PMCID: PMC9364503 DOI: 10.1186/s12859-022-04839-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Accepted: 07/12/2022] [Indexed: 01/15/2023] Open
Abstract
BACKGROUND Biological data suffers from noise that is inherent in the measurements. This is particularly true for time-series gene expression measurements. Nevertheless, in order to to explore cellular dynamics, scientists employ such noisy measurements in predictive and clustering tools. However, noisy data can not only obscure the genes temporal patterns, but applying predictive and clustering tools on noisy data may yield inconsistent, and potentially incorrect, results. RESULTS To reduce the noise of short-term (< 48 h) time-series expression data, we relied on the three basic temporal patterns of gene expression: waves, impulses and sustained responses. We constrained the estimation of the true signals to these patterns by estimating the parameters of first and second-order Fourier functions and using the nonlinear least-squares trust-region optimization technique. Our approach lowered the noise in at least 85% of synthetic time-series expression data, significantly more than the spline method ([Formula: see text]). When the data contained a higher signal-to-noise ratio, our method allowed downstream network component analyses to calculate consistent and accurate predictions, particularly when the noise variance was high. Conversely, these tools led to erroneous results from untreated noisy data. Our results suggest that at least 5-7 time points are required to efficiently de-noise logarithmic scaled time-series expression data. Investing in sampling additional time points provides little benefit to clustering and prediction accuracy. CONCLUSIONS Our constrained Fourier de-noising method helps to cluster noisy gene expression and interpret dynamic gene networks more accurately. The benefit of noise reduction is large and can constitute the difference between a successful application and a failing one.
Collapse
Affiliation(s)
- Nadav Bar
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| | - Bahareh Nikparvar
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| | - Naresh Doni Jayavelu
- grid.34477.330000000122986657Division of Medical Genetics, Department of Medicine, University of Washington Seattle, Seattle, WA 98195-7720 USA
| | - Fabienne Krystin Roessler
- grid.5947.f0000 0001 1516 2393Department of Chemical Engineering, Norwegian University of Science and Technology (NTNU), Sem Sælandsvei 4, Trondheim, NO-7491 Norway
| |
Collapse
|
3
|
Koyuncu E. Centroidal Clustering of Noisy Observations by Using r th Power Distortion Measures. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2022; PP:1430-1438. [PMID: 35731771 DOI: 10.1109/tnnls.2022.3183294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We consider the problem of clustering a dataset through multiple noisy observations of its members. The goal is to obtain a clustering that is as faithful to the clustering of the original dataset as possible. We propose a centroidal approach whose distortion measure is the sum of r th powers of the distances between the cluster center and the noisy observations. For r=2 , our scheme boils down to the well-known approach of clustering the average of noisy samples. First, we provide a mathematical analysis of our clustering scheme. In particular, we find formulas for the average distortion and the spatial distribution of the cluster centers in the asymptotic regime where the number of centers is large. We then provide an algorithm to numerically optimize the cluster centers in the finite regime. We extend our method to automatically assign weights to noisy observations. Finally, we show that for various practical noise models, with a suitable choice of r , our algorithms can outperform several other existing techniques over various datasets.
Collapse
|
4
|
Ding LJ, Schlüter HM, Szucs MJ, Ahmad R, Wu Z, Xu W. Comparison of Statistical Tests and Power Analysis for Phosphoproteomics Data. J Proteome Res 2020; 19:572-582. [PMID: 31789524 DOI: 10.1021/acs.jproteome.9b00280] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Advances in protein tagging and mass spectrometry have enabled generation of large quantitative proteome and phosphoproteome data sets, for identifying differentially expressed targets in case-control studies. The power study of statistical tests is critical for designing strategies for effective target identification and control of experimental cost. Here, we develop a simulation framework to generate realistic phospho-peptide data with known changes between cases and controls. Using this framework, we quantify the performance of traditional t-tests, Bayesian tests, and the ranking-by-fold-change test. Bayesian tests, which share variance information among peptides, outperform the traditional t-tests. Although ranking-by-fold-change has similar power as the Bayesian tests, its type I error rate cannot be properly controlled without proper permutation analysis; therefore, simply relying on the ranking likely brings false positives. Two-sample Bayesian tests considering dependencies between intensity and variance are superior for data sets with complex variance. While increasing the sample size enhances the statistical tests' performance, balanced controls and cases are recommended over a one-side weighted group. Further, higher peptide standard deviations require higher fold changes to achieve the same statistical power. Together, these results highlight the importance of model-informed experimental design and principled statistical analyses when working with large-scale proteomics and phosphoproteomics data.
Collapse
Affiliation(s)
| | - Hannah M Schlüter
- Department of Computing , Imperial College London , South Kensington, London SW7 2AZ , United Kingdom
| | - Matthew J Szucs
- Broad Institute of MIT and Harvard , 415 Main Street , Cambridge , Massachusetts 02139 , United States
| | - Rushdy Ahmad
- Broad Institute of MIT and Harvard , 415 Main Street , Cambridge , Massachusetts 02139 , United States
| | - Zheyang Wu
- Department of Mathematical Sciences and Program of Bioinformatics and Computational Biology and Program of Data Science , Worcester Polytechnic Institute (WPI) , 100 Institute Road , Worcester , Massachusetts 01609 , United States
| | | |
Collapse
|
5
|
Fan J, Ke Y, Sun Q, Zhou WX. FarmTest: Factor-adjusted robust multiple testing with approximate false discovery control. J Am Stat Assoc 2019; 114:1880-1893. [PMID: 33033420 PMCID: PMC7539891 DOI: 10.1080/01621459.2018.1527700] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Revised: 08/15/2018] [Accepted: 09/16/2018] [Indexed: 12/21/2022]
Abstract
Large-scale multiple testing with correlated and heavy-tailed data arises in a wide range of research areas from genomics, medical imaging to finance. Conventional methods for estimating the false discovery proportion (FDP) often ignore the effect of heavy-tailedness and the dependence structure among test statistics, and thus may lead to inefficient or even inconsistent estimation. Also, the commonly imposed joint normality assumption is arguably too stringent for many applications. To address these challenges, in this paper we propose a Factor-Adjusted Robust Multiple Testing (FarmTest) procedure for large-scale simultaneous inference with control of the false discovery proportion. We demonstrate that robust factor adjustments are extremely important in both controlling the FDP and improving the power. We identify general conditions under which the proposed method produces consistent estimate of the FDP. As a byproduct that is of independent interest, we establish an exponential-type deviation inequality for a robust U-type covariance estimator under the spectral norm. Extensive numerical experiments demonstrate the advantage of the proposed method over several state-of-the-art methods especially when the data are generated from heavy-tailed distributions. The proposed procedures are implemented in the R-package FarmTest.
Collapse
Affiliation(s)
- Jianqing Fan
- Honorary Professor, School of Data Science, Fudan University, Shanghai, China and Frederick L. Moore '18 Professor of Finance, Department of Operations Research and Financial Engineering, Princeton University, NJ 08544
| | - Yuan Ke
- Assistant Professor, Department of Statistics, University of Georgia, Athens, GA 30602
| | - Qiang Sun
- Assistant Professor, Department of Statistical Sciences, University of Toronto, Toronto, ON M5S 3G3, Canada
| | - Wen-Xin Zhou
- Wen-Xin Zhou is Assistant Professor, Department of Mathematics, University of California, San Diego, La Jolla, CA 92093
| |
Collapse
|
6
|
Abstract
We propose a general Principal Orthogonal complEment Thresholding (POET) framework for large-scale covariance matrix estimation based on the approximate factor model. A set of high level sufficient conditions for the procedure to achieve optimal rates of convergence under different matrix norms is established to better understand how POET works. Such a framework allows us to recover existing results for sub-Gaussian data in a more transparent way that only depends on the concentration properties of the sample covariance matrix. As a new theoretical contribution, for the first time, such a framework allows us to exploit conditional sparsity covariance structure for the heavy-tailed data. In particular, for the elliptical distribution, we propose a robust estimator based on the marginal and spatial Kendall's tau to satisfy these conditions. In addition, we study conditional graphical model under the same framework. The technical tools developed in this paper are of general interest to high dimensional principal component analysis. Thorough numerical results are also provided to back up the developed theory.
Collapse
Affiliation(s)
- Jianqing Fan
- Dept of Operations Research & Financial Engineering, Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA
| | - Han Liu
- Dept of Operations Research & Financial Engineering, Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA
| | - Weichen Wang
- Dept of Operations Research & Financial Engineering, Sherrerd Hall, Princeton University, Princeton, NJ 08544, USA
| |
Collapse
|
7
|
Lei L, Bickel PJ, El Karoui N. Asymptotics for high dimensional regression M-estimates: fixed design results. Probab Theory Relat Fields 2018. [DOI: 10.1007/s00440-017-0824-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
8
|
Han F, Liu H. ECA: High-Dimensional Elliptical Component Analysis in Non-Gaussian Distributions. J Am Stat Assoc 2017. [DOI: 10.1080/01621459.2016.1246366] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Fang Han
- Department of Statistics, University of Washington, Seattle, WA
| | - Han Liu
- Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ
| |
Collapse
|
9
|
Wang M, Tsai TH, Di Poto C, Ferrarini A, Yu G, Ressom HW. Topic model-based mass spectrometric data analysis in cancer biomarker discovery studies. BMC Genomics 2016; 17 Suppl 4:545. [PMID: 27535232 PMCID: PMC5001243 DOI: 10.1186/s12864-016-2796-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Background A fundamental challenge in quantitation of biomolecules for cancer biomarker discovery is owing to the heterogeneous nature of human biospecimens. Although this issue has been a subject of discussion in cancer genomic studies, it has not yet been rigorously investigated in mass spectrometry based proteomic and metabolomic studies. Purification of mass spectometric data is highly desired prior to subsequent analysis, e.g., quantitative comparison of the abundance of biomolecules in biological samples. Methods We investigated topic models to computationally analyze mass spectrometric data considering both integrated peak intensities and scan-level features, i.e., extracted ion chromatograms (EICs). Probabilistic generative models enable flexible representation in data structure and infer sample-specific pure resources. Scan-level modeling helps alleviate information loss during data preprocessing. We evaluated the capability of the proposed models in capturing mixture proportions of contaminants and cancer profiles on LC-MS based serum proteomic and GC-MS based tissue metabolomic datasets acquired from patients with hepatocellular carcinoma (HCC) and liver cirrhosis as well as synthetic data we generated based on the serum proteomic data. Results The results we obtained by analysis of the synthetic data demonstrated that both intensity-level and scan-level purification models can accurately infer the mixture proportions and the underlying true cancerous sources with small average error ratios (<7 %) between estimation and ground truth. By applying the topic model-based purification to mass spectrometric data, we found more proteins and metabolites with significant changes between HCC cases and cirrhotic controls. Candidate biomarkers selected after purification yielded biologically meaningful pathway analysis results and improved disease discrimination power in terms of the area under ROC curve compared to the results found prior to purification. Conclusions We investigated topic model-based inference methods to computationally address the heterogeneity issue in samples analyzed by LC/GC-MS. We observed that incorporation of scan-level features have the potential to lead to more accurate purification results by alleviating the loss in information as a result of integrating peaks. We believe cancer biomarker discovery studies that use mass spectrometric analysis of human biospecimens can greatly benefit from topic model-based purification of the data prior to statistical and pathway analyses.
Collapse
Affiliation(s)
- Minkun Wang
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA.,Department of Electrical and Computer Engineering, Virginia Tech, 900 N Glebe Rd, Arlington, VA, USA
| | - Tsung-Heng Tsai
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Cristina Di Poto
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Alessia Ferrarini
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA
| | - Guoqiang Yu
- Department of Electrical and Computer Engineering, Virginia Tech, 900 N Glebe Rd, Arlington, VA, USA
| | - Habtom W Ressom
- Department of Oncology, Georgetown University, 4000 Reservoir Rd NW, Washington D.C., USA.
| |
Collapse
|
10
|
Mollah MMH, Jamal R, Mokhtar NM, Harun R, Mollah MNH. A Hybrid One-Way ANOVA Approach for the Robust and Efficient Estimation of Differential Gene Expression with Multiple Patterns. PLoS One 2015; 10:e0138810. [PMID: 26413858 PMCID: PMC4587675 DOI: 10.1371/journal.pone.0138810] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2015] [Accepted: 09/03/2015] [Indexed: 11/22/2022] Open
Abstract
Background Identifying genes that are differentially expressed (DE) between two or more conditions with multiple patterns of expression is one of the primary objectives of gene expression data analysis. Several statistical approaches, including one-way analysis of variance (ANOVA), are used to identify DE genes. However, most of these methods provide misleading results for two or more conditions with multiple patterns of expression in the presence of outlying genes. In this paper, an attempt is made to develop a hybrid one-way ANOVA approach that unifies the robustness and efficiency of estimation using the minimum β-divergence method to overcome some problems that arise in the existing robust methods for both small- and large-sample cases with multiple patterns of expression. Results The proposed method relies on a β-weight function, which produces values between 0 and 1. The β-weight function with β = 0.2 is used as a measure of outlier detection. It assigns smaller weights (≥ 0) to outlying expressions and larger weights (≤ 1) to typical expressions. The distribution of the β-weights is used to calculate the cut-off point, which is compared to the observed β-weight of an expression to determine whether that gene expression is an outlier. This weight function plays a key role in unifying the robustness and efficiency of estimation in one-way ANOVA. Conclusion Analyses of simulated gene expression profiles revealed that all eight methods (ANOVA, SAM, LIMMA, EBarrays, eLNN, KW, robust BetaEB and proposed) perform almost identically for m = 2 conditions in the absence of outliers. However, the robust BetaEB method and the proposed method exhibited considerably better performance than the other six methods in the presence of outliers. In this case, the BetaEB method exhibited slightly better performance than the proposed method for the small-sample cases, but the the proposed method exhibited much better performance than the BetaEB method for both the small- and large-sample cases in the presence of more than 50% outlying genes. The proposed method also exhibited better performance than the other methods for m > 2 conditions with multiple patterns of expression, where the BetaEB was not extended for this condition. Therefore, the proposed approach would be more suitable and reliable on average for the identification of DE genes between two or more conditions with multiple patterns of expression.
Collapse
Affiliation(s)
- Mohammad Manir Hossain Mollah
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
- * E-mail:
| | - Rahman Jamal
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
| | - Norfilza Mohd Mokhtar
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
- Department of Physiology, Faculty of Medicine, Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Roslan Harun
- Institut Perubatan Molekul UKM (UMBI), University Kebangsaan Malaysia (UKM), Jalan Ya’acob Latiff, Bandar Tun Razak, Cheras 56000 Kuala Lumpur, Malaysia
| | - Md. Nurul Haque Mollah
- Laboratory of Bioinformatics, Department of Statistics, University of Rajshahi, Rajshahi-6205, Bangladesh
| |
Collapse
|
11
|
Ganjali M, Baghfalaki T, Berridge D. Robust modeling of differential gene expression data using normal/independent distributions: a Bayesian approach. PLoS One 2015; 10:e0123791. [PMID: 25910040 PMCID: PMC4409222 DOI: 10.1371/journal.pone.0123791] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 03/07/2015] [Indexed: 11/18/2022] Open
Abstract
In this paper, the problem of identifying differentially expressed genes under different conditions using gene expression microarray data, in the presence of outliers, is discussed. For this purpose, the robust modeling of gene expression data using some powerful distributions known as normal/independent distributions is considered. These distributions include the Student’s t and normal distributions which have been used previously, but also include extensions such as the slash, the contaminated normal and the Laplace distributions. The purpose of this paper is to identify differentially expressed genes by considering these distributional assumptions instead of the normal distribution. A Bayesian approach using the Markov Chain Monte Carlo method is adopted for parameter estimation. Two publicly available gene expression data sets are analyzed using the proposed approach. The use of the robust models for detecting differentially expressed genes is investigated. This investigation shows that the choice of model for differentiating gene expression data is very important. This is due to the small number of replicates for each gene and the existence of outlying data. Comparison of the performance of these models is made using different statistical criteria and the ROC curve. The method is illustrated using some simulation studies. We demonstrate the flexibility of these robust models in identifying differentially expressed genes.
Collapse
Affiliation(s)
- Mojtaba Ganjali
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Department of Statistics, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran
- * E-mail: (MG)
| | - Taban Baghfalaki
- School of Biological Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
- Department of Statistics, Faculty of Mathematical Sciences, Tarbiat Modares University, Tehran, Iran
| | - Damon Berridge
- Farr Institute-CIPHER, College of Medicine, Swansea University, Swansea, Wales, U.K.
| |
Collapse
|
12
|
Thomas N, Sweeney K, Somayaji V. Meta-Analysis of Clinical Dose–Response in a Large Drug Development Portfolio. Stat Biopharm Res 2014. [DOI: 10.1080/19466315.2014.924876] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
13
|
Abstract
Motivation: Gene expression assays allow for genome scale analyses of molecular biological mechanisms. State-of-the-art data analysis provides lists of involved genes, either by calculating significance levels of mRNA abundance or by Bayesian assessments of gene activity. A common problem of such approaches is the difficulty of interpreting the biological implication of the resulting gene lists. This lead to an increased interest in methods for inferring high-level biological information. A common approach for representing high level information is by inferring gene ontology (GO) terms which may be attributed to the expression data experiment. Results: This article proposes a probabilistic model for GO term inference. Modelling assumes that gene annotations to GO terms are available and gene involvement in an experiment is represented by a posterior probabilities over gene-specific indicator variables. Such probability measures result from many Bayesian approaches for expression data analysis. The proposed model combines these indicator probabilities in a probabilistic fashion and provides a probabilistic GO term assignment as a result. Experiments on synthetic and microarray data suggest that advantages of the proposed probabilistic GO term inference over statistical test-based approaches are in particular evident for sparsely annotated GO terms and in situations of large uncertainty about gene activity. Provided that appropriate annotations exist, the proposed approach is easily applied to inferring other high level assignments like pathways. Availability: Source code under GPL license is available from the author. Contact:peter.sykacek@boku.ac.at
Collapse
Affiliation(s)
- P Sykacek
- Department of Biotechnology, BOKU University, Muthgasse 18, 1190 Vienna.
| |
Collapse
|
14
|
PERT: a method for expression deconvolution of human blood samples from varied microenvironmental and developmental conditions. PLoS Comput Biol 2012; 8:e1002838. [PMID: 23284283 PMCID: PMC3527275 DOI: 10.1371/journal.pcbi.1002838] [Citation(s) in RCA: 82] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2012] [Accepted: 10/26/2012] [Indexed: 12/30/2022] Open
Abstract
The cellular composition of heterogeneous samples can be predicted using an expression deconvolution algorithm to decompose their gene expression profiles based on pre-defined, reference gene expression profiles of the constituent populations in these samples. However, the expression profiles of the actual constituent populations are often perturbed from those of the reference profiles due to gene expression changes in cells associated with microenvironmental or developmental effects. Existing deconvolution algorithms do not account for these changes and give incorrect results when benchmarked against those measured by well-established flow cytometry, even after batch correction was applied. We introduce PERT, a new probabilistic expression deconvolution method that detects and accounts for a shared, multiplicative perturbation in the reference profiles when performing expression deconvolution. We applied PERT and three other state-of-the-art expression deconvolution methods to predict cell frequencies within heterogeneous human blood samples that were collected under several conditions (uncultured mono-nucleated and lineage-depleted cells, and culture-derived lineage-depleted cells). Only PERT's predicted proportions of the constituent populations matched those assigned by flow cytometry. Genes associated with cell cycle processes were highly enriched among those with the largest predicted expression changes between the cultured and uncultured conditions. We anticipate that PERT will be widely applicable to expression deconvolution strategies that use profiles from reference populations that vary from the corresponding constituent populations in cellular state but not cellular phenotypic identity.
Collapse
|
15
|
Non-gaussian distributions affect identification of expression patterns, functional annotation, and prospective classification in human cancer genomes. PLoS One 2012; 7:e46935. [PMID: 23118863 PMCID: PMC3485292 DOI: 10.1371/journal.pone.0046935] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2012] [Accepted: 09/06/2012] [Indexed: 12/18/2022] Open
Abstract
INTRODUCTION Gene expression data is often assumed to be normally-distributed, but this assumption has not been tested rigorously. We investigate the distribution of expression data in human cancer genomes and study the implications of deviations from the normal distribution for translational molecular oncology research. METHODS We conducted a central moments analysis of five cancer genomes and performed empiric distribution fitting to examine the true distribution of expression data both on the complete-experiment and on the individual-gene levels. We used a variety of parametric and nonparametric methods to test the effects of deviations from normality on gene calling, functional annotation, and prospective molecular classification using a sixth cancer genome. RESULTS Central moments analyses reveal statistically-significant deviations from normality in all of the analyzed cancer genomes. We observe as much as 37% variability in gene calling, 39% variability in functional annotation, and 30% variability in prospective, molecular tumor subclassification associated with this effect. CONCLUSIONS Cancer gene expression profiles are not normally-distributed, either on the complete-experiment or on the individual-gene level. Instead, they exhibit complex, heavy-tailed distributions characterized by statistically-significant skewness and kurtosis. The non-Gaussian distribution of this data affects identification of differentially-expressed genes, functional annotation, and prospective molecular classification. These effects may be reduced in some circumstances, although not completely eliminated, by using nonparametric analytics. This analysis highlights two unreliable assumptions of translational cancer gene expression analysis: that "small" departures from normality in the expression data distributions are analytically-insignificant and that "robust" gene-calling algorithms can fully compensate for these effects.
Collapse
|
16
|
Li W. Volcano plots in analyzing differential expressions with mRNA microarrays. J Bioinform Comput Biol 2012; 10:1231003. [PMID: 23075208 DOI: 10.1142/s0219720012310038] [Citation(s) in RCA: 122] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
A volcano plot displays unstandardized signal (e.g. log-fold-change) against noise-adjusted/standardized signal (e.g. t-statistic or -log(10)(p-value) from the t-test). We review the basic and interactive use of the volcano plot and its crucial role in understanding the regularized t-statistic. The joint filtering gene selection criterion based on regularized statistics has a curved discriminant line in the volcano plot, as compared to the two perpendicular lines for the "double filtering" criterion. This review attempts to provide a unifying framework for discussions on alternative measures of differential expression, improved methods for estimating variance, and visual display of a microarray analysis result. We also discuss the possibility of applying volcano plots to other fields beyond microarray.
Collapse
Affiliation(s)
- Wentian Li
- The Robert S. Boas Center for Genomics and Human Genetics, The Feinstein Institute for Medical Research, North Shore LIJ Health System, Manhasset, 350 Community Drive, NY 11030, USA.
| |
Collapse
|
17
|
Sloutsky R, Jimenez N, Swamidass SJ, Naegle KM. Accounting for noise when clustering biological data. Brief Bioinform 2012; 14:423-36. [DOI: 10.1093/bib/bbs057] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
18
|
Cooper-Knock J, Kirby J, Ferraiuolo L, Heath PR, Rattray M, Shaw PJ. Gene expression profiling in human neurodegenerative disease. Nat Rev Neurol 2012; 8:518-30. [PMID: 22890216 DOI: 10.1038/nrneurol.2012.156] [Citation(s) in RCA: 158] [Impact Index Per Article: 12.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Transcriptome study in neurodegenerative disease has advanced considerably in the past 5 years. Increasing scientific rigour and improved analytical tools have led to more-reproducible data. Many transcriptome analysis platforms assay the expression of the entire genome, enabling a complete biological context to be captured. Gene expression profiling (GEP) is, therefore, uniquely placed to discover pathways of disease pathogenesis, potential therapeutic targets, and biomarkers. This Review summarizes microarray human GEP studies in the common neurodegenerative diseases amyotrophic lateral sclerosis (ALS), Parkinson disease (PD) and Alzheimer disease (AD). Several interesting reports have compared pathological gene expression in different patient groups, disease stages and anatomical areas. In all three diseases, GEP has revealed dysregulation of genes related to neuroinflammation. In ALS and PD, gene expression related to RNA splicing and protein turnover is disrupted, and several studies in ALS support involvement of the cytoskeleton. GEP studies have implicated the ubiquitin-proteasome system in PD pathogenesis, and have provided evidence of mitochondrial dysfunction in PD and AD. Lastly, in AD, a possible role for dysregulation of intracellular signalling pathways, including calcium signalling, has been highlighted. This Review also provides a discussion of methodological considerations in microarray sample preparation and data analysis.
Collapse
Affiliation(s)
- Johnathan Cooper-Knock
- Academic Unit of Neurology, Sheffield Institute for Translational Neuroscience, University of Sheffield, 385A Glossop Road, Sheffield S10 2HQ, UK
| | | | | | | | | | | |
Collapse
|
19
|
Mollah MMH, Mollah MNH, Kishino H. β-empirical Bayes inference and model diagnosis of microarray data. BMC Bioinformatics 2012; 13:135. [PMID: 22713095 PMCID: PMC3464654 DOI: 10.1186/1471-2105-13-135] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2011] [Accepted: 04/23/2012] [Indexed: 12/04/2022] Open
Abstract
Background Microarray data enables the high-throughput survey of mRNA expression profiles at the genomic level; however, the data presents a challenging statistical problem because of the large number of transcripts with small sample sizes that are obtained. To reduce the dimensionality, various Bayesian or empirical Bayes hierarchical models have been developed. However, because of the complexity of the microarray data, no model can explain the data fully. It is generally difficult to scrutinize the irregular patterns of expression that are not expected by the usual statistical gene by gene models. Results As an extension of empirical Bayes (EB) procedures, we have developed the β-empirical Bayes (β-EB) approach based on a β-likelihood measure which can be regarded as an ’evidence-based’ weighted (quasi-) likelihood inference. The weight of a transcript t is described as a power function of its likelihood, fβ(yt|θ). Genes with low likelihoods have unexpected expression patterns and low weights. By assigning low weights to outliers, the inference becomes robust. The value of β, which controls the balance between the robustness and efficiency, is selected by maximizing the predictive β0-likelihood by cross-validation. The proposed β-EB approach identified six significant (p<10−5) contaminated transcripts as differentially expressed (DE) in normal/tumor tissues from the head and neck of cancer patients. These six genes were all confirmed to be related to cancer; they were not identified as DE genes by the classical EB approach. When applied to the eQTL analysis of Arabidopsis thaliana, the proposed β-EB approach identified some potential master regulators that were missed by the EB approach. Conclusions The simulation data and real gene expression data showed that the proposed β-EB method was robust against outliers. The distribution of the weights was used to scrutinize the irregular patterns of expression and diagnose the model statistically. When β-weights outside the range of the predicted distribution were observed, a detailed inspection of the data was carried out. The β-weights described here can be applied to other likelihood-based statistical models for diagnosis, and may serve as a useful tool for transcriptome and proteome studies.
Collapse
Affiliation(s)
- Mohammad Manir Hossain Mollah
- Graduate School of Agricultural and Life Sciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-8657, Japan.
| | | | | |
Collapse
|
20
|
Abstract
Gene expression data are influenced by multiple biological and technological factors leading to a wide range of dispersion scenarios, although skewed patterns are not commonly addressed in microarray analyses. In this study, the distribution pattern of several human transcriptomes has been studied on free-access microarray gene expression data. Our results showed that, even in previously normalized gene expression data, probe and differential expression within probe effects suffer from substantial departures from the commonly assumed symmetric Gaussian distribution. We developed a flexible mixed model for non-competitive microarray data analysis that accounted for asymmetric and heavy-tailed (Student’s t distribution) dispersion processes. Random effects for gene expression data were modeled under asymmetric Student’s t distributions where the asymmetry parameter (λ) took values from perfect symmetry (λ = 0) to right- (λ>0) or left-side (λ>0) over-expression patterns. This approach was applied to four free-access human data sets and revealed clearly better model performance when comparing with standard approaches accounting for traditional symmetric Gaussian distribution patterns. Our analyses on human gene expression data revealed a substantial degree of right-hand asymmetry for probe effects, whereas differential gene expression addressed both symmetric and left-hand asymmetric patterns. Although these results cannot be extrapolated to all microarray experiments, they highlighted the incidence of skew dispersion patterns in human transcriptome; moreover, we provided a new analytical approach to appropriately address this biological phenomenon. The source code of the program accommodating these analytical developments and additional information about practical aspects on running the program are freely available by request to the corresponding author of this article.
Collapse
|