1
|
Rahmatallah Y, Glazko G. Improving data interpretability with new differential sample variance gene set tests. BMC Bioinformatics 2025; 26:103. [PMID: 40229677 PMCID: PMC11998189 DOI: 10.1186/s12859-025-06117-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2024] [Accepted: 03/20/2025] [Indexed: 04/16/2025] Open
Abstract
BACKGROUND Gene set analysis methods have played a major role in generating biological interpretations of omics data such as gene expression datasets. However, most methods focus on detecting homogenous pattern changes in mean expression while methods detecting pattern changes in variance remain poorly explored. While a few studies attempted to use gene-level variance analysis, such approach remains under-utilized. When comparing two phenotypes, gene sets with distinct changes in subgroups under one phenotype are overlooked by available methods although they reflect meaningful biological differences between two phenotypes. Multivariate sample-level variance analysis methods are needed to detect such pattern changes. RESULTS We used ranking schemes based on minimum spanning tree to generalize the Cramer-Von Mises and Anderson-Darling univariate statistics into multivariate gene set analysis methods to detect differential sample variance or mean. We characterized the detection power and Type I error rate of these methods in addition to two methods developed earlier using simulation results with different parameters. We applied the developed methods to microarray gene expression dataset of prednisolone-resistant and prednisolone-sensitive children diagnosed with B-lineage acute lymphoblastic leukemia and bulk RNA-sequencing gene expression dataset of benign hyperplastic polyps and potentially malignant sessile serrated adenoma/polyps. One or both of the two compared phenotypes in each of these datasets have distinct molecular subtypes that contribute to within phenotype variability and to heterogeneous differences between two compared phenotypes. Our results show that methods designed to detect differential sample variance provide meaningful biological interpretations by detecting specific hallmark gene sets associated with the two compared phenotypes as documented in available literature. CONCLUSIONS The results of this study demonstrate the usefulness of methods designed to detect differential sample variance in providing biological interpretations when biologically relevant but heterogeneous changes between two phenotypes are prevalent in specific signaling pathways. Software implementation of the methods is available with detailed documentation from Bioconductor package GSAR. The available methods are applicable to gene expression datasets in a normalized matrix form and could be used with other omics datasets in a normalized matrix form with available collection of feature sets.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| |
Collapse
|
2
|
Rahmatallah Y, Glazko G. Improving data interpretability with new differential sample variance gene set tests. RESEARCH SQUARE 2024:rs.3.rs-4888767. [PMID: 39315246 PMCID: PMC11419169 DOI: 10.21203/rs.3.rs-4888767/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Background Gene set analysis methods have played a major role in generating biological interpretations from omics data such as gene expression datasets. However, most methods focus on detecting homogenous pattern changes in mean expression and methods detecting pattern changes in variance remain poorly explored. While a few studies attempted to use gene-level variance analysis, such approach remains under-utilized. When comparing two phenotypes, gene sets with distinct changes in subgroups under one phenotype are overlooked by available methods although they reflect meaningful biological differences between two phenotypes. Multivariate sample-level variance analysis methods are needed to detect such pattern changes. Results We use ranking schemes based on minimum spanning tree to generalize the Cramer-Von Mises and Anderson-Darling univariate statistics into multivariate gene set analysis methods to detect differential sample variance or mean. We characterize these methods in addition to two methods developed earlier using simulation results with different parameters. We apply the developed methods to microarray gene expression dataset of prednisolone-resistant and prednisolone-sensitive children diagnosed with B-lineage acute lymphoblastic leukemia and bulk RNA-sequencing gene expression dataset of benign hyperplastic polyps and potentially malignant sessile serrated adenoma/polyps. One or both of the two compared phenotypes in each of these datasets have distinct molecular subtypes that contribute to heterogeneous differences. Our results show that methods designed to detect differential sample variance are able to detect specific hallmark signaling pathways associated with the two compared phenotypes as documented in available literature. Conclusions The results in this study demonstrate the usefulness of methods designed to detect differential sample variance in providing biological interpretations when biologically relevant but heterogeneous changes between two phenotypes are prevalent in specific signaling pathways. Software implementation of the developed methods is available with detailed documentation from Bioconductor package GSAR. The available methods are applicable to gene expression datasets in a normalized matrix form and could be used with other omics datasets in a normalized matrix form with available collection of feature sets.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| |
Collapse
|
3
|
Kim K, Kim S, Ahn T, Kim H, Shin SJ, Choi CH, Park S, Kim YB, No JH, Suh DH. A differential diagnosis between uterine leiomyoma and leiomyosarcoma using transcriptome analysis. BMC Cancer 2023; 23:1215. [PMID: 38066476 PMCID: PMC10709939 DOI: 10.1186/s12885-023-11394-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 09/11/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND The objective of this study was to estimate the accuracy of transcriptome-based classifier in differential diagnosis of uterine leiomyoma and leiomyosarcoma. We manually selected 114 normal uterine tissue and 31 leiomyosarcoma samples from publicly available transcriptome data in UCSC Xena as training/validation sets. We developed pre-processing procedure and gene selection method to sensitively find genes of larger variance in leiomyosarcoma than normal uterine tissues. Through our method, 17 genes were selected to build transcriptome-based classifier. The prediction accuracies of deep feedforward neural network (DNN), support vector machine (SVM), random forest (RF), and gradient boosting (GB) models were examined. We interpret the biological functionality of selected genes via network-based analysis using GeneMANIA. To validate the performance of trained model, we additionally collected 35 clinical samples of leiomyosarcoma and leiomyoma as a test set (18 + 17 as 1st and 2nd test sets). RESULTS We discovered genes expressed in a highly variable way in leiomyosarcoma while these genes are expressed in a conserved way in normal uterine samples. These genes were mainly associated with DNA replication. As gene selection and model training were made in leiomyosarcoma and uterine normal tissue, proving discriminant of ability between leiomyosarcoma and leiomyoma is necessary. Thus, further validation of trained model was conducted in newly collected clinical samples of leiomyosarcoma and leiomyoma. The DNN classifier performed sensitivity 0.88, 0.77 (8/9, 7/9) while the specificity 1.0 (8/8, 8/8) in two test data set supporting that the selected genes in conjunction with DNN classifier are well discriminating the difference between leiomyosarcoma and leiomyoma in clinical sample. CONCLUSION The transcriptome-based classifier accurately distinguished uterine leiomyosarcoma from leiomyoma. Our method can be helpful in clinical practice through the biopsy of sample in advance of surgery. Identification of leiomyosarcoma let the doctor avoid of laparoscopic surgery, thus it minimizes un-wanted tumor spread.
Collapse
Affiliation(s)
- Kidong Kim
- Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Sarah Kim
- Department of Life Science, Handong Global University, Pohang, Republic of Korea
| | - TaeJin Ahn
- Department of Life Science, Handong Global University, Pohang, Republic of Korea.
| | - Hyojin Kim
- Department of Pathology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - So-Jin Shin
- Department of Gynecology and Obstetrics, School of Medicine, Keimyung University, Daegu, Republic of Korea
| | - Chel Hun Choi
- Department of Obstetrics and Gynecology, Sungkyunkwan University School of Medicine, Seoul, Republic of Korea
| | - Sungmin Park
- Department of Life Science, Handong Global University, Pohang, Republic of Korea
| | - Yong-Beom Kim
- Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Jae Hong No
- Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| | - Dong Hoon Suh
- Department of Obstetrics and Gynecology, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
| |
Collapse
|
4
|
Xin R, Cheng Q, Chi X, Feng X, Zhang H, Wang Y, Duan M, Xie T, Song X, Yu Q, Fan Y, Huang L, Zhou F. Computational Characterization of Undifferentially Expressed Genes with Altered Transcription Regulation in Lung Cancer. Genes (Basel) 2023; 14:2169. [PMID: 38136991 PMCID: PMC10742656 DOI: 10.3390/genes14122169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2023] [Revised: 11/19/2023] [Accepted: 11/27/2023] [Indexed: 12/24/2023] Open
Abstract
A transcriptome profiles the expression levels of genes in cells and has accumulated a huge amount of public data. Most of the existing biomarker-related studies investigated the differential expression of individual transcriptomic features under the assumption of inter-feature independence. Many transcriptomic features without differential expression were ignored from the biomarker lists. This study proposed a computational analysis protocol (mqTrans) to analyze transcriptomes from the view of high-dimensional inter-feature correlations. The mqTrans protocol trained a regression model to predict the expression of an mRNA feature from those of the transcription factors (TFs). The difference between the predicted and real expression of an mRNA feature in a query sample was defined as the mqTrans feature. The new mqTrans view facilitated the detection of thirteen transcriptomic features with differentially expressed mqTrans features, but without differential expression in the original transcriptomic values in three independent datasets of lung cancer. These features were called dark biomarkers because they would have been ignored in a conventional differential analysis. The detailed discussion of one dark biomarker, GBP5, and additional validation experiments suggested that the overlapping long non-coding RNAs might have contributed to this interesting phenomenon. In summary, this study aimed to find undifferentially expressed genes with significantly changed mqTrans values in lung cancer. These genes were usually ignored in most biomarker detection studies of undifferential expression. However, their differentially expressed mqTrans values in three independent datasets suggested their strong associations with lung cancer.
Collapse
Affiliation(s)
- Ruihao Xin
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Qian Cheng
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Xiaohang Chi
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Xin Feng
- School of Science, Jilin Institute of Chemical Technology, Jilin 132000, China;
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun 130012, China;
| | - Hang Zhang
- Jilin Institute of Chemical Technology, College of Information and Control Engineering, Jilin 132000, China; (Q.C.); (X.C.); (H.Z.)
| | - Yueying Wang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Meiyu Duan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Tunyang Xie
- Centre for Mathematical Sciences, University of Cambridge, Wilberforce Road, Cambridge CB3 0WA, UK;
| | - Xiaonan Song
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Software, Jilin University, Changchun 130012, China;
| | - Qiong Yu
- Department of Epidemiology and Biostatistics, School of Public Health, Jilin University, Changchun 130012, China;
| | - Yusi Fan
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Software, Jilin University, Changchun 130012, China;
| | - Lan Huang
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
| | - Fengfeng Zhou
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China; (R.X.); (Y.W.); (M.D.); (L.H.)
- School of Biology and Engineering, Guizhou Medical University, Guiyang 550025, China
| |
Collapse
|
5
|
Roberts AGK, Catchpoole DR, Kennedy PJ. Identification of differentially distributed gene expression and distinct sets of cancer-related genes identified by changes in mean and variability. NAR Genom Bioinform 2022; 4:lqab124. [PMID: 35047816 PMCID: PMC8759562 DOI: 10.1093/nargab/lqab124] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 11/19/2021] [Accepted: 12/16/2021] [Indexed: 12/13/2022] Open
Abstract
There is increasing evidence that changes in the variability or overall distribution of gene expression are important both in normal biology and in diseases, particularly cancer. Genes whose expression differs in variability or distribution without a difference in mean are ignored by traditional differential expression-based analyses. Using a Bayesian hierarchical model that provides tests for both differential variability and differential distribution for bulk RNA-seq data, we report here an investigation into differential variability and distribution in cancer. Analysis of eight paired tumour-normal datasets from The Cancer Genome Atlas confirms that differential variability and distribution analyses are able to identify cancer-related genes. We further demonstrate that differential variability identifies cancer-related genes that are missed by differential expression analysis, and that differential expression and differential variability identify functionally distinct sets of potentially cancer-related genes. These results suggest that differential variability analysis may provide insights into genetic aspects of cancer that would not be revealed by differential expression, and that differential distribution analysis may allow for more comprehensive identification of cancer-related genes than analyses based on changes in mean or variability alone.
Collapse
|
6
|
Abstract
Given the ever-increasing amount of high-dimensional and complex omics data becoming available, it is increasingly important to discover simple but effective methods of analysis. Divergence analysis transforms each entry of a high-dimensional omics profile into a digitized (binary or ternary) code based on the deviation of the entry from a given baseline population. This is a novel framework that is significantly different from existing omics data analysis methods: it allows digitization of continuous omics data at the univariate or multivariate level, facilitates sample level analysis, and is applicable on many different omics platforms. The divergence package, available on the R platform through the Bioconductor repository collection, provides easy-to-use functions for carrying out this transformation. Here we demonstrate how to use the package with data from the Cancer Genome Atlas.
Collapse
|
7
|
Davis-Marcisak EF, Sherman TD, Orugunta P, Stein-O'Brien GL, Puram SV, Roussos Torres ET, Hopkins AC, Jaffee EM, Favorov AV, Afsari B, Goff LA, Fertig EJ. Differential Variation Analysis Enables Detection of Tumor Heterogeneity Using Single-Cell RNA-Sequencing Data. Cancer Res 2019; 79:5102-5112. [PMID: 31337651 PMCID: PMC6844448 DOI: 10.1158/0008-5472.can-18-3882] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2018] [Revised: 05/13/2019] [Accepted: 07/19/2019] [Indexed: 12/20/2022]
Abstract
Tumor heterogeneity provides a complex challenge to cancer treatment and is a critical component of therapeutic response, disease recurrence, and patient survival. Single-cell RNA-sequencing (scRNA-seq) technologies have revealed the prevalence of intratumor and intertumor heterogeneity. Computational techniques are essential to quantify the differences in variation of these profiles between distinct cell types, tumor subtypes, and patients to fully characterize intratumor and intertumor molecular heterogeneity. In this study, we adapted our algorithm for pathway dysregulation, Expression Variation Analysis (EVA), to perform multivariate statistical analyses of differential variation of expression in gene sets for scRNA-seq. EVA has high sensitivity and specificity to detect pathways with true differential heterogeneity in simulated data. EVA was applied to several public domain scRNA-seq tumor datasets to quantify the landscape of tumor heterogeneity in several key applications in cancer genomics such as immunogenicity, metastasis, and cancer subtypes. Immune pathway heterogeneity of hematopoietic cell populations in breast tumors corresponded to the amount of diversity present in the T-cell repertoire of each individual. Cells from head and neck squamous cell carcinoma (HNSCC) primary tumors had significantly more heterogeneity across pathways than cells from metastases, consistent with a model of clonal outgrowth. Moreover, there were dramatic differences in pathway dysregulation across HNSCC basal primary tumors. Within the basal primary tumors, there was increased immune dysregulation in individuals with a high proportion of fibroblasts present in the tumor microenvironment. These results demonstrate the broad utility of EVA to quantify intertumor and intratumor heterogeneity from scRNA-seq data without reliance on low-dimensional visualization. SIGNIFICANCE: This study presents a robust statistical algorithm for evaluating gene expression heterogeneity within pathways or gene sets in single-cell RNA-seq data.
Collapse
Affiliation(s)
- Emily F Davis-Marcisak
- McKusick-Nathans Institute of the Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Thomas D Sherman
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Pranay Orugunta
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Genevieve L Stein-O'Brien
- McKusick-Nathans Institute of the Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Sidharth V Puram
- Department of Otolaryngology-Head and Neck Surgery, Washington University School of Medicine, St. Louis, Missouri
- Department of Genetics, Washington University School of Medicine, St. Louis, Missouri
| | - Evanthia T Roussos Torres
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Alexander C Hopkins
- Michigan Center for Translational Pathology, University of Michigan, Ann Arbor, Michigan
| | - Elizabeth M Jaffee
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Alexander V Favorov
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
- Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Bahman Afsari
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Loyal A Goff
- McKusick-Nathans Institute of the Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, Maryland
- Solomon H. Snyder Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, Maryland
| | - Elana J Fertig
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, Maryland.
- Department of Applied Mathematics and Statistics, Johns Hopkins University Whiting School of Engineering, Baltimore, Maryland
- Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, Maryland
| |
Collapse
|
8
|
Roberts AGK, Catchpoole DR, Kennedy PJ. Variance-based Feature Selection for Classification of Cancer Subtypes Using Gene Expression Data. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) 2018:1-8. [DOI: 10.1109/ijcnn.2018.8489279] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/06/2025]
|
9
|
Abstract
Technological advances enable increasingly comprehensive profiling of the molecular landscapes of cells, and these data can inform the personalized treatment of complex diseases. Two major obstacles are the complexity of these data and the high degree of person-to-person heterogeneity. We develop a highly simplified, personalized data representation by comparing the profile of an individual to the range of landscapes in a baseline population, thereby mimicking basic clinical diagnostic testing for departures of selected variables from normal levels. Moreover, our method can be applied to any data modality and at any level of granularity, from single features to any subset of features treated as a single entity, for example the gene expression levels in a pathway. Experiments involve both healthy human tissues and various cancer subtypes. Data collected from omics technologies have revealed pervasive heterogeneity and stochasticity of molecular states within and between phenotypes. A prominent example of such heterogeneity occurs between genome-wide mRNA, microRNA, and methylation profiles from one individual tumor to another, even within a cancer subtype. However, current methods in bioinformatics, such as detecting differentially expressed genes or CpG sites, are population-based and therefore do not effectively model intersample diversity. Here we introduce a unified theory to quantify sample-level heterogeneity that is applicable to a single omics profile. Specifically, we simplify an omics profile to a digital representation based on the omics profiles from a set of samples from a reference or baseline population (e.g., normal tissues). The state of any subprofile (e.g., expression vector for a subset of genes) is said to be “divergent” if it lies outside the estimated support of the baseline distribution and is consequently interpreted as “dysregulated” relative to that baseline. We focus on two cases: single features (e.g., individual genes) and distinguished subsets (e.g., regulatory pathways). Notably, since the divergence analysis is at the individual sample level, dysregulation can be analyzed probabilistically; for example, one can estimate the probability that a gene or pathway is divergent in some population. Finally, the reduction in complexity facilitates a more “personalized” and biologically interpretable analysis of variation, as illustrated by experiments involving tissue characterization, disease detection and progression, and disease–pathway associations.
Collapse
|
10
|
Patange S, Girvan M, Larson DR. Single-cell systems biology: probing the basic unit of information flow. ACTA ACUST UNITED AC 2017; 8:7-15. [PMID: 29552672 DOI: 10.1016/j.coisb.2017.11.011] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Gene expression varies across cells in a population or a tissue. This heterogeneity has come into sharp focus in recent years through developments in new imaging and sequencing technologies. However, our ability to measure variation has outpaced our ability to interpret it. Much of the variability may arise from random effects occurring in the processes of gene expression (transcription, RNA processing and decay, translation). The molecular basis of these effects is largely unknown. Likewise, a functional role of this variability in growth, differentiation and disease has only been elucidated in a few cases. In this review, we highlight recent experimental and theoretical advances for measuring and analyzing stochastic variation.
Collapse
Affiliation(s)
- Simona Patange
- Laboratory of Receptor Biology and Gene Expression, Center for Cancer Research, National Cancer Institute. Bethesda, MD 20892
- Institute for Physical Science and Technology, University of Maryland, College Park, MD
| | - Michelle Girvan
- Institute for Physical Science and Technology, University of Maryland, College Park, MD
- Department of Physics, University of Maryland. College Park, MD
| | - Daniel R Larson
- Laboratory of Receptor Biology and Gene Expression, Center for Cancer Research, National Cancer Institute. Bethesda, MD 20892
| |
Collapse
|
11
|
Abstract
The analysis of gene sets (in a form of functionally related genes or pathways) has become the method of choice for extracting the strongest signals from omics data. The motivation behind using gene sets instead of individual genes is two-fold. First, this approach incorporates pre-existing biological knowledge into the analysis and facilitates the interpretation of experimental results. Second, it employs a statistical hypotheses testing framework. Here, we briefly review main Gene Set Analysis (GSA) approaches for testing differential expression of gene sets and several GSA approaches for testing statistical hypotheses beyond differential expression that allow extracting additional biological information from the data. We distinguish three major types of GSA approaches testing: (1) differential expression (DE), (2) differential variability (DV), and (3) differential co-expression (DC) of gene sets between two phenotypes. We also present comparative power analysis and Type I error rates for different approaches in each major type of GSA on simulated data. Our evaluation presents a concise guideline for selecting GSA approaches best performing under particular experimental settings. The value of the three major types of GSA approaches is illustrated with real data example. While being applied to the same data set, major types of GSA approaches result in complementary biological information.
Collapse
|
12
|
Rahmatallah Y, Zybailov B, Emmert-Streib F, Glazko G. GSAR: Bioconductor package for Gene Set analysis in R. BMC Bioinformatics 2017; 18:61. [PMID: 28118818 PMCID: PMC5259853 DOI: 10.1186/s12859-017-1482-6] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2016] [Accepted: 01/10/2017] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Gene set analysis (in a form of functionally related genes or pathways) has become the method of choice for analyzing omics data in general and gene expression data in particular. There are many statistical methods that either summarize gene-level statistics for a gene set or apply a multivariate statistic that accounts for intergene correlations. Most available methods detect complex departures from the null hypothesis but lack the ability to identify the specific alternative hypothesis that rejects the null. RESULTS GSAR (Gene Set Analysis in R) is an open-source R/Bioconductor software package for gene set analysis (GSA). It implements self-contained multivariate non-parametric statistical methods testing a complex null hypothesis against specific alternatives, such as differences in mean (shift), variance (scale), or net correlation structure. The package also provides a graphical visualization tool, based on the union of two minimum spanning trees, for correlation networks to examine the change in the correlation structures of a gene set between two conditions and highlight influential genes (hubs). CONCLUSIONS Package GSAR provides a set of multivariate non-parametric statistical methods that test a complex null hypothesis against specific alternatives. The methods in package GSAR are applicable to any type of omics data that can be represented in a matrix format. The package, with detailed instructions and examples, is freely available under the GPL (> = 2) license from the Bioconductor web site.
Collapse
Affiliation(s)
- Yasir Rahmatallah
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA.
| | - Boris Zybailov
- Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| | - Frank Emmert-Streib
- Computational Medicine and Statistical Learning Laboratory, Tampere University of Technology, Korkeakoulunkatu 1, Tampere, FI-33720, Finland
| | - Galina Glazko
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, 72205, USA
| |
Collapse
|
13
|
Feinberg AP, Koldobskiy MA, Göndör A. Epigenetic modulators, modifiers and mediators in cancer aetiology and progression. Nat Rev Genet 2016; 17:284-99. [PMID: 26972587 DOI: 10.1038/nrg.2016.13] [Citation(s) in RCA: 624] [Impact Index Per Article: 69.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
Abstract
This year is the tenth anniversary of the publication in this journal of a model suggesting the existence of 'tumour progenitor genes'. These genes are epigenetically disrupted at the earliest stages of malignancies, even before mutations, and thus cause altered differentiation throughout tumour evolution. The past decade of discovery in cancer epigenetics has revealed a number of similarities between cancer genes and stem cell reprogramming genes, widespread mutations in epigenetic regulators, and the part played by chromatin structure in cellular plasticity in both development and cancer. In the light of these discoveries, we suggest here a framework for cancer epigenetics involving three types of genes: 'epigenetic mediators', corresponding to the tumour progenitor genes suggested earlier; 'epigenetic modifiers' of the mediators, which are frequently mutated in cancer; and 'epigenetic modulators' upstream of the modifiers, which are responsive to changes in the cellular environment and often linked to the nuclear architecture. We suggest that this classification is helpful in framing new diagnostic and therapeutic approaches to cancer.
Collapse
Affiliation(s)
- Andrew P Feinberg
- Center for Epigenetics, Johns Hopkins University School of Medicine, 855 N. Wolfe Street, Rangos 570, Baltimore, Maryland 21205, USA
| | - Michael A Koldobskiy
- Center for Epigenetics, Johns Hopkins University School of Medicine, 855 N. Wolfe Street, Rangos 570, Baltimore, Maryland 21205, USA
| | - Anita Göndör
- Department of Microbiology, Tumour and Cell Biology, Nobels väg 16, Karolinska Institutet, S-171 77 Stockholm, Sweden
| |
Collapse
|