1
|
Piron A, Szymczak F, Papadopoulou T, Alvelos MI, Defrance M, Lenaerts T, Eizirik DL, Cnop M. RedRibbon: A new rank-rank hypergeometric overlap for gene and transcript expression signatures. Life Sci Alliance 2024; 7:e202302203. [PMID: 38081640 PMCID: PMC10709657 DOI: 10.26508/lsa.202302203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 11/28/2023] [Accepted: 11/29/2023] [Indexed: 12/18/2023] Open
Abstract
High-throughput omics technologies have generated a wealth of large protein, gene, and transcript datasets that have exacerbated the need for new methods to analyse and compare big datasets. Rank-rank hypergeometric overlap is an important threshold-free method to combine and visualize two ranked lists of P-values or fold-changes, usually from differential gene expression analyses. Here, we introduce a new rank-rank hypergeometric overlap-based method aimed at gene level and alternative splicing analyses at transcript or exon level, hitherto unreachable as transcript numbers are an order of magnitude larger than gene numbers. We tested the tool on synthetic and real datasets at gene and transcript levels to detect correlation and anticorrelation patterns and found it to be fast and accurate, even on very large datasets thanks to an evolutionary algorithm-based minimal P-value search. The tool comes with a ready-to-use permutation scheme allowing the computation of adjusted P-values at low time cost. The package compatibility mode is a drop-in replacement to previous packages. RedRibbon holds the promise to accurately extricate detailed information from large comparative analyses.
Collapse
Affiliation(s)
- Anthony Piron
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium
- https://ror.org/01r9htc13 Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
| | - Florian Szymczak
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium
| | - Theodora Papadopoulou
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium
| | - Maria Inês Alvelos
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
| | - Matthieu Defrance
- Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium
- https://ror.org/01r9htc13 Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels (IB2), Brussels, Belgium
- https://ror.org/01r9htc13 Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
- Artificial Intelligence Lab, Vrije Universiteit Brussel, Brussels, Belgium
| | - Décio L Eizirik
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
| | - Miriam Cnop
- https://ror.org/01r9htc13 ULB Center for Diabetes Research, Medical Faculty, Université Libre de Bruxelles, Brussels, Belgium
- https://ror.org/01r9htc13 Division of Endocrinology, Erasmus Hospital, Université Libre de Bruxelles, Brussels, Belgium
| |
Collapse
|
2
|
Luo D, Zhang C, Fu L, Zhang Y, Hu YQ. A novel similarity score based on gene ranks to reveal genetic relationships among diseases. PeerJ 2021; 9:e10576. [PMID: 33505797 PMCID: PMC7796663 DOI: 10.7717/peerj.10576] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 11/24/2020] [Indexed: 12/12/2022] Open
Abstract
Knowledge of similarities among diseases can contribute to uncovering common genetic mechanisms. Based on ranked gene lists, a couple of similarity measures were proposed in the literature. Notice that they may suffer from the determination of cutoff or heavy computational load, we propose a novel similarity score SimSIP among diseases based on gene ranks. Simulation studies under various scenarios demonstrate that SimSIP has better performance than existing rank-based similarity measures. Application of SimSIP in gene expression data of 18 cancer types from The Cancer Genome Atlas shows that SimSIP is superior in clarifying the genetic relationships among diseases and demonstrates the tendency to cluster the histologically or anatomically related cancers together, which is analogous to the pan-cancer studies. Moreover, SimSIP with simpler form and faster computation is more robust for higher levels of noise than existing methods and provides a basis for future studies on genetic relationships among diseases. In addition, a measure MAG is developed to gauge the magnitude of association of anindividual gene with diseases. By using MAG the genes and biological processes significantly associated with colorectal cancer are detected.
Collapse
Affiliation(s)
- Dongmei Luo
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China.,Department of Information and Computing Science, School of Mathematics and Physics, Anhui University of Technology, Ma'anshan, Anhui Province, China
| | - Chengdong Zhang
- Shanghai Public Health Clinical Center, Fudan University, Shanghai, China
| | - Liwan Fu
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China
| | - Yuening Zhang
- SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China
| | - Yue-Qing Hu
- State Key Laboratory of Genetic Engineering, Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China.,Shanghai Center for Mathematical Sciences, Fudan University, Shanghai, China
| |
Collapse
|
3
|
Sánchez-Pla A, Salicrú M, Ocaña J. An equivalence approach to the integrative analysis of feature lists. BMC Bioinformatics 2019; 20:441. [PMID: 31455218 PMCID: PMC6712676 DOI: 10.1186/s12859-019-3008-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2018] [Accepted: 07/29/2019] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Although a few comparison methods based on the biological meaning of gene lists have been developed, the goProfiles approach is one of the few that are being used for that purpose. It consists of projecting lists of genes into predefined levels of the Gene Ontology, in such a way that a multinomial model can be used for estimation and testing. Of particular interest is the fact that it may be used for proving equivalence (in the sense of "enough similarity") between two lists, instead of proving differences between them, which seems conceptually better suited to the end goal of establishing similarity among gene lists. An equivalence method has been derived that uses a distance-based approach and the confidence interval inclusion principle. Equivalence is declared if the upper limit of a one-sided confidence interval for the distance between two profiles is below a pre-established equivalence limit. RESULTS In this work, this method is extended to establish the equivalence of any number of gene lists. Additionally, an algorithm to obtain the smallest equivalence limit that would allow equivalence between two or more lists to be declared is presented. This algorithm is at the base of an iterative method of graphic visualization to represent the most to least equivalent gene lists. These methods deal adequately with the problem of adjusting for multiple testing. The applicability of these techniques is illustrated in two typical situations: (i) a collection of cancer-related gene lists, suggesting which of them are more reasonable to combine -as claimed by the authors- and (ii) a collection of pathogenesis-based transcript sets, showing which of these are more closely related. The methods developed are available in the goProfiles Bioconductor package. CONCLUSIONS The method provides a simple yet powerful and statistically well-grounded way to classify a set of genes or other feature lists by establishing their equivalence at a given equivalence threshold. The classification results can be viewed using standard visualization methods. This may be applied to a variety of problems, from deciding whether a series of datasets generating the lists can be combined to the simplification of groups of lists.
Collapse
Affiliation(s)
- Alex Sánchez-Pla
- Genetics, Microbiology and Statistics Department, Universitat de Barcelona, Avinguda Diagonal, 648, Barcelona, 08028 Spain
| | - Miquel Salicrú
- Genetics, Microbiology and Statistics Department, Universitat de Barcelona, Avinguda Diagonal, 648, Barcelona, 08028 Spain
| | - Jordi Ocaña
- Genetics, Microbiology and Statistics Department, Universitat de Barcelona, Avinguda Diagonal, 648, Barcelona, 08028 Spain
| |
Collapse
|
4
|
Donald MR, Wilson SR. Comparison and visualisation of agreement for paired lists of rankings. Stat Appl Genet Mol Biol 2017; 16:31-45. [PMID: 28284040 DOI: 10.1515/sagmb-2016-0036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Output from analysis of a high-throughput 'omics' experiment very often is a ranked list. One commonly encountered example is a ranked list of differentially expressed genes from a gene expression experiment, with a length of many hundreds of genes. There are numerous situations where interest is in the comparison of outputs following, say, two (or more) different experiments, or of different approaches to the analysis that produce different ranked lists. Rather than considering exact agreement between the rankings, following others, we consider two ranked lists to be in agreement if the rankings differ by some fixed distance. Generally only a relatively small subset of the k top-ranked items will be in agreement. So the aim is to find the point k at which the probability of agreement in rankings changes from being greater than 0.5 to being less than 0.5. We use penalized splines and a Bayesian logit model, to give a nonparametric smooth to the sequence of agreements, as well as pointwise credible intervals for the probability of agreement. Our approach produces a point estimate and a credible interval for k. R code is provided. The method is applied to rankings of genes from breast cancer microarray experiments.
Collapse
|
5
|
Serra F, Romualdi C, Fogolari F. Similarity Measures Based on the Overlap of Ranked Genes Are Effective for Comparison and Classification of Microarray Data. J Comput Biol 2016; 23:603-14. [PMID: 27104372 DOI: 10.1089/cmb.2015.0057] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Similarity (or conversely distance) measures are at the heart of most bioinformatic applications. When the similarity involves only a small subset of features out of many, global similarity measures may be significantly affected by noise. Selecting only a subset of (putatively relevant) features for comparison is a widespread solution to the problem albeit affected by arbitrariness and manual intervention. The problem is becoming more and more important due to the increasing amount of experimental data available. In recent years measures based on ranking similarities between two datasets have been proposed. Here, we use one of the proposed rank similarity measures, sharing some aspects with the fraction enrichment score used for protein structure prediction and the gene set enrichment analysis, and test its performance in classifying experiments. The discrimination ability of the similarity measures based on the overlap of ranked genes tested here compares well or better with standard measures of similarity. This conclusion supports the use of rank-based proximity measures to gain further insight in dataset comparisons, particularly on expression data obtained by different techonologies (e.g., RNA-seq and microarrays).
Collapse
Affiliation(s)
- Fabrizio Serra
- 1 Department of Biomedical Sciences and Technologies, University of Udine, Udine , Italy
| | - Chiara Romualdi
- 2 Department of Biology, University of Padova , Padova, Italy
| | - Federico Fogolari
- 1 Department of Biomedical Sciences and Technologies, University of Udine, Udine , Italy .,3 Istituto Nazionale Biostrutture e Biosistemi , Roma, Italy
| |
Collapse
|
6
|
Chen Q, Zhou XJ, Sun F. Finding genetic overlaps among diseases based on ranked gene lists. J Comput Biol 2015; 22:111-23. [PMID: 25684200 DOI: 10.1089/cmb.2014.0149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022] Open
Abstract
To understand disease relationships in terms of their genetic mechanisms, it is important to study the common genetic basis among different diseases. Although discoveries on pleiotropic genes related to multiple diseases abound, methods flexibly applicable to various types of datasets generated from different studies or experiments are needed to gain big pictures on the genetic relationships among a large number of diseases. We develop a set of genetic similarity measures to gauge the genetic overlap between diseases, as well as several estimators of the number of overlapping disease genes between diseases. These methods are based on ranked gene lists so that they could be flexibly applied to different types of data. We first investigate the performance of the genetic similarity measure for evaluating the similarity between human diseases in simulation studies. Then we apply the method to diseases in the OMIM database. We show that our proposed genetic measure achieves superior performance in explaining phenotype similarities between diseases compared to simpler methods. Furthermore, we identified common genes underlying the genetic overlap between disease pairs. With an example of five vision-related diseases, we demonstrate how our methods can provide insights into the relationships among diseases based on their shared genetic mechanisms.
Collapse
Affiliation(s)
- Quan Chen
- Molecular and Computational Biology Program, University of Southern California , Los Angeles, California
| | | | | |
Collapse
|