1
|
Valle F, Caselle M, Osella M. Exploring the latent space of transcriptomic data with topic modeling. NAR Genom Bioinform 2025; 7:lqaf049. [PMID: 40264683 PMCID: PMC12012681 DOI: 10.1093/nargab/lqaf049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Revised: 04/03/2025] [Accepted: 04/11/2025] [Indexed: 04/24/2025] Open
Abstract
The availability of high-dimensional transcriptomic datasets is increasing at a tremendous pace, together with the need for suitable computational tools. Clustering and dimensionality reduction methods are popular go-to methods to identify basic structures in these datasets. At the same time, different topic modeling techniques have been developed to organize the deluge of available data of natural language using their latent topical structure. This paper leverages the statistical analogies between text and transcriptomic datasets to compare different topic modeling methods when applied to gene expression data. Specifically, we test their accuracy in the specific task of discovering and reconstructing the tissue structure of the human transcriptome and distinguishing healthy from cancerous tissues. We examine the properties of the latent space recovered by different methods, highlight their differences, and their pros and cons across different tasks. We focus in particular on how different statistical priors can affect the results and their interpretability. Finally, we show that the latent topic space can be a useful low-dimensional embedding space, where a basic neural network classifier can annotate transcriptomic profiles with high accuracy.
Collapse
Affiliation(s)
- Filippo Valle
- Physics Department, University of Turin and INFN, Via Pietro Giuria 1, 12125 Torino, Italy
| | - Michele Caselle
- Physics Department, University of Turin and INFN, Via Pietro Giuria 1, 12125 Torino, Italy
| | - Matteo Osella
- Physics Department, University of Turin and INFN, Via Pietro Giuria 1, 12125 Torino, Italy
| |
Collapse
|
2
|
Pizzini L, Valle F, Osella M, Caselle M. Topic modeling analysis of the Allen Human Brain Atlas. Sci Rep 2025; 15:6928. [PMID: 40011617 DOI: 10.1038/s41598-025-91079-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Accepted: 02/18/2025] [Indexed: 02/28/2025] Open
Abstract
The human brain is a complex interconnected structure controlling all elementary and high-level cognitive tasks. It is composed of many regions that exhibit specific distributions of cell types and distinct patterns of functional connections. This complexity is rooted in differential transcription. The constituent cell types of different brain regions express distinctive combinations of genes as they develop and mature, ultimately shaping their functional state in adulthood. How precisely the genetic information of anatomical structures is connected to their underlying biological functions remains an open question in modern neuroscience. A major challenge is the identification of "universal patterns", which do not depend on the particular individual, but are instead basic structural properties shared by all brains. Despite the vast amount of gene expression data available at both the bulk and single-cell levels, this task remains challenging, mainly due to the lack of suitable data mining tools. In this paper, we propose an approach to address this issue based on a hierarchical version of Stochastic Block Modeling. Thanks to its specific choice of priors, the method is particularly effective in identifying these universal features. We use as a laboratory to test our algorithm a dataset obtained from six independent human brains from the Allen Human Brain Atlas. We show that the proposed method is indeed able to identify universal patterns much better than more traditional algorithms such as Latent Dirichlet Allocation or Weighted Correlation Network Analysis. The probabilistic association between genes and samples that we find well represents the known anatomical and functional brain organization. Moreover, leveraging the peculiar "fuzzy" structure of the gene sets obtained with our method, we identify examples of transcriptional and post-transcriptional pathways associated with specific brain regions, highlighting the potential of our approach.
Collapse
Affiliation(s)
- Letizia Pizzini
- Department of Physics and INFN, University of Turin, via P.Giuria 1, 10125, Turin, Italy.
| | - Filippo Valle
- Department of Physics and INFN, University of Turin, via P.Giuria 1, 10125, Turin, Italy
| | - Matteo Osella
- Department of Physics and INFN, University of Turin, via P.Giuria 1, 10125, Turin, Italy
| | - Michele Caselle
- Department of Physics and INFN, University of Turin, via P.Giuria 1, 10125, Turin, Italy
| |
Collapse
|
3
|
Mangold L, Roth C. Quantifying metadata relevance to network block structure using description length. COMMUNICATIONS PHYSICS 2024; 7:331. [PMID: 39398491 PMCID: PMC11469959 DOI: 10.1038/s42005-024-01819-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 09/30/2024] [Indexed: 10/15/2024]
Abstract
Network analysis is often enriched by including an examination of node metadata. In the context of understanding the mesoscale of networks it is often assumed that node groups based on metadata and node groups based on connectivity patterns are intrinsically linked. This assumption is increasingly being challenged, whereby metadata might be entirely unrelated to structure or, similarly, multiple sets of metadata might be relevant to the structure of a network in different ways. We propose the metablox tool to quantify the relationship between a network's node metadata and its mesoscale structure, measuring the strength of the relationship and the type of structural arrangement exhibited by the metadata. We show on a number of synthetic and empirical networks that our tool distinguishes relevant metadata and allows for this in a comparative setting, demonstrating that it can be used as part of systematic meta analyses for the comparison of networks from different domains.
Collapse
Affiliation(s)
- Lena Mangold
- Centre d’Analyse et de Mathématique Sociales (CNRS/EHESS), 54 Bd Raspail, 75006 Paris, France
- Computational Social Science Team, Centre Marc Bloch (CNRS/MEAE), Friedrichstr. 191, 10117 Berlin, Germany
| | - Camille Roth
- Centre d’Analyse et de Mathématique Sociales (CNRS/EHESS), 54 Bd Raspail, 75006 Paris, France
- Computational Social Science Team, Centre Marc Bloch (CNRS/MEAE), Friedrichstr. 191, 10117 Berlin, Germany
| |
Collapse
|
4
|
Kritschgau J, Kaiser D, Alvarado Rodriguez O, Amburg I, Bolkema J, Grubb T, Lan F, Maleki S, Chodrow P, Kay B. Community detection in hypergraphs via mutual information maximization. Sci Rep 2024; 14:6933. [PMID: 38521798 PMCID: PMC10960844 DOI: 10.1038/s41598-024-55934-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Accepted: 02/29/2024] [Indexed: 03/25/2024] Open
Abstract
The hypergraph community detection problem seeks to identify groups of related vertices in hypergraph data. We propose an information-theoretic hypergraph community detection algorithm which compresses the observed data in terms of community labels and community-edge intersections. This algorithm can also be viewed as maximum-likelihood inference in a degree-corrected microcanonical stochastic blockmodel. We perform the compression/inference step via simulated annealing. Unlike several recent algorithms based on canonical models, our microcanonical algorithm does not require inference of statistical parameters such as vertex degrees or pairwise group connection rates. Through synthetic experiments, we find that our algorithm succeeds down to recently-conjectured thresholds for sparse random hypergraphs. We also find competitive performance in cluster recovery tasks on several hypergraph data sets.
Collapse
Affiliation(s)
- Jürgen Kritschgau
- Department of Mathematical Sciences, Carnegie Mellon University, Pittsburgh, PA, 15213, USA
| | - Daniel Kaiser
- Department of Informatics, Indiana University, Bloomington, IN, 47408, USA
| | | | - Ilya Amburg
- Pacific Northwest National Laboratory, Richland, WA, 99354, USA
| | - Jessalyn Bolkema
- Department of Mathematics, California State University, Dominguez Hills, Carson, CA, 90747, USA
| | - Thomas Grubb
- University of California San Diego, San Diego, CA, 92093, USA
| | - Fangfei Lan
- Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, 84112, USA
| | - Sepideh Maleki
- Department of Computer Science, University of Texas at Austin, Austin, TX, 78712, USA
| | - Phil Chodrow
- Department of Computer Science, Middlebury College, Middlebury, VT, 05753, USA
| | - Bill Kay
- Pacific Northwest National Laboratory, Richland, WA, 99354, USA.
| |
Collapse
|
5
|
Ruffle JK, Mohinta S, Pombo G, Gray R, Kopanitsa V, Lee F, Brandner S, Hyare H, Nachev P. Brain tumour genetic network signatures of survival. Brain 2023; 146:4736-4754. [PMID: 37665980 PMCID: PMC10629773 DOI: 10.1093/brain/awad199] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 05/12/2023] [Accepted: 05/30/2023] [Indexed: 09/06/2023] Open
Abstract
Tumour heterogeneity is increasingly recognized as a major obstacle to therapeutic success across neuro-oncology. Gliomas are characterized by distinct combinations of genetic and epigenetic alterations, resulting in complex interactions across multiple molecular pathways. Predicting disease evolution and prescribing individually optimal treatment requires statistical models complex enough to capture the intricate (epi)genetic structure underpinning oncogenesis. Here, we formalize this task as the inference of distinct patterns of connectivity within hierarchical latent representations of genetic networks. Evaluating multi-institutional clinical, genetic and outcome data from 4023 glioma patients over 14 years, across 12 countries, we employ Bayesian generative stochastic block modelling to reveal a hierarchical network structure of tumour genetics spanning molecularly confirmed glioblastoma, IDH-wildtype; oligodendroglioma, IDH-mutant and 1p/19q codeleted; and astrocytoma, IDH-mutant. Our findings illuminate the complex dependence between features across the genetic landscape of brain tumours and show that generative network models reveal distinct signatures of survival with better prognostic fidelity than current gold standard diagnostic categories.
Collapse
Affiliation(s)
- James K Ruffle
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Samia Mohinta
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Guilherme Pombo
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Robert Gray
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Valeriya Kopanitsa
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Faith Lee
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Sebastian Brandner
- Division of Neuropathology and Department of Neurodegenerative Disease, Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Harpreet Hyare
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| | - Parashkev Nachev
- Queen Square Institute of Neurology, University College London, London WC1N 3BG, UK
| |
Collapse
|
6
|
Peixoto TP. Ordered community detection in directed networks. Phys Rev E 2022; 106:024305. [PMID: 36109944 DOI: 10.1103/physreve.106.024305] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 08/02/2022] [Indexed: 06/15/2023]
Abstract
We develop a method to infer community structure in directed networks where the groups are ordered in a latent one-dimensional hierarchy that determines the preferred edge direction. Our nonparametric Bayesian approach is based on a modification of the stochastic block model (SBM), which can take advantage of rank alignment and coherence to produce parsimonious descriptions of networks that combine ordered hierarchies with arbitrary mixing patterns between groups. Since our model also includes directed degree correction, we can use it to distinguish nonlocal hierarchical structure from local in- and out-degree imbalance-thus, removing a source of conflation present in most ranking methods. We also demonstrate how we can reliably compare with the results obtained with the unordered SBM variant to determine whether a hierarchical ordering is statistically warranted in the first place. We illustrate the application of our method on a wide variety of empirical networks across several domains.
Collapse
Affiliation(s)
- Tiago P Peixoto
- Department of Network and Data Science, Central European University, 1100 Vienna, Austria
| |
Collapse
|
7
|
Vaca-Ramírez F, Peixoto TP. Systematic assessment of the quality of fit of the stochastic block model for empirical networks. Phys Rev E 2022; 105:054311. [PMID: 35706168 DOI: 10.1103/physreve.105.054311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Accepted: 04/19/2022] [Indexed: 06/15/2023]
Abstract
We perform a systematic analysis of the quality of fit of the stochastic block model (SBM) for 275 empirical networks spanning a wide range of domains and orders of size magnitude. We employ posterior predictive model checking as a criterion to assess the quality of fit, which involves comparing networks generated by the inferred model with the empirical network, according to a set of network descriptors. We observe that the SBM is capable of providing an accurate description for the majority of networks considered, but falls short of saturating all modeling requirements. In particular, networks possessing a large diameter and slow-mixing random walks tend to be badly described by the SBM. However, contrary to what is often assumed, networks with a high abundance of triangles can be well described by the SBM in many cases. We demonstrate that simple network descriptors can be used to evaluate whether or not the SBM can provide a sufficiently accurate representation, potentially pointing to possible model extensions that can systematically improve the expressiveness of this class of models.
Collapse
Affiliation(s)
- Felipe Vaca-Ramírez
- Department of Network and Data Science, Central European University, 1100 Vienna, Austria
| | - Tiago P Peixoto
- Department of Network and Data Science, Central European University, 1100 Vienna, Austria
| |
Collapse
|
8
|
Valle F, Osella M, Caselle M. Multiomics Topic Modeling for Breast Cancer Classification. Cancers (Basel) 2022; 14:1150. [PMID: 35267458 PMCID: PMC8909787 DOI: 10.3390/cancers14051150] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 02/18/2022] [Indexed: 12/04/2022] Open
Abstract
The integration of transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial to identify the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. This paper presents an approach based on topic modeling to accomplish this integration task. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be naturally extended to integrate any combination of 'omics data. We test this approach on breast cancer samples from the TCGA database, integrating data on messenger RNA, microRNAs, and copy number variations. We show that the inclusion of the microRNA layer significantly improves the accuracy of subtype classification. Moreover, some of the hidden structures or "topics" that the algorithm extracts actually correspond to genes and microRNAs involved in breast cancer development and are associated to the survival probability.
Collapse
Affiliation(s)
- Filippo Valle
- Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy; (M.O.); (M.C.)
| | | | | |
Collapse
|
9
|
Wang J, Li K. Community structure exploration considering latent link patterns in complex networks. Neurocomputing 2021. [DOI: 10.1016/j.neucom.2021.06.032] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
10
|
Chodrow PS, Veldt N, Benson AR. Generative hypergraph clustering: From blockmodels to modularity. SCIENCE ADVANCES 2021; 7:eabh1303. [PMID: 34233880 PMCID: PMC11559555 DOI: 10.1126/sciadv.abh1303] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 05/24/2021] [Indexed: 06/13/2023]
Abstract
Hypergraphs are a natural modeling paradigm for networked systems with multiway interactions. A standard task in network analysis is the identification of closely related or densely interconnected nodes. We propose a probabilistic generative model of clustered hypergraphs with heterogeneous node degrees and edge sizes. Approximate maximum likelihood inference in this model leads to a clustering objective that generalizes the popular modularity objective for graphs. From this, we derive an inference algorithm that generalizes the Louvain graph community detection method, and a faster, specialized variant in which edges are expected to lie fully within clusters. Using synthetic and empirical data, we demonstrate that the specialized method is highly scalable and can detect clusters where graph-based methods fail. We also use our model to find interpretable higher-order structure in school contact networks, U.S. congressional bill cosponsorship and committees, product categories in copurchasing behavior, and hotel locations from web browsing sessions.
Collapse
Affiliation(s)
- Philip S Chodrow
- Department of Mathematics, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, CA 90095, USA.
| | - Nate Veldt
- Center for Applied Mathematics, Cornell University, 657 Frank H.T. Rhodes Hall, Ithaca, NY 14853, USA
| | - Austin R Benson
- Department of Computer Science, Cornell University, 413B Gates Hall, Ithaca, NY 14853, USA
| |
Collapse
|
11
|
|