1
|
Lazzardi S, Valle F, Mazzolini A, Scialdone A, Caselle M, Osella M. Emergent statistical laws in single-cell transcriptomic data. Phys Rev E 2023; 107:044403. [PMID: 37198814 DOI: 10.1103/physreve.107.044403] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/19/2023]
Abstract
Large-scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology, or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Collapse
Affiliation(s)
- Silvia Lazzardi
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Filippo Valle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Andrea Mazzolini
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Antonio Scialdone
- Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Feodor-Lynen-Straße 21, 81377 München, Germany and Institute of Functional Epigenetics and Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
| | - Michele Caselle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Matteo Osella
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| |
Collapse
|
2
|
Roman S, Bertolotti F. A master equation for power laws. ROYAL SOCIETY OPEN SCIENCE 2022; 9:220531. [PMID: 36483760 PMCID: PMC9727680 DOI: 10.1098/rsos.220531] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/22/2022] [Accepted: 11/14/2022] [Indexed: 06/17/2023]
Abstract
We propose a new mechanism for generating power laws. Starting from a random walk, we first outline a simple derivation of the Fokker-Planck equation. By analogy, starting from a certain Markov chain, we derive a master equation for power laws that describes how the number of cascades changes over time (cascades are consecutive transitions that end when the initial state is reached). The partial differential equation has a closed form solution which gives an explicit dependence of the number of cascades on their size and on time. Furthermore, the power law solution has a natural cut-off, a feature often seen in empirical data. This is due to the finite size a cascade can have in a finite time horizon. The derivation of the equation provides a justification for an exponent equal to 2, which agrees well with several empirical distributions, including Richardson's Law on the size and frequency of deadly conflicts. Nevertheless, the equation can be solved for any exponent value. In addition, we propose an urn model where the number of consecutive ball extractions follows a power law. In all cases, the power law is manifest over the entire range of cascade sizes, as shown through log-log plots in the frequency and rank distributions.
Collapse
Affiliation(s)
- Sabin Roman
- Centre for the Study of Existential Risk, University of Cambridge, Cambridge, UK
- Odyssean Institute, London, UK
| | | |
Collapse
|
3
|
Coccia M. Comparative Theories of the Evolution of Technology. GLOBAL ENCYCLOPEDIA OF PUBLIC ADMINISTRATION, PUBLIC POLICY, AND GOVERNANCE 2022:2227-2234. [DOI: 10.1007/978-3-030-66252-3_3841] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/02/2023]
|
4
|
Aguilar-Valdez S, Morales JA, Paredes O. Unraveling the hCoV-19 Informational Architecture. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2021; 2021:2392-2395. [PMID: 34891763 DOI: 10.1109/embc46164.2021.9630954] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The hCoV-19 virus is continuously evolving to highly infectious and lethal variants. There is a latent risk that current vaccines will not be effective over these novel variants. This entails comprehending the genome-wide viral information to unveil mutagenic mechanisms of hCoV-19. To date, this virus is studied as a collection of non-related variants, making it challenging to forecast hotspots and their upcoming effects. In this work, we explore genome-wide information to disentangle informational mechanisms that lead to insights into viral mutagenicity. Towards this aim, we modeled informational compartments based on a topic-free-alignment workflow. These compartments illustrate that hCoV-19 has a complex informational architecture that addresses high-level virus phenomena, i.e., mutagenicity. This new framework represents the first step towards identifying the virus mutagenicity leading to the development of all-variants-effective vaccines.
Collapse
|
5
|
Caetano-Anollés G. The Compressed Vocabulary of Microbial Life. Front Microbiol 2021; 12:655990. [PMID: 34305827 PMCID: PMC8292947 DOI: 10.3389/fmicb.2021.655990] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2021] [Accepted: 04/27/2021] [Indexed: 12/22/2022] Open
Abstract
Communication is an undisputed central activity of life that requires an evolving molecular language. It conveys meaning through messages and vocabularies. Here, I explore the existence of a growing vocabulary in the molecules and molecular functions of the microbial world. There are clear correspondences between the lexicon, syntax, semantics, and pragmatics of language organization and the module, structure, function, and fitness paradigms of molecular biology. These correspondences are constrained by universal laws and engineering principles. Macromolecular structure, for example, follows quantitative linguistic patterns arising from statistical laws that are likely universal, including the Zipf's law, a special case of the scale-free distribution, the Heaps' law describing sublinear growth typical of economies of scales, and the Menzerath-Altmann's law, which imposes size-dependent patterns of decreasing returns. Trade-off solutions between principles of economy, flexibility, and robustness define a "triangle of persistence" describing the impact of the environment on a biological system. The pragmatic landscape of the triangle interfaces with the syntax and semantics of molecular languages, which together with comparative and evolutionary genomic data can explain global patterns of diversification of cellular life. The vocabularies of proteins (proteomes) and functions (functionomes) revealed a significant universal lexical core supporting a universal common ancestor, an ancestral evolutionary link between Bacteria and Eukarya, and distinct reductive evolutionary strategies of language compression in Archaea and Bacteria. A "causal" word cloud strategy inspired by the dependency grammar paradigm used in catenae unfolded the evolution of lexical units associated with Gene Ontology terms at different levels of ontological abstraction. While Archaea holds the smallest, oldest, and most homogeneous vocabulary of all superkingdoms, Bacteria heterogeneously apportions a more complex vocabulary, and Eukarya pushes functional innovation through mechanisms of flexibility and robustness.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, and C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL, United States
| |
Collapse
|
6
|
Solvable Model for the Linear Separability of Structured Data. ENTROPY 2021; 23:e23030305. [PMID: 33806454 PMCID: PMC7999416 DOI: 10.3390/e23030305] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 02/22/2021] [Accepted: 02/25/2021] [Indexed: 11/26/2022]
Abstract
Linear separability, a core concept in supervised machine learning, refers to whether the labels of a data set can be captured by the simplest possible machine: a linear classifier. In order to quantify linear separability beyond this single bit of information, one needs models of data structure parameterized by interpretable quantities, and tractable analytically. Here, I address one class of models with these properties, and show how a combinatorial method allows for the computation, in a mean field approximation, of two useful descriptors of linear separability, one of which is closely related to the popular concept of storage capacity. I motivate the need for multiple metrics by quantifying linear separability in a simple synthetic data set with controlled correlations between the points and their labels, as well as in the benchmark data set MNIST, where the capacity alone paints an incomplete picture. The analytical results indicate a high degree of “universality”, or robustness with respect to the microscopic parameters controlling data structure.
Collapse
|
7
|
Valle F, Osella M, Caselle M. A Topic Modeling Analysis of TCGA Breast and Lung Cancer Transcriptomic Data. Cancers (Basel) 2020; 12:E3799. [PMID: 33339347 PMCID: PMC7766023 DOI: 10.3390/cancers12123799] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/07/2020] [Accepted: 12/11/2020] [Indexed: 01/18/2023] Open
Abstract
Topic modeling is a widely used technique to extract relevant information from large arrays of data. The problem of finding a topic structure in a dataset was recently recognized to be analogous to the community detection problem in network theory. Leveraging on this analogy, a new class of topic modeling strategies has been introduced to overcome some of the limitations of classical methods. This paper applies these recent ideas to TCGA transcriptomic data on breast and lung cancer. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, we identify specific topics that are enriched in genes known to play a role in the corresponding disease and are strongly related to the survival probability of patients. Finally, we show that a simple neural network classifier operating in the low dimensional topic space is able to predict with high accuracy the cancer subtype of a test expression sample.
Collapse
Affiliation(s)
- Filippo Valle
- Physics Department, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy; (M.O.); (M.C.)
| | | | | |
Collapse
|
8
|
Iacopini I, Di Bona G, Ubaldi E, Loreto V, Latora V. Interacting Discovery Processes on Complex Networks. PHYSICAL REVIEW LETTERS 2020; 125:248301. [PMID: 33412072 DOI: 10.1103/physrevlett.125.248301] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 10/22/2020] [Accepted: 11/05/2020] [Indexed: 06/12/2023]
Abstract
Innovation is the driving force of human progress. Recent urn models reproduce well the dynamics through which the discovery of a novelty may trigger further ones, in an expanding space of opportunities, but neglect the effects of social interactions. Here we focus on the mechanisms of collective exploration, and we propose a model in which many urns, representing different explorers, are coupled through the links of a social network and exploit opportunities coming from their contacts. We study different network structures showing, both analytically and numerically, that the pace of discovery of an explorer depends on its centrality in the social network. Our model sheds light on the role that social structures play in discovery processes.
Collapse
Affiliation(s)
- Iacopo Iacopini
- School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
- Centre for Advanced Spatial Analysis, University College London, London W1T 4TJ, United Kingdom
- The Alan Turing Institute, The British Library, London NW1 2DB, United Kingdom
| | - Gabriele Di Bona
- School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
- Scuola Superiore di Catania, Università di Catania, Via Valdisavoia 9, 95123 Catania, Italy
| | - Enrico Ubaldi
- Sony Computer Science Laboratories, 6 Rue Amyot, 75005 Paris, France
| | - Vittorio Loreto
- Sony Computer Science Laboratories, 6 Rue Amyot, 75005 Paris, France
- Sapienza University of Rome, Physics Department, Piazzale Aldo Moro 5, 00185 Rome, Italy
- Complexity Science Hub Vienna, A-1080 Vienna, Austria
| | - Vito Latora
- School of Mathematical Sciences, Queen Mary University of London, London E1 4NS, United Kingdom
- The Alan Turing Institute, The British Library, London NW1 2DB, United Kingdom
- Complexity Science Hub Vienna, A-1080 Vienna, Austria
- Dipartimento di Fisica ed Astronomia, Università di Catania and INFN, I-95123 Catania, Italy
| |
Collapse
|
9
|
Tovo A, Menzel P, Krogh A, Cosentino Lagomarsino M, Suweis S. Taxonomic classification method for metagenomics based on core protein families with Core-Kaiju. Nucleic Acids Res 2020; 48:e93. [PMID: 32633756 PMCID: PMC7498351 DOI: 10.1093/nar/gkaa568] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Revised: 06/12/2020] [Accepted: 06/24/2020] [Indexed: 12/19/2022] Open
Abstract
Characterizing species diversity and composition of bacteria hosted by biota is revolutionizing our understanding of the role of symbiotic interactions in ecosystems. Determining microbiomes diversity implies the assignment of individual reads to taxa by comparison to reference databases. Although computational methods aimed at identifying the microbe(s) taxa are available, it is well known that inferences using different methods can vary widely depending on various biases. In this study, we first apply and compare different bioinformatics methods based on 16S ribosomal RNA gene and shotgun sequencing to three mock communities of bacteria, of which the compositions are known. We show that none of these methods can infer both the true number of taxa and their abundances. We thus propose a novel approach, named Core-Kaiju, which combines the power of shotgun metagenomics data with a more focused marker gene classification method similar to 16S, but based on emergent statistics of core protein domain families. We thus test the proposed method on various mock communities and we show that Core-Kaiju reliably predicts both number of taxa and abundances. Finally, we apply our method on human gut samples, showing how Core-Kaiju may give more accurate ecological characterization and a fresh view on real microbiomes.
Collapse
Affiliation(s)
- Anna Tovo
- Physics and Astronomy Department, LIPh Lab, University of Padova, Via Marzolo 8, 35131 Padova, Italy.,Mathematics Department, University of Padova, via Trieste 63, 35121 Padova, Italy
| | - Peter Menzel
- Labor Berlin Charité Vivantes GmbH, Sylter Str. 2, 13353 Berlin, Germany
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Universitetsparken 1, DK-2100 Copenhagen, Denmark
| | - Marco Cosentino Lagomarsino
- IFOM, FIRC Institute of Molecular Oncology, Via Adamello 16, 20143 Milan, Italy.,Physics Department, University of Milan, and I.N.F.N., Via Celoria 16, 20133 Milan, Italy
| | - Samir Suweis
- Physics and Astronomy Department, LIPh Lab, University of Padova, Via Marzolo 8, 35131 Padova, Italy.,Padova Neuroscience Center, University of Padova, Via Orus 2/B, 35131 Padova, Italy
| |
Collapse
|
10
|
Pastore M, Rotondo P, Erba V, Gherardi M. Statistical learning theory of structured data. Phys Rev E 2020; 102:032119. [PMID: 33075947 DOI: 10.1103/physreve.102.032119] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2020] [Accepted: 08/13/2020] [Indexed: 11/07/2022]
Abstract
The traditional approach of statistical physics to supervised learning routinely assumes unrealistic generative models for the data: Usually inputs are independent random variables, uncorrelated with their labels. Only recently, statistical physicists started to explore more complex forms of data, such as equally labeled points lying on (possibly low-dimensional) object manifolds. Here we provide a bridge between this recently established research area and the framework of statistical learning theory, a branch of mathematics devoted to inference in machine learning. The overarching motivation is the inadequacy of the classic rigorous results in explaining the remarkable generalization properties of deep learning. We propose a way to integrate physical models of data into statistical learning theory and address, with both combinatorial and statistical mechanics methods, the computation of the Vapnik-Chervonenkis entropy, which counts the number of different binary classifications compatible with the loss class. As a proof of concept, we focus on kernel machines and on two simple realizations of data structure introduced in recent physics literature: k-dimensional simplexes with prescribed geometric relations and spherical manifolds (equivalent to margin classification). Entropy, contrary to what happens for unstructured data, is nonmonotonic in the sample size, in contrast with the rigorous bounds. Moreover, data structure induces a transition beyond the storage capacity, which we advocate as a proxy of the nonmonotonicity, and ultimately a cue of low generalization error. The identification of a synaptic volume vanishing at the transition allows a quantification of the impact of data structure within replica theory, applicable in cases where combinatorial methods are not available, as we demonstrate for margin learning.
Collapse
Affiliation(s)
- Mauro Pastore
- Dipartimento di Fisica, Università degli Studi di Milano and INFN, Via Celoria 16, I-20133 Milan, Italy
| | - Pietro Rotondo
- Dipartimento di Fisica, Università degli Studi di Milano and INFN, Via Celoria 16, I-20133 Milan, Italy
| | - Vittorio Erba
- Dipartimento di Fisica, Università degli Studi di Milano and INFN, Via Celoria 16, I-20133 Milan, Italy
| | - Marco Gherardi
- Dipartimento di Fisica, Università degli Studi di Milano and INFN, Via Celoria 16, I-20133 Milan, Italy
| |
Collapse
|
11
|
Coccia M. Comparative Theories of the Evolution of Technology. GLOBAL ENCYCLOPEDIA OF PUBLIC ADMINISTRATION, PUBLIC POLICY, AND GOVERNANCE 2019:1-8. [DOI: 10.1007/978-3-319-31816-5_3841-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Accepted: 08/09/2019] [Indexed: 09/02/2023]
|