51
|
Baum K, Rajapakse JC, Azuaje F. Analysis of correlation-based biomolecular networks from different omics data by fitting stochastic block models. F1000Res 2019; 8:465. [PMID: 31559017 PMCID: PMC6743255 DOI: 10.12688/f1000research.18705.2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 08/14/2019] [Indexed: 12/18/2022] Open
Abstract
Background: Biological entities such as genes, promoters, mRNA, metabolites or proteins do not act alone, but in concert in their network context. Modules, i.e., groups of nodes with similar topological properties in these networks characterize important biological functions of the underlying biomolecular system. Edges in such molecular networks represent regulatory and physical interactions, and comparing them between conditions provides valuable information on differential molecular mechanisms. However, biological data is inherently noisy and network reduction techniques can propagate errors particularly to the level of edges. We aim to improve the analysis of networks of biological molecules by deriving modules together with edge relevance estimations that are based on global network characteristics. Methods: The key challenge we address here is investigating the capability of stochastic block models (SBMs) for representing and analyzing different types of biomolecular networks. Fitting them to SBMs both delivers modules of the networks and enables the derivation of edge confidence scores, and it has not yet been investigated for analyzing biomolecular networks. We apply SBM-based analysis independently to three correlation-based networks of breast cancer data originating from high-throughput measurements of different molecular layers: either transcriptomics, proteomics, or metabolomics. The networks were reduced by thresholding for correlation significance or by requirements on scale-freeness. Results and discussion: We find that the networks are best represented by the hierarchical version of the SBM, and many of the predicted blocks have a biologically and phenotypically relevant functional annotation. The edge confidence scores are overall in concordance with the biological evidence given by the measurements. We conclude that biomolecular networks can be appropriately represented and analyzed by fitting SBMs. As the SBM-derived edge confidence scores are based on global network connectivity characteristics and potential hierarchies within the biomolecular networks are considered, they could be used as additional, integrated features in network-based data comparisons.
Collapse
Affiliation(s)
- Katharina Baum
- Bioinformatics and Modelling, Luxembourg Institute of Health, Strassen, Luxembourg
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Jagath C. Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Francisco Azuaje
- Bioinformatics and Modelling, Luxembourg Institute of Health, Strassen, Luxembourg
| |
Collapse
|
52
|
Baum K, Rajapakse JC, Azuaje F. Analysis of correlation-based biomolecular networks from different omics data by fitting stochastic block models. F1000Res 2019; 8:465. [PMID: 31559017 PMCID: PMC6743255 DOI: 10.12688/f1000research.18705.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 04/03/2019] [Indexed: 10/15/2023] Open
Abstract
Background: Biological entities such as genes, promoters, mRNA, metabolites or proteins do not act alone, but in concert in their network context. Modules, i.e., groups of nodes with similar topological properties in these networks characterize important biological functions of the underlying biomolecular system. Edges in such molecular networks represent regulatory and physical interactions, and comparing them between conditions provides valuable information on differential molecular mechanisms. However, biological data is inherently noisy and network reduction techniques can propagate errors particularly to the level of edges. We aim to improve the analysis of networks of biological molecules by deriving modules together with edge relevance estimations that are based on global network characteristics. Methods: We propose to fit the networks to stochastic block models (SBM), a method that has not yet been investigated for the analysis of biomolecular networks. This procedure both delivers modules of the networks and enables the derivation of edge confidence scores. We apply it to correlation-based networks of breast cancer data originating from high-throughput measurements of diverse molecular layers such as transcriptomics, proteomics, and metabolomics. The networks were reduced by thresholding for correlation significance or by requirements on scale-freeness. Results and discussion: We find that the networks are best represented by the hierarchical version of the SBM, and many of the predicted blocks have a biological meaning according to functional annotation. The edge confidence scores are overall in concordance with the biological evidence given by the measurements. As they are based on global network connectivity characteristics and potential hierarchies within the biomolecular networks are taken into account, they could be used as additional, integrated features in network-based data comparisons. Their tight relationship to edge existence probabilities can be exploited to predict missing or spurious edges in order to improve the network representation of the underlying biological system.
Collapse
Affiliation(s)
- Katharina Baum
- Bioinformatics and Modelling, Luxembourg Institute of Health, Strassen, Luxembourg
- Mathematical Modelling of Cellular Processes, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Jagath C. Rajapakse
- School of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Francisco Azuaje
- Bioinformatics and Modelling, Luxembourg Institute of Health, Strassen, Luxembourg
| |
Collapse
|
53
|
Abstract
Community detection is a commonly used technique for identifying groups in a network based on similarities in connectivity patterns. To facilitate community detection in large networks, we recast the network as a smaller network of ‘super nodes’, where each super node comprises one or more nodes of the original network. We can then use this super node representation as the input into standard community detection algorithms. To define the seeds, or centers, of our super nodes, we apply the ‘CoreHD’ ranking, a technique applied in network dismantling and decycling problems. We test our approach through the analysis of two common methods for community detection: modularity maximization with the Louvain algorithm and maximum likelihood optimization for fitting a stochastic block model. Our results highlight that applying community detection to the compressed network of super nodes is significantly faster while successfully producing partitions that are more aligned with the local network connectivity and more stable across multiple (stochastic) runs within and between community detection algorithms, yet still overlap well with the results obtained using the full network.
Collapse
|
54
|
Kawamoto T, Kabashima Y. Comparative analysis on the selection of number of clusters in community detection. Phys Rev E 2018; 97:022315. [PMID: 29548181 DOI: 10.1103/physreve.97.022315] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Indexed: 11/07/2022]
Abstract
We conduct a comparative analysis on various estimates of the number of clusters in community detection. An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria. In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods. For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues. From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have becomes apparent. In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.
Collapse
Affiliation(s)
- Tatsuro Kawamoto
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi, Koto-ku, Tokyo, Japan
| | - Yoshiyuki Kabashima
- Department of Mathematical and Computing Science, Tokyo Institute of Technology, W8-45, 2-12-1 Ookayma, Meguro-ku, Tokyo, Japan
| |
Collapse
|
55
|
Abstract
We present a Bayesian formulation of weighted stochastic block models that can be used to infer the large-scale modular structure of weighted networks, including their hierarchical organization. Our method is nonparametric, and thus does not require the prior knowledge of the number of groups or other dimensions of the model, which are instead inferred from data. We give a comprehensive treatment of different kinds of edge weights (i.e., continuous or discrete, signed or unsigned, bounded or unbounded), as well as arbitrary weight transformations, and describe an unsupervised model selection approach to choose the best network description. We illustrate the application of our method to a variety of empirical weighted networks, such as global migrations, voting patterns in congress, and neural connections in the human brain.
Collapse
Affiliation(s)
- Tiago P Peixoto
- Department of Mathematical Sciences and Centre for Networks and Collective Behaviour, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom and ISI Foundation, Via Alassio 11/c, 10126 Torino, Italy
| |
Collapse
|
56
|
Aguilar‐Rodríguez J, Peel L, Stella M, Wagner A, Payne JL. The architecture of an empirical genotype-phenotype map. Evolution 2018; 72:1242-1260. [PMID: 29676774 PMCID: PMC6055911 DOI: 10.1111/evo.13487] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2017] [Accepted: 04/03/2018] [Indexed: 12/15/2022]
Abstract
Recent advances in high-throughput technologies are bringing the study of empirical genotype-phenotype (GP) maps to the fore. Here, we use data from protein-binding microarrays to study an empirical GP map of transcription factor (TF) -binding preferences. In this map, each genotype is a DNA sequence. The phenotype of this DNA sequence is its ability to bind one or more TFs. We study this GP map using genotype networks, in which nodes represent genotypes with the same phenotype, and edges connect nodes if their genotypes differ by a single small mutation. We describe the structure and arrangement of genotype networks within the space of all possible binding sites for 525 TFs from three eukaryotic species encompassing three kingdoms of life (animal, plant, and fungi). We thus provide a high-resolution depiction of the architecture of an empirical GP map. Among a number of findings, we show that these genotype networks are "small-world" and assortative, and that they ubiquitously overlap and interface with one another. We also use polymorphism data from Arabidopsis thaliana to show how genotype network structure influences the evolution of TF-binding sites in vivo. We discuss our findings in the context of regulatory evolution.
Collapse
Affiliation(s)
- José Aguilar‐Rodríguez
- Department of Evolutionary Biology and Environmental StudiesUniversity of ZurichZurichSwitzerland
- Swiss Institute of BioinformaticsLausanneSwitzerland
- Current Address: Department of Biology, Stanford University, StanfordCA, USA; Department of Chemical and Systems Biology, Stanford UniversityStanfordCAUSA
| | - Leto Peel
- Institute of Information and Communication Technologies, Electronics and Applied MathematicsUniversité Catholique de LouvainLouvain‐la‐NeuveBelgium
- Namur Center for Complex SystemsUniversity of NamurNamurBelgium
| | - Massimo Stella
- Institute for Complex Systems Simulation, Department of Electronics and Computer ScienceUniversity of SouthamptonSouthamptonUnited Kingdom
| | - Andreas Wagner
- Department of Evolutionary Biology and Environmental StudiesUniversity of ZurichZurichSwitzerland
- Swiss Institute of BioinformaticsLausanneSwitzerland
- The Santa Fe InstituteSanta FeNew MexicoUSA
| | - Joshua L. Payne
- Swiss Institute of BioinformaticsLausanneSwitzerland
- Institute for Integrative Biology, ETHZurichSwitzerland
| |
Collapse
|
57
|
Zhuo Z, Cai SM, Tang M, Lai YC. Accurate detection of hierarchical communities in complex networks based on nonlinear dynamical evolution. CHAOS (WOODBURY, N.Y.) 2018; 28:043119. [PMID: 31906645 DOI: 10.1063/1.5025646] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
One of the most challenging problems in network science is to accurately detect communities at distinct hierarchical scales. Most existing methods are based on structural analysis and manipulation, which are NP-hard. We articulate an alternative, dynamical evolution-based approach to the problem. The basic principle is to computationally implement a nonlinear dynamical process on all nodes in the network with a general coupling scheme, creating a networked dynamical system. Under a proper system setting and with an adjustable control parameter, the community structure of the network would "come out" or emerge naturally from the dynamical evolution of the system. As the control parameter is systematically varied, the community hierarchies at different scales can be revealed. As a concrete example of this general principle, we exploit clustered synchronization as a dynamical mechanism through which the hierarchical community structure can be uncovered. In particular, for quite arbitrary choices of the nonlinear nodal dynamics and coupling scheme, decreasing the coupling parameter from the global synchronization regime, in which the dynamical states of all nodes are perfectly synchronized, can lead to a weaker type of synchronization organized as clusters. We demonstrate the existence of optimal choices of the coupling parameter for which the synchronization clusters encode accurate information about the hierarchical community structure of the network. We test and validate our method using a standard class of benchmark modular networks with two distinct hierarchies of communities and a number of empirical networks arising from the real world. Our method is computationally extremely efficient, eliminating completely the NP-hard difficulty associated with previous methods. The basic principle of exploiting dynamical evolution to uncover hidden community organizations at different scales represents a "game-change" type of approach to addressing the problem of community detection in complex networks.
Collapse
Affiliation(s)
- Zhao Zhuo
- Web Sciences Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Shi-Min Cai
- Web Sciences Center, School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ming Tang
- Institute of Fundamental and Frontier Sciences and Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ying-Cheng Lai
- School of Electrical Computer and Energy Engineering, Arizona State University, Tempe, Arizona 85287, USA
| |
Collapse
|
58
|
Kawamoto T. Algorithmic detectability threshold of the stochastic block model. Phys Rev E 2018; 97:032301. [PMID: 29776051 DOI: 10.1103/physreve.97.032301] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2017] [Indexed: 06/08/2023]
Abstract
The assumption that the values of model parameters are known or correctly learned, i.e., the Nishimori condition, is one of the requirements for the detectability analysis of the stochastic block model in statistical inference. In practice, however, there is no example demonstrating that we can know the model parameters beforehand, and there is no guarantee that the model parameters can be learned accurately. In this study, we consider the expectation-maximization (EM) algorithm with belief propagation (BP) and derive its algorithmic detectability threshold. Our analysis is not restricted to the community structure but includes general modular structures. Because the algorithm cannot always learn the planted model parameters correctly, the algorithmic detectability threshold is qualitatively different from the one with the Nishimori condition.
Collapse
Affiliation(s)
- Tatsuro Kawamoto
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi, Koto-ku, Tokyo, Japan
| |
Collapse
|
59
|
Modelling sequences and temporal networks with dynamic community structures. Nat Commun 2017; 8:582. [PMID: 28928409 PMCID: PMC5605535 DOI: 10.1038/s41467-017-00148-9] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2016] [Accepted: 06/06/2017] [Indexed: 11/09/2022] Open
Abstract
In evolving complex systems such as air traffic and social organisations, collective effects emerge from their many components' dynamic interactions. While the dynamic interactions can be represented by temporal networks with nodes and links that change over time, they remain highly complex. It is therefore often necessary to use methods that extract the temporal networks' large-scale dynamic community structure. However, such methods are subject to overfitting or suffer from effects of arbitrary, a priori-imposed timescales, which should instead be extracted from data. Here we simultaneously address both problems and develop a principled data-driven method that determines relevant timescales and identifies patterns of dynamics that take place on networks, as well as shape the networks themselves. We base our method on an arbitrary-order Markov chain model with community structure, and develop a nonparametric Bayesian inference framework that identifies the simplest such model that can explain temporal interaction data.The description of temporal networks is usually simplified in terms of their dynamic community structures, whose identification however relies on a priori assumptions. Here the authors present a data-driven method that determines relevant timescales for the dynamics and uses it to identify communities.
Collapse
|
60
|
Peel L, Larremore DB, Clauset A. The ground truth about metadata and community detection in networks. SCIENCE ADVANCES 2017; 3:e1602548. [PMID: 28508065 PMCID: PMC5415338 DOI: 10.1126/sciadv.1602548] [Citation(s) in RCA: 126] [Impact Index Per Article: 15.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/17/2016] [Accepted: 03/08/2017] [Indexed: 05/30/2023]
Abstract
Across many scientific domains, there is a common need to automatically extract a simplified view or coarse-graining of how a complex system's components interact. This general task is called community detection in networks and is analogous to searching for clusters in independent vector data. It is common to evaluate the performance of community detection algorithms by their ability to find so-called ground truth communities. This works well in synthetic networks with planted communities because these networks' links are formed explicitly based on those known communities. However, there are no planted communities in real-world networks. Instead, it is standard practice to treat some observed discrete-valued node attributes, or metadata, as ground truth. We show that metadata are not the same as ground truth and that treating them as such induces severe theoretical and practical problems. We prove that no algorithm can uniquely solve community detection, and we prove a general No Free Lunch theorem for community detection, which implies that there can be no algorithm that is optimal for all possible community detection tasks. However, community detection remains a powerful tool and node metadata still have value, so a careful exploration of their relationship with network structure can yield insights of genuine worth. We illustrate this point by introducing two statistical techniques that can quantify the relationship between metadata and community structure for a broad class of models. We demonstrate these techniques using both synthetic and real-world networks, and for multiple types of metadata and community structures.
Collapse
Affiliation(s)
- Leto Peel
- Institute of Information and Communication Technologies, Electronics and Applied Mathematics, Université Catholique de Louvain, Louvain-la-Neuve, Belgium
- naXys, Université de Namur, Namur, Belgium
| | | | - Aaron Clauset
- Santa Fe Institute, Santa Fe, NM 87501, USA
- Department of Computer Science, University of Colorado, Boulder, CO 80309, USA
- BioFrontiers Institute, University of Colorado, Boulder, CO 80309, USA
| |
Collapse
|
61
|
Peixoto TP. Nonparametric Bayesian inference of the microcanonical stochastic block model. Phys Rev E 2017; 95:012317. [PMID: 28208453 DOI: 10.1103/physreve.95.012317] [Citation(s) in RCA: 60] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2016] [Indexed: 11/07/2022]
Abstract
A principled approach to characterize the hidden structure of networks is to formulate generative models and then infer their parameters from data. When the desired structure is composed of modules or "communities," a suitable choice for this task is the stochastic block model (SBM), where nodes are divided into groups, and the placement of edges is conditioned on the group memberships. Here, we present a nonparametric Bayesian method to infer the modular structure of empirical networks, including the number of modules and their hierarchical organization. We focus on a microcanonical variant of the SBM, where the structure is imposed via hard constraints, i.e., the generated networks are not allowed to violate the patterns imposed by the model. We show how this simple model variation allows simultaneously for two important improvements over more traditional inference approaches: (1) deeper Bayesian hierarchies, with noninformative priors replaced by sequences of priors and hyperpriors, which not only remove limitations that seriously degrade the inference on large networks but also reveal structures at multiple scales; (2) a very efficient inference algorithm that scales well not only for networks with a large number of nodes and edges but also with an unlimited number of modules. We show also how this approach can be used to sample modular hierarchies from the posterior distribution, as well as to perform model selection. We discuss and analyze the differences between sampling from the posterior and simply finding the single parameter estimate that maximizes it. Furthermore, we expose a direct equivalence between our microcanonical approach and alternative derivations based on the canonical SBM.
Collapse
Affiliation(s)
- Tiago P Peixoto
- Department of Mathematical Sciences and Centre for Networks and Collective Behaviour, University of Bath, Claverton Down, Bath BA2 7AY, United Kingdom and ISI Foundation, Via Alassio 11/c, 10126 Torino, Italy
| |
Collapse
|
62
|
From Relational Data to Graphs: Inferring Significant Links Using Generalized Hypergeometric Ensembles. LECTURE NOTES IN COMPUTER SCIENCE 2017. [DOI: 10.1007/978-3-319-67256-4_11] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
|
63
|
Kawamoto T, Kabashima Y. Detectability thresholds of general modular graphs. Phys Rev E 2017; 95:012304. [PMID: 28208358 DOI: 10.1103/physreve.95.012304] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2016] [Indexed: 06/06/2023]
Abstract
We investigate the detectability thresholds of various modular structures in the stochastic block model. Our analysis reveals how the detectability threshold is related to the details of the modular pattern, including the hierarchy of the clusters. We show that certain planted structures are impossible to infer regardless of their fuzziness.
Collapse
Affiliation(s)
- Tatsuro Kawamoto
- Department of Mathematical and Computing Science, Tokyo Institute of Technology, 4259-G5-22, Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8502, Japan
| | - Yoshiyuki Kabashima
- Department of Mathematical and Computing Science, Tokyo Institute of Technology, 4259-G5-22, Nagatsuta-cho, Midori-ku, Yokohama, Kanagawa 226-8502, Japan
| |
Collapse
|
64
|
Abstract
This paper introduces a bibliometric, citation network-based method for assessing the social validation of novel research, and applies this method to the development of high-throughput toxicology research at the US Environmental Protection Agency. Social validation refers to the acceptance of novel research methods by a relevant scientific community; it is formally independent of the technical validation of methods, and is frequently studied in history, philosophy, and social studies of science using qualitative methods. The quantitative methods introduced here find that high-throughput toxicology methods are spread throughout a large and well-connected research community, which suggests high social validation. Further assessment of social validation involving mixed qualitative and quantitative methods are discussed in the conclusion.
Collapse
Affiliation(s)
- Daniel J. Hicks
- Rotman Institute of Philosophy, University of Western Ontario, London, Ontario, Canada
- American Association for the Advancement of Science, Hosted in Office of Research and Development, United States Environmental Protection Agency, Washington, District of Columbia, United States of America
- * E-mail:
| |
Collapse
|
65
|
Newman MEJ. Equivalence between modularity optimization and maximum likelihood methods for community detection. Phys Rev E 2016; 94:052315. [PMID: 27967199 DOI: 10.1103/physreve.94.052315] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2016] [Indexed: 11/07/2022]
Abstract
We demonstrate an equivalence between two widely used methods of community detection in networks, the method of modularity maximization and the method of maximum likelihood applied to the degree-corrected stochastic block model. Specifically, we show an exact equivalence between maximization of the generalized modularity that includes a resolution parameter and the special case of the block model known as the planted partition model, in which all communities in a network are assumed to have statistically similar properties. Among other things, this equivalence provides a mathematically principled derivation of the modularity function, clarifies the conditions and assumptions of its use, and gives an explicit formula for the optimal value of the resolution parameter.
Collapse
Affiliation(s)
- M E J Newman
- Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
66
|
Peixoto TP. Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2015; 92:042807. [PMID: 26565289 DOI: 10.1103/physreve.92.042807] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2015] [Indexed: 05/24/2023]
Abstract
Many network systems are composed of interdependent but distinct types of interactions, which cannot be fully understood in isolation. These different types of interactions are often represented as layers, attributes on the edges, or as a time dependence of the network structure. Although they are crucial for a more comprehensive scientific understanding, these representations offer substantial challenges. Namely, it is an open problem how to precisely characterize the large or mesoscale structure of network systems in relation to these additional aspects. Furthermore, the direct incorporation of these features invariably increases the effective dimension of the network description, and hence aggravates the problem of overfitting, i.e., the use of overly complex characterizations that mistake purely random fluctuations for actual structure. In this work, we propose a robust and principled method to tackle these problems, by constructing generative models of modular network structure, incorporating layered, attributed and time-varying properties, as well as a nonparametric Bayesian methodology to infer the parameters from data and select the most appropriate model according to statistical evidence. We show that the method is capable of revealing hidden structure in layered, edge-valued, and time-varying networks, and that the most appropriate level of granularity with respect to the additional dimensions can be reliably identified. We illustrate our approach on a variety of empirical systems, including a social network of physicians, the voting correlations of deputies in the Brazilian national congress, the global airport network, and a proximity network of high-school students.
Collapse
Affiliation(s)
- Tiago P Peixoto
- Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany
| |
Collapse
|
67
|
Peixoto TP. Model Selection and Hypothesis Testing for Large-Scale Network Models with Overlapping Groups. PHYSICAL REVIEW X 2015; 5:011033. [DOI: 10.1103/physrevx.5.011033] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2025]
|
68
|
Herlau T, Schmidt MN, Mørup M. Infinite-degree-corrected stochastic block model. PHYSICAL REVIEW. E, STATISTICAL, NONLINEAR, AND SOFT MATTER PHYSICS 2014; 90:032819. [PMID: 25314493 DOI: 10.1103/physreve.90.032819] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/09/2013] [Indexed: 06/04/2023]
Abstract
In stochastic block models, which are among the most prominent statistical models for cluster analysis of complex networks, clusters are defined as groups of nodes with statistically similar link probabilities within and between groups. A recent extension by Karrer and Newman [Karrer and Newman, Phys. Rev. E 83, 016107 (2011)] incorporates a node degree correction to model degree heterogeneity within each group. Although this demonstrably leads to better performance on several networks, it is not obvious whether modeling node degree is always appropriate or necessary. We formulate the degree corrected stochastic block model as a nonparametric Bayesian model, incorporating a parameter to control the amount of degree correction that can then be inferred from data. Additionally, our formulation yields principled ways of inferring the number of groups as well as predicting missing links in the network that can be used to quantify the model's predictive performance. On synthetic data we demonstrate that including the degree correction yields better performance on both recovering the true group structure and predicting missing links when degree heterogeneity is present, whereas performance is on par for data with no degree heterogeneity within clusters. On seven real networks (with no ground truth group structure available) we show that predictive performance is about equal whether or not degree correction is included; however, for some networks significantly fewer clusters are discovered when correcting for degree, indicating that the data can be more compactly explained by clusters of heterogenous degree nodes.
Collapse
Affiliation(s)
- Tue Herlau
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Mikkel N Schmidt
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Morten Mørup
- Section for Cognitive Systems, DTU Compute, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| |
Collapse
|