1
|
Runghen R, Stouffer DB, Dalla Riva GV. Exploiting node metadata to predict interactions in bipartite networks using graph embedding and neural networks. ROYAL SOCIETY OPEN SCIENCE 2022; 9:220079. [PMID: 36016910 PMCID: PMC9399714 DOI: 10.1098/rsos.220079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/05/2022] [Accepted: 08/02/2022] [Indexed: 06/15/2023]
Abstract
Networks are increasingly used in various fields to represent systems with the aim of understanding the underlying rules governing observed interactions, and hence predict how the system is likely to behave in the future. Recent developments in network science highlight that accounting for node metadata improves both our understanding of how nodes interact with one another, and the accuracy of link prediction. However, to predict interactions in a network within existing statistical and machine learning frameworks, we need to learn objects that rapidly grow in dimension with the number of nodes. Thus, the task becomes computationally and conceptually challenging for networks. Here, we present a new predictive procedure combining a statistical, low-rank graph embedding method with machine learning techniques which reduces substantially the complexity of the learning task and allows us to efficiently predict interactions from node metadata in bipartite networks. To illustrate its application on real-world data, we apply it to a large dataset of tourist visits across a country. We found that our procedure accurately reconstructs existing interactions and predicts new interactions in the network. Overall, both from a network science and data science perspective, our work offers a flexible and generalizable procedure for link prediction.
Collapse
Affiliation(s)
- Rogini Runghen
- Centre for Integrative Ecology, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
- The Roux Institute, Northeastern University, Boston, MA, USA
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, USA
| | - Daniel B. Stouffer
- Centre for Integrative Ecology, School of Biological Sciences, University of Canterbury, Christchurch, New Zealand
| | - Giulio V. Dalla Riva
- School of Mathematics and Statistics, University of Canterbury, Christchurch, New Zealand
| |
Collapse
|
2
|
Young JG, Kirkley A, Newman MEJ. Clustering of heterogeneous populations of networks. Phys Rev E 2022; 105:014312. [PMID: 35193232 DOI: 10.1103/physreve.105.014312] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2021] [Accepted: 01/07/2022] [Indexed: 06/14/2023]
Abstract
Statistical methods for reconstructing networks from repeated measurements typically assume that all measurements are generated from the same underlying network structure. This need not be the case, however. People's social networks might be different on weekdays and weekends, for instance. Brain networks may differ between healthy patients and those with dementia or other conditions. Here we describe a Bayesian analysis framework for such data that allows for the fact that network measurements may be reflective of multiple possible structures. We define a finite mixture model of the measurement process and derive a Gibbs sampling procedure that samples exactly from the full posterior distribution of model parameters. The end result is a clustering of the measured networks into groups with similar structure. We demonstrate the method on both real and synthetic network populations.
Collapse
Affiliation(s)
- Jean-Gabriel Young
- Department of Mathematics and Statistics, University of Vermont, Burlington, Vermont 05405, USA
- Vermont Complex Systems Center, University of Vermont, Burlington, Vermont 05405, USA
| | - Alec Kirkley
- Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA
- School of Data Science, City University of Hong Kong, 999077, Hong Kong
| | - M E J Newman
- Department of Physics, University of Michigan, Ann Arbor, Michigan 48109, USA
- Center for the Study of Complex Systems, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
3
|
Pantazis K, Athreya A, Arroyo J, Frost WN, Hill ES, Lyzinski V. The Importance of Being Correlated: Implications of Dependence in Joint Spectral Inference across Multiple Networks. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2022; 23:141. [PMID: 37645242 PMCID: PMC10465120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Spectral inference on multiple networks is a rapidly-developing subfield of graph statistics. Recent work has demonstrated that joint, or simultaneous, spectral embedding of multiple independent networks can deliver more accurate estimation than individual spectral decompositions of those same networks. Such inference procedures typically rely heavily on independence assumptions across the multiple network realizations, and even in this case, little attention has been paid to the induced network correlation that can be a consequence of such joint embeddings. In this paper, we present a generalized omnibus embedding methodology and we provide a detailed analysis of this embedding across both independent and correlated networks, the latter of which significantly extends the reach of such procedures, and we describe how this omnibus embedding can itself induce correlation. This leads us to distinguish between inherent correlation-that is, the correlation that arises naturally in multisample network data-and induced correlation, which is an artifice of the joint embedding methodology. We show that the generalized omnibus embedding procedure is flexible and robust, and we prove both consistency and a central limit theorem for the embedded points. We examine how induced and inherent correlation can impact inference for network time series data, and we provide network analogues of classical questions such as the effective sample size for more generally correlated data. Further, we show how an appropriately calibrated generalized omnibus embedding can detect changes in real biological networks that previous embedding procedures could not discern, confirming that the effect of inherent and induced correlation can be subtle and transformative. By allowing for and deconstructing both forms of correlation, our methodology widens the scope of spectral techniques for network inference, with import in theory and practice.
Collapse
Affiliation(s)
| | - Avanti Athreya
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD 21218
| | - Jesús Arroyo
- Department of Statistics, Texas A&M University, College Station, TX 77843
| | - William N Frost
- Cell Biology and Anatomy, and Center for Brain Function and Repair, Chicago Medical School, Rosalind Franklin University of Medicine and Science, Chicago, IL 60064-3905
| | - Evan S Hill
- Cell Biology and Anatomy, and Center for Brain Function and Repair, Chicago Medical School, Rosalind Franklin University of Medicine and Science, Chicago, IL 60064-3905
| | - Vince Lyzinski
- Department of Mathematics, University of Maryland, College Park, MD 20742
| |
Collapse
|
4
|
Xie F, Xu Y. Efficient Estimation for Random Dot Product Graphs via a One-Step Procedure. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1948419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Affiliation(s)
- Fangzheng Xie
- Department of Statistics, Indiana University, Bloomington, IN
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
5
|
Arroyo J, Athreya A, Cape J, Chen G, Priebe CE, Vogelstein JT. Inference for Multiple Heterogeneous Networks with a Common Invariant Subspace. JOURNAL OF MACHINE LEARNING RESEARCH : JMLR 2021; 22:1-49. [PMID: 34650343 PMCID: PMC8513708] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The development of models and methodology for the analysis of data from multiple heterogeneous networks is of importance both in statistical network theory and across a wide spectrum of application domains. Although single-graph analysis is well-studied, multiple graph inference is largely unexplored, in part because of the challenges inherent in appropriately modeling graph differences and yet retaining sufficient model simplicity to render estimation feasible. This paper addresses exactly this gap, by introducing a new model, the common subspace independent-edge multiple random graph model, which describes a heterogeneous collection of networks with a shared latent structure on the vertices but potentially different connectivity patterns for each graph. The model encompasses many popular network representations, including the stochastic blockmodel. The model is both flexible enough to meaningfully account for important graph differences, and tractable enough to allow for accurate inference in multiple networks. In particular, a joint spectral embedding of adjacency matrices-the multiple adjacency spectral embedding-leads to simultaneous consistent estimation of underlying parameters for each graph. Under mild additional assumptions, the estimates satisfy asymptotic normality and yield improvements for graph eigenvalue estimation. In both simulated and real data, the model and the embedding can be deployed for a number of subsequent network inference tasks, including dimensionality reduction, classification, hypothesis testing, and community detection. Specifically, when the embedding is applied to a data set of connectomes constructed through diffusion magnetic resonance imaging, the result is an accurate classification of brain scans by human subject and a meaningful determination of heterogeneity across scans of different individuals.
Collapse
Affiliation(s)
- Jesús Arroyo
- Department of Statistics, Texas A&M University, College Station, TX, 77843
| | - Avanti Athreya
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Joshua Cape
- Department of Statistics, University of Pittsburgh, Pittsburgh, PA 15260, USA
| | - Guodong Chen
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Carey E Priebe
- Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Joshua T Vogelstein
- Department of Biomedical Engineering, Kavli Neuroscience Discovery Institute, Johns Hopkins University, Baltimore, MD, 21218, USA
| |
Collapse
|
6
|
Xie F, Xu Y. Optimal Bayesian estimation for random dot product graphs. Biometrika 2020. [DOI: 10.1093/biomet/asaa031] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
Summary
We propose and prove the optimality of a Bayesian approach for estimating the latent positions in random dot product graphs, which we call posterior spectral embedding. Unlike classical spectral-based adjacency, or Laplacian spectral embedding, posterior spectral embedding is a fully likelihood-based graph estimation method that takes advantage of the Bernoulli likelihood information of the observed adjacency matrix. We develop a minimax lower bound for estimating the latent positions, and show that posterior spectral embedding achieves this lower bound in the following two senses: it both results in a minimax-optimal posterior contraction rate and yields a point estimator achieving the minimax risk asymptotically. The convergence results are subsequently applied to clustering in stochastic block models with positive semidefinite block probability matrices, strengthening an existing result concerning the number of misclustered vertices. We also study a spectral-based Gaussian spectral embedding as a natural Bayesian analogue of adjacency spectral embedding, but the resulting posterior contraction rate is suboptimal by an extra logarithmic factor. The practical performance of the proposed methodology is illustrated through extensive synthetic examples and the analysis of Wikipedia graph data.
Collapse
Affiliation(s)
- Fangzheng Xie
- Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 North Charles Street, Baltimore, Maryland 21218, U.S.A
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Johns Hopkins University, 3400 North Charles Street, Baltimore, Maryland 21218, U.S.A
| |
Collapse
|