1
|
Zhou Y, Müller HG. Wasserstein regression with empirical measures and density estimation for sparse data. Biometrics 2024; 80:ujae127. [PMID: 39499238 DOI: 10.1093/biomtc/ujae127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Revised: 08/18/2024] [Accepted: 10/11/2024] [Indexed: 11/07/2024]
Abstract
The problem of modeling the relationship between univariate distributions and one or more explanatory variables lately has found increasing interest. Existing approaches proceed by substituting proxy estimated distributions for the typically unknown response distributions. These estimates are obtained from available data but are problematic when for some of the distributions only few data are available. Such situations are common in practice and cannot be addressed with currently available approaches, especially when one aims at density estimates. We show how this and other problems associated with density estimation such as tuning parameter selection and bias issues can be side-stepped when covariates are available. We also introduce a novel version of distribution-response regression that is based on empirical measures. By avoiding the preprocessing step of recovering complete individual response distributions, the proposed approach is applicable when the sample size available for each distribution varies and especially when it is small for some of the distributions but large for others. In this case, one can still obtain consistent distribution estimates even for distributions with only few data by gaining strength across the entire sample of distributions, while traditional approaches where distributions or densities are estimated individually fail, since sparsely sampled densities cannot be consistently estimated. The proposed model is demonstrated to outperform existing approaches through simulations and Environmental Influences on Child Health Outcomes data.
Collapse
Affiliation(s)
- Yidong Zhou
- Department of Statistics, University of California, Davis, CA 95616, United States
| | - Hans-Georg Müller
- Department of Statistics, University of California, Davis, CA 95616, United States
| |
Collapse
|
2
|
Gertheiss J, Rügamer D, Liew BXW, Greven S. Functional Data Analysis: An Introduction and Recent Developments. Biom J 2024; 66:e202300363. [PMID: 39330918 DOI: 10.1002/bimj.202300363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 05/17/2024] [Accepted: 05/27/2024] [Indexed: 09/28/2024]
Abstract
Functional data analysis (FDA) is a statistical framework that allows for the analysis of curves, images, or functions on higher dimensional domains. The goals of FDA, such as descriptive analyses, classification, and regression, are generally the same as for statistical analyses of scalar-valued or multivariate data, but FDA brings additional challenges due to the high- and infinite dimensionality of observations and parameters, respectively. This paper provides an introduction to FDA, including a description of the most common statistical analysis techniques, their respective software implementations, and some recent developments in the field. The paper covers fundamental concepts such as descriptives and outliers, smoothing, amplitude and phase variation, and functional principal component analysis. It also discusses functional regression, statistical inference with functional data, functional classification and clustering, and machine learning approaches for functional data analysis. The methods discussed in this paper are widely applicable in fields such as medicine, biophysics, neuroscience, and chemistry and are increasingly relevant due to the widespread use of technologies that allow for the collection of functional data. Sparse functional data methods are also relevant for longitudinal data analysis. All presented methods are demonstrated using available software in R by analyzing a dataset on human motion and motor control. To facilitate the understanding of the methods, their implementation, and hands-on application, the code for these practical examples is made available through a code and data supplement and on GitHub.
Collapse
Affiliation(s)
- Jan Gertheiss
- Departmesnt of Mathematics and Statistics, School of Economics and Social Sciences, Helmut Schmidt University, Hamburg, Germany
| | - David Rügamer
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Bernard X W Liew
- School of Sport, Rehabilitation and Exercise Sciences, University of Essex, Essex, UK
| | - Sonja Greven
- Chair of Statistics, School of Business and Economics, Humboldt-Universität zu Berlin, Berlin, Germany
| |
Collapse
|
3
|
Zhu C, Müller HG. Autoregressive optimal transport models. J R Stat Soc Series B Stat Methodol 2023; 85:1012-1033. [PMID: 37521164 PMCID: PMC10376456 DOI: 10.1093/jrsssb/qkad051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2021] [Revised: 04/15/2023] [Accepted: 04/17/2023] [Indexed: 08/01/2023]
Abstract
Series of univariate distributions indexed by equally spaced time points are ubiquitous in applications and their analysis constitutes one of the challenges of the emerging field of distributional data analysis. To quantify such distributional time series, we propose a class of intrinsic autoregressive models that operate in the space of optimal transport maps. The autoregressive transport models that we introduce here are based on regressing optimal transport maps on each other, where predictors can be transport maps from an overall barycenter to a current distribution or transport maps between past consecutive distributions of the distributional time series. Autoregressive transport models and their associated distributional regression models specify the link between predictor and response transport maps by moving along geodesics in Wasserstein space. These models emerge as natural extensions of the classical autoregressive models in Euclidean space. Unique stationary solutions of autoregressive transport models are shown to exist under a geometric moment contraction condition of Wu & Shao [(2004) Limit theorems for iterated random functions. Journal of Applied Probability 41, 425-436)], using properties of iterated random functions. We also discuss an extension to a varying coefficient model for first-order autoregressive transport models. In addition to simulations, the proposed models are illustrated with distributional time series of house prices across U.S. counties and annual summer temperature distributions.
Collapse
Affiliation(s)
- Changbo Zhu
- Address for correspondence: Changbo Zhu, Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA.
| | - Hans-Georg Müller
- Department of Statistics, University of California, Davis, Davis, CA 95616, USA
| |
Collapse
|
4
|
Qiu J, Dai X, Zhu Z. Nonparametric Estimation of Repeated Densities with Heterogeneous Sample Sizes. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2104728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Jiaming Qiu
- Department of Statistics, Iowa State University
| | | | | |
Collapse
|
5
|
Galasso B, Zemel Y, de Carvalho M. Bayesian semiparametric modelling of phase-varying point processes. Electron J Stat 2022. [DOI: 10.1214/21-ejs1973] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Bastian Galasso
- Department of Innovation and Digital Transformation, Coca-Cola Embonor, Chile
| | - Yoav Zemel
- Statistical Laboratory, University of Cambridge, United Kingdom
| | | |
Collapse
|
6
|
OUP accepted manuscript. Biometrika 2022. [DOI: 10.1093/biomet/asac005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
7
|
Multimodal Bayesian registration of noisy functions using Hamiltonian Monte Carlo. Comput Stat Data Anal 2021. [DOI: 10.1016/j.csda.2021.107298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
8
|
Affiliation(s)
- Yaqing Chen
- Department of Statistics, University of California, Davis, CA
| | - Zhenhua Lin
- Department of Statistics and Applied Probability, National University of Singapore, Singapore, Singapore
| | | |
Collapse
|
9
|
Horváth L, Kokoszka P, Wang S. Monitoring for a change point in a sequence of distributions. Ann Stat 2021. [DOI: 10.1214/20-aos2036] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
Chen Y, Dubey P, Müller HG, Bruchhage M, Wang JL, Deoni S. Modeling sparse longitudinal data in early neurodevelopment. Neuroimage 2021; 237:118079. [PMID: 34000395 DOI: 10.1016/j.neuroimage.2021.118079] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2020] [Revised: 04/09/2021] [Accepted: 04/12/2021] [Indexed: 11/15/2022] Open
Abstract
Early childhood is a period marked by rapid brain growth accompanied by cognitive and motor development. However, it remains unclear how early developmental skills relate to neuroanatomical growth across time with no growth quantile trajectories of typical brain development currently available to place and compare individual neuroanatomical development. Even though longitudinal neuroimaging data have become more common, they are often sparse, making dynamic analyses at subject level a challenging task. Using the Principal Analysis through Conditional Expectation (PACE) approach geared towards sparse longitudinal data, we investigate the evolution of gray matter, white matter and cerebrospinal fluid volumes in a cohort of 446 children between the ages of 1 and 120 months. For each child, we calculate their dynamic age-varying association between the growing brain and scores that assess cognitive functioning, applying the functional varying coefficient model. Using local Fréchet regression, we construct age-varying growth percentiles to reveal the evolution of brain development across the population. To further demonstrate its utility, we apply PACE to predict individual trajectories of brain development.
Collapse
Affiliation(s)
- Yaqing Chen
- Department of Statistics, University of California, Davis, Davis, CA, 95616, USA
| | - Paromita Dubey
- Department of Statistics, Stanford University, Stanford, CA, 94305, USA
| | - Hans-Georg Müller
- Department of Statistics, University of California, Davis, Davis, CA, 95616, USA
| | - Muriel Bruchhage
- Advanced Baby Imaging Lab, Hasbro Children's Hospital, Rhode Island Hospital, Providence, RI, 02903, USA; Department of Pediatrics, Warren Alpert Medical School at Brown University, Providence, RI, 02912, USA
| | - Jane-Ling Wang
- Department of Statistics, University of California, Davis, Davis, CA, 95616, USA
| | - Sean Deoni
- Advanced Baby Imaging Lab, Hasbro Children's Hospital, Rhode Island Hospital, Providence, RI, 02903, USA; Department of Pediatrics, Warren Alpert Medical School at Brown University, Providence, RI, 02912, USA; Department of Radiology, Warren Alpert Medical School at Brown University, Providence, RI, 02912, USA; Maternal, Newborn, and Child Health Discovery & Tools, Bill & Melinda Gates Foundation, Seattle, WA, USA.
| |
Collapse
|
11
|
Chakraborty A, Panaretos VM. Functional registration and local variations: Identifiability, rank, and tuning. BERNOULLI 2021. [DOI: 10.3150/20-bej1267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
| | - Victor M. Panaretos
- Institut de Mathématiques, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland
| |
Collapse
|
12
|
Gajardo Á, Müller HG. Point process models for COVID-19 cases and deaths. J Appl Stat 2021; 50:2294-2309. [PMID: 37529574 PMCID: PMC10388820 DOI: 10.1080/02664763.2021.1907839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Accepted: 03/18/2021] [Indexed: 10/21/2022]
Abstract
The study of events distributed over time which can be quantified as point processes has attracted much interest over the years due to its wide range of applications. It has recently gained new relevance due to the COVID-19 case and death processes associated with SARS-CoV-2 that characterize the COVID-19 pandemic and are observed across different countries. It is of interest to study the behavior of these point processes and how they may be related to covariates such as mobility restrictions, gross domestic product per capita, and fraction of population of older age. As infections and deaths in a region are intrinsically events that arrive at random times, a point process approach is natural for this setting. We adopt techniques for conditional functional point processes that target point processes as responses with vector covariates as predictors, to study the interaction and optimal transport between case and death processes and doubling times conditional on covariates.
Collapse
Affiliation(s)
- Álvaro Gajardo
- Department of Statistics, University of California, Davis, CA, USA
| | | |
Collapse
|
13
|
Petersen A, Liu X, Divani AA. Wasserstein $F$-tests and confidence bands for the Fréchet regression of density response curves. Ann Stat 2021. [DOI: 10.1214/20-aos1971] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
14
|
Chen Y, Müller HG. Wasserstein gradients for the temporal evolution of probability distributions. Electron J Stat 2021. [DOI: 10.1214/21-ejs1883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Yaqing Chen
- Department of Statistics, University of California, Davis
| | | |
Collapse
|
15
|
Balzanella A, Irpino A. Spatial prediction and spatial dependence monitoring on georeferenced data streams. STAT METHOD APPL-GER 2020. [DOI: 10.1007/s10260-019-00462-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
16
|
Tameling C, Sommerfeld M, Munk A. Empirical optimal transport on countable metric spaces: Distributional limits and statistical applications. ANN APPL PROBAB 2019. [DOI: 10.1214/19-aap1463] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
17
|
Lila E, Aston JAD. Statistical Analysis of Functions on Surfaces, With an Application to Medical Imaging. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1635479] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Eardi Lila
- Cambridge Centre for Analysis, University of Cambridge, Cambridge, UK
| | - John A. D. Aston
- Statistical Laboratory, DPMMS, University of Cambridge, Cambridge, UK
| |
Collapse
|
18
|
Affiliation(s)
- Kyunghee Han
- Department of Statistics, University of California, Davis
| | | | - Byeong U. Park
- Department of Statistics, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
19
|
|
20
|
Affiliation(s)
- Alexander Petersen
- Department of Statistics and Applied Probability, University of California, Santa Barbara, California 93106, U.S.A
| | - Hans-Georg Müller
- Department of Statistics, University of California, One Shields Avenue, Davis, California 95616, U.S.A
| |
Collapse
|
21
|
|
22
|
Wrobel J, Zipunnikov V, Schrack J, Goldsmith J. Registration for exponential family functional data. Biometrics 2019; 75:48-57. [PMID: 30129091 PMCID: PMC10585654 DOI: 10.1111/biom.12963] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2017] [Revised: 08/01/2018] [Accepted: 08/01/2018] [Indexed: 12/01/2022]
Abstract
We introduce a novel method for separating amplitude and phase variability in exponential family functional data. Our method alternates between two steps: the first uses generalized functional principal components analysis to calculate template functions, and the second estimates smooth warping functions that map observed curves to templates. Existing approaches to registration have primarily focused on continuous functional observations, and the few approaches for discrete functional data require a pre-smoothing step; these methods are frequently computationally intensive. In contrast, we focus on the likelihood of the observed data and avoid the need for preprocessing, and we implement both steps of our algorithm in a computationally efficient way. Our motivation comes from the Baltimore Longitudinal Study on Aging, in which accelerometer data provides valuable insights into the timing of sedentary behavior. We analyze binary functional data with observations each minute over 24 hours for 592 participants, where values represent activity and inactivity. Diurnal patterns of activity are obscured due to misalignment in the original data but are clear after curves are aligned. Simulations designed to mimic the application indicate that the proposed methods outperform competing approaches in terms of estimation accuracy and computational efficiency. Code for our method and simulations is publicly available.
Collapse
Affiliation(s)
- Julia Wrobel
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, U.S.A
| | - Vadim Zipunnikov
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, U.S.A
| | - Jennifer Schrack
- Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, Maryland, U.S.A
- Longitudinal Studies Section, Translational Gerontology Branch, National Institute on Aging, National Institutes of Health, Bethesda, Maryland, U.S.A
| | - Jeff Goldsmith
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, New York, U.S.A
| |
Collapse
|
23
|
Procrustes Metrics on Covariance Operators and Optimal Transportation of Gaussian Processes. SANKHYA A 2019. [DOI: 10.1007/s13171-018-0130-1] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/16/2022]
|
24
|
Bigot J, Gouet R, Klein T, López A. Upper and lower risk bounds for estimating the Wasserstein barycenter of random measures on the real line. Electron J Stat 2018. [DOI: 10.1214/18-ejs1400] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
25
|
Inconsistency of Template Estimation by Minimizing of the Variance/Pre-Variance in the Quotient Space. ENTROPY 2017. [DOI: 10.3390/e19060288] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|
26
|
|