1
|
Rineer J, Kruskamp N, Kery C, Jones K, Hilscher R, Bobashev G. A National Synthetic Populations Dataset for the United States. Sci Data 2025; 12:144. [PMID: 39863626 PMCID: PMC11762717 DOI: 10.1038/s41597-025-04380-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 01/02/2025] [Indexed: 01/27/2025] Open
Abstract
Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015-2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson's r correlation coefficient.
Collapse
Affiliation(s)
- James Rineer
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA.
| | - Nicholas Kruskamp
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA
| | - Caroline Kery
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA
| | - Kasey Jones
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA
| | - Rainer Hilscher
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA
| | - Georgiy Bobashev
- RTI International, 3040 Cornwallis Rd., P.O. Box 12194, Research Triangle Park, NC, 27709, USA
| |
Collapse
|
2
|
Prédhumeau M, Manley E. A synthetic population for agent-based modelling in Canada. Sci Data 2023; 10:148. [PMID: 36941294 PMCID: PMC10027812 DOI: 10.1038/s41597-023-02030-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 02/17/2023] [Indexed: 03/23/2023] Open
Abstract
In order to anticipate the impact of local public policies, a synthetic population reflecting the characteristics of the local population provides a valuable test bed. While synthetic population datasets are now available for several countries, there is no open-source synthetic population for Canada. We propose an open-source synthetic population of individuals and households at a fine geographical level for Canada for the years 2021, 2023 and 2030. Based on 2016 census data and population projections, the synthetic individuals have detailed socio-demographic attributes, including age, sex, income, education level, employment status and geographic locations, and are related into households. A comparison of the 2021 synthetic population with 2021 census data over various geographical areas validates the reliability of the synthetic dataset. Users can extract populations from the dataset for specific zones, to explore 'what if' scenarios on present and future populations. They can extend the dataset using local survey data to add new characteristics to individuals. Users can also run the code to generate populations for years up to 2042.
Collapse
Affiliation(s)
| | - Ed Manley
- University of Leeds, School of Geography, Leeds, LS2 9JT, UK
| |
Collapse
|
3
|
Synthetic data in health care: A narrative review. PLOS DIGITAL HEALTH 2023; 2:e0000082. [PMID: 36812604 PMCID: PMC9931305 DOI: 10.1371/journal.pdig.0000082] [Citation(s) in RCA: 56] [Impact Index Per Article: 28.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 12/06/2022] [Indexed: 01/09/2023]
Abstract
Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.
Collapse
|
4
|
Møgelmose S, Neels K, Hens N. Incorporating human dynamic populations in models of infectious disease transmission: a systematic review. BMC Infect Dis 2022; 22:862. [DOI: 10.1186/s12879-022-07842-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2022] [Accepted: 11/04/2022] [Indexed: 11/19/2022] Open
Abstract
Abstract
Background
An increasing number of infectious disease models consider demographic change in the host population, but the demographic methods and assumptions vary considerably. We carry out a systematic review of the methods and assumptions used to incorporate dynamic populations in infectious disease models.
Methods
We systematically searched PubMed and Web of Science for articles on infectious disease transmission in dynamic host populations. We screened the articles and extracted data in accordance with the guidelines of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA).
Results
We identified 46 articles containing 53 infectious disease models with dynamic populations. Population dynamics were modelled explicitly in 71% of the disease transmission models using cohort-component-based models (CCBMs) or individual-based models (IBMs), while 29% used population prospects as an external input. Fertility and mortality were in most cases age- or age-sex-specific, but several models used crude fertility rates (40%). Households were incorporated in 15% of the models, which were IBMs except for one model using external population prospects. Finally, 17% of the infectious disease models included demographic sensitivity analyses.
Conclusions
We find that most studies model fertility, mortality and migration explicitly. Moreover, population-level modelling was more common than IBMs. Demographic characteristics beyond age and sex are cumbersome to implement in population-level models and were for that reason only incorporated in IBMs. Several IBMs included households and networks, but the granularity of the underlying demographic processes was often similar to that of CCBMs. We describe the implications of the most common assumptions and discuss possible extensions.
Collapse
|
5
|
McLure A, Graves PM, Lau C, Shaw C, Glass K. Modelling lymphatic filariasis elimination in American Samoa: GEOFIL predicts need for new targets and six rounds of mass drug administration. Epidemics 2022; 40:100591. [DOI: 10.1016/j.epidem.2022.100591] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 06/07/2022] [Accepted: 06/07/2022] [Indexed: 11/03/2022] Open
|
6
|
Wangdi K, Sheel M, Fuimaono S, Graves PM, Lau CL. Lymphatic filariasis in 2016 in American Samoa: Identifying clustering and hotspots using non-spatial and three spatial analytical methods. PLoS Negl Trop Dis 2022; 16:e0010262. [PMID: 35344542 PMCID: PMC8989349 DOI: 10.1371/journal.pntd.0010262] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 04/07/2022] [Accepted: 02/15/2022] [Indexed: 02/04/2023] Open
Abstract
BACKGROUND American Samoa completed seven rounds of mass drug administration from 2000-2006 as part of the Global Programme to Eliminate Lymphatic Filariasis (LF). However, resurgence was confirmed in 2016 through WHO-recommended school-based transmission assessment survey and a community-based survey. This paper uses data from the 2016 community survey to compare different spatial and non-spatial methods to characterise clustering and hotspots of LF. METHOD Non-spatial clustering of infection markers (antigen [Ag], microfilaraemia [Mf], and antibodies (Ab [Wb123, Bm14, Bm33]) was assessed using intra-cluster correlation coefficients (ICC) at household and village levels. Spatial dependence, clustering and hotspots were examined using semivariograms, Kulldorf's scan statistic and Getis-Ord Gi* statistics based on locations of surveyed households. RESULTS The survey included 2671 persons (750 households, 730 unique locations in 30 villages). ICCs were higher at household (0.20-0.69) than village levels (0.10-0.30) for all infection markers. Semivariograms identified significant spatial dependency for all markers (range 207-562 metres). Using Kulldorff's scan statistic, significant spatial clustering was observed in two previously known locations of ongoing transmission: for all markers in Fagali'i and all Abs in Vaitogi. Getis-Ord Gi* statistic identified hotspots of all markers in Fagali'i, Vaitogi, and Pago Pago-Anua areas. A hotspot of Ag and Wb123 Ab was identified around the villages of Nua-Seetaga-Asili. Bm14 and Bm33 Ab hotspots were seen in Maleimi and Vaitogi-Ili'ili-Tafuna. CONCLUSION Our study demonstrated the utility of different non-spatial and spatial methods for investigating clustering and hotspots, the benefits of using multiple infection markers, and the value of triangulating results between methods.
Collapse
Affiliation(s)
- Kinley Wangdi
- Department of Global Health, Research School of Population Health, College of Health and Medicine, Australian National University, Acton, Canberra, Australia
| | - Meru Sheel
- National Centre for Epidemiology and Population Health, Research School of Population Health, College of Health and Medicine, Australian National University, Acton, Canberra, Australia
| | | | - Patricia M. Graves
- College of Public Health, Medical and Veterinary Sciences and Australian Institute of Tropical Health and Medicine, James Cook University, Cairns, Australia
| | - Colleen L. Lau
- Department of Global Health, Research School of Population Health, College of Health and Medicine, Australian National University, Acton, Canberra, Australia
- School of Public Health, Faculty of Medicine, The University of Queensland, Herston, Australia
| |
Collapse
|
7
|
Skrip L, Derra K, Kaboré M, Noori N, Gansané A, Valéa I, Tinto H, Brice BW, Gordon MV, Hagedorn B, Hien H, Althouse BM, Wenger EA, Ouédraogo AL. Clinical management and mortality among COVID-19 cases in sub-Saharan Africa: A retrospective study from Burkina Faso and simulated case analysis. Int J Infect Dis 2020; 101:194-200. [PMID: 32987177 PMCID: PMC7518969 DOI: 10.1016/j.ijid.2020.09.1432] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2020] [Revised: 09/13/2020] [Accepted: 09/22/2020] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Absolute numbers of COVID-19 cases and deaths reported to date in the sub-Saharan Africa (SSA) region have been significantly lower than those across the Americas, Asia and Europe. As a result, there has been limited information about the demographic and clinical characteristics of deceased cases in the region, as well as the impacts of different case management strategies. METHODS Data from deceased cases reported across SSA through 10 May 2020 and from hospitalized cases in Burkina Faso through 15 April 2020 were analyzed. Demographic, epidemiological and clinical information on deceased cases in SSA was derived through a line-list of publicly available information and, for cases in Burkina Faso, from aggregate records at the Centre Hospitalier Universitaire de Tengandogo in Ouagadougou. A synthetic case population was probabilistically derived using distributions of age, sex and underlying conditions from populations of West African countries to assess individual risk factors and treatment effect sizes. Logistic regression analysis was conducted to evaluate the adjusted odds of survival for patients receiving oxygen therapy or convalescent plasma, based on therapeutic effectiveness observed for other respiratory illnesses. RESULTS Across SSA, deceased cases for which demographic data were available were predominantly male (63/103, 61.2%) and aged >50 years (59/75, 78.7%). In Burkina Faso, specifically, the majority of deceased cases either did not seek care at all or were hospitalized for a single day (59.4%, 19/32). Hypertension and diabetes were often reported as underlying conditions. After adjustment for sex, age and underlying conditions in the synthetic case population, the odds of mortality for cases not receiving oxygen therapy were significantly higher than for those receiving oxygen, such as due to disruptions to standard care (OR 2.07; 95% CI 1.56-2.75). Cases receiving convalescent plasma had 50% reduced odds of mortality than those who did not (95% CI 0.24-0.93). CONCLUSIONS Investment in sustainable production and maintenance of supplies for oxygen therapy, along with messaging around early and appropriate use for healthcare providers, caregivers and patients could reduce COVID-19 deaths in SSA. Further investigation into convalescent plasma is warranted until data on its effectiveness specifically in treating COVID-19 becomes available. The success of supportive or curative clinical interventions will depend on earlier treatment seeking, such that community engagement and risk communication will be critical components of the response.
Collapse
Affiliation(s)
- Laura Skrip
- Institute for Disease Modeling, Bellevue, WA, USA.
| | - Karim Derra
- IRSS-Clinical Research Unit of Nanoro, Burkina Faso
| | - Mikaila Kaboré
- Ministry of Health, Teaching Hospital Yalgado Ouedraogo, Ouagadougou, Burkina Faso
| | | | - Adama Gansané
- Centre National de Recherche et de Formation Sur le Paludisme, National Public Health Institute, Ouagadougou, Burkina Faso
| | | | | | - Bicaba W Brice
- Centre des Operations de Réponses aux Urgences Sanitaires, Ouagadougou, National Public Health Institute, Burkina Faso
| | | | | | - Hervé Hien
- Centre MURAZ, National Public Health Institute, Ouagadougou, Burkina Faso; IRSS, Programme de Recherche Sur les Politiques et les Systèmes de Santé, Bobo-Dioulasso, Burkina Faso
| | - Benjamin M Althouse
- Institute for Disease Modeling, Bellevue, WA, USA; University of Washington, Seattle, WA, USA; New Mexico State University, Las Cruces, NM, USA
| | | | | |
Collapse
|
8
|
Uncovering temporal changes in Europe's population density patterns using a data fusion approach. Nat Commun 2020; 11:4631. [PMID: 32934205 PMCID: PMC7493994 DOI: 10.1038/s41467-020-18344-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2020] [Accepted: 08/04/2020] [Indexed: 11/08/2022] Open
Abstract
The knowledge of the spatial and temporal distribution of human population is vital for the study of cities, disaster risk management or planning of infrastructure. However, information on the distribution of population is often based on place-of-residence statistics from official sources, thus ignoring the changing population densities resulting from human mobility. Existing assessments of spatio-temporal population are limited in their detail and geographical coverage, and the promising mobile-phone records are hindered by issues concerning availability and consistency. Here, we present a multi-layered dasymetric approach that combines official statistics with geospatial data from emerging sources to produce and validate a European Union-wide dataset of population grids taking into account intraday and monthly population variations at 1 km2 resolution. The results reproduce and systematically quantify known insights concerning the spatio-temporal population density structure of large European cities, whose daytime population we estimate to be, on average, 1.9 times higher than night time in city centers.
Collapse
|
9
|
Krauland MG, Frankeny RJ, Lewis J, Brink L, Hulsey EG, Roberts MS, Hacker KA. Development of a Synthetic Population Model for Assessing Excess Risk for Cardiovascular Disease Death. JAMA Netw Open 2020; 3:e2015047. [PMID: 32870312 PMCID: PMC7489828 DOI: 10.1001/jamanetworkopen.2020.15047] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/13/2020] [Accepted: 06/10/2020] [Indexed: 12/26/2022] Open
Abstract
Importance Evaluating the association of social determinants of health with chronic diseases at the population level requires access to individual-level factors associated with disease, which are rarely available for large populations. Synthetic populations are a possible alternative for this purpose. Objective To construct and validate a synthetic population that statistically mimics the characteristics and spatial disease distribution of a real population, using real and synthetic data. Design, Setting, and Participants This population-based decision analytical model used data for Allegheny County, Pennsylvania, collected from January 2015 to December 2016, to build a semisynthetic population based on the synthetic population used by the modeling and simulation platform FRED (A Framework for Reconstructing Epidemiological Dynamics). Disease status was assigned to this population using health insurer claims data from the 3 major insurance providers in the county or from the National Health and Nutrition Examination Survey. Biological, social, and other variables were also obtained from the National Health Interview Survey, Allegheny County, and public databases. Data analysis was performed from November 2016 to February 2020. Exposures Risk of cardiovascular disease (CVD) death. Main Outcomes and Measures Difference between expected and observed CVD death risk. A validated risk equation was used to estimate CVD death risk. Results The synthetic population comprised 1 188 112 individuals with demographic characteristics similar to those of the 2010 census population in the same county. In the synthetic population, the mean (SD) age was 40.6 (23.3) years, and 622 997 were female individuals (52.4%). Mean (SD) observed 4-year rate of excess CVD death risk at the census tract level was -40 (523) per 100 000 persons. The correlation of social determinant data with difference between expected and observed CVD death risk indicated that income- and education-based social determinants were associated with risk. Estimating improved social determinants of health and biological factors associated with disease did not entirely remove the excess in CVD death rates. That is, a 20% improvement in the most significant determinants still resulted in 105 census tracts with excess CVD death risk, which represented 24% of the county population. Conclusions and Relevance The results of this study suggest that creating a geographically explicit synthetic population from real and synthetic data is feasible and that synthetic populations are useful for modeling disease in large populations and for estimating the outcome of interventions.
Collapse
Affiliation(s)
- Mary G. Krauland
- Department of Health Policy and Management, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
- Public Health Dynamics Laboratory, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
| | - Robert J. Frankeny
- Public Health Dynamics Laboratory, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
| | - Josh Lewis
- Allegheny County Department of Health, Pittsburgh, Pennsylvania
| | - LuAnn Brink
- Allegheny County Department of Health, Pittsburgh, Pennsylvania
| | - Eric G. Hulsey
- Allegheny County Department of Health, Pittsburgh, Pennsylvania
| | - Mark S. Roberts
- Department of Health Policy and Management, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
- Public Health Dynamics Laboratory, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
| | - Karen A. Hacker
- Department of Health Policy and Management, University of Pittsburgh Graduate School of Public Health, Pittsburgh, Pennsylvania
- Allegheny County Department of Health, Pittsburgh, Pennsylvania
| |
Collapse
|
10
|
Xu Z, Graves PM, Lau CL, Clements A, Geard N, Glass K. GEOFIL: A spatially-explicit agent-based modelling framework for predicting the long-term transmission dynamics of lymphatic filariasis in American Samoa. Epidemics 2018; 27:19-27. [PMID: 30611745 DOI: 10.1016/j.epidem.2018.12.003] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2018] [Revised: 12/22/2018] [Accepted: 12/28/2018] [Indexed: 10/27/2022] Open
Abstract
In this study, a spatially-explicit agent-based modelling framework GEOFIL was developed to predict lymphatic filariasis (LF) transmission dynamics in American Samoa. GEOFIL included individual-level information on age, gender, disease status, household location, household members, workplace/school location and colleagues/schoolmates at each time step during the simulation. In American Samoa, annual mass drug administration from 2000 to 2006 successfully reduced LF prevalence dramatically. However, GEOFIL predicted continual increase in microfilaraemia prevalence in the absence of further intervention. Evidence from seroprevalence and transmission assessment surveys conducted from 2010 to 2016 indicated a resurgence of LF in American Samoa, corroborating GEOFIL's predictions. The microfilaraemia and antigenaemia prevalence in 6-7-yo children were much lower than in the overall population. Mosquito biting rates were found to be a critical determinant of infection risk. Transmission hotspots are likely to disappear with lower biting rates. GEOFIL highlights current knowledge gaps, such as data on mosquito abundance, biting rates and within-host parasite dynamics, which are important for improving the accuracy of model predictions.
Collapse
Affiliation(s)
- Zhijing Xu
- Research School of Population Health, The Australian National University, Australia.
| | - Patricia M Graves
- College of Public Health, Medical and Veterinary Sciences, Division of Tropical Health and Medicine, James Cook University, Australia
| | - Colleen L Lau
- Research School of Population Health, The Australian National University, Australia
| | | | - Nicholas Geard
- School of Computing and Information Systems, The University of Melbourne, Australia; The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, Australia; Melbourne School of Population and Global Health, The University of Melbourne, Australia
| | - Kathryn Glass
- Research School of Population Health, The Australian National University, Australia
| |
Collapse
|
11
|
Population Synthesis Handling Three Geographical Resolutions. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2018. [DOI: 10.3390/ijgi7050174] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
|