1
|
Lotspeich SC, Mullan AE, McGowan LD, Hepler SA. Combining Straight-Line and Map-Based Distances to Investigate the Connection Between Proximity to Healthy Foods and Disease. Stat Med 2025; 44:e70054. [PMID: 40226886 PMCID: PMC11995689 DOI: 10.1002/sim.70054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2024] [Revised: 01/20/2025] [Accepted: 02/26/2025] [Indexed: 04/15/2025]
Abstract
Healthy foods are essential for a healthy life, but accessing healthy food can be more challenging for some people than others. This disparity in food access may lead to disparities in well-being, potentially with disproportionate rates of diseases in communities that face more challenges in accessing healthy food (i.e., low-access communities). Identifying low-access, high-risk communities for targeted interventions is a public health priority, but current methods to quantify food access rely on distance measures that are either computationally simple (like the length of the shortest straight-line route) or accurate (like the length of the shortest map-based driving route), but not both. We propose a multiple imputation approach to combine these distance measures, allowing researchers to harness the computational ease of one with the accuracy of the other. The approach incorporates straight-line distances for all neighborhoods and map-based distances for just a subset, offering comparable estimates to the "gold standard" model using map-based distances for all neighborhoods and improved efficiency over the "complete case" model using map-based distances for just the subset. Through the adoption of a measurement error framework, information from the straight-line distances can be leveraged to compute informative placeholders (i.e., impute) for any neighborhoods without map-based distances. Using simulations and data for the Piedmont Triad region of North Carolina, we quantify and compare the associations between two health outcomes (diabetes and obesity) and neighborhood-level access to healthy foods. The imputation procedure also makes it possible to predict the full landscape of food access in an area without requiring map-based measurements for all neighborhoods.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Statistical SciencesWake Forest UniversityWinston‐SalemNorth CarolinaUSA
| | - Ashley E. Mullan
- Department of Statistical SciencesWake Forest UniversityWinston‐SalemNorth CarolinaUSA
| | | | - Staci A. Hepler
- Department of Statistical SciencesWake Forest UniversityWinston‐SalemNorth CarolinaUSA
| |
Collapse
|
2
|
de Souza JS, Barbian MH, dos Reis RCP. Comparison of calibration methods in the analysis of 2013 Brazilian National Health Survey data. REVISTA BRASILEIRA DE EPIDEMIOLOGIA 2025; 28:e250005. [PMID: 40008745 PMCID: PMC11849994 DOI: 10.1590/1980-549720250005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 10/09/2024] [Accepted: 11/07/2024] [Indexed: 02/27/2025] Open
Abstract
OBJECTIVE This study aims to compare calibration methods for weights in the subsample of Laboratory Exams from the 2013 Brazilian National Health Survey (PNS), seeking to assess their representativeness and precision. METHODS Two alternative proposals for constructing calibrated weights were performed based on post-stratification and raking methods. A comparison between the weights provided for the Laboratory Exams subsample and the two suggested weights was conducted through parameter estimates using the 2013 PNS subsample data. Additionally, seven measures were used to assess the performance of the proposed weighting systems. RESULTS The alternative post-stratification and raking weights produced generalizable estimates for the target population of the 2013 PNS, while the original weights did not. The alternative methods showed similar performance to the original method, with a slight advantage for raking in some evaluation measures. CONCLUSION It is recommended that basic design weights be documented and included in the public-use data files of the PNS. Furthermore, it is suggested to cross-reference information between the sample and subsample of the 2013 PNS to enable the exploration of methods such as data imputation, aiming to obtain more accurate and representative estimates. These improvements are essential to ensure the quality and usefulness of PNS data in epidemiological and public health studies.
Collapse
Affiliation(s)
- Juliana Sena de Souza
- Universidade Federal do Rio Grande do Sul, Graduate Program in Epidemiology – Porto Alegre (RS), Brazil
| | - Márcia Helena Barbian
- Universidade Federal do Rio Grande do Sul, Department of Statistics – Porto Alegre (RS), Brazil
| | - Rodrigo Citton Padilha dos Reis
- Universidade Federal do Rio Grande do Sul, Graduate Program in Epidemiology – Porto Alegre (RS), Brazil
- Universidade Federal do Rio Grande do Sul, Department of Statistics – Porto Alegre (RS), Brazil
| |
Collapse
|
3
|
Levis AW, Mukherjee R, Wang R, Fischer H, Haneuse S. Double Sampling for Informatively Missing Data in Electronic Health Record-Based Comparative Effectiveness Research. Stat Med 2024; 43:6086-6098. [PMID: 39638313 PMCID: PMC11639654 DOI: 10.1002/sim.10298] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2023] [Revised: 07/14/2024] [Accepted: 11/21/2024] [Indexed: 12/07/2024]
Abstract
Missing data arise in most applied settings and are ubiquitous in electronic health records (EHR). When data are missing not at random (MNAR) with respect to measured covariates, sensitivity analyses are often considered. These solutions, however, are often unsatisfying in that they are not guaranteed to yield actionable conclusions. Motivated by an EHR-based study of long-term outcomes following bariatric surgery, we consider the use of double sampling as a means to mitigate MNAR outcome data when the statistical goals are estimation and inference regarding causal effects. We describe assumptions that are sufficient for the identification of the joint distribution of confounders, treatment, and outcome under this design. Additionally, we derive efficient and robust estimators of the average causal treatment effect under a nonparametric model and under a model assuming outcomes were, in fact, initially missing at random (MAR). We compare these in simulations to an approach that adaptively estimates based on evidence of violation of the MAR assumption. Finally, we also show that the proposed double sampling design can be extended to handle arbitrary coarsening mechanisms, and derive nonparametric efficient estimators of any smooth full data functional.
Collapse
Affiliation(s)
- Alexander W. Levis
- Department of Statistics & Data ScienceCarnegie Mellon UniversityPittsburghPennsylvania
| | - Rajarshi Mukherjee
- Department of BiostatisticsHarvard T. H. Chan School of Public HealthBostonMassachusetts
| | - Rui Wang
- Department of BiostatisticsHarvard T. H. Chan School of Public HealthBostonMassachusetts
- Department of Population MedicineHarvard Pilgrim Health Care Institute and Harvard Medical SchoolBostonMassachusetts
| | - Heidi Fischer
- Department of Research and EvaluationKaiser PermanentePasadenaCaliforniaUSA
| | - Sebastien Haneuse
- Department of BiostatisticsHarvard T. H. Chan School of Public HealthBostonMassachusetts
| |
Collapse
|
4
|
Hasler J, Ma Y, Wei Y, Parikh R, Chen J. A SEMIPARAMETRIC METHOD FOR RISK PREDICTION USING INTEGRATED ELECTRONIC HEALTH RECORD DATA. Ann Appl Stat 2024; 18:3318-3337. [PMID: 40134753 PMCID: PMC11934126 DOI: 10.1214/24-aoas1938] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2025]
Abstract
When using electronic health records (EHRs) for clinical and translational research, additional data is often available from external sources to enrich the information extracted from EHRs. For example, academic biobanks have more granular data available, and patient reported data is often collected through small-scale surveys. It is common that the external data is available only for a small subset of patients who have EHR information. We propose efficient and robust methods for building and evaluating models for predicting the risk of binary outcomes using such integrated EHR data. Our method is built upon an idea derived from the two-phase design literature that modeling the availability of a patient's external data as a function of an EHR-based preliminary predictive score leads to effective utilization of the EHR data. Through both theoretical and simulation studies, we show that our method has high efficiency for estimating log-odds ratio parameters, the area under the ROC curve, as well as other measures for quantifying predictive accuracy. We apply our method to develop a model for predicting the short-term mortality risk of oncology patients, where the data was extracted from the University of Pennsylvania hospital system EHR and combined with survey-based patient reported outcome data.
Collapse
Affiliation(s)
| | - Yanyuan Ma
- Department of Statistics, Pennsylvania State University
| | - Yizheng Wei
- Department of Statistics, University of South Carolina
| | - Ravi Parikh
- Departments of Medicine and Health Policy and Medicine, University of Pennsylvania
| | - Jinbo Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| |
Collapse
|
5
|
LOTSPEICH SC, AMORIM GGC, SHAW PA, TAO R, SHEPHERD BE. Optimal multiwave validation of secondary use data with outcome and exposure misclassification. CAN J STAT 2024; 52:532-554. [PMID: 39629097 PMCID: PMC11610482 DOI: 10.1002/cjs.11772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 12/23/2022] [Indexed: 04/03/2023]
Abstract
Observational databases provide unprecedented opportunities for secondary use in biomedical research. However, these data can be error-prone and must be validated before use. It is usually unrealistic to validate the whole database because of resource constraints. A cost-effective alternative is a two-phase design that validates a subset of records enriched for information about a particular research question. We consider odds ratio estimation under differential outcome and exposure misclassification and propose optimal designs that minimize the variance of the maximum likelihood estimator. Our adaptive grid search algorithm can locate the optimal design in a computationally feasible manner. Because the optimal design relies on unknown parameters, we introduce a multiwave strategy to approximate the optimal design. We demonstrate the proposed design's efficiency gains through simulations and two large observational studies.
Collapse
Affiliation(s)
- Sarah C. LOTSPEICH
- Department of Statistical Sciences, Wake Forest University, Winston-Salem, 27109, North Carolina, U.S.A
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, 37203, Tennessee, U.S.A
| | - Gustavo G. C. AMORIM
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, 37203, Tennessee, U.S.A
| | - Pamela A. SHAW
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, 19104, Pennsylvania, U.S.A
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute, Seattle, 98101, Washington, U.S.A
| | - Ran TAO
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, 37203, Tennessee, U.S.A
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, 37232, Tennessee, U.S.A
| | - Bryan E. SHEPHERD
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, 37203, Tennessee, U.S.A
| |
Collapse
|
6
|
Amorim G, Tao R, Lotspeich S, Shaw PA, Lumley T, Patel RC, Shepherd BE. Three-phase generalized raking and multiple imputation estimators to address error-prone data. Stat Med 2024; 43:379-394. [PMID: 37987515 PMCID: PMC10842111 DOI: 10.1002/sim.9967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 09/23/2023] [Accepted: 11/09/2023] [Indexed: 11/22/2023]
Abstract
Validation studies are often used to obtain more reliable information in settings with error-prone data. Validated data on a subsample of subjects can be used together with error-prone data on all subjects to improve estimation. In practice, more than one round of data validation may be required, and direct application of standard approaches for combining validation data into analyses may lead to inefficient estimators since the information available from intermediate validation steps is only partially considered or even completely ignored. In this paper, we present two novel extensions of multiple imputation and generalized raking estimators that make full use of all available data. We show through simulations that incorporating information from intermediate steps can lead to substantial gains in efficiency. This work is motivated by and illustrated in a study of contraceptive effectiveness among 83 671 women living with HIV, whose data were originally extracted from electronic medical records, of whom 4732 had their charts reviewed, and a subsequent 1210 also had a telephone interview to validate key study variables.
Collapse
Affiliation(s)
- Gustavo Amorim
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Ran Tao
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Sarah Lotspeich
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
- Department of Statistical Sciences, Wake Forest University, Winston-Salem, NC, USA
| | - Pamela A. Shaw
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Rena C. Patel
- Department of Medicine, University of Washington, Seattle, WA, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
7
|
Lotspeich SC, Shepherd BE, Kariuki MA, Wools-Kaloustian K, McGowan CC, Musick B, Semeere A, Crabtree Ramírez BE, Mkwashapi DM, Cesar C, Ssemakadde M, Machado DM, Ngeresa A, Ferreira FF, Lwali J, Marcelin A, Cardoso SW, Luque MT, Otero L, Cortés CP, Duda SN. Lessons learned from over a decade of data audits in international observational HIV cohorts in Latin America and East Africa. J Clin Transl Sci 2023; 7:e245. [PMID: 38033704 PMCID: PMC10685260 DOI: 10.1017/cts.2023.659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Revised: 10/13/2023] [Accepted: 10/16/2023] [Indexed: 12/02/2023] Open
Abstract
Introduction Routine patient care data are increasingly used for biomedical research, but such "secondary use" data have known limitations, including their quality. When leveraging routine care data for observational research, developing audit protocols that can maximize informational return and minimize costs is paramount. Methods For more than a decade, the Latin America and East Africa regions of the International epidemiology Databases to Evaluate AIDS (IeDEA) consortium have been auditing the observational data drawn from participating human immunodeficiency virus clinics. Since our earliest audits, where external auditors used paper forms to record audit findings from paper medical records, we have streamlined our protocols to obtain more efficient and informative audits that keep up with advancing technology while reducing travel obligations and associated costs. Results We present five key lessons learned from conducting data audits of secondary-use data from resource-limited settings for more than 10 years and share eight recommendations for other consortia looking to implement data quality initiatives. Conclusion After completing multiple audit cycles in both the Latin America and East Africa regions of the IeDEA consortium, we have established a rich reference for data quality in our cohorts, as well as large, audited analytical datasets that can be used to answer important clinical questions with confidence. By sharing our audit processes and how they have been adapted over time, we hope that others can develop protocols informed by our lessons learned from more than a decade of experience in these large, diverse cohorts.
Collapse
Affiliation(s)
- Sarah C. Lotspeich
- Department of Statistical Sciences, Wake Forest
University, Winston-Salem, NC,
USA
- Department of Biostatistics, Vanderbilt University Medical
Center, Nashville, TN, USA
| | - Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University Medical
Center, Nashville, TN, USA
| | | | - Kara Wools-Kaloustian
- Department of Medicine, Indiana University School of
Medicine, Indianapolis, IN,
USA
| | - Catherine C. McGowan
- Division of Infectious Diseases, Department of Medicine,
Vanderbilt University Medical Center, Nashville,
TN, USA
| | - Beverly Musick
- Department of Biostatistics, Indiana University School of
Medicine, Indianapolis, IN,
USA
| | - Aggrey Semeere
- Infectious Diseases Institute, Makerere University,
Kampala, Uganda
| | - Brenda E. Crabtree Ramírez
- Department of Infectious Diseases, Instituto Nacional de
Ciencias Méxicas y Nutrición Salvador Zubirán, Mexico City,
Mexico
| | - Denna M. Mkwashapi
- Sexual and Reproductive Health Program, National Institute
for Medical Research Mwanza, United Republic of Tanzania,
Mwanza, Tanzania
| | | | | | - Daisy Maria Machado
- Departamento de Pediatria, Universidade Federal de São
Paulo, São Paulo, Brazil
| | - Antony Ngeresa
- Academic Model Providing Access to Health Care (AMPATH),
Eldoret, Kenya
| | | | - Jerome Lwali
- Tumbi Hospital HIV Care and Treatment Clinic, United Republic of
Tanzania, Kibaha, Tanzania
| | - Adias Marcelin
- Le Groupe Haïtien d’Etude du Sarcome de Kaposi et des Infections
Opportunistes, Port-au-Prince, Haiti
| | | | - Marco Tulio Luque
- Instituto Hondureño de Seguridad Social and Hospital Escuela
Universitario, Tegucigalpa, Honduras
| | - Larissa Otero
- Instituto de Medicina Tropical Alexander von Humboldt, Universidad Peruana
Cayetano Heredia, Lima, Peru
- School of Medicine, Universidad Peruana Cayetano Heredia,
Lima, Peru
| | | | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University
Medical Center, Nashville, TN,
USA
| |
Collapse
|
8
|
Shepherd BE, Han K, Chen T, Bian A, Pugh S, Duda SN, Lumley T, Heerman WJ, Shaw PA. Multiwave validation sampling for error-prone electronic health records. Biometrics 2023; 79:2649-2663. [PMID: 35775996 PMCID: PMC10525037 DOI: 10.1111/biom.13713] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 06/16/2022] [Indexed: 11/29/2022]
Abstract
Electronic health record (EHR) data are increasingly used for biomedical research, but these data have recognized data quality challenges. Data validation is necessary to use EHR data with confidence, but limited resources typically make complete data validation impossible. Using EHR data, we illustrate prospective, multiwave, two-phase validation sampling to estimate the association between maternal weight gain during pregnancy and the risks of her child developing obesity or asthma. The optimal validation sampling design depends on the unknown efficient influence functions of regression coefficients of interest. In the first wave of our multiwave validation design, we estimate the influence function using the unvalidated (phase 1) data to determine our validation sample; then in subsequent waves, we re-estimate the influence function using validated (phase 2) data and update our sampling. For efficiency, estimation combines obesity and asthma sampling frames while calibrating sampling weights using generalized raking. We validated 996 of 10,335 mother-child EHR dyads in six sampling waves. Estimated associations between childhood obesity/asthma and maternal weight gain, as well as other covariates, are compared to naïve estimates that only use unvalidated data. In some cases, estimates markedly differ, underscoring the importance of efficient validation sampling to obtain accurate estimates incorporating validated data.
Collapse
Affiliation(s)
- Bryan E. Shepherd
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
| | - Kyunghee Han
- Depart. of Mathematics, Statistics, and Computer Science; Univ. of Illinois at Chicago
| | - Tong Chen
- Department of Statistics, University of Auckland
| | - Aihua Bian
- Department of Biostatistics, Vanderbilt University, Nashville, Tennessee, USA
| | - Shannon Pugh
- Department of Emergency Medicine, Vanderbilt University Medical Center
| | - Stephany N. Duda
- Department of Biomedical Informatics, Vanderbilt University Medical Center
| | | | | | - Pamela A. Shaw
- Biostatistics Unit, Kaiser Permanente Washington Health Research Institute
| |
Collapse
|
9
|
Chen T, Lumley T. Optimal sampling for design-based estimators of regression models. Stat Med 2022; 41:1482-1497. [PMID: 34989429 PMCID: PMC8918008 DOI: 10.1002/sim.9300] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2021] [Revised: 12/02/2021] [Accepted: 12/10/2021] [Indexed: 11/05/2022]
Abstract
Two-phase designs measure variables of interest on a subcohort where the outcome and covariates are readily available or cheap to collect on all individuals in the cohort. Given limited resource availability, it is of interest to find an optimal design that includes more informative individuals in the final sample. We explore the optimal designs and efficiencies for analyses by design-based estimators. Generalized raking is an efficient class of design-based estimators, and they improve on the inverse-probability weighted (IPW) estimator by adjusting weights based on the auxiliary information. We derive a closed-form solution of the optimal design for estimating regression coefficients from generalized raking estimators. We compare it with the optimal design for analysis via the IPW estimator and other two-phase designs in measurement-error settings. We consider general two-phase designs where the outcome variable and variables of interest can be continuous or discrete. Our results show that the optimal designs for analyses by the two classes of design-based estimators can be very different. The optimal design for analysis via the IPW estimator is optimal for IPW estimation and typically gives near-optimal efficiency for generalized raking estimation, though we show there is potential improvement in some settings.
Collapse
Affiliation(s)
- Tong Chen
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| |
Collapse
|
10
|
Shepherd BE, Shaw PA. Errors in multiple variables in human immunodeficiency virus (HIV) cohort and electronic health record data: statistical challenges and opportunities. STATISTICAL COMMUNICATIONS IN INFECTIOUS DISEASES 2020; 12:20190015. [PMID: 35880997 PMCID: PMC9204761 DOI: 10.1515/scid-2019-0015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2019] [Accepted: 08/21/2020] [Indexed: 06/15/2023]
Abstract
Objectives: Observational data derived from patient electronic health records (EHR) data are increasingly used for human immunodeficiency virus/acquired immunodeficiency syndrome (HIV/AIDS) research. There are challenges to using these data, in particular with regards to data quality; some are recognized, some unrecognized, and some recognized but ignored. There are great opportunities for the statistical community to improve inference by incorporating validation subsampling into analyses of EHR data.Methods: Methods to address measurement error, misclassification, and missing data are relevant, as are sampling designs such as two-phase sampling. However, many of the existing statistical methods for measurement error, for example, only address relatively simple settings, whereas the errors seen in these datasets span multiple variables (both predictors and outcomes), are correlated, and even affect who is included in the study.Results/Conclusion: We will discuss some preliminary methods in this area with a particular focus on time-to-event outcomes and outline areas of future research.
Collapse
Affiliation(s)
- Bryan E. Shepherd
- Biostatistics, Vanderbilt University, 2525 West End, Suite 11000, 37203Nashville, Tennessee, USA
| | - Pamela A. Shaw
- Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|