1
|
Salvatore M, Mondul AM, Friese CR, Hanauer D, Xu H, Pearce CL, Mukherjee B. Impacts of sample weighting on transferability of risk prediction models across EHR-Linked biobanks with different recruitment strategies. J Biomed Inform 2025; 167:104853. [PMID: 40398830 DOI: 10.1016/j.jbi.2025.104853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2025] [Revised: 04/15/2025] [Accepted: 05/18/2025] [Indexed: 05/23/2025]
Abstract
OBJECTIVE To evaluate whether using poststratification weights when training risk prediction models enhances transferability when the external test cohort has a different sampling strategy, a commonly encountered scenario when analyzing electronic health record (EHR)-linked biobanks. METHODS PS weights were calculated to align a health system-based biobank, the Michigan Genomics Initiative (MGI; n = 76,757), with a nationally recruited biobank, All of Us (AOU; n = 226,764), which oversamples underrepresented groups. Basic PS weights (PSBASIC) captured age, sex, and race/ethnicity; full PS weights (PSFULL) additionally included smoking, alcohol consumption, BMI, depression, hypertension, and the Charlson Comorbidity Index. Models for esophageal, liver, and pancreatic cancers were developed using EHR data from MGI at 0, 1, 2, and 5 years prior to diagnosis. Phenotype risk scores (PheRS) were constructed using six methods (e.g., regularized regression, random forest) and evaluated alongside covariates, risk factors, and symptoms. Evaluation metrics included the odds ratio (OR) for the top decile vs. the middle 40th-60th percentiles of the risk score distribution and the area under the receiver operating curve (AUC) evaluated in the AOU test cohort when models are trained with and without weighting. RESULTS Elastic net and random forest methods generally performed well in risk stratification, but no single PheRS construction method consistently outperformed others. Applying PS weights did not consistently improve risk stratification performance. For example, in liver cancer risk stratification at t = 1, unweighted random forest PheRS yielded an OR of 13.73 (95 % CI: 8.97, 21.01), compared to 14.55 (95 % CI: 9.45, 22.42) with PSBASIC and 13.62 (95 % CI: 8.90, 20.85) with PSFULL. CONCLUSION PS weights do not significantly enhance risk model transferability between biobanks. EHR-based PheRS are crucial for risk stratification and should be integrated with other multimodal data for improved risk prediction. Identifying high-risk populations for diseases like liver cancer early through health history mining shows promise.
Collapse
Affiliation(s)
- Maxwell Salvatore
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA; Center for Precision Health Data Science, University of Michigan, Ann Arbor, MI, USA
| | - Alison M Mondul
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA; Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | - Christopher R Friese
- Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA; Department of Systems, Populations, and Leadership, School of Nursing, University of Michigan, Ann Arbor, MI, USA; Department of Health Management and Policy, University of Michigan, Ann Arbor, MI, USA
| | - David Hanauer
- Department of Learning Health Sciences, University of Michigan Medical School, Ann Arbor, MI, USA
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, Yale University, New Haven, CT, USA
| | - Celeste Leigh Pearce
- Department of Epidemiology, University of Michigan, Ann Arbor, MI, USA; Rogel Cancer Center, University of Michigan, Ann Arbor, MI, USA
| | | |
Collapse
|
2
|
Yang Y, Dempsey W, Han P, Deshmukh Y, Richardson S, Tom B, Mukherjee B. Exploring the Big Data Paradox for various estimands using vaccination data from the global COVID-19 Trends and Impact Survey (CTIS). SCIENCE ADVANCES 2024; 10:eadj0266. [PMID: 38820165 PMCID: PMC11314312 DOI: 10.1126/sciadv.adj0266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/23/2023] [Accepted: 04/26/2024] [Indexed: 06/02/2024]
Abstract
Selection bias poses a substantial challenge to valid statistical inference in nonprobability samples. This study compared estimates of the first-dose COVID-19 vaccination rates among Indian adults in 2021 from a large nonprobability sample, the COVID-19 Trends and Impact Survey (CTIS), and a small probability survey, the Center for Voting Options and Trends in Election Research (CVoter), against national benchmark data from the COVID Vaccine Intelligence Network. Notably, CTIS exhibits a larger estimation error on average (0.37) compared to CVoter (0.14). Additionally, we explored the accuracy (regarding mean squared error) of CTIS in estimating successive differences (over time) and subgroup differences (for females versus males) in mean vaccine uptakes. Compared to the overall vaccination rates, targeting these alternative estimands comparing differences or relative differences in two means increased the effective sample size. These results suggest that the Big Data Paradox can manifest in countries beyond the United States and may not apply equally to every estimand of interest.
Collapse
Affiliation(s)
- Youqi Yang
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Walter Dempsey
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| | - Peisong Han
- Biostatistics Innovation Group, Gilead Sciences, Foster City, CA, USA
| | - Yashwant Deshmukh
- Center For Voting Opinions and Trends in Election Research, Noida, India
| | | | - Brian Tom
- MRC Biostatistics Unit, University of Cambridge, Cambridge, UK
| | - Bhramar Mukherjee
- Department of Biostatistics, University of Michigan, Ann Arbor, MI, USA
| |
Collapse
|