1
|
Moccia C, Moirano G, Popovic M, Pizzi C, Fariselli P, Richiardi L, Ekstrøm CT, Maule M. Machine learning in causal inference for epidemiology. Eur J Epidemiol 2024; 39:1097-1108. [PMID: 39535572 PMCID: PMC11599438 DOI: 10.1007/s10654-024-01173-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 10/21/2024] [Indexed: 11/16/2024]
Abstract
In causal inference, parametric models are usually employed to address causal questions estimating the effect of interest. However, parametric models rely on the correct model specification assumption that, if not met, leads to biased effect estimates. Correct model specification is challenging, especially in high-dimensional settings. Incorporating Machine Learning (ML) into causal analyses may reduce the bias arising from model misspecification, since ML methods do not require the specification of a functional form of the relationship between variables. However, when ML predictions are directly plugged in a predefined formula of the effect of interest, there is the risk of introducing a "plug-in bias" in the effect measure. To overcome this problem and to achieve useful asymptotic properties, new estimators that combine the predictive potential of ML and the ability of traditional statistical methods to make inference about population parameters have been proposed. For epidemiologists interested in taking advantage of ML for causal inference investigations, we provide an overview of three estimators that represent the current state-of-art, namely Targeted Maximum Likelihood Estimation (TMLE), Augmented Inverse Probability Weighting (AIPW) and Double/Debiased Machine Learning (DML).
Collapse
Affiliation(s)
- Chiara Moccia
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy.
| | - Giovenale Moirano
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy
| | - Maja Popovic
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy
| | - Costanza Pizzi
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Turin, Turin, Italy
| | - Lorenzo Richiardi
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy
| | - Claus Thorn Ekstrøm
- Section of Biostatistics, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Milena Maule
- Cancer Epidemiology Unit, Department of Medical Sciences, University of Turin and CPO Piedmont, Via Santena 7, Turin, 10126, Italy
| |
Collapse
|
2
|
Ghosh R, Gutierrez JP, de Jesús Ascencio-Montiel I, Juárez-Flores A, Bertozzi SM. SARS-CoV-2 infection by trimester of pregnancy and adverse perinatal outcomes: a Mexican retrospective cohort study. BMJ Open 2024; 14:e075928. [PMID: 38604636 PMCID: PMC11015228 DOI: 10.1136/bmjopen-2023-075928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Accepted: 03/04/2024] [Indexed: 04/13/2024] Open
Abstract
OBJECTIVE Conflicting evidence for the association between COVID-19 and adverse perinatal outcomes exists. This study examined the associations between maternal COVID-19 during pregnancy and adverse perinatal outcomes including preterm birth (PTB), low birth weight (LBW), small-for-gestational age (SGA), large-for-gestational age (LGA) and fetal death; as well as whether the associations differ by trimester of infection. DESIGN AND SETTING The study used a retrospective Mexican birth cohort from the Instituto Mexicano del Seguro Social (IMSS), Mexico, between January 2020 and November 2021. PARTICIPANTS We used the social security administrative dataset from IMSS that had COVID-19 information and linked it with the IMSS routine hospitalisation dataset, to identify deliveries in the study period with a test for SARS-CoV-2 during pregnancy. OUTCOME MEASURES PTB, LBW, SGA, LGA and fetal death. We used targeted maximum likelihood estimators, to quantify associations (risk ratio, RR) and CIs. We fit models for the overall COVID-19 sample, and separately for those with mild or severe disease, and by trimester of infection. Additionally, we investigated potential bias induced by missing non-tested pregnancies. RESULTS The overall sample comprised 17 340 singleton pregnancies, of which 30% tested positive. We found that those with mild COVID-19 had an RR of 0.89 (95% CI 0.80 to 0.99) for PTB and those with severe COVID-19 had an RR of 1.53 (95% CI 1.07 to 2.19) for LGA. COVID-19 in the first trimester was associated with fetal death, RR=2.36 (95% CI 1.04, 5.36). Results also demonstrate that missing non-tested pregnancies might induce bias in the associations. CONCLUSIONS In the overall sample, there was no evidence of an association between COVID-19 and adverse perinatal outcomes. However, the findings suggest that severe COVID-19 may increase the risk of some perinatal outcomes, with the first trimester potentially being a high-risk period.
Collapse
Affiliation(s)
- Rakesh Ghosh
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, California, USA
- School of Public Health, University of California Berkeley, Berkeley, California, USA
| | - Juan Pablo Gutierrez
- Center for Policy, Population & Health Research, National Autonomous University of Mexico, Mexico City, Mexico
| | | | - Arturo Juárez-Flores
- Center for Policy, Population & Health Research, National Autonomous University of Mexico, Mexico City, Mexico
| | - Stefano M Bertozzi
- School of Public Health, University of California Berkeley, Berkeley, California, USA
- University of Washington - Seattle Campus, Seattle, Washington, USA
- National Institute of Public Health, Cuernavaca, Mexico
| |
Collapse
|
3
|
Smith MJ, Phillips RV, Luque-Fernandez MA, Maringe C. Application of targeted maximum likelihood estimation in public health and epidemiological studies: a systematic review. Ann Epidemiol 2023; 86:34-48.e28. [PMID: 37343734 DOI: 10.1016/j.annepidem.2023.06.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2023] [Revised: 05/24/2023] [Accepted: 06/06/2023] [Indexed: 06/23/2023]
Abstract
PURPOSE The targeted maximum likelihood estimation (TMLE) statistical data analysis framework integrates machine learning, statistical theory, and statistical inference to provide a least biased, efficient, and robust strategy for estimation and inference of a variety of statistical and causal parameters. We describe and evaluate the epidemiological applications that have benefited from recent methodological developments. METHODS We conducted a systematic literature review in PubMed for articles that applied any form of TMLE in observational studies. We summarized the epidemiological discipline, geographical location, expertize of the authors, and TMLE methods over time. We used the Roadmap of Targeted Learning and Causal Inference to extract key methodological aspects of the publications. We showcase the contributions to the literature of these TMLE results. RESULTS Of the 89 publications included, 33% originated from the University of California at Berkeley, where the framework was first developed by Professor Mark van der Laan. By 2022, 59% of the publications originated from outside the United States and explored up to seven different epidemiological disciplines in 2021-2022. Double-robustness, bias reduction, and model misspecification were the main motivations that drew researchers toward the TMLE framework. Through time, a wide variety of methodological, tutorial, and software-specific articles were cited, owing to the constant growth of methodological developments around TMLE. CONCLUSIONS There is a clear dissemination trend of the TMLE framework to various epidemiological disciplines and to increasing numbers of geographical areas. The availability of R packages, publication of tutorial papers, and involvement of methodological experts in applied publications have contributed to an exponential increase in the number of studies that understood the benefits and adoption of TMLE.
Collapse
Affiliation(s)
- Matthew J Smith
- Inequalities in Cancer Outcomes Network, London School of Hygiene and Tropical Medicine, London, UK.
| | - Rachael V Phillips
- Division of Biostatistics, School of Public Health, University of California at Berkeley, Berkeley, CA
| | - Miguel Angel Luque-Fernandez
- Inequalities in Cancer Outcomes Network, London School of Hygiene and Tropical Medicine, London, UK; Department of Statistics and Operations Research, University of Granada, Granada, Spain
| | - Camille Maringe
- Inequalities in Cancer Outcomes Network, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
4
|
Yu W, Li S, Ye T, Xu R, Song J, Guo Y. Deep Ensemble Machine Learning Framework for the Estimation of PM2.5 Concentrations. ENVIRONMENTAL HEALTH PERSPECTIVES 2022; 130:37004. [PMID: 35254864 PMCID: PMC8901043 DOI: 10.1289/ehp9752] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/02/2021] [Revised: 02/14/2022] [Accepted: 02/14/2022] [Indexed: 05/29/2023]
Abstract
BACKGROUND Accurate estimation of historical PM2.5 (particle matter with an aerodynamic diameter of less than 2.5μm) is critical and essential for environmental health risk assessment. OBJECTIVES The aim of this study was to develop a multiple-level stacked ensemble machine learning framework for improving the estimation of the daily ground-level PM2.5 concentrations. METHODS An innovative deep ensemble machine learning framework (DEML) was developed to estimate the daily PM2.5 concentrations. The framework has a three-stage structure: At the first stage, four base models [gradient boosting machine (GBM), support vector machine (SVM), random forest (RF), and eXtreme gradient boosting (XGBoost)] were used to generate a new data set of PM2.5 concentrations for training the next-stage learners. At the second stage, three meta-models [RF, XGBoost, and Generalized Linear Model (GLM)] were used to estimate PM2.5 concentrations using a combination of the original data set and the predictions from the first-stage models. At the third stage, a nonnegative least squares (NNLS) algorithm was employed to obtain the optimal weights for PM2.5 estimation. We took the data from 133 monitoring stations in Italy as an example to implement the DEML to predict daily PM2.5 at each 1km×1km grid cell from 2015 to 2019 across Italy. We evaluated the model performance by performing 10-fold cross-validation (CV) and compared it with five benchmark algorithms [GBM, SVM, RF, XGBoost, and Super Learner (SL)]. RESULTS The results revealed that the PM2.5 prediction performance of DEML [coefficients of determination (R2)=0.87 and root mean square error (RMSE)=5.38μg/m3] was superior to any benchmark models (with R2 of 0.51, 0.76, 0.83, 0.70, and 0.83 for GBM, SVM, RF, XGBoost, and SL approach, respectively). DEML displayed reliable performance in capturing the spatiotemporal variations of PM2.5 in Italy. DISCUSSION The proposed DEML framework achieved an outstanding performance in PM2.5 estimation, which could be used as a tool for more accurate environmental exposure assessment. https://doi.org/10.1289/EHP9752.
Collapse
Affiliation(s)
- Wenhua Yu
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
| | - Shanshan Li
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
| | - Tingting Ye
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
| | - Rongbin Xu
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
| | - Jiangning Song
- Monash Biomedicine Discovery Institute, Department of Biochemistry and Molecular Biology, Monash University, Melbourne, Australia
| | - Yuming Guo
- Climate, Air Quality Research Unit, School of Public Health and Preventive Medicine, Monash University, Melbourne, Australia
| |
Collapse
|
5
|
Smith MJ, Mansournia MA, Maringe C, Zivich PN, Cole SR, Leyrat C, Belot A, Rachet B, Luque-Fernandez MA. Introduction to computational causal inference using reproducible Stata, R, and Python code: A tutorial. Stat Med 2022; 41:407-432. [PMID: 34713468 PMCID: PMC11795351 DOI: 10.1002/sim.9234] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2021] [Revised: 10/08/2021] [Accepted: 10/11/2021] [Indexed: 11/09/2022]
Abstract
The main purpose of many medical studies is to estimate the effects of a treatment or exposure on an outcome. However, it is not always possible to randomize the study participants to a particular treatment, therefore observational study designs may be used. There are major challenges with observational studies; one of which is confounding. Controlling for confounding is commonly performed by direct adjustment of measured confounders; although, sometimes this approach is suboptimal due to modeling assumptions and misspecification. Recent advances in the field of causal inference have dealt with confounding by building on classical standardization methods. However, these recent advances have progressed quickly with a relative paucity of computational-oriented applied tutorials contributing to some confusion in the use of these methods among applied researchers. In this tutorial, we show the computational implementation of different causal inference estimators from a historical perspective where new estimators were developed to overcome the limitations of the previous estimators (ie, nonparametric and parametric g-formula, inverse probability weighting, double-robust, and data-adaptive estimators). We illustrate the implementation of different methods using an empirical example from the Connors study based on intensive care medicine, and most importantly, we provide reproducible and commented code in Stata, R, and Python for researchers to adapt in their own observational study. The code can be accessed at https://github.com/migariane/Tutorial_Computational_Causal_Inference_Estimators.
Collapse
Affiliation(s)
- Matthew J Smith
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Mohammad A Mansournia
- Department of Epidemiology and Biostatistics, Tehran University of Medical Sciences, Tehran, Iran
| | - Camille Maringe
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Paul N Zivich
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
- Carolina Population Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Stephen R Cole
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Clémence Leyrat
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Aurélien Belot
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Bernard Rachet
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
| | - Miguel A Luque-Fernandez
- Inequalities in Cancer Outcomes Network, Department of Non-communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, London, UK
- Non-communicable Disease and Cancer Epidemiology Group, Instituto de Investigacion Biosanitaria de Granada (ibs.GRANADA), Andalusian School of Public Health, University of Granada, Granada, Spain
- Biomedical Network Research Centers of Epidemiology and Public Health (CIBERESP), Madrid, Spain
| |
Collapse
|
6
|
Carbone JT, Clift J, Wyllie T, Smyth A. Housing Unit Type and Perceived Social Isolation Among Senior Housing Community Residents. THE GERONTOLOGIST 2021; 62:889-899. [PMID: 34919687 DOI: 10.1093/geront/gnab184] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2021] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND AND OBJECTIVES Social isolation, and its associated health implications, is an important issue for older adults in the United States. To date, there has been limited study of the pathways that connect these two factors. The present study expands on previous models by linking factors related to the built environment-in the form of housing unit type-to perceived social isolation among those living independently in dedicated senior housing. RESEARCH DESIGN AND METHODS The causal inference technique of inverse probability weighting with regression adjustment was employed to assess the impact of living in a townhome-style unit, as opposed to in an apartment building, on self-reported perceived social isolation (N = 1,160). RESULTS Individuals who lived in townhome-style housing reported 10.4% lower probability of experiencing social isolation as a result of living in a townhome-style unit as opposed to an apartment building-style unit. DISCUSSION AND IMPLICATIONS The findings provide evidence for the conceptual model that characteristics specific to a given housing unit type may create conditions that exacerbate or buffer individuals from experiencing social isolation. This, in turn, has important implications for the targeting of interventions for social isolation. Policy considerations related to the type of affordable senior housing being built should also be informed by these findings. Additionally, future research should better explicate the role of housing unit type on mental and emotional health outcomes.
Collapse
Affiliation(s)
- Jason T Carbone
- School of Social Work, Wayne State University, Detroit, Michigan, USA
| | - Jennifer Clift
- School of Social Work, Wayne State University, Detroit, Michigan, USA
| | - Tom Wyllie
- Presbyterian Villages of Michigan, Southfield, Michigan, USA
| | - Amy Smyth
- Presbyterian Villages of Michigan, Southfield, Michigan, USA
| |
Collapse
|
7
|
Duarte MBO, Leal F, Argenton JLP, Carvalheira JBC. Impact of androgen deprivation therapy on mortality of prostate cancer patients with COVID-19: a propensity score-based analysis. Infect Agent Cancer 2021; 16:66. [PMID: 34823563 PMCID: PMC8614632 DOI: 10.1186/s13027-021-00406-y] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 11/11/2021] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Previous studies hypothesized that androgen deprivation therapy (ADT) may reduce severe acute respiratory syndrome coronavirus 2 (SARS-COV2) infectivity. However, it is unknown whether there is an association between ADT and a higher survival in prostate cancer patients with COVID-19. METHODS We performed a retrospective analysis of prostate cancer (PC) patients hospitalized to treat COVID-19 in Brazil's public health system. We compared patients with the active use of ADT versus those with non-active ADT, past use. We constructed propensity score models of patients in active versus non-active use of ADT. All variables were used to derive propensity score estimation in both models. In the first model we performed a pair-matched propensity score model between those under active and non-active use of ADT. To the second model we initially performed a multivariate backward elimination process to select variables to a final inverse-weight adjusted with double robust estimation model. RESULTS We analyzed 199 PC patients with COVID-19 that received ADT. In total, 52.3% (95/199) of our patients were less than 75 years old, 78.4% (156/199) were on active ADT, and most were using a GnRH analog (80.1%; 125/156). Most of patients were in palliative treatment (89.9%; 179/199). Also, 63.3% of our cohort died from COVID-19. Forty-eight patients under active ADT were pair matched against 48 controls (non-active ADT). All patients (199) were analyzed in the double robust model. ADT active use were not protective factor in both inverse-weight based propensity score (OR 0.70, 95% CI 0.38-1.31, P = 0.263), and pair-matched propensity score (OR 0.67, 95% CI 0.27-1.63, P = 0.374) models. We noticed a significant imbalance in the propensity score of patients in active and those in non-active ADT, with important reductions in the differences after the adjustments. CONCLUSIONS The active use of ADT was not associated with a reduced risk of death in patients with COVID-19.
Collapse
Affiliation(s)
- Mateus Bringel Oliveira Duarte
- Division of Oncology, Department of Anesthesiology, Oncology and Radiology, School of Medical Sciences, State University of Campinas (UNICAMP), Campinas, SP, Brazil
- Uberlândia Cancer Hospital, Federal University of Uberlândia, UFU, Uberlândia, MG, Brazil
| | - Frederico Leal
- Division of Oncology, Department of Anesthesiology, Oncology and Radiology, School of Medical Sciences, State University of Campinas (UNICAMP), Campinas, SP, Brazil
| | | | - José Barreto Campello Carvalheira
- Division of Oncology, Department of Anesthesiology, Oncology and Radiology, School of Medical Sciences, State University of Campinas (UNICAMP), Campinas, SP, Brazil.
| |
Collapse
|
8
|
Amusa L, Zewotir T, North D, Kharsany ABM, Lewis L. Association of medical male circumcision and sexually transmitted infections in a population-based study using targeted maximum likelihood estimation. BMC Public Health 2021; 21:1642. [PMID: 34496810 PMCID: PMC8425067 DOI: 10.1186/s12889-021-11705-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2020] [Accepted: 08/29/2021] [Indexed: 11/20/2022] Open
Abstract
Background Epidemiological theory and many empirical studies support the hypothesis that there is a protective effect of male circumcision against some sexually transmitted infections (STIs). However, there is a paucity of randomized control trials (RCTs) to test this hypothesis in the South African population. Due to the infeasibility of conducting RCTs, estimating marginal or average treatment effects with observational data increases interest. Using targeted maximum likelihood estimation (TMLE), a doubly robust estimation technique, we aim to provide evidence of an association between medical male circumcision (MMC) and two STI outcomes. Methods HIV and HSV-2 status were the two primary outcomes for this study. We investigated the associations between MMC and these STI outcomes, using cross-sectional data from the HIV Incidence Provincial Surveillance System (HIPSS) study in KwaZulu-Natal, South Africa. HIV antibodies were tested from the blood samples collected in the study. For HSV-2, serum samples were tested for HSV-2 antibodies via an ELISA-based anti-HSV-2 IgG. We estimated marginal prevalence ratios (PR) using TMLE and compared estimates with those from propensity score full matching (PSFM) and inverse probability of treatment weighting (IPTW). Results From a total 2850 male participants included in the analytic sample, the overall weighted prevalence of HIV was 32.4% (n = 941) and HSV-2 was 53.2% (n = 1529). TMLE estimates suggest that MMC was associated with 31% lower HIV prevalence (PR: 0.690; 95% CI: 0.614, 0.777) and 21.1% lower HSV-2 prevalence (PR: 0.789; 95% CI: 0.734, 0.848). The propensity score analyses also provided evidence of association of MMC with lower prevalence of HIV and HSV-2. For PSFM: HIV (PR: 0.689; 95% CI: 0.537, 0.885), and HSV-2 (PR: 0.832; 95% CI: 0.709, 0.975). For IPTW: HIV (PR: 0.708; 95% CI: 0.572, 0.875), and HSV-2 (PR: 0.837; 95% CI: 0.738, 0.949). Conclusion Using a TMLE approach, we present further evidence of a protective association of MMC against HIV and HSV-2 in this hyper-endemic South African setting. TMLE has the potential to enhance the evidence base for recommendations that embrace the effect of public health interventions on health or disease outcomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12889-021-11705-9.
Collapse
Affiliation(s)
- Lateef Amusa
- Department of Statistics, School of Mathematics, Statistics and Computer Science, University of Kwazulu-Natal, Durban, South Africa. .,Department of Statistics, University of Ilorin, Ilorin, Nigeria.
| | - Temesgen Zewotir
- Department of Statistics, School of Mathematics, Statistics and Computer Science, University of Kwazulu-Natal, Durban, South Africa
| | - Delia North
- Department of Statistics, School of Mathematics, Statistics and Computer Science, University of Kwazulu-Natal, Durban, South Africa
| | - Ayesha B M Kharsany
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa.,School of Laboratory Medicine & Medical Sciences, Nelson R Mandela School of Medicine, University of KwaZulu-Natal, Durban, South Africa
| | - Lara Lewis
- Centre for the AIDS Programme of Research in South Africa (CAPRISA), University of KwaZulu-Natal, Durban, South Africa
| |
Collapse
|
9
|
Rose S. Intersections of machine learning and epidemiological methods for health services research. Int J Epidemiol 2021; 49:1763-1770. [PMID: 32236476 PMCID: PMC7825941 DOI: 10.1093/ije/dyaa035] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/17/2020] [Indexed: 12/15/2022] Open
Abstract
The field of health services research is broad and seeks to answer questions about the health care system. It is inherently interdisciplinary, and epidemiologists have made crucial contributions. Parametric regression techniques remain standard practice in health services research with machine learning techniques currently having low penetrance in comparison. However, studies in several prominent areas, including health care spending, outcomes and quality, have begun deploying machine learning tools for these applications. Nevertheless, major advances in epidemiological methods are also as yet underleveraged in health services research. This article summarizes the current state of machine learning in key areas of health services research, and discusses important future directions at the intersection of machine learning and epidemiological methods for health services research.
Collapse
Affiliation(s)
- Sherri Rose
- Department of Health Care Policy, Harvard Medical School, 180 Longwood Ave, Boston, MA, 02115, USA
| |
Collapse
|
10
|
Schomaker M, Luque-Fernandez MA, Leroy V, Davies MA. Using longitudinal targeted maximum likelihood estimation in complex settings with dynamic interventions. Stat Med 2019; 38:4888-4911. [PMID: 31436859 DOI: 10.1002/sim.8340] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Revised: 07/10/2019] [Accepted: 07/19/2019] [Indexed: 11/12/2022]
Abstract
Longitudinal targeted maximum likelihood estimation (LTMLE) has very rarely been used to estimate dynamic treatment effects in the context of time-dependent confounding affected by prior treatment when faced with long follow-up times, multiple time-varying confounders, and complex associational relationships simultaneously. Reasons for this include the potential computational burden, technical challenges, restricted modeling options for long follow-up times, and limited practical guidance in the literature. However, LTMLE has desirable asymptotic properties, ie, it is doubly robust, and can yield valid inference when used in conjunction with machine learning. It also has the advantage of easy-to-calculate analytic standard errors in contrast to the g-formula, which requires bootstrapping. We use a topical and sophisticated question from HIV treatment research to show that LTMLE can be used successfully in complex realistic settings, and we compare results to competing estimators. Our example illustrates the following practical challenges common to many epidemiological studies: (1) long follow-up time (30 months); (2) gradually declining sample size; (3) limited support for some intervention rules of interest; (4) a high-dimensional set of potential adjustment variables, increasing both the need and the challenge of integrating appropriate machine learning methods; and (5) consideration of collider bias. Our analyses, as well as simulations, shed new light on the application of LTMLE in complex and realistic settings: We show that (1) LTMLE can yield stable and good estimates, even when confronted with small samples and limited modeling options; (2) machine learning utilized with a small set of simple learners (if more complex ones cannot be fitted) can outperform a single, complex model, which is tailored to incorporate prior clinical knowledge; and (3) performance can vary considerably depending on interventions and their support in the data, and therefore critical quality checks should accompany every LTMLE analysis. We provide guidance for the practical application of LTMLE.
Collapse
Affiliation(s)
- M Schomaker
- Centre for Infectious Disease Epidemiology & Research, University of Cape Town, Cape Town, South Africa.,Institute of Public Health, Medical Decision Making and Health Technology Assessment, Department of Public Health, Health Services Research and Health Technology Assessment, UMIT - University for Health Sciences, Medical Informatics and Technology, Hall in Tirol, Austria
| | - M A Luque-Fernandez
- Biomedical Research Institute of Granada - Noncommunicable and Cancer Epidemiology Group, Andalusian School of Public Health, University of Granada, Granada, Spain.,Department of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, UK.,Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, Massachusetts
| | | | - M A Davies
- Centre for Infectious Disease Epidemiology & Research, University of Cape Town, Cape Town, South Africa
| |
Collapse
|
11
|
Yu YH, Bodnar LM, Brooks MM, Himes KP, Naimi AI. Comparison of Parametric and Nonparametric Estimators for the Association Between Incident Prepregnancy Obesity and Stillbirth in a Population-Based Cohort Study. Am J Epidemiol 2019; 188:1328-1336. [PMID: 31111944 DOI: 10.1093/aje/kwz081] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 01/09/2019] [Accepted: 01/10/2019] [Indexed: 11/13/2022] Open
Abstract
While prepregnancy obesity increases risk of stillbirth, few studies have evaluated the role of newly developed obesity independent of long-standing obesity. Additionally, researchers have relied almost exclusively on parametric models, which require correct specification of an unknown function for consistent estimation. We estimated the association between incident obesity and stillbirth in a cohort constructed from linked birth and death records in Pennsylvania (2003-2013). Incident obesity was defined as body mass index (weight (kg)/height (m)2) greater than or equal to 30. We used parametric G-computation, semiparametric inverse-probability weighting, and parametric/nonparametric targeted minimum loss-based estimation (TMLE) to estimate the association between incident prepregnancy obesity and stillbirth. Compared with pregnancies from women who stayed nonobese, women who became obese prior to their next pregnancy were estimated to have 2.0 (95% confidence interval (CI): 0.5, 3.5) more stillbirths per 1,000 pregnancies using parametric G-computation. However, despite well-behaved stabilized inverse probability weights, risk differences estimated from inverse-probability weighting, nonparametric TMLE, and parametric TMLE represented 6.9 (95% CI: 3.7, 10.0), 0.4 (95% CI: 0.1, 0.7), and 2.9 (95% CI: 1.5, 4.2) excess stillbirths per 1,000 pregnancies, respectively. These results, particularly those derived from nonparametric TMLE, were highly sensitive to covariates included in the propensity score models. Our results suggest that caution is warranted when using nonparametric estimators to quantify exposure effects.
Collapse
Affiliation(s)
- Ya-Hui Yu
- Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Lisa M Bodnar
- Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Obstetrics, Gynecology, and Reproductive Sciences, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- Magee-Womens Research Institute, Pittsburgh, Pennsylvania
| | - Maria M Brooks
- Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Katherine P Himes
- Department of Obstetrics, Gynecology, and Reproductive Sciences, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- Magee-Womens Research Institute, Pittsburgh, Pennsylvania
| | - Ashley I Naimi
- Department of Epidemiology, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
12
|
Lim S, Tellez M, Ismail AI. Estimating a Dynamic Effect of Soda Intake on Pediatric Dental Caries Using Targeted Maximum Likelihood Estimation Method. Caries Res 2019; 53:532-540. [PMID: 30889593 DOI: 10.1159/000497359] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2018] [Accepted: 01/25/2019] [Indexed: 11/19/2022] Open
Abstract
An effect of soda intake on dental caries in young children (birth to 5 years) may vary over time. Estimating a dynamic effect may be challenging due to time-varying confounding and loss to follow-up. The purpose of this paper is to demonstrate utility of targeted maximum likelihood estimation (TMLE) method in addressing longitudinal data analysis challenges and estimating a dynamic effect of soda intake on pediatric caries. Data came from the Detroit Dental Health Project, a 4-year cohort study of low-income -African-American children and caregivers. The sample included 995 child-caregiver pairs who participated in 2002-03 (W1) and were followed up in 2004-05 (W2) and 2007 (W3). The outcome was counts of caries surfaces at W3, and the exposure was child's soda intake at W1 and W2. Time-varying covariates included caregiver's smoking status, oral health fatalism, and social support. Forty-three percent of children consistently consumed soda at W1 and W2, whereas 21% were nonconsumers throughout 2 surveys. The remaining 35% switched intake status between W1 and W2. Association between soda intake patterns and caries was tested using TMLE. Children with a consistent soda intake had 1.03 more caries lesions at W3 than those with consistently no soda intake (95% CI 0.09-1.97) on average. If soda was consumed only at W1 or W2, an estimated effect of soda on caries development at W3 was no longer statistically significant. In conclusion, consistent soda intake during the early childhood led to one additional caries tooth surface. The study highlights utility of TMLE in pediatric caries research as it can handle modeling challenges associated with longitudinal data.
Collapse
Affiliation(s)
- Sungwoo Lim
- Department of Pediatric Dentistry and Community Oral Health Sciences Kornberg School of Dentistry, Philadelphia, Pennsylvania, USA
| | - Marisol Tellez
- Department of Pediatric Dentistry and Community Oral Health Sciences Kornberg School of Dentistry, Philadelphia, Pennsylvania, USA,
| | - Amid I Ismail
- Department of Pediatric Dentistry and Community Oral Health Sciences Kornberg School of Dentistry, Philadelphia, Pennsylvania, USA
| |
Collapse
|
13
|
Izano MA, Sofrygin OA, Picciotto S, Bradshaw PT, Eisen EA. Metalworking Fluids and Colon Cancer Risk: Longitudinal Targeted Minimum Loss-based Estimation. Environ Epidemiol 2019; 3:e035. [PMID: 33778333 PMCID: PMC7952104 DOI: 10.1097/ee9.0000000000000035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Accepted: 12/10/2018] [Indexed: 12/09/2022] Open
Abstract
Metalworking fluids (MWFs) are a class of complex mixtures of chemicals and oils, including several known carcinogens that may pose a cancer hazard to millions of workers. Reports on the relation between MWFs and incident colon cancer have been mixed. METHODS We investigated the relation between exposure to straight, soluble, and synthetic MWFs and the incidence of colon cancer in a cohort of automobile manufacturing industry workers, adjusting for time-varying confounding affected by prior exposure to reduce healthy worker survivor bias. We used longitudinal targeted minimum loss-based estimation (TMLE) to estimate the difference in the cumulative incidence of colon cancer comparing counterfactual outcomes if always exposed above to always exposed below an exposure cutoff while at work. Exposure concentration cutoffs were selected a priori at the 90th percentile of total particulate matter for each fluid type: 0.034, 0.400, and 0.003 for straight, soluble, and synthetic MWFs, respectively. RESULTS The estimated 25-year risk differences were 3.8% (95% confidence interval [CI] = 0.7, 7.0) for straight, 1.3% (95% CI = -2.3, 4.8) for soluble, and 0.2% (95% CI = -3.3, 3.7) for synthetic MWFs, respectively. The corresponding risk ratios were 2.39 (1.12, 5.08), 1.43 (0.67, 3.04), and 1.08 (0.51, 2.30) for straight, soluble, and synthetic MWFs, respectively. CONCLUSIONS By controlling for time-varying confounding affected by prior exposure, a key feature of occupational cohorts, we were able to provide evidence for a causal effect of straight MWF exposure on colon cancer risk that was not found using standard analytical techniques in previous reports.
Collapse
Affiliation(s)
- Monika A. Izano
- Division of Research, Kaiser Permanente Northern California, Oakland, California
- Department of Obstetrics/Gynecology and Reproductive Sciences, University of California, San Francisco, California
| | - Oleg A. Sofrygin
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Sally Picciotto
- School of Public Health, University of California, Berkeley, California
| | | | - Ellen A. Eisen
- School of Public Health, University of California, Berkeley, California
| |
Collapse
|
14
|
Naimi AI, Balzer LB. Stacked generalization: an introduction to super learning. Eur J Epidemiol 2018; 33:459-464. [PMID: 29637384 DOI: 10.1007/s10654-018-0390-z] [Citation(s) in RCA: 126] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2017] [Accepted: 03/28/2018] [Indexed: 12/24/2022]
Abstract
Stacked generalization is an ensemble method that allows researchers to combine several different prediction algorithms into one. Since its introduction in the early 1990s, the method has evolved several times into a host of methods among which is the "Super Learner". Super Learner uses V-fold cross-validation to build the optimal weighted combination of predictions from a library of candidate algorithms. Optimality is defined by a user-specified objective function, such as minimizing mean squared error or maximizing the area under the receiver operating characteristic curve. Although relatively simple in nature, use of Super Learner by epidemiologists has been hampered by limitations in understanding conceptual and technical details. We work step-by-step through two examples to illustrate concepts and address common concerns.
Collapse
Affiliation(s)
- Ashley I Naimi
- Department of Epidemiology, University of Pittsburgh, 130 DeSoto Street 503 Parran Hall, Pittsburgh, PA, 15261, USA.
| | - Laura B Balzer
- Department of Biostatistics and Epidemiology, University of Massachusetts, Amherst, MA, USA
| |
Collapse
|