1
|
Rytgaard HCW, van der Laan MJ. Targeted maximum likelihood estimation for causal inference in survival and competing risks analysis. Lifetime Data Anal 2024; 30:4-33. [PMID: 36336732 DOI: 10.1007/s10985-022-09576-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 09/05/2022] [Indexed: 06/16/2023]
Abstract
Targeted maximum likelihood estimation (TMLE) provides a general methodology for estimation of causal parameters in presence of high-dimensional nuisance parameters. Generally, TMLE consists of a two-step procedure that combines data-adaptive nuisance parameter estimation with semiparametric efficiency and rigorous statistical inference obtained via a targeted update step. In this paper, we demonstrate the practical applicability of TMLE based causal inference in survival and competing risks settings where event times are not confined to take place on a discrete and finite grid. We focus on estimation of causal effects of time-fixed treatment decisions on survival and absolute risk probabilities, considering different univariate and multidimensional parameters. Besides providing a general guidance to using TMLE for survival and competing risks analysis, we further describe how the previous work can be extended with the use of loss-based cross-validated estimation, also known as super learning, of the conditional hazards. We illustrate the usage of the considered methods using publicly available data from a trial on adjuvant chemotherapy for colon cancer. R software code to implement all considered algorithms and to reproduce all analyses is available in an accompanying online appendix on Github.
Collapse
|
2
|
Rytgaard HCW, Eriksson F, van der Laan MJ. Estimation of time-specific intervention effects on continuously distributed time-to-event outcomes by targeted maximum likelihood estimation. Biometrics 2023; 79:3038-3049. [PMID: 36988158 DOI: 10.1111/biom.13856] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 02/22/2023] [Indexed: 03/30/2023]
Abstract
This work considers targeted maximum likelihood estimation (TMLE) of treatment effects on absolute risk and survival probabilities in classical time-to-event settings characterized by right-censoring and competing risks. TMLE is a general methodology combining flexible ensemble learning and semiparametric efficiency theory in a two-step procedure for substitution estimation of causal parameters. We specialize and extend the continuous-time TMLE methods for competing risks settings, proposing a targeting algorithm that iteratively updates cause-specific hazards to solve the efficient influence curve equation for the target parameter. As part of the work, we further detail and implement the recently proposed highly adaptive lasso estimator for continuous-time conditional hazards with L1 -penalized Poisson regression. The resulting estimation procedure benefits from relying solely on very mild nonparametric restrictions on the statistical model, thus providing a novel tool for machine-learning-based semiparametric causal inference for continuous-time time-to-event data. We apply the methods to a publicly available dataset on follicular cell lymphoma where subjects are followed over time until disease relapse or death without relapse. The data display important time-varying effects that can be captured by the highly adaptive lasso. In our simulations that are designed to imitate the data, we compare our methods to a similar approach based on random survival forests and to the discrete-time TMLE.
Collapse
Affiliation(s)
| | - Frank Eriksson
- Section of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, California, USA
| |
Collapse
|
3
|
Mertens A, Benjamin-Chung J, Colford JM, Hubbard AE, van der Laan MJ, Coyle J, Sofrygin O, Cai W, Jilek W, Rosete S, Nguyen A, Pokpongkiat NN, Djajadi S, Seth A, Jung E, Chung EO, Malenica I, Hejazi N, Li H, Hafen R, Subramoney V, Häggström J, Norman T, Christian P, Brown KH, Arnold BF. Author Correction: Child wasting and concurrent stunting in low- and middle-income countries. Nature 2023; 623:E1. [PMID: 37833391 PMCID: PMC10620077 DOI: 10.1038/s41586-023-06695-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2023]
Affiliation(s)
- Andrew Mertens
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA.
| | - Jade Benjamin-Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - John M Colford
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jeremy Coyle
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Oleg Sofrygin
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wilson Cai
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wendy Jilek
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Sonali Rosete
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anna Nguyen
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nolan N Pokpongkiat
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Stephanie Djajadi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anmol Seth
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther Jung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther O Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ivana Malenica
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nima Hejazi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Haodong Li
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Hafen
- Hafen Consulting, West Richland, WA, USA
| | | | | | - Thea Norman
- Quantitative Sciences, Bill & Melinda Gates Foundation, Seattle, WA, USA
| | - Parul Christian
- Center for Human Nutrition, Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Kenneth H Brown
- Department of Nutrition, University of California, Davis, Davis, CA, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Francisco, CA, USA.
- Department of Ophthalmology, University of California, San Francisco, CA, USA.
| |
Collapse
|
4
|
Benjamin-Chung J, Mertens A, Colford JM, Hubbard AE, van der Laan MJ, Coyle J, Sofrygin O, Cai W, Nguyen A, Pokpongkiat NN, Djajadi S, Seth A, Jilek W, Jung E, Chung EO, Rosete S, Hejazi N, Malenica I, Li H, Hafen R, Subramoney V, Häggström J, Norman T, Brown KH, Christian P, Arnold BF. Author Correction: Early-childhood linear growth faltering in low- and middle-income countries. Nature 2023; 623:E2. [PMID: 37833392 PMCID: PMC10620071 DOI: 10.1038/s41586-023-06703-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/15/2023]
Affiliation(s)
- Jade Benjamin-Chung
- Department of Epidemiology & Population Health, Stanford University, Stanford, CA, USA.
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| | - Andrew Mertens
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - John M Colford
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jeremy Coyle
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Oleg Sofrygin
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wilson Cai
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anna Nguyen
- Department of Epidemiology & Population Health, Stanford University, Stanford, CA, USA
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nolan N Pokpongkiat
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Stephanie Djajadi
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anmol Seth
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wendy Jilek
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther Jung
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther O Chung
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Sonali Rosete
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nima Hejazi
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ivana Malenica
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Haodong Li
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Hafen
- Hafen Consulting, LLC, West Richland, WA, USA
| | | | | | - Thea Norman
- Quantitative Sciences, Bill & Melinda Gates Foundation, Seattle, WA, USA
| | - Kenneth H Brown
- Department of Nutrition, University of California, Davis, Davis, CA, USA
| | - Parul Christian
- Center for Human Nutrition, Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Francisco, San Francisco, CA, USA.
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
5
|
Boileau P, Qi NT, van der Laan MJ, Dudoit S, Leng N. A flexible approach for predictive biomarker discovery. Biostatistics 2023; 24:1085-1105. [PMID: 35861622 DOI: 10.1093/biostatistics/kxac029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 06/01/2022] [Accepted: 06/27/2022] [Indexed: 11/14/2022] Open
Abstract
An endeavor central to precision medicine is predictive biomarker discovery; they define patient subpopulations which stand to benefit most, or least, from a given treatment. The identification of these biomarkers is often the byproduct of the related but fundamentally different task of treatment rule estimation. Using treatment rule estimation methods to identify predictive biomarkers in clinical trials where the number of covariates exceeds the number of participants often results in high false discovery rates. The higher than expected number of false positives translates to wasted resources when conducting follow-up experiments for drug target identification and diagnostic assay development. Patient outcomes are in turn negatively affected. We propose a variable importance parameter for directly assessing the importance of potentially predictive biomarkers and develop a flexible nonparametric inference procedure for this estimand. We prove that our estimator is double robust and asymptotically linear under loose conditions in the data-generating process, permitting valid inference about the importance metric. The statistical guarantees of the method are verified in a thorough simulation study representative of randomized control trials with moderate and high-dimensional covariate vectors. Our procedure is then used to discover predictive biomarkers from among the tumor gene expression data of metastatic renal cell carcinoma patients enrolled in recently completed clinical trials. We find that our approach more readily discerns predictive from nonpredictive biomarkers than procedures whose primary purpose is treatment rule estimation. An open-source software implementation of the methodology, the uniCATE R package, is briefly introduced.
Collapse
Affiliation(s)
- Philippe Boileau
- Graduate Group in Biostatistics and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Nina Ting Qi
- Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA
| | - Mark J van der Laan
- Division of Biostatistics, Department of Statistics, Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sandrine Dudoit
- Division of Biostatistics, Department of Statistics, Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ning Leng
- Genentech Inc., 1 DNA Way, South San Francisco, CA 94080, USA
| |
Collapse
|
6
|
Benjamin-Chung J, Mertens A, Colford JM, Hubbard AE, van der Laan MJ, Coyle J, Sofrygin O, Cai W, Nguyen A, Pokpongkiat NN, Djajadi S, Seth A, Jilek W, Jung E, Chung EO, Rosete S, Hejazi N, Malenica I, Li H, Hafen R, Subramoney V, Häggström J, Norman T, Brown KH, Christian P, Arnold BF. Early-childhood linear growth faltering in low- and middle-income countries. Nature 2023; 621:550-557. [PMID: 37704719 PMCID: PMC10511325 DOI: 10.1038/s41586-023-06418-5] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 07/10/2023] [Indexed: 09/15/2023]
Abstract
Globally, 149 million children under 5 years of age are estimated to be stunted (length more than 2 standard deviations below international growth standards)1,2. Stunting, a form of linear growth faltering, increases the risk of illness, impaired cognitive development and mortality. Global stunting estimates rely on cross-sectional surveys, which cannot provide direct information about the timing of onset or persistence of growth faltering-a key consideration for defining critical windows to deliver preventive interventions. Here we completed a pooled analysis of longitudinal studies in low- and middle-income countries (n = 32 cohorts, 52,640 children, ages 0-24 months), allowing us to identify the typical age of onset of linear growth faltering and to investigate recurrent faltering in early life. The highest incidence of stunting onset occurred from birth to the age of 3 months, with substantially higher stunting at birth in South Asia. From 0 to 15 months, stunting reversal was rare; children who reversed their stunting status frequently relapsed, and relapse rates were substantially higher among children born stunted. Early onset and low reversal rates suggest that improving children's linear growth will require life course interventions for women of childbearing age and a greater emphasis on interventions for children under 6 months of age.
Collapse
Affiliation(s)
- Jade Benjamin-Chung
- Department of Epidemiology & Population Health, Stanford University, Stanford, CA, USA.
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| | - Andrew Mertens
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - John M Colford
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jeremy Coyle
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Oleg Sofrygin
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wilson Cai
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anna Nguyen
- Department of Epidemiology & Population Health, Stanford University, Stanford, CA, USA
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nolan N Pokpongkiat
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Stephanie Djajadi
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anmol Seth
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wendy Jilek
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther Jung
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther O Chung
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Sonali Rosete
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nima Hejazi
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ivana Malenica
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Haodong Li
- Division of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Hafen
- Hafen Consulting, LLC, West Richland, WA, USA
| | | | | | - Thea Norman
- Quantitative Sciences, Bill & Melinda Gates Foundation, Seattle, WA, USA
| | - Kenneth H Brown
- Department of Nutrition, University of California, Davis, Davis, CA, USA
| | - Parul Christian
- Center for Human Nutrition, Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Francisco, San Francisco, CA, USA.
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
7
|
Mertens A, Benjamin-Chung J, Colford JM, Coyle J, van der Laan MJ, Hubbard AE, Rosete S, Malenica I, Hejazi N, Sofrygin O, Cai W, Li H, Nguyen A, Pokpongkiat NN, Djajadi S, Seth A, Jung E, Chung EO, Jilek W, Subramoney V, Hafen R, Häggström J, Norman T, Brown KH, Christian P, Arnold BF. Causes and consequences of child growth faltering in low-resource settings. Nature 2023; 621:568-576. [PMID: 37704722 PMCID: PMC10511328 DOI: 10.1038/s41586-023-06501-x] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 07/31/2023] [Indexed: 09/15/2023]
Abstract
Growth faltering in children (low length for age or low weight for length) during the first 1,000 days of life (from conception to 2 years of age) influences short-term and long-term health and survival1,2. Interventions such as nutritional supplementation during pregnancy and the postnatal period could help prevent growth faltering, but programmatic action has been insufficient to eliminate the high burden of stunting and wasting in low- and middle-income countries. Identification of age windows and population subgroups on which to focus will benefit future preventive efforts. Here we use a population intervention effects analysis of 33 longitudinal cohorts (83,671 children, 662,763 measurements) and 30 separate exposures to show that improving maternal anthropometry and child condition at birth accounted for population increases in length-for-age z-scores of up to 0.40 and weight-for-length z-scores of up to 0.15 by 24 months of age. Boys had consistently higher risk of all forms of growth faltering than girls. Early postnatal growth faltering predisposed children to subsequent and persistent growth faltering. Children with multiple growth deficits exhibited higher mortality rates from birth to 2 years of age than children without growth deficits (hazard ratios 1.9 to 8.7). The importance of prenatal causes and severe consequences for children who experienced early growth faltering support a focus on pre-conception and pregnancy as a key opportunity for new preventive interventions.
Collapse
Affiliation(s)
- Andrew Mertens
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA.
| | - Jade Benjamin-Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - John M Colford
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jeremy Coyle
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Sonali Rosete
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ivana Malenica
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nima Hejazi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Oleg Sofrygin
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wilson Cai
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Haodong Li
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anna Nguyen
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nolan N Pokpongkiat
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Stephanie Djajadi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anmol Seth
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther Jung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther O Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wendy Jilek
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | | | - Ryan Hafen
- Hafen Consulting, West Richland, WA, USA
| | | | - Thea Norman
- Quantitative Sciences, Bill & Melinda Gates Foundation, Seattle, WA, USA
| | - Kenneth H Brown
- Department of Nutrition, University of California, Davis, Davis, CA, USA
| | - Parul Christian
- Center for Human Nutrition, Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Francisco, San Francisco, CA, USA.
- Department of Ophthalmology, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
8
|
Mertens A, Benjamin-Chung J, Colford JM, Hubbard AE, van der Laan MJ, Coyle J, Sofrygin O, Cai W, Jilek W, Rosete S, Nguyen A, Pokpongkiat NN, Djajadi S, Seth A, Jung E, Chung EO, Malenica I, Hejazi N, Li H, Hafen R, Subramoney V, Häggström J, Norman T, Christian P, Brown KH, Arnold BF. Child wasting and concurrent stunting in low- and middle-income countries. Nature 2023; 621:558-567. [PMID: 37704720 PMCID: PMC10511327 DOI: 10.1038/s41586-023-06480-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 07/25/2023] [Indexed: 09/15/2023]
Abstract
Sustainable Development Goal 2.2-to end malnutrition by 2030-includes the elimination of child wasting, defined as a weight-for-length z-score that is more than two standard deviations below the median of the World Health Organization standards for child growth1. Prevailing methods to measure wasting rely on cross-sectional surveys that cannot measure onset, recovery and persistence-key features that inform preventive interventions and estimates of disease burden. Here we analyse 21 longitudinal cohorts and show that wasting is a highly dynamic process of onset and recovery, with incidence peaking between birth and 3 months. Many more children experience an episode of wasting at some point during their first 24 months than prevalent cases at a single point in time suggest. For example, at the age of 24 months, 5.6% of children were wasted, but by the same age (24 months), 29.2% of children had experienced at least one wasting episode and 10.0% had experienced two or more episodes. Children who were wasted before the age of 6 months had a faster recovery and shorter episodes than did children who were wasted at older ages; however, early wasting increased the risk of later growth faltering, including concurrent wasting and stunting (low length-for-age z-score), and thus increased the risk of mortality. In diverse populations with high seasonal rainfall, the population average weight-for-length z-score varied substantially (more than 0.5 z in some cohorts), with the lowest mean z-scores occurring during the rainiest months; this indicates that seasonally targeted interventions could be considered. Our results show the importance of establishing interventions to prevent wasting from birth to the age of 6 months, probably through improved maternal nutrition, to complement current programmes that focus on children aged 6-59 months.
Collapse
Affiliation(s)
- Andrew Mertens
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA.
| | - Jade Benjamin-Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
- Department of Epidemiology and Population Health, Stanford University, Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - John M Colford
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Jeremy Coyle
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Oleg Sofrygin
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wilson Cai
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Wendy Jilek
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Sonali Rosete
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anna Nguyen
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nolan N Pokpongkiat
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Stephanie Djajadi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Anmol Seth
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther Jung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Esther O Chung
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ivana Malenica
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Nima Hejazi
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Haodong Li
- Division of Epidemiology and Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| | - Ryan Hafen
- Hafen Consulting, West Richland, WA, USA
| | | | | | - Thea Norman
- Quantitative Sciences, Bill & Melinda Gates Foundation, Seattle, WA, USA
| | - Parul Christian
- Center for Human Nutrition, Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Kenneth H Brown
- Department of Nutrition, University of California, Davis, Davis, CA, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Francisco, CA, USA.
- Department of Ophthalmology, University of California, San Francisco, CA, USA.
| |
Collapse
|
9
|
Wei W, Petersen M, van der Laan MJ, Zheng Z, Wu C, Wang J. Efficient targeted learning of heterogeneous treatment effects for multiple subgroups. Biometrics 2023; 79:1934-1946. [PMID: 36416173 DOI: 10.1111/biom.13800] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Accepted: 10/20/2022] [Indexed: 11/23/2022]
Abstract
In biomedical science, analyzing treatment effect heterogeneity plays an essential role in assisting personalized medicine. The main goals of analyzing treatment effect heterogeneity include estimating treatment effects in clinically relevant subgroups and predicting whether a patient subpopulation might benefit from a particular treatment. Conventional approaches often evaluate the subgroup treatment effects via parametric modeling and can thus be susceptible to model mis-specifications. In this paper, we take a model-free semiparametric perspective and aim to efficiently evaluate the heterogeneous treatment effects of multiple subgroups simultaneously under the one-step targeted maximum-likelihood estimation (TMLE) framework. When the number of subgroups is large, we further expand this path of research by looking at a variation of the one-step TMLE that is robust to the presence of small estimated propensity scores in finite samples. From our simulations, our method demonstrates substantial finite sample improvements compared to conventional methods. In a case study, our method unveils the potential treatment effect heterogeneity of rs12916-T allele (a proxy for statin usage) in decreasing Alzheimer's disease risk.
Collapse
Affiliation(s)
- Waverly Wei
- Division of Biostatistics, University of California, Berkeley, California, USA
| | - Maya Petersen
- Division of Biostatistics, University of California, Berkeley, California, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, California, USA
| | - Zeyu Zheng
- Department of Industrial Engineering and Operations Research, University of California, Berkeley, California, USA
| | - Chong Wu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Texas, USA
| | - Jingshen Wang
- Division of Biostatistics, University of California, Berkeley, California, USA
| |
Collapse
|
10
|
Benitez A, Petersen ML, van der Laan MJ, Santos N, Butrick E, Walker D, Ghosh R, Otieno P, Waiswa P, Balzer LB. Defining and estimating effects in cluster randomized trials: A methods comparison. Stat Med 2023; 42:3443-3466. [PMID: 37308115 PMCID: PMC10898620 DOI: 10.1002/sim.9813] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Revised: 04/27/2023] [Accepted: 05/21/2023] [Indexed: 06/14/2023]
Abstract
Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the causal effect of interest (eg, at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we present a general framework to formally define an array of causal effects in terms of summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of CRT estimators, including the t-test, generalized estimating equations (GEE), augmented-GEE, and targeted maximum likelihood estimation (TMLE). Using finite sample simulations, we illustrate the practical performance of these estimators for different causal effects and when, as commonly occurs, there are limited numbers of clusters of different sizes. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world impact of varying cluster sizes and targeting effects at the cluster-level or at the individual-level. Specifically, the relative effect of the PTBi intervention was 0.81 at the cluster-level, corresponding to a 19% reduction in outcome incidence, and was 0.66 at the individual-level, corresponding to a 34% reduction in outcome risk. Given its flexibility to estimate a variety of user-specified effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis.
Collapse
Affiliation(s)
| | - Maya L. Petersen
- School of Public Health, Biostatistics, University of California Berkeley, Berkeley, California
| | - Mark J. van der Laan
- School of Public Health, Biostatistics, University of California Berkeley, Berkeley, California
| | - Nicole Santos
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, California
| | - Elizabeth Butrick
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, California
| | - Dilys Walker
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, California
| | - Rakesh Ghosh
- Institute for Global Health Sciences, University of California San Francisco, San Francisco, California
| | - Phelgona Otieno
- Center for Clinical Research, Kenya Medical Research Institute, Nairobi, Kenya
| | - Peter Waiswa
- Centre of Excellence for Maternal, Newborn and Child Health, Makerere University College of Health Sciences, Kampala, Uganda
| | - Laura B. Balzer
- School of Public Health, Biostatistics, University of California Berkeley, Berkeley, California
| |
Collapse
|
11
|
Phillips RV, van der Laan MJ, Lee H, Gruber S. Practical considerations for specifying a super learner. Int J Epidemiol 2023; 52:1276-1285. [PMID: 36905602 DOI: 10.1093/ije/dyad023] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 02/23/2023] [Indexed: 03/12/2023] Open
Abstract
Common tasks encountered in epidemiology, including disease incidence estimation and causal inference, rely on predictive modelling. Constructing a predictive model can be thought of as learning a prediction function (a function that takes as input covariate data and outputs a predicted value). Many strategies for learning prediction functions from data (learners) are available, from parametric regressions to machine learning algorithms. It can be challenging to choose a learner, as it is impossible to know in advance which one is the most suitable for a particular dataset and prediction task. The super learner (SL) is an algorithm that alleviates concerns over selecting the one 'right' learner by providing the freedom to consider many, such as those recommended by collaborators, used in related research or specified by subject-matter experts. Also known as stacking, SL is an entirely prespecified and flexible approach for predictive modelling. To ensure the SL is well specified for learning the desired prediction function, the analyst does need to make a few important choices. In this educational article, we provide step-by-step guidelines for making these decisions, walking the reader through each of them and providing intuition along the way. In doing so, we aim to empower the analyst to tailor the SL specification to their prediction task, thereby ensuring their SL performs as well as possible. A flowchart provides a concise, easy-to-follow summary of key suggestions and heuristics, based on our accumulated experience and guided by SL optimality theory.
Collapse
Affiliation(s)
- Rachael V Phillips
- Division of Biostatistics, School of Public Health, University of California at Berkeley, Berkeley, California, United States
| | - Mark J van der Laan
- Division of Biostatistics, School of Public Health, University of California at Berkeley, Berkeley, California, United States
| | - Hana Lee
- Office of Biostatistics, Center for Drug Evaluation and Research, United States Food and Drug Administration, Silver Spring, Maryland, United States
| | - Susan Gruber
- Putnam Data Sciences, LLC, Cambridge, Massachusetts, United States
| |
Collapse
|
12
|
Chen D, Petersen ML, Rytgaard HC, Grøn R, Lange T, Rasmussen S, Pratley RE, Marso SP, Kvist K, Buse J, van der Laan MJ. Beyond the Cox Hazard Ratio: A Targeted Learning Approach to Survival Analysis in a Cardiovascular Outcome Trial Application. Stat Biopharm Res 2023. [DOI: 10.1080/19466315.2023.2173644] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
Affiliation(s)
- David Chen
- Division of Biostatistics, UC Berkeley School of Public Health, Berkeley, CA
| | - Maya L. Petersen
- Division of Biostatistics, UC Berkeley School of Public Health, Berkeley, CA
| | | | | | - Theis Lange
- Section of Biostatistics, University of Copenhagen, Copenhagen, Denmark
| | | | | | - Steven P. Marso
- Midwest Heart and Vascular Institute, HCA Midwest Health, Overland Park, KS
| | | | - John Buse
- Division of Endocrinology and Metabolism, UNC School of Medicine, Chapel Hill, NC
| | | |
Collapse
|
13
|
Malenica I, Phillips RV, Chambaz A, Hubbard AE, Pirracchio R, van der Laan MJ. Personalized online ensemble machine learning with applications for dynamic data streams. Stat Med 2023; 42:1013-1044. [PMID: 36897184 DOI: 10.1002/sim.9655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 11/09/2022] [Accepted: 12/29/2022] [Indexed: 01/18/2023]
Abstract
In this work we introduce the personalized online super learner (POSL), an online personalizable ensemble machine learning algorithm for streaming data. POSL optimizes predictions with respect to baseline covariates, so personalization can vary from completely individualized, that is, optimization with respect to subject ID, to many individuals, that is, optimization with respect to common baseline covariates. As an online algorithm, POSL learns in real time. As a super learner, POSL is grounded in statistical optimality theory and can leverage a diversity of candidate algorithms, including online algorithms with different training and update times, fixed/offline algorithms that are not updated during POSL's fitting procedure, pooled algorithms that learn from many individuals' time series, and individualized algorithms that learn from within a single time series. POSL's ensembling of the candidates can depend on the amount of data collected, the stationarity of the time series, and the mutual characteristics of a group of time series. Depending on the underlying data-generating process and the information available in the data, POSL is able to adapt to learning across samples, through time, or both. For a range of simulations that reflect realistic forecasting scenarios and in a medical application, we examine the performance of POSL relative to other current ensembling and online learning methods. We show that POSL is able to provide reliable predictions for both short and long time series, and it's able to adjust to changing data-generating environments. We further cultivate POSL's practicality by extending it to settings where time series dynamically enter and exit.
Collapse
Affiliation(s)
- Ivana Malenica
- Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Rachael V Phillips
- Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Antoine Chambaz
- Applied Mathematics at Paris 5 (MAP5), University of Paris, Paris, France
| | - Alan E Hubbard
- Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Romain Pirracchio
- Department of Anesthesia and Perioperative Care, University of California, San Francisco, San Francisco, California, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA
| |
Collapse
|
14
|
Phillips RV, van der Laan MJ. Discussion on "Adaptive enrichment designs with a continuous biomarker" by Nigel Stallard. Biometrics 2023; 79:20-22. [PMID: 35332936 DOI: 10.1111/biom.13640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2021] [Accepted: 12/22/2021] [Indexed: 11/30/2022]
Affiliation(s)
- Rachael V Phillips
- Division of Biostatistics, University of California at Berkeley, Berkeley, California, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California at Berkeley, Berkeley, California, USA
| |
Collapse
|
15
|
Hejazi NS, Boileau P, van der Laan MJ, Hubbard AE. A generalization of moderated statistics to data adaptive semiparametric estimation in high-dimensional biology. Stat Methods Med Res 2023; 32:539-554. [PMID: 36573044 PMCID: PMC11078029 DOI: 10.1177/09622802221146313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
The widespread availability of high-dimensional biological data has made the simultaneous screening of many biological characteristics a central problem in computational and high-dimensional biology. As the dimensionality of datasets continues to grow, so too does the complexity of identifying biomarkers linked to exposure patterns. The statistical analysis of such data often relies upon parametric modeling assumptions motivated by convenience, inviting opportunities for model misspecification. While estimation frameworks incorporating flexible, data adaptive regression strategies can mitigate this, their standard variance estimators are often unstable in high-dimensional settings, resulting in inflated Type-I error even after standard multiple testing corrections. We adapt a shrinkage approach compatible with parametric modeling strategies to semiparametric variance estimators of a family of efficient, asymptotically linear estimators of causal effects, defined by counterfactual exposure contrasts. Augmenting the inferential stability of these estimators in high-dimensional settings yields a data adaptive approach for robustly uncovering stable causal associations, even when sample sizes are limited. Our generalized variance estimator is evaluated against appropriate alternatives in numerical experiments, and an open source R/Bioconductor package, biotmle, is introduced. The proposal is demonstrated in an analysis of high-dimensional DNA methylation data from an observational study on the epigenetic effects of tobacco smoking.
Collapse
Affiliation(s)
- Nima S Hejazi
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Philippe Boileau
- Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, CA, USA
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Alan E Hubbard
- Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, CA, USA
| |
Collapse
|
16
|
Ogburn EL, Sofrygin O, Díaz I, van der Laan MJ. Causal Inference for Social Network Data. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2131557] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Affiliation(s)
- Elizabeth L. Ogburn
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Oleg Sofrygin
- Kaiser Permanente Division of Research, 2000 Broadway, Oakland, CA, 94612, USA
| | - Iván Díaz
- Division of Biostatistics and Epidemiology, Weill Cornell Medicine, New York, NY, USA
| | - Mark J. van der Laan
- Department of Biostatistics, University of California Berkeley, 2121 Berkeley Way, Berkeley, CA, 94720, USA
| |
Collapse
|
17
|
Boileau P, Hejazi NS, van der Laan MJ, Dudoit S. Cross-Validated Loss-Based Covariance Matrix Estimator Selection in High Dimensions. J Comput Graph Stat 2022; 32:601-612. [PMID: 37273839 PMCID: PMC10237052 DOI: 10.1080/10618600.2022.2110883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2021] [Accepted: 07/28/2022] [Indexed: 10/15/2022]
Abstract
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience. Thus, a variety of estimators have been derived to overcome the shortcomings of the canonical estimator in such settings. Yet, selecting an optimal estimator from among the plethora available remains an open challenge. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. We propose a general class of loss functions for covariance matrix estimation and establish accompanying finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validation selector. In numerical experiments, we demonstrate the optimality of our proposed selector in moderate sample sizes and across diverse data-generating processes. The practical benefits of our procedure are highlighted in a dimension reduction application to single-cell transcriptome sequencing data.
Collapse
Affiliation(s)
- Philippe Boileau
- Graduate Group in Biostatistics and Center for Computational Biology, UC Berkeley
| | - Nima S. Hejazi
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine
| | - Mark J. van der Laan
- Division of Biostatistics, Department of Statistics, and Center for Computational Biology, UC Berkeley
| | - Sandrine Dudoit
- Department of Statistics, Division of Biostatistics, and Center for Computational Biology, UC Berkeley
| |
Collapse
|
18
|
Affiliation(s)
| | | | - Mark J. van der Laan
- Division of Biostatistics and Center for Targeted Machine Learning and Causal Inference, University of California, Berkeley
| |
Collapse
|
19
|
Gruber S, Phillips RV, Lee H, van der Laan MJ. Data-Adaptive Selection of the Propensity Score Truncation Level for Inverse-Probability-Weighted and Targeted Maximum Likelihood Estimators of Marginal Point Treatment Effects. Am J Epidemiol 2022; 191:1640-1651. [PMID: 35512316 DOI: 10.1093/aje/kwac087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2021] [Revised: 03/22/2022] [Accepted: 04/29/2022] [Indexed: 01/29/2023] Open
Abstract
Inverse probability weighting (IPW) and targeted maximum likelihood estimation (TMLE) are methodologies that can adjust for confounding and selection bias and are often used for causal inference. Both estimators rely on the positivity assumption that within strata of confounders there is a positive probability of receiving treatment at all levels under consideration. Practical applications of IPW require finite inverse probability (IP) weights. TMLE requires that propensity scores (PS) be bounded away from 0 and 1. Although truncation can improve variance and finite sample bias, this artificial distortion of the IP weights and PS distribution introduces asymptotic bias. As sample size grows, truncation-induced bias eventually swamps variance, rendering nominal confidence interval coverage and hypothesis tests invalid. We present a simple truncation strategy based on the sample size, n, that sets the upper bound on IP weights at $\sqrt{\textit{n}}$ ln n/5. For TMLE, the lower bound on the PS should be set to 5/($\sqrt{\textit{n}}$ ln n/5). Our strategy was designed to optimize the mean squared error of the parameter estimate. It naturally extends to data structures with missing outcomes. Simulation studies and a data analysis demonstrate our strategy's ability to minimize both bias and mean squared error in comparison with other common strategies, including the popular but flawed quantile-based heuristic.
Collapse
|
20
|
van der Laan MJ, Benkeser D, Cai W. Efficient estimation of pathwise differentiable target parameters with the undersmoothed highly adaptive lasso. Int J Biostat 2022:ijb-2019-0092. [DOI: 10.1515/ijb-2019-0092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 05/09/2022] [Indexed: 11/15/2022]
Abstract
Abstract
We consider estimation of a functional parameter of a realistically modeled data distribution based on observing independent and identically distributed observations. The highly adaptive lasso estimator of the functional parameter is defined as the minimizer of the empirical risk over a class of cadlag functions with finite sectional variation norm, where the functional parameter is parametrized in terms of such a class of functions. In this article we establish that this HAL estimator yields an asymptotically efficient estimator of any smooth feature of the functional parameter under a global undersmoothing condition. It is formally shown that the L
1-restriction in HAL does not obstruct it from solving the score equations along paths that do not enforce this condition. Therefore, from an asymptotic point of view, the only reason for undersmoothing is that the true target function might not be complex so that the HAL-fit leaves out key basis functions that are needed to span the desired efficient influence curve of the smooth target parameter. Nonetheless, in practice undersmoothing appears to be beneficial and a simple targeted method is proposed and practically verified to perform well. We demonstrate our general result HAL-estimator of a treatment-specific mean and of the integrated square density. We also present simulations for these two examples confirming the theory.
Collapse
Affiliation(s)
| | - David Benkeser
- Department of Biostatistics and Bioinformatics , Emory University , Atlanta , USA
| | - Weixin Cai
- Division of Biostatistics , University of California , Berkeley , USA
| |
Collapse
|
21
|
Ertefaie A, Hejazi NS, van der Laan MJ. Nonparametric inverse-probability-weighted estimators based on the highly adaptive lasso. Biometrics 2022:10.1111/biom.13719. [PMID: 35839293 PMCID: PMC9840713 DOI: 10.1111/biom.13719] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 05/12/2022] [Indexed: 01/18/2023]
Abstract
Inverse-probability-weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudopopulation in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse-probability-weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at nearly n - 1 / 3 $ n^{-1/3}$ -rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse-probability-weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study.
Collapse
Affiliation(s)
- Ashkan Ertefaie
- Department of Biostatistics and Computational Biology, University of Rochester
| | - Nima S. Hejazi
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine
| | - Mark J. van der Laan
- Division of Biostatistics, School of Public Health, University of California, Berkeley,Department of Statistics, University of California, Berkeley
| |
Collapse
|
22
|
Phillips RV, van der Laan MJ. Rachael V. Phillips and Mark J. van der Laan’s contribution to the Discussion of ‘Assumption‐lean inference for generalised linear model parameters’ by Vansteelandt and Dukes. J R Stat Soc Series B Stat Methodol 2022. [DOI: 10.1111/rssb.12529] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
23
|
Montoya LM, van der Laan MJ, Luedtke AR, Skeem JL, Coyle JR, Petersen ML. The optimal dynamic treatment rule superlearner: considerations, performance, and application to criminal justice interventions. Int J Biostat 2022:ijb-2020-0127. [PMID: 35708222 DOI: 10.1515/ijb-2020-0127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Accepted: 05/06/2022] [Indexed: 11/15/2022]
Abstract
The optimal dynamic treatment rule (ODTR) framework offers an approach for understanding which kinds of patients respond best to specific treatments - in other words, treatment effect heterogeneity. Recently, there has been a proliferation of methods for estimating the ODTR. One such method is an extension of the SuperLearner algorithm - an ensemble method to optimally combine candidate algorithms extensively used in prediction problems - to ODTRs. Following the ``causal roadmap," we causally and statistically define the ODTR and provide an introduction to estimating it using the ODTR SuperLearner. Additionally, we highlight practical choices when implementing the algorithm, including choice of candidate algorithms, metalearners to combine the candidates, and risk functions to select the best combination of algorithms. Using simulations, we illustrate how estimating the ODTR using this SuperLearner approach can uncover treatment effect heterogeneity more effectively than traditional approaches based on fitting a parametric regression of the outcome on the treatment, covariates and treatment-covariate interactions. We investigate the implications of choices in implementing an ODTR SuperLearner at various sample sizes. Our results show the advantages of: (1) including a combination of both flexible machine learning algorithms and simple parametric estimators in the library of candidate algorithms; (2) using an ensemble metalearner to combine candidates rather than selecting only the best-performing candidate; (3) using the mean outcome under the rule as a risk function. Finally, we apply the ODTR SuperLearner to the ``Interventions" study, an ongoing randomized controlled trial, to identify which justice-involved adults with mental illness benefit most from cognitive behavioral therapy to reduce criminal re-offending.
Collapse
Affiliation(s)
- Lina M Montoya
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | | | - Jennifer L Skeem
- School of Social Work and Goldman School of Public Policy, University of California Berkeley, Berkeley, USA
| | - Jeremy R Coyle
- Division of Biostatistics, University of California Berkeley, Berkeley, USA
| | - Maya L Petersen
- Divisions of Biostatistics and Epidemiology, University of California Berkeley, Berkeley, USA
| |
Collapse
|
24
|
Montoya LM, van der Laan MJ, Skeem JL, Petersen ML. Estimators for the value of the optimal dynamic treatment rule with application to criminal justice interventions. Int J Biostat 2022:ijb-2020-0128. [PMID: 35659857 DOI: 10.1515/ijb-2020-0128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2020] [Accepted: 05/06/2022] [Indexed: 11/15/2022]
Abstract
Given an (optimal) dynamic treatment rule, it may be of interest to evaluate that rule - that is, to ask the causal question: what is the expected outcome had every subject received treatment according to that rule? In this paper, we study the performance of estimators that approximate the true value of: (1) an a priori known dynamic treatment rule (2) the true, unknown optimal dynamic treatment rule (ODTR); (3) an estimated ODTR, a so-called "data-adaptive parameter," whose true value depends on the sample. Using simulations of point-treatment data, we specifically investigate: (1) the impact of increasingly data-adaptive estimation of nuisance parameters and/or of the ODTR on performance; (2) the potential for improved efficiency and bias reduction through the use of semiparametric efficient estimators; and, (3) the importance of sample splitting based on the cross-validated targeted maximum likelihood estimator (CV-TMLE) for accurate inference. In the simulations considered, there was very little cost and many benefits to using CV-TMLE to estimate the value of the true and estimated ODTR; importantly, and in contrast to non cross-validated estimators, the performance of CV-TMLE was maintained even when highly data-adaptive algorithms were used to estimate both nuisance parameters and the ODTR. In addition, we apply these estimators for the value of the rule to the "Interventions" study, an ongoing randomized controlled trial, to identify whether assigning cognitive behavioral therapy (CBT) to criminal justice-involved adults with mental illness using an ODTR significantly reduces the probability of recidivism, compared to assigning CBT in a non-individualized way.
Collapse
Affiliation(s)
- Lina M Montoya
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, 27599-7400, USA
| | | | - Jennifer L Skeem
- School of Social Work and Goldman School of Public Policy, University of California Berkeley, Berkeley, USA
| | - Maya L Petersen
- Divisions of Biostatistics and Epidemiology, University of California Berkeley, Berkeley, USA
| |
Collapse
|
25
|
Li H, Rosete S, Coyle J, Phillips RV, Hejazi NS, Malenica I, Arnold BF, Benjamin-Chung J, Mertens A, Colford JM, van der Laan MJ, Hubbard AE. Evaluating the robustness of targeted maximum likelihood estimators via realistic simulations in nutrition intervention trials. Stat Med 2022; 41:2132-2165. [PMID: 35172378 PMCID: PMC10362909 DOI: 10.1002/sim.9348] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 01/20/2022] [Accepted: 01/26/2022] [Indexed: 12/18/2022]
Abstract
Several recently developed methods have the potential to harness machine learning in the pursuit of target quantities inspired by causal inference, including inverse weighting, doubly robust estimating equations and substitution estimators like targeted maximum likelihood estimation. There are even more recent augmentations of these procedures that can increase robustness, by adding a layer of cross-validation (cross-validated targeted maximum likelihood estimation and double machine learning, as applied to substitution and estimating equation approaches, respectively). While these methods have been evaluated individually on simulated and experimental data sets, a comprehensive analysis of their performance across real data based simulations have yet to be conducted. In this work, we benchmark multiple widely used methods for estimation of the average treatment effect using ten different nutrition intervention studies data. A nonparametric regression method, undersmoothed highly adaptive lasso, is used to generate the simulated distribution which preserves important features from the observed data and reproduces a set of true target parameters. For each simulated data, we apply the methods above to estimate the average treatment effects as well as their standard errors and resulting confidence intervals. Based on the analytic results, a general recommendation is put forth for use of the cross-validated variants of both substitution and estimating equation estimators. We conclude that the additional layer of cross-validation helps in avoiding unintentional over-fitting of nuisance parameter functionals and leads to more robust inferences.
Collapse
Affiliation(s)
- Haodong Li
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Sonali Rosete
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Jeremy Coyle
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Rachael V Phillips
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Nima S Hejazi
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Ivana Malenica
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Benjamin F Arnold
- Proctor Foundation, University of California, San Francisco, San Francisco, California, USA
| | - Jade Benjamin-Chung
- Epidemiology & Population Health, Stanford University, Stanford, California, USA
| | - Andrew Mertens
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - John M Colford
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Mark J van der Laan
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Alan E Hubbard
- Divisions of Epidemiology & Biostatistics, University of California, Berkeley, Berkeley, California, USA
| |
Collapse
|
26
|
Hejazi NS, van der Laan MJ, Janes HE, Gilbert PB, Benkeser DC. Efficient nonparametric inference on the effects of stochastic interventions under two-phase sampling, with applications to vaccine efficacy trials. Biometrics 2021; 77:1241-1253. [PMID: 32949147 PMCID: PMC8016405 DOI: 10.1111/biom.13375] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 08/02/2020] [Accepted: 09/01/2020] [Indexed: 12/17/2022]
Abstract
The advent and subsequent widespread availability of preventive vaccines has altered the course of public health over the past century. Despite this success, effective vaccines to prevent many high-burden diseases, including human immunodeficiency virus (HIV), have been slow to develop. Vaccine development can be aided by the identification of immune response markers that serve as effective surrogates for clinically significant infection or disease endpoints. However, measuring immune response marker activity is often costly, which has motivated the usage of two-phase sampling for immune response evaluation in clinical trials of preventive vaccines. In such trials, the measurement of immunological markers is performed on a subset of trial participants, where enrollment in this second phase is potentially contingent on the observed study outcome and other participant-level information. We propose nonparametric methodology for efficiently estimating a counterfactual parameter that quantifies the impact of a given immune response marker on the subsequent probability of infection. Along the way, we fill in theoretical gaps pertaining to the asymptotic behavior of nonparametric efficient estimators in the context of two-phase sampling, including a multiple robustness property enjoyed by our estimators. Techniques for constructing confidence intervals and hypothesis tests are presented, and an open source software implementation of the methodology, the txshift R package, is introduced. We illustrate the proposed techniques using data from a recent preventive HIV vaccine efficacy trial.
Collapse
Affiliation(s)
- Nima S Hejazi
- Graduate Group in Biostatistics, University of California, Berkeley, California
- Center for Computational Biology, University of California, Berkeley, California
| | - Mark J van der Laan
- Division of Epidemiology & Biostatistics, School of Public Health, University of California, Berkeley, California
- Department of Statistics, University of California, Berkeley, California
| | - Holly E Janes
- Vaccine & Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
- Department of Biostatistics, University of Washington, Seattle, Washington
| | - Peter B Gilbert
- Vaccine & Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington
- Department of Biostatistics, University of Washington, Seattle, Washington
| | - David C Benkeser
- Department of Biostatistics & Computational Biology, Rollins School of Public Health, Emory University, Atlanta, Georgia
| |
Collapse
|
27
|
Rudolph KE, Díaz I, Hejazi NS, van der Laan MJ, Luo SX, Shulman M, Campbell A, Rotrosen J, Nunes EV. Explaining differential effects of medication for opioid use disorder using a novel approach incorporating mediating variables. Addiction 2021; 116:2094-2103. [PMID: 33340181 DOI: 10.1111/add.15377] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 10/28/2020] [Accepted: 12/09/2020] [Indexed: 01/29/2023]
Abstract
BACKGROUND AND AIMS A recent study found that homeless individuals with opioid use disorder (OUD) had a lower risk of relapse on extended-release naltrexone (XR-NTX) versus buprenorphine-naloxone (BUP-NX), whereas non-homeless individuals had a lower risk of relapse on BUP-NX. This secondary study examined differences in mediation pathways to medication effect between homeless and non-homeless participants. DESIGN Secondary analysis of an open-label randomized controlled, 24-week comparative effectiveness trial, 2014-17. SETTING Eight community addiction treatment programs in the United States. PARTICIPANTS English-speaking adults with DSM-5 OUD, recruited during inpatient admission (n = 570). INTERVENTION(S) Randomization to monthly injection of XR-NTX or daily sublingual BUP-NX. MEASUREMENTS(S) Mediation analysis estimated the direct effect of XR-NTX versus BUP-NX on relapse and indirect effect through mediators of medication adherence, use of illicit opioids, depressive symptoms and pain, separately by homeless status. FINDINGS For the homeless subgroup, the protective indirect path contributed a 3.4 percentage point reduced risk of relapse [95% confidence interval (CI) = -12.0, 5.3] comparing XR-NTX to BUP-NX (explaining 21% of the total effect). For the non-homeless subgroup, the indirect path contributed a 9.4 percentage point increased risk of relapse (95% CI = 3.1, 15.7) comparing XR-NTX to BUP-NX (explaining 57% of the total effect). CONCLUSIONS A novel approach to mediation analysis shows that much of the difference in medication effectiveness (extended-release naltrexone versus buprenorphine-naloxone) on opioid relapse among non-homeless adults with opioid use disorder appears to be explained by mediators of adherence, illicit opioid use, depressive symptoms and pain.
Collapse
Affiliation(s)
- Kara E Rudolph
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Iván Díaz
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Nima S Hejazi
- Division of Epidemiology and Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
| | - Mark J van der Laan
- Division of Epidemiology and Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
| | - Sean X Luo
- Department of Psychiatry, School of Medicine, Columbia University and New York State Psychiatric Institute, New York, NY, USA
| | - Matisyahu Shulman
- Department of Psychiatry, School of Medicine, Columbia University and New York State Psychiatric Institute, New York, NY, USA
| | - Aimee Campbell
- Department of Psychiatry, School of Medicine, Columbia University and New York State Psychiatric Institute, New York, NY, USA
| | - John Rotrosen
- Department of Psychiatry, School of Medicine, New York University, New York, NY, USA
| | - Edward V Nunes
- Department of Psychiatry, School of Medicine, Columbia University and New York State Psychiatric Institute, New York, NY, USA
| |
Collapse
|
28
|
Rudolph KE, Levy J, van der Laan MJ. Transporting stochastic direct and indirect effects to new populations. Biometrics 2021; 77:197-211. [PMID: 32277465 PMCID: PMC7664994 DOI: 10.1111/biom.13274] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Revised: 02/24/2020] [Accepted: 03/23/2020] [Indexed: 12/01/2022]
Abstract
Transported mediation effects may contribute to understanding how interventions work differently when applied to new populations. However, we are not aware of any estimators for such effects. Thus, we propose two doubly robust, efficient estimators of transported stochastic (also called randomized interventional) direct and indirect effects. We demonstrate their finite sample properties in a simulation study. We then apply the preferred substitution estimator to longitudinal data from the Moving to Opportunity Study, a large-scale housing voucher experiment, to transport stochastic indirect effect estimates of voucher receipt in childhood on subsequent risk of mental health or substance use disorder mediated through parental employment across sites, thereby gaining understanding of drivers of the site differences.
Collapse
Affiliation(s)
- Kara E Rudolph
- Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, New York
| | - Jonathan Levy
- Division of Biostatistics, University of California, Berkeley, California
| | | |
Collapse
|
29
|
Abstract
In many problems, a sensible estimator of a possibly multivariate monotone function may fail to be monotone. We study the correction of such an estimator obtained via projection onto the space of functions monotone over a finite grid in the domain. We demonstrate that this corrected estimator has no worse supremal estimation error than the initial estimator, and that analogously corrected confidence bands contain the true function whenever the initial bands do, at no loss to band width. Additionally, we demonstrate that the corrected estimator is asymptotically equivalent to the initial estimator if the initial estimator satisfies a stochastic equicontinuity condition and the true function is Lipschitz and strictly monotone. We provide simple sufficient conditions in the special case that the initial estimator is asymptotically linear, and illustrate the use of these results for estimation of a G-computed distribution function. Our stochastic equicontinuity condition is weaker than standard uniform stochastic equicontinuity, which has been required for alternative correction procedures. This allows us to apply our results to the bivariate correction of the local linear estimator of a conditional distribution function known to be monotone in its conditioning argument. Our experiments suggest that the projection step can yield significant practical improvements.
Collapse
Affiliation(s)
- Ted Westling
- Department of Mathematics and Statistics, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, Berkeley, California, USA
| | - Marco Carone
- Department of Biostatistics, University of Washington, Seattle, Washington, USA
| |
Collapse
|
30
|
Benkeser D, Mertens A, Colford JM, Hubbard A, Arnold BF, Stein A, van der Laan MJ. A machine learning-based approach for estimating and testing associations with multivariate outcomes. Int J Biostat 2020; 17:7-21. [PMID: 32784265 DOI: 10.1515/ijb-2019-0061] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2019] [Accepted: 06/18/2020] [Indexed: 11/15/2022]
Abstract
We propose a method for summarizing the strength of association between a set of variables and a multivariate outcome. Classical summary measures are appropriate when linear relationships exist between covariates and outcomes, while our approach provides an alternative that is useful in situations where complex relationships may be present. We utilize machine learning to detect nonlinear relationships and covariate interactions and propose a measure of association that captures these relationships. A hypothesis test about the proposed associative measure can be used to test the strong null hypothesis of no association between a set of variables and a multivariate outcome. Simulations demonstrate that this hypothesis test has greater power than existing methods against alternatives where covariates have nonlinear relationships with outcomes. We additionally propose measures of variable importance for groups of variables, which summarize each groups' association with the outcome. We demonstrate our methodology using data from a birth cohort study on childhood health and nutrition in the Philippines.
Collapse
Affiliation(s)
- David Benkeser
- Emory University, School of Public Health, Atlanta, 30322, USA
| | - Andrew Mertens
- Department of Epidemiology, University of California, Berkeley, Berkeley, USA
| | - John M Colford
- Department of Epidemiology, University of California, Berkeley, Berkeley, USA
| | - Alan Hubbard
- Department of Biostatistics, University of California, Berkeley, Berkeley, USA
| | - Benjamin F Arnold
- Francis I. Proctor Foundation, University of California, San Fransisco, USA
| | - Aryeh Stein
- Hubert Department of Global Health, Emory University Rollins School of Public Health, Atlanta, USA
| | - Mark J van der Laan
- Department of Biostatistics, University of California, Berkeley, Berkeley, USA
| |
Collapse
|
31
|
|
32
|
|
33
|
Kreif N, Sofrygin O, Schmittdiel JA, Adams AS, Grant RW, Zhu Z, van der Laan MJ, Neugebauer R. Exploiting nonsystematic covariate monitoring to broaden the scope of evidence about the causal effects of adaptive treatment strategies. Biometrics 2020; 77:329-342. [PMID: 32297311 DOI: 10.1111/biom.13271] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 01/31/2020] [Accepted: 03/16/2020] [Indexed: 12/25/2022]
Abstract
In studies based on electronic health records (EHR), the frequency of covariate monitoring can vary by covariate type, across patients, and over time, which can limit the generalizability of inferences about the effects of adaptive treatment strategies. In addition, monitoring is a health intervention in itself with costs and benefits, and stakeholders may be interested in the effect of monitoring when adopting adaptive treatment strategies. This paper demonstrates how to exploit nonsystematic covariate monitoring in EHR-based studies to both improve the generalizability of causal inferences and to evaluate the health impact of monitoring when evaluating adaptive treatment strategies. Using a real world, EHR-based, comparative effectiveness research (CER) study of patients with type II diabetes mellitus, we illustrate how the evaluation of joint dynamic treatment and static monitoring interventions can improve CER evidence and describe two alternate estimation approaches based on inverse probability weighting (IPW). First, we demonstrate the poor performance of the standard estimator of the effects of joint treatment-monitoring interventions, due to a large decrease in data support and concerns over finite-sample bias from near-violations of the positivity assumption (PA) for the monitoring process. Second, we detail an alternate IPW estimator using a no direct effect assumption. We demonstrate that this estimator can improve efficiency but at the potential cost of increase in bias from violations of the PA for the treatment process.
Collapse
Affiliation(s)
- Noémi Kreif
- Centre for Health Economics, University of York, York, UK
| | - Oleg Sofrygin
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Julie A Schmittdiel
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Alyce S Adams
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Richard W Grant
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Zheng Zhu
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| | - Mark J van der Laan
- Division of Biostatistics, School of Public Health, University of California, Berkeley, California
| | - Romain Neugebauer
- Division of Research, Kaiser Permanente Northern California, Oakland, California
| |
Collapse
|
34
|
Rudolph KE, Sofrygin O, van der Laan MJ. Complier stochastic direct effects: identification and robust estimation. J Am Stat Assoc 2020; 116:1254-1264. [PMID: 34531623 PMCID: PMC8439556 DOI: 10.1080/01621459.2019.1704292] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2019] [Revised: 10/10/2019] [Accepted: 12/07/2019] [Indexed: 01/28/2023]
Abstract
Mediation analysis is critical to understanding the mechanisms underlying exposure-outcome relationships. In this paper, we identify the instrumental variable-direct effect of the exposure on the outcome not through the mediator, using randomization of the instrument. We call this estimand the complier stochastic direct effect (CSDE). To our knowledge, such an estimand has not previously been considered or estimated. We propose and evaluate several estimators for the CSDE: a ratio of inverse-probability of treatment-weighted estimators (IPTW), a ratio of estimating equation estimators (EE), a ratio of targeted minimum loss-based estimators (TMLE), and a TMLE that targets the CSDE directly. These estimators are applicable for a variety of study designs, including randomized encouragement trials, like the Moving to Opportunity housing voucher experiment we consider as an illustrative example, treatment discontinuities, and Mendelian randomization. We found the IPTW estimator to be the most sensitive to finite sample bias, resulting in bias of over 40% even when all models were correctly specified in a sample size of N=100. In contrast, the EE estimator and TMLE that targets the CSDE directly were far less sensitive. The EE and TML estimators also have advantages in terms of efficiency and reduced reliance on correct parametric model specification.
Collapse
Affiliation(s)
- Kara E Rudolph
- Department of Epidemiology, Columbia University, New York, New York
| | - Oleg Sofrygin
- Division of Biostatistics, University of California, Berkeley
| | | |
Collapse
|
35
|
Abstract
When predicting an outcome is the scientific goal, one must decide on a metric by which to evaluate the quality of predictions. We consider the problem of measuring the performance of a prediction algorithm with the same data that were used to train the algorithm. Typical approaches involve bootstrapping or cross-validation. However, we demonstrate that bootstrap-based approaches often fail and standard cross-validation estimators may perform poorly. We provide a general study of cross-validation-based estimators that highlights the source of this poor performance, and propose an alternative framework for estimation using techniques from the efficiency theory literature. We provide a theorem establishing the weak convergence of our estimators. The general theorem is applied in detail to two specific examples and we discuss possible extensions to other parameters of interest. For the two explicit examples that we consider, our estimators demonstrate remarkable finite-sample improvements over standard approaches.
Collapse
Affiliation(s)
- David Benkeser
- Department of Biostatistics and Bioinformatics, Emory University
| | - Maya Petersen
- Graduate Group in Biostatistics, University of California, Berkeley
| | - Mark J van der Laan
- Graduate Group in Biostatistics, University of California, Berkeley.,Department of Statistics, University of California, Berkeley
| |
Collapse
|
36
|
Schnitzer ME, Sango J, Ferreira Guerra S, van der Laan MJ. Data-adaptive longitudinal model selection in causal inference with collaborative targeted minimum loss-based estimation. Biometrics 2019; 76:145-157. [PMID: 31397506 DOI: 10.1111/biom.13135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 07/22/2019] [Indexed: 11/26/2022]
Abstract
Causal inference methods have been developed for longitudinal observational study designs where confounding is thought to occur over time. In particular, one may estimate and contrast the population mean counterfactual outcome under specific exposure patterns. In such contexts, confounders of the longitudinal treatment-outcome association are generally identified using domain-specific knowledge. However, this may leave an analyst with a large set of potential confounders that may hinder estimation. Previous approaches to data-adaptive model selection for this type of causal parameter were limited to the single time-point setting. We develop a longitudinal extension of a collaborative targeted minimum loss-based estimation (C-TMLE) algorithm that can be applied to perform variable selection in the models for the probability of treatment with the goal of improving the estimation of the population mean counterfactual outcome under a fixed exposure pattern. We investigate the properties of this method through a simulation study, comparing it to G-Computation and inverse probability of treatment weighting. We then apply the method in a real-data example to evaluate the safety of trimester-specific exposure to inhaled corticosteroids during pregnancy in women with mild asthma. The data for this study were obtained from the linkage of electronic health databases in the province of Quebec, Canada. The C-TMLE covariate selection approach allowed for a reduction of the set of potential confounders, which included baseline and longitudinal variables.
Collapse
Affiliation(s)
| | - Joel Sango
- Statistics Canada, Ottawa, Ontario, Canada.,Department of Mathematics and Statistics, Université de Montréal, Montréal, Québec, Canada
| | | | - Mark J van der Laan
- Division of Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, California
| |
Collapse
|
37
|
Miles CH, Petersen M, van der Laan MJ. Causal inference when counterfactuals depend on the proportion of all subjects exposed. Biometrics 2019; 75:768-777. [PMID: 30714118 PMCID: PMC6679813 DOI: 10.1111/biom.13034] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2017] [Accepted: 01/23/2019] [Indexed: 10/27/2022]
Abstract
The assumption that no subject's exposure affects another subject's outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject's outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this article, we instead consider a complete interference setting, in which each subject affects every other subject's outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.
Collapse
Affiliation(s)
- Caleb H. Miles
- Department of Biostatistics, Columbia Mailman School of Public Health, New York, New York, U.S.A
| | - Maya Petersen
- Division of Biostatistics, University of California at Berkeley, Berkeley, California, U.S.A
- Division of Epidemiology, University of California at Berkeley, Berkeley, California, U.S.A
| | - Mark J. van der Laan
- Division of Biostatistics, University of California at Berkeley, Berkeley, California, U.S.A
- Department of Statistics, University of California at Berkeley, Berkeley, California, U.S.A
| |
Collapse
|
38
|
Ju C, Benkeser D, van der Laan MJ. Robust inference on the average treatment effect using the outcome highly adaptive lasso. Biometrics 2019; 76:109-118. [PMID: 31350906 DOI: 10.1111/biom.13121] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2018] [Accepted: 07/16/2019] [Indexed: 12/01/2022]
Abstract
Many estimators of the average effect of a treatment on an outcome require estimation of the propensity score, the outcome regression, or both. It is often beneficial to utilize flexible techniques, such as semiparametric regression or machine learning, to estimate these quantities. However, optimal estimation of these regressions does not necessarily lead to optimal estimation of the average treatment effect, particularly in settings with strong instrumental variables. A recent proposal addressed these issues via the outcome-adaptive lasso, a penalized regression technique for estimating the propensity score that seeks to minimize the impact of instrumental variables on treatment effect estimators. However, a notable limitation of this approach is that its application is restricted to parametric models. We propose a more flexible alternative that we call the outcome highly adaptive lasso. We discuss the large sample theory for this estimator and propose closed-form confidence intervals based on the proposed estimator. We show via simulation that our method offers benefits over several popular approaches.
Collapse
Affiliation(s)
- Cheng Ju
- Division of Biostatistics, University of California, Berkeley, California
| | - David Benkeser
- Division of Biostatistics, Emory University, Atlanta, Georgia
| | | |
Collapse
|
39
|
Balzer LB, Zheng W, van der Laan MJ, Petersen ML. A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure. Stat Methods Med Res 2019; 28:1761-1780. [PMID: 29921160 PMCID: PMC6173669 DOI: 10.1177/0962280218774936] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
We often seek to estimate the impact of an exposure naturally occurring or randomly assigned at the cluster-level. For example, the literature on neighborhood determinants of health continues to grow. Likewise, community randomized trials are applied to learn about real-world implementation, sustainability, and population effects of interventions with proven individual-level efficacy. In these settings, individual-level outcomes are correlated due to shared cluster-level factors, including the exposure, as well as social or biological interactions between individuals. To flexibly and efficiently estimate the effect of a cluster-level exposure, we present two targeted maximum likelihood estimators (TMLEs). The first TMLE is developed under a non-parametric causal model, which allows for arbitrary interactions between individuals within a cluster. These interactions include direct transmission of the outcome (i.e. contagion) and influence of one individual's covariates on another's outcome (i.e. covariate interference). The second TMLE is developed under a causal sub-model assuming the cluster-level and individual-specific covariates are sufficient to control for confounding. Simulations compare the alternative estimators and illustrate the potential gains from pairing individual-level risk factors and outcomes during estimation, while avoiding unwarranted assumptions. Our results suggest that estimation under the sub-model can result in bias and misleading inference in an observational setting. Incorporating working assumptions during estimation is more robust than assuming they hold in the underlying causal model. We illustrate our approach with an application to HIV prevention and treatment.
Collapse
Affiliation(s)
- Laura B Balzer
- Department of Biostatistics & Epidemiology, School of Public Health & Health Sciences, University of Massachusetts, Amherst, MA, USA
| | | | - Mark J van der Laan
- Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
| | - Maya L Petersen
- Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA
| |
Collapse
|
40
|
Sofrygin O, Zhu Z, Schmittdiel JA, Adams AS, Grant RW, van der Laan MJ, Neugebauer R. Targeted learning with daily EHR data. Stat Med 2019; 38:3073-3090. [PMID: 31025411 DOI: 10.1002/sim.8164] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2017] [Revised: 01/11/2019] [Accepted: 03/22/2019] [Indexed: 11/10/2022]
Abstract
Electronic health records (EHR) data provide a cost- and time-effective opportunity to conduct cohort studies of the effects of multiple time-point interventions in the diverse patient population found in real-world clinical settings. Because the computational cost of analyzing EHR data at daily (or more granular) scale can be quite high, a pragmatic approach has been to partition the follow-up into coarser intervals of pre-specified length (eg, quarterly or monthly intervals). The feasibility and practical impact of analyzing EHR data at a granular scale has not been previously evaluated. We start filling these gaps by leveraging large-scale EHR data from a diabetes study to develop a scalable targeted learning approach that allows analyses with small intervals. We then study the practical effects of selecting different coarsening intervals on inferences by reanalyzing data from the same large-scale pool of patients. Specifically, we map daily EHR data into four analytic datasets using 90-, 30-, 15-, and 5-day intervals. We apply a semiparametric and doubly robust estimation approach, the longitudinal Targeted Minimum Loss-Based Estimation (TMLE), to estimate the causal effects of four dynamic treatment rules with each dataset, and compare the resulting inferences. To overcome the computational challenges presented by the size of these data, we propose a novel TMLE implementation, the "long-format TMLE," and rely on the latest advances in scalable data-adaptive machine-learning software, xgboost and h2o, for estimation of the TMLE nuisance parameters.
Collapse
Affiliation(s)
- Oleg Sofrygin
- Division of Research, Kaiser Permanente, Northern California, Oakland, California.,Division of Biostatistics, University of California, Berkeley, California
| | - Zheng Zhu
- Division of Research, Kaiser Permanente, Northern California, Oakland, California
| | - Julie A Schmittdiel
- Division of Research, Kaiser Permanente, Northern California, Oakland, California
| | - Alyce S Adams
- Division of Research, Kaiser Permanente, Northern California, Oakland, California
| | - Richard W Grant
- Division of Research, Kaiser Permanente, Northern California, Oakland, California
| | | | - Romain Neugebauer
- Division of Research, Kaiser Permanente, Northern California, Oakland, California
| |
Collapse
|
41
|
Ju C, Wyss R, Franklin JM, Schneeweiss S, Häggström J, van der Laan MJ. Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data. Stat Methods Med Res 2019; 28:1044-1063. [PMID: 29226777 PMCID: PMC6039292 DOI: 10.1177/0962280217744588] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Propensity score-based estimators are increasingly used for causal inference in observational studies. However, model selection for propensity score estimation in high-dimensional data has received little attention. In these settings, propensity score models have traditionally been selected based on the goodness-of-fit for the treatment mechanism itself, without consideration of the causal parameter of interest. Collaborative minimum loss-based estimation is a novel methodology for causal inference that takes into account information on the causal parameter of interest when selecting a propensity score model. This "collaborative learning" considers variable associations with both treatment and outcome when selecting a propensity score model in order to minimize a bias-variance tradeoff in the estimated treatment effect. In this study, we introduce a novel approach for collaborative model selection when using the LASSO estimator for propensity score estimation in high-dimensional covariate settings. To demonstrate the importance of selecting the propensity score model collaboratively, we designed quasi-experiments based on a real electronic healthcare database, where only the potential outcomes were manually generated, and the treatment and baseline covariates remained unchanged. Results showed that the collaborative minimum loss-based estimation algorithm outperformed other competing estimators for both point estimation and confidence interval coverage. In addition, the propensity score model selected by collaborative minimum loss-based estimation could be applied to other propensity score-based estimators, which also resulted in substantive improvement for both point estimation and confidence interval coverage. We illustrate the discussed concepts through an empirical example comparing the effects of non-selective nonsteroidal anti-inflammatory drugs with selective COX-2 inhibitors on gastrointestinal complications in a population of Medicare beneficiaries.
Collapse
Affiliation(s)
- Cheng Ju
- Division of Biostatistics, University of California, USA
| | - Richard Wyss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, USA
| | - Jessica M Franklin
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, USA
| | - Sebastian Schneeweiss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, USA
| | | | | |
Collapse
|
42
|
Ju C, Combs M, Lendle SD, Franklin JM, Wyss R, Schneeweiss S, van der Laan MJ. Propensity score prediction for electronic healthcare databases using Super Learner and High-dimensional Propensity Score Methods. J Appl Stat 2019; 46:2216-2236. [PMID: 32843815 PMCID: PMC7444746 DOI: 10.1080/02664763.2019.1582614] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 02/08/2019] [Indexed: 02/06/2023]
Abstract
The optimal learner for prediction modeling varies depending on the underlying data-generating distribution. Super Learner (SL) is a generic ensemble learning algorithm that uses cross-validation to select among a "library" of candidate prediction models. While SL has been widely studied in a number of settings, it has not been thoroughly evaluated in large electronic healthcare databases that are common in pharmacoepidemiology and comparative effectiveness research. In this study, we applied and evaluated the performance of SL in its ability to predict the propensity score (PS), the conditional probability of treatment assignment given baseline covariates, using three electronic healthcare databases. We considered a library of algorithms that consisted of both nonparametric and parametric models. We also proposed a novel strategy for prediction modeling that combines SL with the high-dimensional propensity score (hdPS) variable selection algorithm. Predictive performance was assessed using three metrics: the negative log-likelihood, area under the curve (AUC), and time complexity. Results showed that the best individual algorithm, in terms of predictive performance, varied across datasets. The SL was able to adapt to the given dataset and optimize predictive performance relative to any individual learner. Combining the SL with the hdPS was the most consistent prediction method and may be promising for PS estimation and prediction modeling in electronic healthcare databases.
Collapse
Affiliation(s)
- Cheng Ju
- Division of Biostatistics, University of California, Berkeley
| | - Mary Combs
- Division of Biostatistics, University of California, Berkeley
| | - Samuel D Lendle
- Division of Biostatistics, University of California, Berkeley
| | - Jessica M Franklin
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School
| | - Richard Wyss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School
| | - Sebastian Schneeweiss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School
| | | |
Collapse
|
43
|
Gruber S, van der Laan MJ. Comment on “Automated Versus Do-It-Yourself Methods for Causal Inference: Lessons Learned from a Data Analysis Competition”. Stat Sci 2019. [DOI: 10.1214/18-sts689] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
44
|
Ju C, Gruber S, Lendle SD, Chambaz A, Franklin JM, Wyss R, Schneeweiss S, van der Laan MJ. Scalable collaborative targeted learning for high-dimensional data. Stat Methods Med Res 2019; 28:532-554. [PMID: 28936917 PMCID: PMC6086775 DOI: 10.1177/0962280217729845] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well-behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation procedure. The original instantiation of the collaborative targeted minimum loss-based estimation template can be presented as a greedy forward stepwise collaborative targeted minimum loss-based estimation algorithm. It does not scale well when the number p of covariates increases drastically. This motivates the introduction of a novel instantiation of the collaborative targeted minimum loss-based estimation template where the covariates are pre-ordered. Its time complexity is O ( p ) as opposed to the original O ( p 2 ) , a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is O ( p ) as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy collaborative targeted minimum loss-based estimation algorithm is unacceptably slow. Simulation studies seem to indicate that our scalable collaborative targeted minimum loss-based estimation and SL-C-TMLE algorithms work well. All C-TMLEs are publicly available in a Julia software package.
Collapse
Affiliation(s)
- Cheng Ju
- University of California, Berkeley, CA, USA
| | - Susan Gruber
- Harvard Pilgrim Health Care Institute and Harvard Medical School, Boston, MA, USA
| | | | - Antoine Chambaz
- University of California, Berkeley, CA, USA
- Modal’X, UPL, Univ Paris Nanterre, Nanterre, France
| | - Jessica M Franklin
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, Boston, MA, USA
| | - Richard Wyss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, Boston, MA, USA
| | - Sebastian Schneeweiss
- Division of Pharmacoepidemiology and Pharmacoeconomics, Department of Medicine, Brigham and Womens Hospital and Harvard Medical School, Boston, MA, USA
| | | |
Collapse
|
45
|
Price BL, Gilbert PB, van der Laan MJ. Estimation of the optimal surrogate based on a randomized trial. Biometrics 2018; 74:1271-1281. [PMID: 29701875 PMCID: PMC6393111 DOI: 10.1111/biom.12879] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Revised: 12/01/2017] [Accepted: 02/01/2018] [Indexed: 11/27/2022]
Abstract
A common scientific problem is to determine a surrogate outcome for a long-term outcome so that future randomized studies can restrict themselves to only collecting the surrogate outcome. We consider the setting that we observe n independent and identically distributed observations of a random variable consisting of baseline covariates, a treatment, a vector of candidate surrogate outcomes at an intermediate time point, and the final outcome of interest at a final time point. We assume the treatment is randomized, conditional on the baseline covariates. The goal is to use these data to learn a most-promising surrogate for use in future trials for inference about a mean contrast treatment effect on the final outcome. We define an optimal surrogate for the current study as the function of the data generating distribution collected by the intermediate time point that satisfies the Prentice definition of a valid surrogate endpoint and that optimally predicts the final outcome: this optimal surrogate is an unknown parameter. We show that this optimal surrogate is a conditional mean and present super-learner and targeted super-learner based estimators, whose predicted outcomes are used as the surrogate in applications. We demonstrate a number of desirable properties of this optimal surrogate and its estimators, and study the methodology in simulations and an application to dengue vaccine efficacy trials.
Collapse
Affiliation(s)
- Brenda L Price
- Department of Biostatistics, University of Washington, Seattle, Washington, 98109, U.S.A
| | - Peter B Gilbert
- Department of Biostatistics, University of Washington, Seattle, Washington, 98109, U.S.A
- Vaccine and Infectious Disease and Public Health Sciences Divisions, Fred Hutchinson Cancer Research Center, Seattle, Washington, 98109, U.S.A
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, California, 94720, U.S.A
| |
Collapse
|
46
|
Luedtke AR, Carone M, van der Laan MJ. An omnibus non-parametric test of equality in distribution for unknown functions. J R Stat Soc Series B Stat Methodol 2018; 81:75-99. [PMID: 31024219 DOI: 10.1111/rssb.12299] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
We present a novel family of nonparametric omnibus tests of the hypothesis that two unknown but estimable functions are equal in distribution when applied to the observed data structure. We developed these tests, which represent a generalization of the maximum mean discrepancy tests described in Gretton et al. [2006], using recent developments from the higher-order pathwise differentiability literature. Despite their complex derivation, the associated test statistics can be expressed rather simply as U-statistics. We study the asymptotic behavior of the proposed tests under the null hypothesis and under both fixed and local alternatives. We provide examples to which our tests can be applied and show that they perform well in a simulation study. As an important special case, our proposed tests can be used to determine whether an unknown function, such as the conditional average treatment effect, is equal to zero almost surely.
Collapse
Affiliation(s)
- Alexander R Luedtke
- Vaccine and Infectious Disease Division, Fred Hutchinson Cancer Research Center, Seattle, WA, USA,
| | - Marco Carone
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, Berkeley, CA, USA
| |
Collapse
|
47
|
Carone M, Luedtke AR, van der Laan MJ. Toward computerized efficient estimation in infinite-dimensional models. J Am Stat Assoc 2018; 114:1174-1190. [PMID: 32405108 PMCID: PMC7219981 DOI: 10.1080/01621459.2018.1482752] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Revised: 05/01/2018] [Indexed: 10/14/2022]
Abstract
Despite the risk of misspecification they are tied to, parametric models continue to be used in statistical practice because they are simple and convenient to use. In particular, efficient estimation procedures in parametric models are easy to describe and implement. Unfortunately, the same cannot be said of semiparametric and nonparametric models. While the latter often reflect the level of available scientific knowledge more appropriately, performing efficient inference in these models is generally challenging. The efficient influence function is a key analytic object from which the construction of asymptotically efficient estimators can potentially be streamlined. However, the theoretical derivation of the efficient influence function requires specialized knowledge and is often a difficult task, even for experts. In this paper, we present a novel representation of the efficient influence function and describe a numerical procedure for approximating its evaluation. The approach generalizes the nonparametric procedures of Frangakis et al. (2015) and Luedtke et al. (2015) to arbitrary models. We present theoretical results to support our proposal, and illustrate the method in the context of several semiparametric problems. The proposed approach is an important step toward automating efficient estimation in general statistical models, thereby rendering more accessible the use of realistic models in statistical analyses.
Collapse
Affiliation(s)
- Marco Carone
- Department of Biostatistics, University of Washington
| | | | | |
Collapse
|
48
|
Abstract
The positivity assumption, or the experimental treatment assignment (ETA) assumption, is important for identifiability in causal inference. Even if the positivity assumption holds, practical violations of this assumption may jeopardize the finite sample performance of the causal estimator. One of the consequences of practical violations of the positivity assumption is extreme values in the estimated propensity score (PS). A common practice to address this issue is truncating the PS estimate when constructing PS-based estimators. In this study, we propose a novel adaptive truncation method, Positivity-C-TMLE, based on the collaborative targeted maximum likelihood estimation (C-TMLE) methodology. We demonstrate the outstanding performance of our novel approach in a variety of simulations by comparing it with other commonly studied estimators. Results show that by adaptively truncating the estimated PS with a more targeted objective function, the Positivity-C-TMLE estimator achieves the best performance for both point estimation and confidence interval coverage among all estimators considered.
Collapse
Affiliation(s)
- Cheng Ju
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | - Joshua Schwab
- Division of Biostatistics, University of California, Berkeley, CA, USA
| | | |
Collapse
|
49
|
Abstract
In health and social sciences, research questions often involve systematic assessment of the modification of treatment causal effect by patient characteristics. In longitudinal settings, time-varying or post-intervention effect modifiers are also of interest. In this work, we investigate the robust and efficient estimation of the Counterfactual-History-Adjusted Marginal Structural Model (van der Laan MJ, Petersen M. Statistical learning of origin-specific statically optimal individualized treatment rules. Int J Biostat. 2007;3), which models the conditional intervention-specific mean outcome given a counterfactual modifier history in an ideal experiment. We establish the semiparametric efficiency theory for these models, and present a substitution-based, semiparametric efficient and doubly robust estimator using the targeted maximum likelihood estimation methodology (TMLE, e.g. van der Laan MJ, Rubin DB. Targeted maximum likelihood learning. Int J Biostat. 2006;2, van der Laan MJ, Rose S. Targeted learning: causal inference for observational and experimental data, 1st ed. Springer Series in Statistics. Springer, 2011). To facilitate implementation in applications where the effect modifier is high dimensional, our third contribution is a projected influence function (and the corresponding projected TMLE estimator), which retains most of the robustness of its efficient peer and can be easily implemented in applications where the use of the efficient influence function becomes taxing. We compare the projected TMLE estimator with an Inverse Probability of Treatment Weighted estimator (e.g. Robins JM. Marginal structural models. In: Proceedings of the American Statistical Association. Section on Bayesian Statistical Science, 1-10. 1997a, Hernan MA, Brumback B, Robins JM. Marginal structural models to estimate the causal effect of zidovudine on the survival of HIV-positive men. EPIDEMIOLOGY 2000;11:561-570), and a non-targeted G-computation estimator (Robins JM. A new approach to causal inference in mortality studies with sustained exposure periods - application to control of the healthy worker survivor effect. Math Modell. 1986;7:1393-1512.). The comparative performance of these estimators is assessed in a simulation study. The use of the projected TMLE estimator is illustrated in a secondary data analysis for the Sequenced Treatment Alternatives to Relieve Depression (STAR*D) trial where effect modifiers are subject to missing at random.
Collapse
Affiliation(s)
- Wenjing Zheng
- Division of Biostatistics, University of California, Berkeley, USA
- Center for Targeted Learning, University of California, Berkeley, USA
| | - Zhehui Luo
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, USA
| | - Mark J van der Laan
- Division of Biostatistics, University of California, Berkeley, USA
- Center for Targeted Learning, University of California, Berkeley, USA
| |
Collapse
|
50
|
Chambaz A, Hubbard A, van der Laan MJ. Special Issue on Data-Adaptive Statistical Inference. Int J Biostat 2018; 12:1. [PMID: 27227714 DOI: 10.1515/ijb-2016-0033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|