1
|
Kragh Jørgensen RR, Jensen JF, El-Galaly T, Bøgsted M, Brøndum RF, Simonsen MR, Jakobsen LH. Development of time to event prediction models using federated learning. BMC Med Res Methodol 2025; 25:143. [PMID: 40419965 PMCID: PMC12105200 DOI: 10.1186/s12874-025-02598-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2025] [Accepted: 05/15/2025] [Indexed: 05/28/2025] Open
Abstract
BACKGROUND In a wide range of diseases, it is necessary to utilize multiple data sources to obtain enough data for model training. However, performing centralized pooling of multiple data sources, while protecting each patients' sensitive data, can require a cumbersome process involving many institutional bodies. Alternatively, federated learning (FL) can be utilized to train models based on data located at multiple sites. METHOD We propose two methods for training time-to-event prediction models based on distributed data, relying on FL algorithms, for time-to-event prediction models. Both approach incorporates steps to allow prediction of individual-level survival curves, without exposing individual-level event times. For Cox proportional hazards models, the latter is accomplished by using a kernel smoother for the baseline hazard function. The other proposed methodology is based on general parametric likelihood theory for right-censored data. We compared these two methods in four simulation and with one real-world dataset predicting the survival probability in patients with Hodgkin lymphoma (HL). RESULTS The simulations demonstrated that the FL models performed similarly to the non-distributed case in all four experiments, with only slight deviations in predicted survival probabilities compared to the true model. Our findings were similar in the real-world advanced-stage HL example where the FL models were compared to their non-distributed versions, revealing only small deviations in performance. CONCLUSION The proposed procedures enable training of time-to-event models using data distributed across sites, without direct sharing of individual-level data and event times, while retaining a predictive performance on par with undistributed approaches.
Collapse
Affiliation(s)
- Rasmus Rask Kragh Jørgensen
- Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark.
- Center for Clinical Data Science, Aalborg University and Aalborg University Hospital, Aalborg, Denmark.
- Department of Clinical Medicine, Aalborg University, Aalborg, Denmark.
| | - Jonas Faartoft Jensen
- Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark
| | - Tarec El-Galaly
- Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark
- Department of Clinical Medicine, Aalborg University, Aalborg, Denmark
- Department of Medicine Solna, Clinical Epidemiology Division, Karolinska Institutet, Stockholm, Sweden
- Department of Hematology, Odense University Hospital, Odense, Denmark
| | - Martin Bøgsted
- Center for Clinical Data Science, Aalborg University and Aalborg University Hospital, Aalborg, Denmark
- Clinical Cancer Research Centre, Aalborg University Hospital, Aalborg, Denmark
| | - Rasmus Froberg Brøndum
- Center for Clinical Data Science, Aalborg University and Aalborg University Hospital, Aalborg, Denmark
- Clinical Cancer Research Centre, Aalborg University Hospital, Aalborg, Denmark
| | - Mikkel Runason Simonsen
- Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark
- Department of Mathematical Sciences, Aalborg University, Aalborg, Denmark
| | - Lasse Hjort Jakobsen
- Department of Hematology, Clinical Cancer Research Center, Aalborg University Hospital, Aalborg, Denmark
| |
Collapse
|
2
|
Liang CJ, Luo C, Kranzler HR, Bian J, Chen Y. Communication-efficient federated learning of temporal effects on opioid use disorder with data from distributed research networks. J Am Med Inform Assoc 2025; 32:656-664. [PMID: 39864407 PMCID: PMC12005629 DOI: 10.1093/jamia/ocae313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Revised: 11/30/2024] [Accepted: 12/20/2024] [Indexed: 01/28/2025] Open
Abstract
OBJECTIVE To develop a distributed algorithm to fit multi-center Cox regression models with time-varying coefficients to facilitate privacy-preserving data integration across multiple health systems. MATERIALS AND METHODS The Cox model with time-varying coefficients relaxes the proportional hazards assumption of the usual Cox model and is particularly useful to model time-to-event outcomes. We proposed a One-shot Distributed Algorithm to fit multi-center Cox regression models with Time varying coefficients (ODACT). This algorithm constructed a surrogate likelihood function to approximate the Cox partial likelihood function, using patient-level data from a lead site and aggregated data from other sites. The performance of ODACT was demonstrated by simulation and a real-world study of opioid use disorder (OUD) using decentralized data from a large clinical research network across 5 sites with 69 163 subjects. RESULTS The ODACT method precisely estimated the time-varying effects over time. In the simulation study, ODACT always achieved estimation close to that of the pooled analysis, while the meta-estimator showed considerable amount of bias. In the OUD study, the bias of the estimated hazard ratios by ODACT are smaller than those of the meta-estimator for all 7 risk factors at almost all of the time points from 0 to 2.5 years. The greatest bias of the meta-estimator was for the effects of age ≥65 years, and smoking. CONCLUSION ODACT is a privacy-preserving and communication-efficient method for analyzing multi-center time-to-event data which allows the covariates' effects to be time-varying. ODACT provides estimates close to the pooled estimator and substantially outperforms the meta-analysis estimator. DISCUSSION The proposed ODACT is a privacy-preserving distributed algorithm for fitting Cox models with time-varying coefficients. The limitations of ODACT include that privacy-preserving via aggregate data does rely on relatively large number of data at each individual site, and rigorous quantification of the risk of privacy leaks requires further investigation.
Collapse
Affiliation(s)
- C Jason Liang
- Biostatistics Research Branch, National Institute of Allergy and Infectious Diseases, Bethesda, MD 20892, United States
| | - Chongliang Luo
- Division of Public Health Sciences, Washington University School of Medicine, St Louis, MO 63110, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Henry R Kranzler
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL 32610, United States
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
- Center for Health AI and Synthesis of Evidence, University of Pennsylvania, Philadelphia, PA 19104, United States
| |
Collapse
|
3
|
Zhou X, Chang L, Cao J. Communication-Efficient Nonconvex Federated Learning With Error Feedback for Uplink and Downlink. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:1003-1014. [PMID: 37995164 DOI: 10.1109/tnnls.2023.3333804] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/25/2023]
Abstract
Facing large-scale online learning, the reliance on sophisticated model architectures often leads to nonconvex distributed optimization, which is more challenging than convex problems. Online recruited workers, such as mobile phone, laptop, and desktop computers, often have narrower uplink bandwidths than downlink. In this article, we propose two communication-efficient nonconvex federated learning algorithms with error feedback 2021 (EF21) and lazily aggregated gradient (LAG) for adapting uplink and downlink communications. EF21 is a new and theoretically better EF, which consistently and substantially outperforms vanilla EF in practice. LAG is a gradient filtration technique for adapting communication. For reducing communication costs of uplink, we design an effective LAG rule and then give EF21 with LAG (EF-LAG) algorithm, which combines EF21 and our LAG rule. We also present a bidirectional EF-LAG (BiEF-LAG) algorithm for reducing uplink and downlink communication costs. Theoretically, our proposed algorithms enjoy the same fast convergence rate as gradient descent (GD) for smooth nonconvex learning. That is, our algorithms greatly reduce communication costs without sacrificing the quality of learning. Numerical experiments on both synthetic data and deep learning benchmarks show significant empirical superiority of our algorithms in communication.
Collapse
|
4
|
Tong J, Li L, Reps JM, Lorman V, Jing N, Edmondson M, Lou X, Jhaveri R, Kelleher KJ, Pajor NM, Forrest CB, Bian J, Chu H, Chen Y. Advancing Interpretable Regression Analysis for Binary Data: A Novel Distributed Algorithm Approach. Stat Med 2024; 43:5573-5582. [PMID: 39489875 PMCID: PMC11588971 DOI: 10.1002/sim.10250] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 08/04/2024] [Accepted: 10/02/2024] [Indexed: 11/05/2024]
Abstract
Sparse data bias, where there is a lack of sufficient cases, is a common problem in data analysis, particularly when studying rare binary outcomes. Although a two-step meta-analysis approach may be used to lessen the bias by combining the summary statistics to increase the number of cases from multiple studies, this method does not completely eliminate bias in effect estimation. In this paper, we propose a one-shot distributed algorithm for estimating relative risk using a modified Poisson regression for binary data, named ODAP-B. We evaluate the performance of our method through both simulation studies and real-world case analyses of postacute sequelae of SARS-CoV-2 infection in children using data from 184 501 children across eight national academic medical centers. Compared with the meta-analysis method, our method provides closer estimates of the relative risk for all outcomes considered including syndromic and systemic outcomes. Our method is communication-efficient and privacy-preserving, requiring only aggregated data to obtain relatively unbiased effect estimates compared with two-step meta-analysis methods. Overall, ODAP-B is an effective distributed learning algorithm for Poisson regression to study rare binary outcomes. The method provides inference on adjusted relative risk with a robust variance estimator.
Collapse
Affiliation(s)
- Jiayi Tong
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public HealthBaltimoreMarylandUSA
| | - Lu Li
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- The Graduate Group in Applied Mathematics and Computational Science, School of Arts and SciencesUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
| | - Jenna Marie Reps
- Janssen Research and DevelopmentTitusvilleNew JerseyUSA
- Observational Health Data Sciences and Informatics (OHDSI)New YorkNew YorkUSA
- Department of Medical InformaticsErasmus University Medical CenterRotterdamThe Netherlands
| | - Vitaly Lorman
- Applied Clinical Research Center, Children's Hospital of PhiladelphiaPhiladelphiaPennsylvaniaUSA
| | - Naimin Jing
- Biostatistics and Research Decision Sciences, Merck & Co., IncRahwayNew JerseyUSA
| | - Mackenzie Edmondson
- Biostatistics and Research Decision Sciences, Merck & Co., IncRahwayNew JerseyUSA
| | - Xiwei Lou
- Health Outcomes & Biomedical informatics, College of MedicineUniversity of FloridaGainesvilleFloridaUSA
| | - Ravi Jhaveri
- Division of Infectious DiseasesAnn & Robert H. Lurie Children's Hospital of ChicagoChicagoIllinoisUSA
| | - Kelly J. Kelleher
- Center for Child Health Equity and Outcomes ResearchThe Abigail Wexner Research Institute at Nationwide Children's HospitalColumbusOhioUSA
| | - Nathan M. Pajor
- Divisions of Pulmonary Medicine | Biomedical Informatics | James M. Anderson Center for Health Systems ExcellenceCincinnati Children's Hospital Medical Center and University of Cincinnati College of MedicineCincinnatiOhioUSA
| | - Christopher B. Forrest
- Applied Clinical Research Center, Children's Hospital of PhiladelphiaPhiladelphiaPennsylvaniaUSA
| | - Jiang Bian
- Health Outcomes & Biomedical informatics, College of MedicineUniversity of FloridaGainesvilleFloridaUSA
| | - Haitao Chu
- Statistical Research and Data Science Center, Pfizer Inc.New YorkNew YorkUSA
| | - Yong Chen
- Center for Health AI and Synthesis of Evidence (CHASE), Department of Biostatistics, Epidemiology and Informatics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Department of BiostatisticsJohns Hopkins Bloomberg School of Public HealthBaltimoreMarylandUSA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of MedicineUniversity of PennsylvaniaPhiladelphiaPennsylvaniaUSA
- Penn Institute for Biomedical Informatics (IBI)PhiladelphiaPennsylvaniaUSA
- Leonard Davis Institute of Health EconomicsPhiladelphiaPennsylvaniaUSA
- Penn Medicine Center for Evidence‐based Practice (CEP), PhiladelphiaPennsylvaniaUSA
| |
Collapse
|
5
|
Li R, Romano JD, Chen Y, Moore JH. Centralized and Federated Models for the Analysis of Clinical Data. Annu Rev Biomed Data Sci 2024; 7:179-199. [PMID: 38723657 PMCID: PMC11571052 DOI: 10.1146/annurev-biodatasci-122220-115746] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
The progress of precision medicine research hinges on the gathering and analysis of extensive and diverse clinical datasets. With the continued expansion of modalities, scales, and sources of clinical datasets, it becomes imperative to devise methods for aggregating information from these varied sources to achieve a comprehensive understanding of diseases. In this review, we describe two important approaches for the analysis of diverse clinical datasets, namely the centralized model and federated model. We compare and contrast the strengths and weaknesses inherent in each model and present recent progress in methodologies and their associated challenges. Finally, we present an outlook on the opportunities that both models hold for the future analysis of clinical data.
Collapse
Affiliation(s)
- Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA;
| | - Joseph D Romano
- Institute for Biomedical Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, California, USA;
| |
Collapse
|
6
|
Camirand Lemyre F, Lévesque S, Domingue MP, Herrmann K, Ethier JF. Distributed Statistical Analyses: A Scoping Review and Examples of Operational Frameworks Adapted to Health Analytics. JMIR Med Inform 2024; 12:e53622. [PMID: 39028684 PMCID: PMC11617597 DOI: 10.2196/53622] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 07/10/2024] [Accepted: 07/19/2024] [Indexed: 07/21/2024] Open
Abstract
BACKGROUND Data from multiple organizations are crucial for advancing learning health systems. However, ethical, legal, and social concerns may restrict the use of standard statistical methods that rely on pooling data. Although distributed algorithms offer alternatives, they may not always be suitable for health frameworks. OBJECTIVE This paper aims to support researchers and data custodians in three ways: (1) providing a concise overview of the literature on statistical inference methods for horizontally partitioned data; (2) describing the methods applicable to generalized linear models (GLM) and assessing their underlying distributional assumptions; (3) adapting existing methods to make them fully usable in health settings. METHODS A scoping review methodology was employed for the literature mapping, from which methods presenting a methodological framework for GLM analyses with horizontally partitioned data were identified and assessed from the perspective of applicability in health settings. Statistical theory was used to adapt methods and to derive the properties of the resulting estimators. RESULTS From the review, 41 articles were selected, and six approaches were extracted for conducting standard GLM-based statistical analysis. However, these approaches assumed evenly and identically distributed data across nodes. Consequently, statistical procedures were derived to accommodate uneven node sample sizes and heterogeneous data distributions across nodes. Workflows and detailed algorithms were developed to highlight information-sharing requirements and operational complexity. CONCLUSIONS This paper contributes to the field of health analytics by providing an overview of the methods that can be used with horizontally partitioned data, by adapting these methods to the context of heterogeneous health data and by clarifying the workflows and quantities exchanged by the methods discussed. Further analysis of the confidentiality preserved by these methods is needed to fully understand the risk associated with the sharing of summary statistics.
Collapse
Affiliation(s)
- Félix Camirand Lemyre
- GRIIS, Université de Sherbrooke, 2500, Boul de l'Université, Sherbrooke, QC, J1K 2R1, Canada, 1-819-821-8000 ext 74977
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Simon Lévesque
- GRIIS, Université de Sherbrooke, 2500, Boul de l'Université, Sherbrooke, QC, J1K 2R1, Canada, 1-819-821-8000 ext 74977
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
- Health Data Research Network Canada, Vancouver, BC, Canada
| | - Marie-Pier Domingue
- GRIIS, Université de Sherbrooke, 2500, Boul de l'Université, Sherbrooke, QC, J1K 2R1, Canada, 1-819-821-8000 ext 74977
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
- Chaire MEIE Québec - Le numérique au service des systèmes de santé apprenants, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Klaus Herrmann
- Département de mathématiques, Faculté des sciences, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Jean-François Ethier
- GRIIS, Université de Sherbrooke, 2500, Boul de l'Université, Sherbrooke, QC, J1K 2R1, Canada, 1-819-821-8000 ext 74977
- Health Data Research Network Canada, Vancouver, BC, Canada
- Département de médecine, Faculté de médecine et des sciences de la santé, Université de Sherbrooke, Sherbrooke, QC, Canada
| |
Collapse
|
7
|
Tong J, Shen Y, Xu A, He X, Luo C, Edmondson M, Zhang D, Lu Y, Yan C, Li R, Siegel L, Sun L, Shenkman EA, Morton SC, Malin BA, Bian J, Asch DA, Chen Y. Evaluating site-of-care-related racial disparities in kidney graft failure using a novel federated learning framework. J Am Med Inform Assoc 2024; 31:1303-1312. [PMID: 38713006 PMCID: PMC11105132 DOI: 10.1093/jamia/ocae075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2023] [Revised: 01/09/2024] [Accepted: 03/26/2024] [Indexed: 05/08/2024] Open
Abstract
OBJECTIVES Racial disparities in kidney transplant access and posttransplant outcomes exist between non-Hispanic Black (NHB) and non-Hispanic White (NHW) patients in the United States, with the site of care being a key contributor. Using multi-site data to examine the effect of site of care on racial disparities, the key challenge is the dilemma in sharing patient-level data due to regulations for protecting patients' privacy. MATERIALS AND METHODS We developed a federated learning framework, named dGEM-disparity (decentralized algorithm for Generalized linear mixed Effect Model for disparity quantification). Consisting of 2 modules, dGEM-disparity first provides accurately estimated common effects and calibrated hospital-specific effects by requiring only aggregated data from each center and then adopts a counterfactual modeling approach to assess whether the graft failure rates differ if NHB patients had been admitted at transplant centers in the same distribution as NHW patients were admitted. RESULTS Utilizing United States Renal Data System data from 39 043 adult patients across 73 transplant centers over 10 years, we found that if NHB patients had followed the distribution of NHW patients in admissions, there would be 38 fewer deaths or graft failures per 10 000 NHB patients (95% CI, 35-40) within 1 year of receiving a kidney transplant on average. DISCUSSION The proposed framework facilitates efficient collaborations in clinical research networks. Additionally, the framework, by using counterfactual modeling to calculate the event rate, allows us to investigate contributions to racial disparities that may occur at the level of site of care. CONCLUSIONS Our framework is broadly applicable to other decentralized datasets and disparities research related to differential access to care. Ultimately, our proposed framework will advance equity in human health by identifying and addressing hospital-level racial disparities.
Collapse
Affiliation(s)
- Jiayi Tong
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Yishan Shen
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Applied Mathematics and Computational Science, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Alice Xu
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Washington University in St. Louis, St. Louis, MO 63130, United States
| | - Xing He
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - Chongliang Luo
- Division of Public Health Sciences, Department of Surgery, Washington University in St. Louis, St. Louis, MO 63110, United States
| | | | - Dazheng Zhang
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Yiwen Lu
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Chao Yan
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Ruowang Li
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA 90048, United States
| | - Lianne Siegel
- Division of Biostatistics and Health Data Science, School of Public Health, University of Minnesota, Minneapolis, MN 55414, United States
| | - Lichao Sun
- Department of Computer Science and Engineering, Lehigh University, Bethlehem, PA 18015, United States
| | - Elizabeth A Shenkman
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - Sally C Morton
- School of Mathematical and Statistical Sciences, Arizona State University, Tempe, AZ 85287, United States
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Computer Science, Vanderbilt University, Nashville, TN 37212, United States
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32611, United States
| | - David A Asch
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
| | - Yong Chen
- The Center for Health AI and Synthesis of Evidence (CHASE), Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Applied Mathematics and Computational Science, The University of Pennsylvania, Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
| |
Collapse
|
8
|
Zhang D, Tong J, Jing N, Yang Y, Luo C, Lu Y, Christakis DA, Güthe D, Hornig M, Kelleher KJ, Morse KE, Rogerson CM, Divers J, Carroll RJ, Forrest CB, Chen Y. Learning competing risks across multiple hospitals: one-shot distributed algorithms. J Am Med Inform Assoc 2024; 31:1102-1112. [PMID: 38456459 PMCID: PMC11031234 DOI: 10.1093/jamia/ocae027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2023] [Revised: 12/30/2023] [Accepted: 02/03/2024] [Indexed: 03/09/2024] Open
Abstract
OBJECTIVES To characterize the complex interplay between multiple clinical conditions in a time-to-event analysis framework using data from multiple hospitals, we developed two novel one-shot distributed algorithms for competing risk models (ODACoR). By applying our algorithms to the EHR data from eight national children's hospitals, we quantified the impacts of a wide range of risk factors on the risk of post-acute sequelae of SARS-COV-2 (PASC) among children and adolescents. MATERIALS AND METHODS Our ODACoR algorithms are effectively executed due to their devised simplicity and communication efficiency. We evaluated our algorithms via extensive simulation studies as applications to quantification of the impacts of risk factors for PASC among children and adolescents using data from eight children's hospitals including the Children's Hospital of Philadelphia, Cincinnati Children's Hospital Medical Center, Children's Hospital of Colorado covering over 6.5 million pediatric patients. The accuracy of the estimation was assessed by comparing the results from our ODACoR algorithms with the estimators derived from the meta-analysis and the pooled data. RESULTS The meta-analysis estimator showed a high relative bias (∼40%) when the clinical condition is relatively rare (∼0.5%), whereas ODACoR algorithms exhibited a substantially lower relative bias (∼0.2%). The estimated effects from our ODACoR algorithms were identical on par with the estimates from the pooled data, suggesting the high reliability of our federated learning algorithms. In contrast, the meta-analysis estimate failed to identify risk factors such as age, gender, chronic conditions history, and obesity, compared to the pooled data. DISCUSSION Our proposed ODACoR algorithms are communication-efficient, highly accurate, and suitable to characterize the complex interplay between multiple clinical conditions. CONCLUSION Our study demonstrates that our ODACoR algorithms are communication-efficient and can be widely applicable for analyzing multiple clinical conditions in a time-to-event analysis framework.
Collapse
Affiliation(s)
- Dazheng Zhang
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Jiayi Tong
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Naimin Jing
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Biostatistics and Research Decision Sciences, Merck & Co., Inc, Rahway, NJ 07065, United States
| | - Yuchen Yang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Division of Public Health Sciences, Washington University School of Medicine in St Louis, St Louis, MO 63110, United States
| | - Yiwen Lu
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, United States
| | | | - Diana Güthe
- Survivor Corps, Washington, DC 20814, United States
| | - Mady Hornig
- Department of Epidemiology, Columbia University Mailman School of Public Health, New York, NY 10032, United States
| | - Kelly J Kelleher
- Research Institute at Nationwide Children’s Hospital, Columbus, OH 43205, United States
| | - Keith E Morse
- Division of Pediatric Hospital Medicine, Department of Pediatrics, Stanford University, Palo Alto, CA 94304, United States
| | - Colin M Rogerson
- Department of Pediatrics, Indiana University School of Medicine, Indianapolis, IN 46202, United States
| | - Jasmin Divers
- Department of Foundations of Medicine, New York University Long Island School of Medicine, Mineola, NY 11501, United States
| | - Raymond J Carroll
- Department of Statistics, Texas A&M University, College Station, TX 77843, United States
| | - Christopher B Forrest
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, United States
| | - Yong Chen
- The Center for Health AI and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States
- The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA 19104, United States
- Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA 19104, United States
- Leonard Davis Institute of Health Economics, Philadelphia, PA 19104, United States
- Penn Medicine Center for Evidence-based Practice (CEP), Philadelphia, PA 19104, United States
| |
Collapse
|
9
|
Kim C, Yu DH, Baek H, Cho J, You SC, Park RW. Data Resource Profile: Health Insurance Review and Assessment Service Covid-19 Observational Medical Outcomes Partnership (HIRA Covid-19 OMOP) database in South Korea. Int J Epidemiol 2024; 53:dyae062. [PMID: 38658170 DOI: 10.1093/ije/dyae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 04/08/2024] [Indexed: 04/26/2024] Open
Affiliation(s)
- Chungsoo Kim
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Dong Han Yu
- Big Data Department, Health Insurance Assessment and Review Services, Wonju, Republic of Korea
| | - Hyeran Baek
- Big Data Department, Health Insurance Assessment and Review Services, Wonju, Republic of Korea
| | - Jaehyeong Cho
- Department of Research, Keimyung University Dongsan Medical Center, Daegu, Republic of Korea
| | - Seng Chan You
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| |
Collapse
|
10
|
Rollo C, Pancotti C, Birolo G, Rossi I, Sanavia T, Fariselli P. SYNDSURV: A simple framework for survival analysis with data distributed across multiple institutions. Comput Biol Med 2024; 172:108288. [PMID: 38503094 DOI: 10.1016/j.compbiomed.2024.108288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 02/17/2024] [Accepted: 03/12/2024] [Indexed: 03/21/2024]
Abstract
Data sharing among different institutions represents one of the major challenges in developing distributed machine learning approaches, especially when data is sensitive, such as in medical applications. Federated learning is a possible solution, but requires fast communications and flawless security. Here, we propose SYNDSURV (SYNthetic Distributed SURVival), an alternative approach that simplifies the current state-of-the-art paradigm by allowing different centres to generate local simulated instances from real data and then gather them into a centralised hub, where an Artificial Intelligence (AI) model can learn in a standard way. The main advantage of this procedure is that it is model-agnostic, therefore prediction models can be directly applied in distributed applications without requiring particular adaptations as the current federated approaches do. To show the validity of our approach for medical applications, we tested it on a survival analysis task, offering a viable alternative to train AI models on distributed data. While federated learning has been mainly optimised for gradient-based approaches so far, our framework works with any predictive method, proving to be a comparable way of performing distributed learning without being too demanding towards each participating institute in terms of infrastructural requirements.
Collapse
Affiliation(s)
- Cesare Rollo
- University of Torino, Via Santena 19, Torino, 10126, Italy.
| | | | | | - Ivan Rossi
- University of Torino, Via Santena 19, Torino, 10126, Italy.
| | | | | |
Collapse
|
11
|
Zhang D, Tong J, Stein R, Lu Y, Jing N, Yang Y, Boland MR, Luo C, Baldassano RN, Carroll RJ, Forrest CB, Chen Y. One-shot distributed algorithms for addressing heterogeneity in competing risks data across clinical sites. J Biomed Inform 2024; 150:104595. [PMID: 38244958 PMCID: PMC11002871 DOI: 10.1016/j.jbi.2024.104595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2023] [Revised: 12/15/2023] [Accepted: 01/15/2024] [Indexed: 01/22/2024]
Abstract
OBJECTIVE To characterize the interplay between multiple medical conditions across sites and account for the heterogeneity in patient population characteristics across sites within a distributed research network, we develop a one-shot algorithm that can efficiently utilize summary-level data from various institutions. By applying our proposed algorithm to a large pediatric cohort across four national Children's hospitals, we replicated a recently published prospective cohort, the RISK study, and quantified the impact of the risk factors associated with the penetrating or stricturing behaviors of pediatric Crohn's disease (PCD). METHODS In this study, we introduce the ODACoRH algorithm, a one-shot distributed algorithm designed for the competing risks model with heterogeneity. Our approach considers the variability in baseline hazard functions of multiple endpoints of interest across different sites. To accomplish this, we build a surrogate likelihood function by combining patient-level data from the local site with aggregated data from other external sites. We validated our method through extensive simulation studies and replication of the RISK study to investigate the impact of risk factors on the PCD for adolescents and children from four children's hospitals within the PEDSnet, A National Pediatric Learning Health System. To evaluate our ODACoRH algorithm, we compared results from the ODACoRH algorithms with those from meta-analysis as well as those derived from the pooled data. RESULTS The ODACoRH algorithm had the smallest relative bias to the gold standard method (-0.2%), outperforming the meta-analysis method (-11.4%). In the PCD association study, the estimated subdistribution hazard ratios obtained through the ODACoRH algorithms are identical on par with the results derived from pooled data, which demonstrates the high reliability of our federated learning algorithms. From a clinical standpoint, the identified risk factors for PCD align well with the RISK study published in the Lancet in 2017 and other published studies, supporting the validity of our findings. CONCLUSION With the ODACoRH algorithm, we demonstrate the capability of effectively integrating data from multiple sites in a decentralized data setting while accounting for between-site heterogeneity. Importantly, our study reveals several crucial clinical risk factors for PCD that merit further investigations.
Collapse
Affiliation(s)
- Dazheng Zhang
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. https://twitter.com/DazhengZ
| | - Jiayi Tong
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA. https://twitter.com/JiayiJessieTong
| | - Ronen Stein
- Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Pediatrics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | - Yiwen Lu
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA
| | - Naimin Jing
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Biostatistics and Research Decision Sciences, Merck & Co., Inc, NJ, USA
| | - Yuchen Yang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Mary R Boland
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Department of Mathematics, Saint Vincent College, Latrobe, PA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; Division of Public Health Sciences, Washington University School of Medicine in St Louis, St Louis, MO, USA
| | - Robert N Baldassano
- Department of Pediatrics, Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Pediatrics, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA, USA
| | | | - Christopher B Forrest
- Applied Clinical Research Center, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Yong Chen
- The Center for Health Analytics and Synthesis of Evidence (CHASE), University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania; Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA; The Graduate Group in Applied Mathematics and Computational Science, School of Arts and Sciences, University of Pennsylvania, Philadelphia, PA, USA; Leonard Davis Institute of Health Economics, Philadelphia, PA, USA; Penn Medicine Center for Evidence-based Practice (CEP), Philadelphia, PA, USA; Penn Institute for Biomedical Informatics (IBI), Philadelphia, PA, USA.
| |
Collapse
|
12
|
Jing N, Liu X, Wu Q, Rao S, Mejias A, Maltenfort M, Schuchard J, Lorman V, Razzaghi H, Webb R, Zhou C, Jhaveri R, Lee GM, Pajor NM, Thacker D, Charles Bailey L, Forrest CB, Chen Y. Development and validation of a federated learning framework for detection of subphenotypes of multisystem inflammatory syndrome in children. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.01.26.24301827. [PMID: 38343837 PMCID: PMC10854314 DOI: 10.1101/2024.01.26.24301827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/19/2024]
Abstract
Background Multisystem inflammatory syndrome in children (MIS-C) is a severe post-acute sequela of SARS-CoV-2 infection. The highly diverse clinical features of MIS-C necessities characterizing its features by subphenotypes for improved recognition and treatment. However, jointly identifying subphenotypes in multi-site settings can be challenging. We propose a distributed multi-site latent class analysis (dMLCA) approach to jointly learn MIS-C subphenotypes using data across multiple institutions. Methods We used data from the electronic health records (EHR) systems across nine U.S. children's hospitals. Among the 3,549,894 patients, we extracted 864 patients < 21 years of age who had received a diagnosis of MIS-C during an inpatient stay or up to one day before admission. Using MIS-C conditions, laboratory results, and procedure information as input features for the patients, we applied our dMLCA algorithm and identified three MIS-C subphenotypes. As validation, we characterized and compared more granular features across subphenotypes. To evaluate the specificity of the identified subphenotypes, we further compared them with the general subphenotypes identified in the COVID-19 infected patients. Findings Subphenotype 1 (46.1%) represents patients with a mild manifestation of MIS-C not requiring intensive care, with minimal cardiac involvement. Subphenotype 2 (25.3%) is associated with a high risk of shock, cardiac and renal involvement, and an intermediate risk of respiratory symptoms. Subphenotype 3 (28.6%) represents patients requiring intensive care, with a high risk of shock and cardiac involvement, accompanied by a high risk of >4 organ system being impacted. Importantly, for hospital-specific clinical decision-making, our algorithm also revealed a substantial heterogeneity in relative proportions of these three subtypes across hospitals. Properly accounting for such heterogeneity can lead to accurate characterization of the subphenotypes at the patient-level. Interpretation Our identified three MIS-C subphenotypes have profound implications for personalized treatment strategies, potentially influencing clinical outcomes. Further, the proposed algorithm facilitates federated subphenotyping while accounting for the heterogeneity across hospitals.
Collapse
Affiliation(s)
- Naimin Jing
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA
- Current affiliation: Biostatistics and Research Decision Sciences, Merck & Co., Inc, Kenilworth, NJ
| | - Xiaokang Liu
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA
| | - Qiong Wu
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA
| | - Suchitra Rao
- Department of Pediatrics, University of Colorado School of Medicine and Children’s Hospital Colorado, Aurora, CO
| | - Asuncion Mejias
- Division of Infectious Diseases, Department of Pediatrics, Nationwide Children’s Hospital and The Ohio State University, Columbus, OH
| | - Mitchell Maltenfort
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Julia Schuchard
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Vitaly Lorman
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Hanieh Razzaghi
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Ryan Webb
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
| | - Chuan Zhou
- Center for Child Health, Behavior and Development, Seattle Children’s Hospital, Seattle, WA
| | - Ravi Jhaveri
- Division of Infectious Diseases, Ann & Robert H. Lurie Children’s Hospital of Chicago, Chicago, IL
| | - Grace M. Lee
- Department of Pediatrics (Infectious Diseases), Stanford University School of Medicine, Stanford, CA
| | - Nathan M. Pajor
- Division of Pulmonary Medicine, Cincinnati Children’s Hospital Medical Center and University of Cincinnati College of Medicine, Cincinnati, OH
| | - Deepika Thacker
- Division of Cardiology, Nemours Children’s Health, Wilmington, DE
| | - L. Charles Bailey
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Christopher B. Forrest
- Applied Clinical Research Center, Children’s Hospital of Philadelphia, Philadelphia, PA
- Department of Pediatrics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Yong Chen
- Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
13
|
Miao G, Yu L, Yang J, Bennett DA, Zhao J, Wu SS. Learning from vertically distributed data across multiple sites: An efficient privacy-preserving algorithm for Cox proportional hazards model with variable selection. J Biomed Inform 2024; 149:104581. [PMID: 38142903 PMCID: PMC10996392 DOI: 10.1016/j.jbi.2023.104581] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2023] [Revised: 08/24/2023] [Accepted: 12/19/2023] [Indexed: 12/26/2023]
Abstract
OBJECTIVE To develop a lossless distributed algorithm for regularized Cox proportional hazards model with variable selection to support federated learning for vertically distributed data. METHODS We propose a novel distributed algorithm for fitting regularized Cox proportional hazards model when data sharing among different data providers is restricted. Based on cyclical coordinate descent, the proposed algorithm computes intermediary statistics by each site and then exchanges them to update the model parameters in other sites without accessing individual patient-level data. We evaluate the performance of the proposed algorithm with (1) a simulation study and (2) a real-world data analysis predicting the risk of Alzheimer's dementia from the Religious Orders Study and Rush Memory and Aging Project (ROSMAP). Moreover, we compared the performance of our method with existing privacy-preserving models. RESULTS Our algorithm achieves privacy-preserving variable selection for time-to-event data in the vertically distributed setting, without degradation of accuracy compared with a centralized approach. Simulation demonstrates that our algorithm is highly efficient in analyzing high-dimensional datasets. Real-world data analysis reveals that our distributed Cox model yields higher accuracy in predicting the risk of Alzheimer's dementia than the conventional Cox model built by each data provider without data sharing. Moreover, our algorithm is computationally more efficient compared with existing privacy-preserving Cox models with or without regularization term. CONCLUSION The proposed algorithm is lossless, privacy-preserving and highly efficient to fit regularized Cox model for vertically distributed data. It provides a suitable and convenient approach for modeling time-to-event data in a distributed manner.
Collapse
Affiliation(s)
- Guanhong Miao
- Department of Epidemiology, College of Public Health & Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA; Center for Genetic Epidemiology and Bioinformatics, University of Florida, Gainesville, FL, USA; Department of Biostatistics, College of Public Health & Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA
| | - Lei Yu
- Rush Alzheimer's Disease Center & Department of Neurological Sciences, Rush University Medical Center, Chicago, IL, USA
| | - Jingyun Yang
- Rush Alzheimer's Disease Center & Department of Neurological Sciences, Rush University Medical Center, Chicago, IL, USA
| | - David A Bennett
- Rush Alzheimer's Disease Center & Department of Neurological Sciences, Rush University Medical Center, Chicago, IL, USA
| | - Jinying Zhao
- Department of Epidemiology, College of Public Health & Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA; Center for Genetic Epidemiology and Bioinformatics, University of Florida, Gainesville, FL, USA
| | - Samuel S Wu
- Department of Biostatistics, College of Public Health & Health Professions and College of Medicine, University of Florida, Gainesville, FL, USA.
| |
Collapse
|
14
|
Li BS, Cai T, Duan R. TARGETING UNDERREPRESENTED POPULATIONS IN PRECISION MEDICINE: A FEDERATED TRANSFER LEARNING APPROACH. Ann Appl Stat 2023; 17:2970-2992. [PMID: 39314265 PMCID: PMC11417462 DOI: 10.1214/23-aoas1747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
The limited representation of minorities and disadvantaged populations in large-scale clinical and genomics research poses a significant barrier to translating precision medicine research into practice. Prediction models are likely to underperform in underrepresented populations due to heterogeneity across populations, thereby exacerbating known health disparities. To address this issue, we propose FETA, a two-way data integration method that leverages a federated transfer learning approach to integrate heterogeneous data from diverse populations and multiple healthcare institutions, with a focus on a target population of interest having limited sample sizes. We show that FETA achieves performance comparable to the pooled analysis, where individual-level data is shared across institutions, with only a small number of communications across participating sites. Our theoretical analysis and simulation study demonstrate how FETA's estimation accuracy is influenced by communication budgets, privacy restrictions, and heterogeneity across populations. We apply FETA to multisite data from the electronic Medical Records and Genomics (eMERGE) Network to construct genetic risk prediction models for extreme obesity. Compared to models trained using target data only, source data only, and all data without accounting for population-level differences, FETA shows superior predictive performance. FETA has the potential to improve estimation and prediction accuracy in underrepresented populations and reduce the gap in model performance across populations.
Collapse
Affiliation(s)
- By Sai Li
- Institute of Statistics and Big Data, Renmin University of China
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health
| |
Collapse
|
15
|
Li S, Liu P, Nascimento GG, Wang X, Leite FRM, Chakraborty B, Hong C, Ning Y, Xie F, Teo ZL, Ting DSW, Haddadi H, Ong MEH, Peres MA, Liu N. Federated and distributed learning applications for electronic health records and structured medical data: a scoping review. J Am Med Inform Assoc 2023; 30:2041-2049. [PMID: 37639629 PMCID: PMC10654866 DOI: 10.1093/jamia/ocad170] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2023] [Revised: 07/19/2023] [Indexed: 08/31/2023] Open
Abstract
OBJECTIVES Federated learning (FL) has gained popularity in clinical research in recent years to facilitate privacy-preserving collaboration. Structured data, one of the most prevalent forms of clinical data, has experienced significant growth in volume concurrently, notably with the widespread adoption of electronic health records in clinical practice. This review examines FL applications on structured medical data, identifies contemporary limitations, and discusses potential innovations. MATERIALS AND METHODS We searched 5 databases, SCOPUS, MEDLINE, Web of Science, Embase, and CINAHL, to identify articles that applied FL to structured medical data and reported results following the PRISMA guidelines. Each selected publication was evaluated from 3 primary perspectives, including data quality, modeling strategies, and FL frameworks. RESULTS Out of the 1193 papers screened, 34 met the inclusion criteria, with each article consisting of one or more studies that used FL to handle structured clinical/medical data. Of these, 24 utilized data acquired from electronic health records, with clinical predictions and association studies being the most common clinical research tasks that FL was applied to. Only one article exclusively explored the vertical FL setting, while the remaining 33 explored the horizontal FL setting, with only 14 discussing comparisons between single-site (local) and FL (global) analysis. CONCLUSIONS The existing FL applications on structured medical data lack sufficient evaluations of clinically meaningful benefits, particularly when compared to single-site analyses. Therefore, it is crucial for future FL applications to prioritize clinical motivations and develop designs and methodologies that can effectively support and aid clinical practice and research.
Collapse
Affiliation(s)
- Siqi Li
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Pinyan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Gustavo G Nascimento
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Xinru Wang
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Fabio Renato Manzolli Leite
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Bibhas Chakraborty
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Department of Statistics and Data Science, National University of Singapore, Singapore 117546, Singapore
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27708, United States
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Feng Xie
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Zhen Ling Teo
- Singapore National Eye Centre, Singapore, Singapore Eye Research Institute, Singapore 168751, Singapore
| | - Daniel Shu Wei Ting
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Singapore National Eye Centre, Singapore, Singapore Eye Research Institute, Singapore 168751, Singapore
| | - Hamed Haddadi
- Department of Computing, Imperial College London, London SW7 2AZ, England, United Kingdom
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Department of Emergency Medicine, Singapore General Hospital, Singapore 169608, Singapore
| | - Marco Aurélio Peres
- National Dental Research Institute Singapore, National Dental Centre Singapore, Singapore 168938, Singapore
- Oral Health Academic Clinical Programme, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore 169857, Singapore
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore 169857, Singapore
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore
| |
Collapse
|
16
|
Li S, Ning Y, Ong MEH, Chakraborty B, Hong C, Xie F, Yuan H, Liu M, Buckland DM, Chen Y, Liu N. FedScore: A privacy-preserving framework for federated scoring system development. J Biomed Inform 2023; 146:104485. [PMID: 37660960 DOI: 10.1016/j.jbi.2023.104485] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2023] [Revised: 08/08/2023] [Accepted: 08/31/2023] [Indexed: 09/05/2023]
Abstract
OBJECTIVE We propose FedScore, a privacy-preserving federated learning framework for scoring system generation across multiple sites to facilitate cross-institutional collaborations. MATERIALS AND METHODS The FedScore framework includes five modules: federated variable ranking, federated variable transformation, federated score derivation, federated model selection and federated model evaluation. To illustrate usage and assess FedScore's performance, we built a hypothetical global scoring system for mortality prediction within 30 days after a visit to an emergency department using 10 simulated sites divided from a tertiary hospital in Singapore. We employed a pre-existing score generator to construct 10 local scoring systems independently at each site and we also developed a scoring system using centralized data for comparison. RESULTS We compared the acquired FedScore model's performance with that of other scoring models using the receiver operating characteristic (ROC) analysis. The FedScore model achieved an average area under the curve (AUC) value of 0.763 across all sites, with a standard deviation (SD) of 0.020. We also calculated the average AUC values and SDs for each local model, and the FedScore model showed promising accuracy and stability with a high average AUC value which was closest to the one of the pooled model and SD which was lower than that of most local models. CONCLUSION This study demonstrates that FedScore is a privacy-preserving scoring system generator with potentially good generalizability.
Collapse
Affiliation(s)
- Siqi Li
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Yilin Ning
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Marcus Eng Hock Ong
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Health Services Research Centre, Singapore Health Services, Singapore, Singapore; Department of Emergency Medicine, Singapore General Hospital, Singapore, Singapore
| | - Bibhas Chakraborty
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Department of Statistics and Data Science, National University of Singapore, Singapore, Singapore; Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Chuan Hong
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Feng Xie
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore
| | - Han Yuan
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Mingxuan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore
| | - Daniel M Buckland
- Department of Emergency Medicine, Duke University School of Medicine, Durham, NC, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore; Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore; Institute of Data Science, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
17
|
Wu Q, Schuemie MJ, Suchard MA, Ryan P, Hripcsak GM, Rohde CA, Chen Y. Padé approximant meets federated learning: A nearly lossless, one-shot algorithm for evidence synthesis in distributed research networks with rare outcomes. J Biomed Inform 2023; 145:104476. [PMID: 37598737 PMCID: PMC11056245 DOI: 10.1016/j.jbi.2023.104476] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 07/03/2023] [Accepted: 08/12/2023] [Indexed: 08/22/2023]
Abstract
OBJECTIVE We developed and evaluated a novel one-shot distributed algorithm for evidence synthesis in distributed research networks with rare outcomes. MATERIALS AND METHODS Fed-Padé, motivated by a classic mathematical tool, Padé approximants, reconstructs the multi-site data likelihood via Padé approximant whose key parameters can be computed distributively. Thanks to the simplicity of [2,2] Padé approximant, Fed-Padé requests an extremely simple task and low communication cost for data partners. Specifically, each data partner only needs to compute and share the log-likelihood and its first 4 gradients evaluated at an initial estimator. We evaluated the performance of our algorithm with extensive simulation studies and four observational healthcare databases. RESULTS Our simulation studies revealed that a [2,2]-Padé approximant can well reconstruct the multi-site likelihood so that Fed-Padé produces nearly identical estimates to the pooled analysis. Across all simulation scenarios considered, the median of relative bias and rate of instability of our Fed-Padé are both <0.1%, whereas meta-analysis estimates have bias up to 50% and instability up to 75%. Furthermore, the confidence intervals derived from the Fed-Padé algorithm showed better coverage of the truth than confidence intervals based on the meta-analysis. In real data analysis, the Fed-Padé has a relative bias of <1% for all three comparisons for risks of acute liver injury and decreased libido, whereas the meta-analysis estimates have a substantially higher bias (around 10%). CONCLUSION The Fed-Padé algorithm is nearly lossless, stable, communication-efficient, and easy to implement for models with rare outcomes. It provides an extremely suitable and convenient approach for synthesizing evidence in distributed research networks with rare outcomes.
Collapse
Affiliation(s)
- Qiong Wu
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States of America
| | - Martijn J Schuemie
- Observational Health Data Sciences and Informatics, New York, NY, United States of America; Janssen Research & Development, Titusville, NJ, United States of America; Department of Biostatistics, University of California, Los Angeles, CA, United States of America
| | - Marc A Suchard
- Observational Health Data Sciences and Informatics, New York, NY, United States of America; Department of Biostatistics, University of California, Los Angeles, CA, United States of America; Department of Human Genetics, University of California, Los Angeles, CA, United States of America
| | - Patrick Ryan
- Observational Health Data Sciences and Informatics, New York, NY, United States of America; Janssen Research & Development, Titusville, NJ, United States of America; Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States of America
| | - George M Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, United States of America; Medical Informatics Services, New York-Presbyterian Hospital, New York, NY, United States of America
| | - Charles A Rohde
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD, United States of America
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States of America; Observational Health Data Sciences and Informatics, New York, NY, United States of America.
| |
Collapse
|
18
|
Li D, Lu W, Shu D, Toh S, Wang R. Distributed Cox proportional hazards regression using summary-level information. Biostatistics 2023; 24:776-794. [PMID: 35195675 PMCID: PMC10345997 DOI: 10.1093/biostatistics/kxac006] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2020] [Revised: 11/22/2021] [Accepted: 01/30/2022] [Indexed: 07/20/2023] Open
Abstract
Individual-level data sharing across multiple sites can be infeasible due to privacy and logistical concerns. This article proposes a general distributed methodology to fit Cox proportional hazards models without sharing individual-level data in multi-site studies. We make inferences on the log hazard ratios based on an approximated partial likelihood score function that uses only summary-level statistics. This approach can be applied to both stratified and unstratified models, accommodate both discrete and continuous exposure variables, and permit the adjustment of multiple covariates. In particular, the fitting of stratified Cox models can be carried out with only one file transfer of summary-level information. We derive the asymptotic properties of the proposed estimators and compare the proposed estimators with the maximum partial likelihood estimators using pooled individual-level data and meta-analysis methods through simulation studies. We apply the proposed method to a real-world data set to examine the effect of sleeve gastrectomy versus Roux-en-Y gastric bypass on the time to first postoperative readmission.
Collapse
Affiliation(s)
| | - Wenbin Lu
- Department of Statistics, North Carolina State University, Raleigh, NC, 27695, USA
| | - Di Shu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, 19104, USA, Department of Pediatrics, Childrens Hospital of Philadelphia, Philadelphia, PA, 19104, USA, and Center for Pediatric Clinical Effectiveness, Children’s Hospital of Philadelphia, Philadelphia, PA, 19104, USA
| | - Sengwee Toh
- Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, 02215, USA
| | - Rui Wang
- Department of Population Medicine, Harvard Medical School and Harvard Pilgrim Health Care Institute, Boston, MA, 02215, USA and Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, 02215, USA
| |
Collapse
|
19
|
Yan Z, Zachrison KS, Schwamm LH, Estrada JJ, Duan R. A privacy-preserving and computation-efficient federated algorithm for generalized linear mixed models to analyze correlated electronic health records data. PLoS One 2023; 18:e0280192. [PMID: 36649349 PMCID: PMC9844867 DOI: 10.1371/journal.pone.0280192] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2022] [Accepted: 12/22/2022] [Indexed: 01/18/2023] Open
Abstract
Large collaborative research networks provide opportunities to jointly analyze multicenter electronic health record (EHR) data, which can improve the sample size, diversity of the study population, and generalizability of the results. However, there are challenges to analyzing multicenter EHR data including privacy protection, large-scale computation resource requirements, heterogeneity across sites, and correlated observations. In this paper, we propose a federated algorithm for generalized linear mixed models (Fed-GLMM), which can flexibly model multicenter longitudinal or correlated data while accounting for site-level heterogeneity. Fed-GLMM can be applied to both federated and centralized research networks to enable privacy-preserving data integration and improve computational efficiency. By communicating a limited amount of summary statistics, Fed-GLMM can achieve nearly identical results as the gold-standard method where the GLMM is directly fitted to the pooled dataset. We demonstrate the performance of Fed-GLMM in numerical experiments and an application to longitudinal EHR data from multiple healthcare facilities.
Collapse
Affiliation(s)
- Zhiyu Yan
- Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Kori S. Zachrison
- Department of Emergency Medicine, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
| | - Lee H. Schwamm
- Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
- Mass General Brigham, Boston, Massachusetts, United States of America
| | - Juan J. Estrada
- Department of Neurology, Massachusetts General Hospital, Boston, Massachusetts, United States of America
| | - Rui Duan
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
| |
Collapse
|
20
|
Gu T, Lee PH, Duan R. COMMUTE: Communication-efficient transfer learning for multi-site risk prediction. J Biomed Inform 2023; 137:104243. [PMID: 36403757 PMCID: PMC9868117 DOI: 10.1016/j.jbi.2022.104243] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 09/20/2022] [Accepted: 11/06/2022] [Indexed: 11/19/2022]
Abstract
OBJECTIVES We propose a communication-efficient transfer learning approach (COMMUTE) that effectively incorporates multi-site healthcare data for training a risk prediction model in a target population of interest, accounting for challenges including population heterogeneity and data sharing constraints across sites. METHODS We first train population-specific source models locally within each site. Using data from a given target population, COMMUTE learns a calibration term for each source model, which adjusts for potential data heterogeneity through flexible distance-based regularizations. In a centralized setting where multi-site data can be directly pooled, all data are combined to train the target model after calibration. When individual-level data are not shareable in some sites, COMMUTE requests only the locally trained models from these sites, with which, COMMUTE generates heterogeneity-adjusted synthetic data for training the target model. We evaluate COMMUTE via extensive simulation studies and an application to multi-site data from the electronic Medical Records and Genomics (eMERGE) Network to predict extreme obesity. RESULTS Simulation studies show that COMMUTE outperforms methods without adjusting for population heterogeneity and methods trained in a single population over a broad spectrum of settings. Using eMERGE data, COMMUTE achieves an area under the receiver operating characteristic curve (AUC) around 0.80, which outperforms other benchmark methods with AUC ranging from 0.51 to 0.70. CONCLUSION COMMUTE improves the risk prediction in a target population with limited samples and safeguards against negative transfer when some source populations are highly different from the target. In a federated setting, it is highly communication efficient as it only requires each site to share model parameter estimates once, and no iterative communication or higher-order terms are needed.
Collapse
Affiliation(s)
- Tian Gu
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States
| | - Phil H Lee
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States; Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, United States; Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, United States
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, United States.
| |
Collapse
|
21
|
Wang X, Zhang HG, Xiong X, Hong C, Weber GM, Brat GA, Bonzel CL, Luo Y, Duan R, Palmer NP, Hutch MR, Gutiérrez-Sacristán A, Bellazzi R, Chiovato L, Cho K, Dagliati A, Estiri H, García-Barrio N, Griffier R, Hanauer DA, Ho YL, Holmes JH, Keller MS, Klann MEng JG, L'Yi S, Lozano-Zahonero S, Maidlow SE, Makoudjou A, Malovini A, Moal B, Moore JH, Morris M, Mowery DL, Murphy SN, Neuraz A, Yuan Ngiam K, Omenn GS, Patel LP, Pedrera-Jiménez M, Prunotto A, Jebathilagam Samayamuthu M, Sanz Vidorreta FJ, Schriver ER, Schubert P, Serrano-Balazote P, South AM, Tan ALM, Tan BWL, Tibollo V, Tippmann P, Visweswaran S, Xia Z, Yuan W, Zöller D, Kohane IS, Avillach P, Guo Z, Cai T. SurvMaximin: Robust federated approach to transporting survival risk prediction models. J Biomed Inform 2022; 134:104176. [PMID: 36007785 PMCID: PMC9707637 DOI: 10.1016/j.jbi.2022.104176] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/18/2022] [Accepted: 08/15/2022] [Indexed: 10/15/2022]
Abstract
OBJECTIVE For multi-center heterogeneous Real-World Data (RWD) with time-to-event outcomes and high-dimensional features, we propose the SurvMaximin algorithm to estimate Cox model feature coefficients for a target population by borrowing summary information from a set of health care centers without sharing patient-level information. MATERIALS AND METHODS For each of the centers from which we want to borrow information to improve the prediction performance for the target population, a penalized Cox model is fitted to estimate feature coefficients for the center. Using estimated feature coefficients and the covariance matrix of the target population, we then obtain a SurvMaximin estimated set of feature coefficients for the target population. The target population can be an entire cohort comprised of all centers, corresponding to federated learning, or a single center, corresponding to transfer learning. RESULTS Simulation studies and a real-world international electronic health records application study, with 15 participating health care centers across three countries (France, Germany, and the U.S.), show that the proposed SurvMaximin algorithm achieves comparable or higher accuracy compared with the estimator using only the information of the target site and other existing methods. The SurvMaximin estimator is robust to variations in sample sizes and estimated feature coefficients between centers, which amounts to significantly improved estimates for target sites with fewer observations. CONCLUSIONS The SurvMaximin method is well suited for both federated and transfer learning in the high-dimensional survival analysis setting. SurvMaximin only requires a one-time summary information exchange from participating centers. Estimated regression vectors can be very heterogeneous. SurvMaximin provides robust Cox feature coefficient estimates without outcome information in the target population and is privacy-preserving.
Collapse
Affiliation(s)
- Xuan Wang
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Harrison G Zhang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Xin Xiong
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Chuan Hong
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Griffin M Weber
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Gabriel A Brat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Clara-Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Yuan Luo
- Department of Preventive Medicine Northwestern University, Chicago, IL, USA
| | - Rui Duan
- Department of Biostatistics, Harvard University, Boston, MA, USA
| | - Nathan P Palmer
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Meghan R Hutch
- Department of Preventive Medicine Northwestern University, Chicago, IL, USA
| | | | - Riccardo Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Luca Chiovato
- Unit of Internal Medicine and Endocrinology, Istituti Clinici Scientifici Maugeri SpA SB IRCCS, Pavia, Italy
| | - Kelly Cho
- Population Health and Data Science, VA Boston Healthcare System, Boston, MA, USA; Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | - Arianna Dagliati
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Pavia, Italy
| | - Hossein Estiri
- Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | | | - Romain Griffier
- IAM unit, Bordeaux University Hospital, Bordeaux, France; INSERM Bordeaux Population Health ERIAS TEAM, ERIAS - Inserm U1219 BPH, Bordeaux, France
| | - David A Hanauer
- Department of Learning Health Sciences, University of Michigan, Ann Arbor, MI, USA
| | - Yuk-Lam Ho
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | - John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Mark S Keller
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | - Sehi L'Yi
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sara Lozano-Zahonero
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Sarah E Maidlow
- Michigan Institute for Clinical and Health Research (MICHR) Informatics, University of Michigan, Ann Arbor, MI, USA
| | - Adeline Makoudjou
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Alberto Malovini
- Laboratory of Informatics and Systems Engineering for Clinical Research, Istituti Clinici Scientifici Maugeri SpA SB IRCCS, Pavia, Italy
| | - Bertrand Moal
- IAM unit, Bordeaux University Hospital, Bordeaux, France
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, USA
| | - Michele Morris
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology, and Informatics University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Shawn N Murphy
- Department of Neurology, Massachusetts General Hospital, Boston, MA, USA
| | - Antoine Neuraz
- Department of biomedical informatics, Hôpital Necker-Enfants Malade, Assistance Publique Hôpitaux de Paris (APHP), University of Paris, Paris, France
| | - Kee Yuan Ngiam
- Department of Biomedical informatics, WiSDM, National University Health Systems, Singapore
| | - Gilbert S Omenn
- Depts of Computational Medicine & Bioinformatics, Internal Medicine, Human Genetics, Public Health University of Michigan, Ann Arbor, MI, USA
| | - Lav P Patel
- Department of Internal Medicine, Division of Medical Informatics, University Of Kansas Medical Center
| | | | - Andrea Prunotto
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | | | | | - Emily R Schriver
- Data Analytics Center, University of Pennsylvania Health System, Philadelphia, PA, USA
| | - Petra Schubert
- Massachusetts Veterans Epidemiology Research and Information Center (MAVERIC), VA Boston Healthcare System, Boston, MA, USA
| | | | - Andrew M South
- Department of Pediatrics-Section of Nephrology, Brenner Children's, Wake Forest School of Medicine, Winston Salem, NC, USA
| | - Amelia L M Tan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Byorn W L Tan
- Department of Medicine, National University Hospital, Singapore
| | - Valentina Tibollo
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Patric Tippmann
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Shyam Visweswaran
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
| | - Zongqi Xia
- Department of Neurology, University of Pittsburgh, Pittsburgh, PA, USA
| | - William Yuan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Daniela Zöller
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Isaac S Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Paul Avillach
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Zijian Guo
- Department of Statistics, Rutgers, The State University of New Jersey, Piscataway, NJ, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
22
|
Edmondson MJ, Luo C, Nazmul Islam M, Sheils NE, Buresh J, Chen Z, Bian J, Chen Y. Distributed Quasi-Poisson regression algorithm for modeling multi-site count outcomes in distributed data networks. J Biomed Inform 2022; 131:104097. [PMID: 35643272 PMCID: PMC11874216 DOI: 10.1016/j.jbi.2022.104097] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Revised: 04/20/2022] [Accepted: 05/20/2022] [Indexed: 10/18/2022]
Abstract
BACKGROUND Observational studies incorporating real-world data from multiple institutions facilitate study of rare outcomes or exposures and improve generalizability of results. Due to privacy concerns surrounding patient-level data sharing across institutions, methods for performing regression analyses distributively are desirable. Meta-analysis of institution-specific estimates is commonly used, but has been shown to produce biased estimates in certain settings. While distributed regression methods are increasingly available, methods for analyzing count outcomes are currently limited. Count data in practice are commonly subject to overdispersion, exhibiting greater variability than expected under a given statistical model. OBJECTIVE We propose a novel computational method, a one-shot distributed algorithm for quasi-Poisson regression (ODAP), to distributively model count outcomes while accounting for overdispersion. METHODS ODAP incorporates a surrogate likelihood approach to perform distributed quasi-Poisson regression without requiring patient-level data sharing, only requiring sharing of aggregate data from each participating institution. ODAP requires at most three rounds of non-iterative communication among institutions to generate coefficient estimates and corresponding standard errors. In simulations, we evaluate ODAP under several data scenarios possible in multi-site analyses, comparing ODAP and meta-analysis estimates in terms of error relative to pooled regression estimates, considered the gold standard. In a proof-of-concept real-world data analysis, we similarly compare ODAP and meta-analysis in terms of relative error to pooled estimatation using data from the OneFlorida Clinical Research Consortium, modeling length of stay in COVID-19 patients as a function of various patient characteristics. In a second proof-of-concept analysis, using the same outcome and covariates, we incorporate data from the UnitedHealth Group Clinical Discovery Database together with the OneFlorida data in a distributed analysis to compare estimates produced by ODAP and meta-analysis. RESULTS In simulations, ODAP exhibited negligible error relative to pooled regression estimates across all settings explored. Meta-analysis estimates, while largely unbiased, were increasingly variable as heterogeneity in the outcome increased across institutions. When baseline expected count was 0.2, relative error for meta-analysis was above 5% in 25% of iterations (250/1000), while the largest relative error for ODAP in any iteration was 3.59%. In our proof-of-concept analysis using only OneFlorida data, ODAP estimates were closer to pooled regression estimates than those produced by meta-analysis for all 15 covariates. In our distributed analysis incorporating data from both OneFlorida and the UnitedHealth Group Clinical Discovery Database, ODAP and meta-analysis estimates were largely similar, while some differences in estimates (as large as 13.8%) could be indicative of bias in meta-analytic estimates. CONCLUSIONS ODAP performs privacy-preserving, communication-efficient distributed quasi-Poisson regression to analyze count outcomes using data stored within multiple institutions. Our method produces estimates nearly matching pooled regression estimates and sometimes more accurate than meta-analysis estimates, most notably in settings with relatively low counts and high outcome heterogeneity across institutions.
Collapse
Affiliation(s)
- Mackenzie J Edmondson
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | | | | | - John Buresh
- Optum Labs at UnitedHealth Group, Minnetonka, MN, USA
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA; Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA.
| |
Collapse
|
23
|
Liu X, Duan R, Luo C, Ogdie A, Moore JH, Kranzler HR, Bian J, Chen Y. Multisite learning of high-dimensional heterogeneous data with applications to opioid use disorder study of 15,000 patients across 5 clinical sites. Sci Rep 2022; 12:11073. [PMID: 35773438 PMCID: PMC9245877 DOI: 10.1038/s41598-022-14029-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Accepted: 05/31/2022] [Indexed: 11/17/2022] Open
Abstract
Integrating data across institutions can improve learning efficiency. To integrate data efficiently while protecting privacy, we propose A one-shot, summary-statistics-based, Distributed Algorithm for fitting Penalized (ADAP) regression models across multiple datasets. ADAP utilizes patient-level data from a lead site and incorporates the first-order (ADAP1) and second-order gradients (ADAP2) of the objective function from collaborating sites to construct a surrogate objective function at the lead site, where model fitting is then completed with proper regularizations applied. We evaluate the performance of the proposed method using both simulation and a real-world application to study risk factors for opioid use disorder (OUD) using 15,000 patient data from the OneFlorida Clinical Research Consortium. Our results show that ADAP performs nearly the same as the pooled estimator but achieves higher estimation accuracy and better variable selection than the local and average estimators. Moreover, ADAP2 successfully handles heterogeneity in covariate distributions.
Collapse
Affiliation(s)
- Xiaokang Liu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
| | - Alexis Ogdie
- Department of Medicine, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
| | - Jason H Moore
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, Los Angeles, CA, 90096, USA
| | - Henry R Kranzler
- Department of Psychiatry, University of Pennsylvania Perelman School of Medicine and the VISN 4 MIRECC, Crescenz VAMC, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, 423 Guardian Drive, Philadelphia, PA, 19104, USA.
| |
Collapse
|
24
|
Distributed learning for heterogeneous clinical data with application to integrating COVID-19 data across 230 sites. NPJ Digit Med 2022; 5:76. [PMID: 35701668 PMCID: PMC9198031 DOI: 10.1038/s41746-022-00615-8] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 05/19/2022] [Indexed: 11/09/2022] Open
Abstract
Integrating real-world data (RWD) from several clinical sites offers great opportunities to improve estimation with a more general population compared to analyses based on a single clinical site. However, sharing patient-level data across sites is practically challenging due to concerns about maintaining patient privacy. We develop a distributed algorithm to integrate heterogeneous RWD from multiple clinical sites without sharing patient-level data. The proposed distributed conditional logistic regression (dCLR) algorithm can effectively account for between-site heterogeneity and requires only one round of communication. Our simulation study and data application with the data of 14,215 COVID-19 patients from 230 clinical sites in the UnitedHealth Group Clinical Research Database demonstrate that the proposed distributed algorithm provides an estimator that is robust to heterogeneity in event rates when efficiently integrating data from multiple clinical sites. Our algorithm is therefore a practical alternative to both meta-analysis and existing distributed algorithms for modeling heterogeneous multi-site binary outcomes.
Collapse
|
25
|
Luo C, Duan R, Naj AC, Kranzler HR, Bian J, Chen Y. ODACH: a one-shot distributed algorithm for Cox model with heterogeneous multi-center data. Sci Rep 2022; 12:6627. [PMID: 35459767 PMCID: PMC9033863 DOI: 10.1038/s41598-022-09069-0] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2021] [Accepted: 02/28/2022] [Indexed: 11/08/2022] Open
Abstract
We developed a One-shot Distributed Algorithm for Cox proportional-hazards model to analyze Heterogeneous multi-center time-to-event data (ODACH) circumventing the need for sharing patient-level information across sites. This algorithm implements a surrogate likelihood function to approximate the Cox log-partial likelihood function that is stratified by site using patient-level data from a lead site and aggregated information from other sites, allowing the baseline hazard functions and the distribution of covariates to vary across sites. Simulation studies and application to a real-world opioid use disorder study showed that ODACH provides estimates close to the pooled estimator, which analyzes patient-level data directly from all sites via a stratified Cox model. Compared to the estimator from meta-analysis, the inverse variance-weighted average of the site-specific estimates, ODACH estimator demonstrates less susceptibility to bias, especially when the event is rare. ODACH is thus a valuable privacy-preserving and communication-efficient method for analyzing multi-center time-to-event data.
Collapse
Affiliation(s)
- Chongliang Luo
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Adam C Naj
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
- Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Henry R Kranzler
- Department of Psychiatry, Perelman School of Medicine, University of Pennsylvania and the VISN 4 MIRECC, Crescenz VAMC, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
26
|
Clegg A, Bandeen-Roche K, Farrin A, Forster A, Gill TM, Gladman J, Kerse N, Lindley R, McManus RJ, Melis R, Mujica-Mota R, Raina P, Rockwood K, Teh R, van der Windt D, Witham M. New horizons in evidence-based care for older people: individual participant data meta-analysis. Age Ageing 2022; 51:afac090. [PMID: 35460409 PMCID: PMC9034697 DOI: 10.1093/ageing/afac090] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 02/10/2022] [Indexed: 11/13/2022] Open
Abstract
Evidence-based decisions on clinical and cost-effectiveness of interventions are ideally informed by meta-analyses of intervention trial data. However, when undertaken, such meta-analyses in ageing research have typically been conducted using standard methods whereby summary (aggregate) data are extracted from published trial reports. Although meta-analysis of aggregate data can provide useful insights into the average effect of interventions within a selected trial population, it has limitations regarding robust conclusions on which subgroups of people stand to gain the greatest benefit from an intervention or are at risk of experiencing harm. Future evidence synthesis using individual participant data from ageing research trials for meta-analysis could transform understanding of the effectiveness of interventions for older people, supporting evidence-based and sustainable commissioning. A major advantage of individual participant data meta-analysis (IPDMA) is that it enables examination of characteristics that predict treatment effects, such as frailty, disability, cognitive impairment, ethnicity, gender and other wider determinants of health. Key challenges of IPDMA relate to the complexity and resources needed for obtaining, managing and preparing datasets, requiring a meticulous approach involving experienced researchers, frequently with expertise in designing and analysing clinical trials. In anticipation of future IPDMA work in ageing research, we are establishing an international Ageing Research Trialists collective, to bring together trialists with a common focus on transforming care for older people as a shared ambition across nations.
Collapse
Affiliation(s)
- Andrew Clegg
- Academic Unit for Ageing & Stroke Research, University of Leeds, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
| | - Karen Bandeen-Roche
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
| | - Amanda Farrin
- Leeds Institute for Clinical Trials Research, University of Leeds, Leeds, UK
| | - Anne Forster
- Academic Unit for Ageing & Stroke Research, University of Leeds, Bradford Teaching Hospitals NHS Foundation Trust, Bradford, UK
| | - Thomas M Gill
- Yale School of Medicine, Yale University, New Haven, CT, USA
| | | | - Ngaire Kerse
- Department of General Practice and Primary Health Care, University of Auckland School of Population Health, Auckland, New Zealand
| | - Richard Lindley
- Sydney Medical School, University of Sydney, Sydney, Australia
| | - Richard J McManus
- Nuffield Department of Primary Care Health Sciences, University of Oxford, Oxford, UK
| | | | - Ruben Mujica-Mota
- Academic Unit of Health Economics, Leeds Institute of Health Sciences, University of Leeds, Leeds, UK
| | - Parminder Raina
- Department of Health Evidence and Impact & McMaster Institute for Research on Aging, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Kenneth Rockwood
- Division of Geriatric Medicine, Dalhousie University, Halifax, Canada
| | - Ruth Teh
- Department of General Practice and Primary Health Care, University of Auckland School of Population Health, Auckland, New Zealand
| | | | - Miles Witham
- AGE Research Group, Newcastle University, Newcastle, UK
| |
Collapse
|
27
|
Luo C, Islam MN, Sheils NE, Buresh J, Reps J, Schuemie MJ, Ryan PB, Edmondson M, Duan R, Tong J, Marks-Anglin A, Bian J, Chen Z, Duarte-Salles T, Fernández-Bertolín S, Falconer T, Kim C, Park RW, Pfohl SR, Shah NH, Williams AE, Xu H, Zhou Y, Lautenbach E, Doshi JA, Werner RM, Asch DA, Chen Y. DLMM as a lossless one-shot algorithm for collaborative multi-site distributed linear mixed models. Nat Commun 2022; 13:1678. [PMID: 35354802 PMCID: PMC8967932 DOI: 10.1038/s41467-022-29160-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2021] [Accepted: 03/03/2022] [Indexed: 12/21/2022] Open
Abstract
Linear mixed models are commonly used in healthcare-based association analyses for analyzing multi-site data with heterogeneous site-specific random effects. Due to regulations for protecting patients' privacy, sensitive individual patient data (IPD) typically cannot be shared across sites. We propose an algorithm for fitting distributed linear mixed models (DLMMs) without sharing IPD across sites. This algorithm achieves results identical to those achieved using pooled IPD from multiple sites (i.e., the same effect size and standard error estimates), hence demonstrating the lossless property. The algorithm requires each site to contribute minimal aggregated data in only one round of communication. We demonstrate the lossless property of the proposed DLMM algorithm by investigating the associations between demographic and clinical characteristics and length of hospital stay in COVID-19 patients using administrative claims from the UnitedHealth Group Clinical Discovery Database. We extend this association study by incorporating 120,609 COVID-19 patients from 11 collaborative data sources worldwide.
Collapse
Affiliation(s)
- Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St. Louis, MO, USA
| | | | | | | | - Jenna Reps
- Janssen Research and Development LLC, Titusville, NJ, USA
| | | | - Patrick B Ryan
- Janssen Research and Development LLC, Titusville, NJ, USA
| | - Mackenzie Edmondson
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Rui Duan
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jiayi Tong
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Arielle Marks-Anglin
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Zhaoyi Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Talita Duarte-Salles
- Fundacio Institut Universitari per a la recerca a l'Atencio Primaria de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Sergio Fernández-Bertolín
- Fundacio Institut Universitari per a la recerca a l'Atencio Primaria de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Thomas Falconer
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Chungsoo Kim
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Stephen R Pfohl
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Nigam H Shah
- Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
| | - Andrew E Williams
- Institute for Clinical Research and Health Policy Studies, Tufts University School of Medicine, Boston, MA, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Yujia Zhou
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Ebbing Lautenbach
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA
- Division of Infectious Diseases, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Center for Clinical Epidemiology and Biostatistics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Jalpa A Doshi
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
| | - Rachel M Werner
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
- Cpl Michael J Crescenz VA Medical Center, Philadelphia, PA, USA
| | - David A Asch
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, PA, USA
- Leonard Davis Institute of Health Economics, Philadelphia, PA, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
28
|
Chen AA, Luo C, Chen Y, Shinohara RT, Shou H, Alzheimer’s Disease Neuroimaging Initiative. Privacy-preserving harmonization via distributed ComBat. Neuroimage 2022; 248:118822. [PMID: 34958950 PMCID: PMC9802006 DOI: 10.1016/j.neuroimage.2021.118822] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Accepted: 12/14/2021] [Indexed: 01/03/2023] Open
Abstract
Challenges in clinical data sharing and the need to protect data privacy have led to the development and popularization of methods that do not require directly transferring patient data. In neuroimaging, integration of data across multiple institutions also introduces unwanted biases driven by scanner differences. These scanner effects have been shown by several research groups to severely affect downstream analyses. To facilitate the need of removing scanner effects in a distributed data setting, we introduce distributed ComBat, an adaptation of a popular harmonization method for multivariate data that borrows information across features. We present our fast and simple distributed algorithm and show that it yields equivalent results using data from the Alzheimer's Disease Neuroimaging Initiative. Our method enables harmonization while ensuring maximal privacy protection, thus facilitating a broad range of downstream analyses in functional and structural imaging studies.
Collapse
Affiliation(s)
- Andrew A. Chen
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States,Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, United States,Corresponding author at: Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States, (A.A. Chen)
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Russell T. Shinohara
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States,Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | - Haochang Shou
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, United States,Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, United States
| | | |
Collapse
|
29
|
Luo C, Islam MN, Sheils NE, Buresh J, Schuemie MJ, Doshi JA, Werner RM, Asch DA, Chen Y. OUP accepted manuscript. J Am Med Inform Assoc 2022; 29:1366-1371. [PMID: 35579348 PMCID: PMC9277633 DOI: 10.1093/jamia/ocac067] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 04/19/2022] [Accepted: 05/10/2022] [Indexed: 11/14/2022] Open
Abstract
Objective To develop a lossless distributed algorithm for generalized linear mixed model (GLMM) with application to privacy-preserving hospital profiling. Materials and Methods The GLMM is often fitted to implement hospital profiling, using clinical or administrative claims data. Due to individual patient data (IPD) privacy regulations and the computational complexity of GLMM, a distributed algorithm for hospital profiling is needed. We develop a novel distributed penalized quasi-likelihood (dPQL) algorithm to fit GLMM when only aggregated data, rather than IPD, can be shared across hospitals. We also show that the standardized mortality rates, which are often reported as the results of hospital profiling, can also be calculated distributively without sharing IPD. We demonstrate the applicability of the proposed dPQL algorithm by ranking 929 hospitals for coronavirus disease 2019 (COVID-19) mortality or referral to hospice that have been previously studied. Results The proposed dPQL algorithm is mathematically proven to be lossless, that is, it obtains identical results as if IPD were pooled from all hospitals. In the example of hospital profiling regarding COVID-19 mortality, the dPQL algorithm reached convergence with only 5 iterations, and the estimation of fixed effects, random effects, and mortality rates were identical to that of the PQL from pooled data. Conclusion The dPQL algorithm is lossless, privacy-preserving and fast-converging for fitting GLMM. It provides an extremely suitable and convenient distributed approach for hospital profiling.
Collapse
Affiliation(s)
- Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Division of Public Health Sciences, Washington University School of Medicine in St. Louis, St Louis, Missouri, USA
| | | | | | | | | | - Jalpa A Doshi
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Leonard Davis Institute of Health Economics, Philadelphia, Pennsylvania, USA
| | - Rachel M Werner
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Leonard Davis Institute of Health Economics, Philadelphia, Pennsylvania, USA
- Cpl Michael J Crescenz VA Medical Center, Philadelphia, Pennsylvania, USA
| | - David A Asch
- Division of General Internal Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
- Leonard Davis Institute of Health Economics, Philadelphia, Pennsylvania, USA
| | - Yong Chen
- Corresponding Author: Yong Chen, PhD, Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, 602 Blockley Hall, 423 Guardian Drive, Philadelphia, PA 19104, USA;
| |
Collapse
|
30
|
Schuemie MJ, Chen Y, Madigan D, Suchard MA. Combining cox regressions across a heterogeneous distributed research network facing small and zero counts. Stat Methods Med Res 2021; 31:438-450. [PMID: 34841975 DOI: 10.1177/09622802211060518] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Studies of the effects of medical interventions increasingly take place in distributed research settings using data from multiple clinical data sources including electronic health records and administrative claims. In such settings, privacy concerns typically prohibit sharing of individual patient data, and instead, cross-network analyses can only utilize summary statistics from the individual databases such as hazard ratios and standard errors. In the specific but very common context of the Cox proportional hazards model, we show that combining such per site summary statistics into a single network-wide estimate using standard meta-analysis methods leads to substantial bias when outcome counts are small. This bias derives primarily from the normal approximations of the per site likelihood that the methods utilized. Here we propose and evaluate methods that eschew normal approximations in favor of three more flexible approximations: a skew-normal, a one-dimensional grid, and a custom parametric function that mimics the behavior of the Cox likelihood function. In extensive simulation studies, we demonstrate how these approximations impact bias in the context of both fixed-effects and (Bayesian) random-effects models. We then apply these approaches to three real-world studies of the comparative safety of antidepressants, each using data from four observational health care databases.
Collapse
Affiliation(s)
- Martijn J Schuemie
- Observational Health Data Sciences and Informatics, New York, NY USA.,Janssen Research & Development, Titusville, NJ USA.,Department of Biostatistics, University of California, Los Angeles, CA, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics,6572University of Pennsylvania, Philadelphia, PA, USA
| | - David Madigan
- Observational Health Data Sciences and Informatics, New York, NY USA.,Khoury College of Computer Sciences, 1848Northeastern University, Boston, MA, USA
| | - Marc A Suchard
- Observational Health Data Sciences and Informatics, New York, NY USA.,Department of Biostatistics, University of California, Los Angeles, CA, USA.,Department of Human Genetics, 8783University of California, Los Angeles, CA, USA
| |
Collapse
|
31
|
Dolezel D, McLeod A, Fulton L. Examining Predictors of Myocardial Infarction. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:11284. [PMID: 34769805 PMCID: PMC8583114 DOI: 10.3390/ijerph182111284] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Revised: 10/18/2021] [Accepted: 10/21/2021] [Indexed: 02/05/2023]
Abstract
Cardiovascular diseases are the leading cause of death in the United States. This study analyzed predictors of myocardial infarction (MI) for those aged 35 and older based on demographic, socioeconomic, geographic, behavioral, and risk factors, as well as access to healthcare variables using the Center for Disease (CDC) Control Behavioral Risk Factor Surveillance System (BRFSS) survey for the year 2019. Multiple quasibinomial models were generated on an 80% training set hierarchically and then used to forecast the 20% test set. The final training model proved somewhat capable of prediction with a weighted F1-Score = 0.898. A complete model based on statistically significant variables using the entirety of the dataset was compared to the same model built on the training set. Models demonstrated coefficient stability. Similar to previous studies, age, gender, marital status, veteran status, income, home ownership, employment status, and education level were important demographic and socioeconomic predictors. The only geographic variable that remained in the model was associated with the West North Central Census Division (in-creased risk). Statistically important behavioral and risk factors as well as comorbidities included health status, smoking, alcohol consumption frequency, cholesterol, blood pressure, diabetes, stroke, chronic obstructive pulmonary disorder (COPD), kidney disease, and arthritis. Three access to healthcare variables proved statistically significant: lack of a primary care provider (Odds Ratio, OR = 0.853, p < 0.001), cost considerations prevented some care (OR = 1.232, p < 0.001), and lack of an annual checkup (OR = 0.807, p < 0.001). The directionality of these odds ratios is congruent with a marginal effects model and implies that those without MI are more likely not to have a primary provider or annual checkup, but those with MI are more likely to have missed care due to the cost of that care. Cost of healthcare for MI patients is associated with not receiving care after accounting for all other variables.
Collapse
Affiliation(s)
- Diane Dolezel
- Health Information Management Department, Texas State University, San Marcos, TX 78666, USA;
| | - Alexander McLeod
- Computer Information Systems & Quantitative Methods Department, Texas State University, San Marcos, TX 78666, USA;
| | - Larry Fulton
- School of Health Administration, Texas State University, San Marcos, TX 78666, USA
| |
Collapse
|
32
|
Hogan WR, Shenkman EA, Robinson T, Carasquillo O, Robinson PS, Essner RZ, Bian J, Lipori G, Harle C, Magoc T, Manini L, Mendoza T, White S, Loiacono A, Hall J, Nelson D. The OneFlorida Data Trust: a centralized, translational research data infrastructure of statewide scope. J Am Med Inform Assoc 2021; 29:686-693. [PMID: 34664656 PMCID: PMC8922180 DOI: 10.1093/jamia/ocab221] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Revised: 09/03/2021] [Accepted: 09/29/2021] [Indexed: 01/22/2023] Open
Abstract
The OneFlorida Data Trust is a centralized research patient data repository created and managed by the OneFlorida Clinical Research Consortium ("OneFlorida"). It comprises structured electronic health record (EHR), administrative claims, tumor registry, death, and other data on 17.2 million individuals who received healthcare in Florida between January 2012 and the present. Ten healthcare systems in Miami, Orlando, Tampa, Jacksonville, Tallahassee, Gainesville, and rural areas of Florida contribute EHR data, covering the major metropolitan regions in Florida. Deduplication of patients is accomplished via privacy-preserving entity resolution (precision 0.97-0.99, recall 0.75), thereby linking patients' EHR, claims, and death data. Another unique feature is the establishment of mother-baby relationships via Florida vital statistics data. Research usage has been significant, including major studies launched in the National Patient-Centered Clinical Research Network ("PCORnet"), where OneFlorida is 1 of 9 clinical research networks. The Data Trust's robust, centralized, statewide data are a valuable and relatively unique research resource.
Collapse
Affiliation(s)
- William R Hogan
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA,Corresponding Author: William R. Hogan, MD, MS, FACMI, Clinical & Translational Research Building, 2004 Mowry Road, PO Box 100219, Gainesville, FL 32610, USA;
| | - Elizabeth A Shenkman
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | | | | | | | | | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | | | - Christopher Harle
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA,UF Health, Gainesville, Florida, USA
| | | | - Lizabeth Manini
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Tona Mendoza
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Sonya White
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Alex Loiacono
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | - Jackie Hall
- Department of Health Outcomes & Biomedical Informatics, University of Florida, Gainesville, Florida, USA
| | | |
Collapse
|
33
|
An efficient and accurate distributed learning algorithm for modeling multi-site zero-inflated count outcomes. Sci Rep 2021; 11:19647. [PMID: 34608222 PMCID: PMC8490431 DOI: 10.1038/s41598-021-99078-2] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2021] [Accepted: 09/13/2021] [Indexed: 12/03/2022] Open
Abstract
Clinical research networks (CRNs), made up of multiple healthcare systems each with patient data from several care sites, are beneficial for studying rare outcomes and increasing generalizability of results. While CRNs encourage sharing aggregate data across healthcare systems, individual systems within CRNs often cannot share patient-level data due to privacy regulations, prohibiting multi-site regression which requires an analyst to access all individual patient data pooled together. Meta-analysis is commonly used to model data stored at multiple institutions within a CRN but can result in biased estimation, most notably in rare-event contexts. We present a communication-efficient, privacy-preserving algorithm for modeling multi-site zero-inflated count outcomes within a CRN. Our method, a one-shot distributed algorithm for performing hurdle regression (ODAH), models zero-inflated count data stored in multiple sites without sharing patient-level data across sites, resulting in estimates closely approximating those that would be obtained in a pooled patient-level data analysis. We evaluate our method through extensive simulations and two real-world data applications using electronic health records: examining risk factors associated with pediatric avoidable hospitalization and modeling serious adverse event frequency associated with a colorectal cancer therapy. In simulations, ODAH produced bias less than 0.1% across all settings explored while meta-analysis estimates exhibited bias up to 12.7%, with meta-analysis performing worst in settings with high zero-inflation or low event rates. Across both applied analyses, ODAH estimates had less than 10% bias for 18 of 20 coefficients estimated, while meta-analysis estimates exhibited substantially higher bias. Relative to existing methods for distributed data analysis, ODAH offers a highly accurate, computationally efficient method for modeling multi-site zero-inflated count data.
Collapse
|
34
|
Holmes JH, Beinlich J, Boland MR, Bowles KH, Chen Y, Cook TS, Demiris G, Draugelis M, Fluharty L, Gabriel PE, Grundmeier R, Hanson CW, Herman DS, Himes BE, Hubbard RA, Kahn CE, Kim D, Koppel R, Long Q, Mirkovic N, Morris JS, Mowery DL, Ritchie MD, Urbanowicz R, Moore JH. Why Is the Electronic Health Record So Challenging for Research and Clinical Care? Methods Inf Med 2021; 60:32-48. [PMID: 34282602 DOI: 10.1055/s-0041-1731784] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
BACKGROUND The electronic health record (EHR) has become increasingly ubiquitous. At the same time, health professionals have been turning to this resource for access to data that is needed for the delivery of health care and for clinical research. There is little doubt that the EHR has made both of these functions easier than earlier days when we relied on paper-based clinical records. Coupled with modern database and data warehouse systems, high-speed networks, and the ability to share clinical data with others are large number of challenges that arguably limit the optimal use of the EHR OBJECTIVES: Our goal was to provide an exhaustive reference for those who use the EHR in clinical and research contexts, but also for health information systems professionals as they design, implement, and maintain EHR systems. METHODS This study includes a panel of 24 biomedical informatics researchers, information technology professionals, and clinicians, all of whom have extensive experience in design, implementation, and maintenance of EHR systems, or in using the EHR as clinicians or researchers. All members of the panel are affiliated with Penn Medicine at the University of Pennsylvania and have experience with a variety of different EHR platforms and systems and how they have evolved over time. RESULTS Each of the authors has shared their knowledge and experience in using the EHR in a suite of 20 short essays, each representing a specific challenge and classified according to a functional hierarchy of interlocking facets such as usability and usefulness, data quality, standards, governance, data integration, clinical care, and clinical research. CONCLUSION We provide here a set of perspectives on the challenges posed by the EHR to clinical and research users.
Collapse
Affiliation(s)
- John H Holmes
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - James Beinlich
- Information Technology Entity Services and Corporate Information Services, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Mary R Boland
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Kathryn H Bowles
- Department of Biobehavioral Health Sciences, University of Pennsylvania School of Nursing, Philadelphia, Pennsylvania, United States
| | - Yong Chen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Tessa S Cook
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - George Demiris
- Department of Biobehavioral Health Sciences, University of Pennsylvania School of Nursing, Philadelphia, Pennsylvania, United States
| | - Michael Draugelis
- Department of Predictive Health Care, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Laura Fluharty
- Clinical Research Operations, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Peter E Gabriel
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Robert Grundmeier
- Department of Pediatrics, The Children's Hospital of Philadelphia, Philadelphia, Pennsylvania, United States
| | - C William Hanson
- Department of Anesthesiology and Critical Care, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Daniel S Herman
- Department of Pathology and Laboratory Medicine, University of Pennsylvania Perelman School of Medicine Philadelphia, Pennsylvania, United States
| | - Blanca E Himes
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Rebecca A Hubbard
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Charles E Kahn
- Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Dokyoon Kim
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Ross Koppel
- Department of Sociology, University of Pennsylvania, Philadelphia, Pennsylvania, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Nebojsa Mirkovic
- Department of Research Analytics, University of Pennsylvania Health System, Philadelphia, Pennsylvania, United States
| | - Jeffrey S Morris
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Danielle L Mowery
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Marylyn D Ritchie
- Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Ryan Urbanowicz
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| | - Jason H Moore
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania, United States
| |
Collapse
|
35
|
Liu Z, Meng R, Yang Y, Li K, Yin Z, Ren J, Shen C, Feng Z, Zhan S. Active Vaccine Safety Surveillance: Global Trends and Challenges in China. HEALTH DATA SCIENCE 2021; 2021:9851067. [PMID: 38487501 PMCID: PMC10880162 DOI: 10.34133/2021/9851067] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/22/2020] [Accepted: 05/03/2021] [Indexed: 03/17/2024]
Abstract
Importance. The great success in vaccine-preventable diseases has been accompanied by vaccine safety concerns. This has caused vaccine hesitancy to be the top 10 in threats to global health. The comprehensive understanding of adverse events following immunization should be entirely based on clinical trials and postapproval surveillance. It has increasingly been recognized worldwide that the active surveillance of vaccine safety should be an essential part of immunization programs due to its complementary advantages to passive surveillance and clinical trials.Highlights. In the present study, the framework of vaccine safety surveillance was summarized to illustrate the importance of active surveillance and address vaccine hesitancy or safety concerns. Then, the global progress of active surveillance systems was reviewed, mainly focusing on population-based or hospital-based active surveillance. With these successful paradigms, the practical and reliable ways to create robust and similar systems in China were discussed and presented from the perspective of available databases, methodology challenges, policy supports, and ethical considerations.Conclusion. In the inevitable trend of the global vaccine safety ecosystem, the establishment of an active surveillance system for vaccine safety in China is urgent and feasible. This process can be accelerated with the consensus and cooperation of regulatory departments, research institutions, and data owners.
Collapse
Affiliation(s)
- Zhike Liu
- Department of Epidemiology and Biostatistics, Peking University Health Science Center, Beijing, China
| | - Ruogu Meng
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Yu Yang
- National Institute of Health Data Science, Peking University, Beijing, China
| | - Keli Li
- National Immunization Programme, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Zundong Yin
- National Immunization Programme, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Jingtian Ren
- Center for Drug Reevaluation, National Medical Products Administration, BeijingChina
| | - Chuanyong Shen
- Center for Drug Reevaluation, National Medical Products Administration, BeijingChina
| | - Zijian Feng
- National Immunization Programme, Chinese Center for Disease Control and Prevention, Beijing, China
| | - Siyan Zhan
- Department of Epidemiology and Biostatistics, Peking University Health Science Center, Beijing, China
| |
Collapse
|
36
|
Park JA, Sung MD, Kim HH, Park YR. Weight-Based Framework for Predictive Modeling of Multiple Databases With Noniterative Communication Without Data Sharing: Privacy-Protecting Analytic Method for Multi-Institutional Studies. JMIR Med Inform 2021; 9:e21043. [PMID: 33818396 PMCID: PMC8056295 DOI: 10.2196/21043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Revised: 11/16/2020] [Accepted: 03/03/2021] [Indexed: 01/22/2023] Open
Abstract
Background Securing the representativeness of study populations is crucial in biomedical research to ensure high generalizability. In this regard, using multi-institutional data have advantages in medicine. However, combining data physically is difficult as the confidential nature of biomedical data causes privacy issues. Therefore, a methodological approach is necessary when using multi-institution medical data for research to develop a model without sharing data between institutions. Objective This study aims to develop a weight-based integrated predictive model of multi-institutional data, which does not require iterative communication between institutions, to improve average predictive performance by increasing the generalizability of the model under privacy-preserving conditions without sharing patient-level data. Methods The weight-based integrated model generates a weight for each institutional model and builds an integrated model for multi-institutional data based on these weights. We performed 3 simulations to show the weight characteristics and to determine the number of repetitions of the weight required to obtain stable values. We also conducted an experiment using real multi-institutional data to verify the developed weight-based integrated model. We selected 10 hospitals (2845 intensive care unit [ICU] stays in total) from the electronic intensive care unit Collaborative Research Database to predict ICU mortality with 11 features. To evaluate the validity of our model, compared with a centralized model, which was developed by combining all the data of 10 hospitals, we used proportional overlap (ie, 0.5 or less indicates a significant difference at a level of .05; and 2 indicates 2 CIs overlapping completely). Standard and firth logistic regression models were applied for the 2 simulations and the experiment. Results The results of these simulations indicate that the weight of each institution is determined by 2 factors (ie, the data size of each institution and how well each institutional model fits into the overall institutional data) and that repeatedly generating 200 weights is necessary per institution. In the experiment, the estimated area under the receiver operating characteristic curve (AUC) and 95% CIs were 81.36% (79.37%-83.36%) and 81.95% (80.03%-83.87%) in the centralized model and weight-based integrated model, respectively. The proportional overlap of the CIs for AUC in both the weight-based integrated model and the centralized model was approximately 1.70, and that of overlap of the 11 estimated odds ratios was over 1, except for 1 case. Conclusions In the experiment where real multi-institutional data were used, our model showed similar results to the centralized model without iterative communication between institutions. In addition, our weight-based integrated model provided a weighted average model by integrating 10 models overfitted or underfitted, compared with the centralized model. The proposed weight-based integrated model is expected to provide an efficient distributed research approach as it increases the generalizability of the model and does not require iterative communication.
Collapse
Affiliation(s)
- Ji Ae Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Min Dong Sung
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Ho Heon Kim
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Yu Rang Park
- Department of Biomedical System Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
37
|
Duan R, Ning Y, Chen Y. Heterogeneity-aware and communication-efficient distributed statistical inference. Biometrika 2021. [DOI: 10.1093/biomet/asab007] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
In multicentre research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed. The existing distributed algorithms usually assume the data are homogeneously distributed across sites. This assumption ignores the important fact that the data collected at different sites may come from various subpopulations and environments, which can lead to heterogeneity in the distribution of the data. Ignoring the heterogeneity may lead to erroneous statistical inference. We propose distributed algorithms which account for the heterogeneous distributions by allowing site-specific nuisance parameters. The proposed methods extend the surrogate likelihood approach (Wang et al. 2017; Jordan et al. 2018) to the heterogeneous setting by applying a novel density ratio tilting method to the efficient score function. The proposed algorithms maintain the same communication cost as existing communication-efficient algorithms. We establish a nonasymptotic risk bound for the proposed distributed estimator and its limiting distribution in the two-index asymptotic setting, which allows both sample size per site and the number of sites to go to infinity. In addition, we show that the asymptotic variance of the estimator attains the Cramér–Rao lower bound when the number of sites is smaller in rate than the sample size at each site. Finally, we use simulation studies and a real data application to demonstrate the validity and feasibility of the proposed methods.
Collapse
|
38
|
Duan R, Chen Z, Tong J, Luo C, Lyu T, Tao C, Maraganore D, Bian J, Chen Y. Leverage Real-world Longitudinal Data in Large Clinical Research Networks for Alzheimer's Disease and Related Dementia (ADRD). AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:393-401. [PMID: 33936412 PMCID: PMC8075520] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
With vast amounts ofpatients' medical information, electronic health records (EHRs) are becoming one of the most important data sources in biomedical and health care research. Effectively integrating data from multiple clinical sites can help provide more generalized real-world evidence that is clinically meaningful. To analyze the clinical data from multiple sites, distributed algorithms are developed to protect patient privacy without sharing individual-level medical information. In this paper, we applied the One-shot Distributed Algorithm for Cox proportional hazard model (ODAC) to the longitudinal data from the OneFlorida Clinical Research Consortium to demonstrate the feasibility of implementing the distributed algorithms in large research networks. We studied the associations between the clinical risk factors and Alzheimer's disease and related dementia (ADRD) onsets to advance clinical research on our understanding of the complex risk factors of ADRD and ultimately improve the care of ADRD patients.
Collapse
Affiliation(s)
- Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Zhaoyi Chen
- Department of Epidemiology, College of Medicine & College of Public Health and Health Professions, University of Florida, Gainesville, FL, USA
| | - Jiayi Tong
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Chongliang Luo
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Tianchen Lyu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Demetrius Maraganore
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
39
|
Tong J, Chen Z, Duan R, Lo-Ciganic WH, Lyu T, Tao C, Merkel PA, Kranzler HR, Bian J, Chen Y. Identifying Clinical Risk Factors for Opioid Use Disorder using a Distributed Algorithm to Combine Real-World Data from a Large Clinical Data Research Network. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2021; 2020:1220-1229. [PMID: 33936498 PMCID: PMC8075517] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Because they contain detailed individual-level data on various patient characteristics including their medical conditions and treatment histories, electronic health record (EHR) systems have been widely adopted as an efficient source for health research. Compared to data from a single health system, real-world data (RWD) from multiple clinical sites provide a larger and more generalizable population for accurate estimation, leading to better decision making for health care. However, due to concerns over protecting patient privacy, it is challenging to share individual patient-level data across sites in practice. To tackle this issue, many distributed algorithms have been developed to transfer summary-level statistics to derive accurate estimates. Nevertheless, many of these algorithms require multiple rounds of communication to exchange intermediate results across different sites. Among them, the One-shot Distributed Algorithm for Logistic regression (termed ODAL) was developed to reduce communication overhead while protecting patient privacy. In this paper, we applied the ODAL algorithm to RWD from a large clinical data research network-the OneFlorida Clinical Research Consortium and estimated the associations between risk factors and the diagnosis of opioid use disorder (OUD) among individuals who received at least one opioid prescription. The ODAL algorithm provided consistent findings of the associated risk factors and yielded better estimates than meta-analysis.
Collapse
Affiliation(s)
- Jiayi Tong
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Zhaoyi Chen
- Department of Epidemiology, College of Medicine & College of Public Health and Health Professions, University of Florida, Gainesville, FL, USA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Wei-Hsuan Lo-Ciganic
- Department of Pharmaceutical Outcomes & Policy, College of Pharmacy, University of Florida, Gainesville, FL, USA
| | - Tianchen Lyu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Peter A Merkel
- Department of Medicine, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| | - Henry R Kranzler
- Department of Psychiatry, Perelman School of Medicine, The University of Pennsylvania, and VISN 4 MIRECC, Crescenz VAMC, Philadelphia, PA, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, The University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
40
|
Li R, Duan R, Zhang X, Lumley T, Pendergrass S, Bauer C, Hakonarson H, Carrell DS, Smoller JW, Wei WQ, Carroll R, Velez Edwards DR, Wiesner G, Sleiman P, Denny JC, Mosley JD, Ritchie MD, Chen Y, Moore JH. Lossless integration of multiple electronic health records for identifying pleiotropy using summary statistics. Nat Commun 2021; 12:168. [PMID: 33420026 PMCID: PMC7794298 DOI: 10.1038/s41467-020-20211-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2020] [Accepted: 11/13/2020] [Indexed: 11/22/2022] Open
Abstract
Increasingly, clinical phenotypes with matched genetic data from bio-bank linked electronic health records (EHRs) have been used for pleiotropy analyses. Thus far, pleiotropy analysis using individual-level EHR data has been limited to data from one site. However, it is desirable to integrate EHR data from multiple sites to improve the detection power and generalizability of the results. Due to privacy concerns, individual-level patients' data are not easily shared across institutions. As a result, we introduce Sum-Share, a method designed to efficiently integrate EHR and genetic data from multiple sites to perform pleiotropy analysis. Sum-Share requires only summary-level data and one round of communication from each site, yet it produces identical test statistics compared with that of pooled individual-level data. Consequently, Sum-Share can achieve lossless integration of multiple datasets. Using real EHR data from eMERGE, Sum-Share is able to identify 1734 potential pleiotropic SNPs for five cardiovascular diseases.
Collapse
Affiliation(s)
- Ruowang Li
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Rui Duan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Xinyuan Zhang
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA
| | - Thomas Lumley
- Department of Statistics, University of Auckland, Auckland, New Zealand
| | - Sarah Pendergrass
- Biomedical and Translational Informatics Institute, Geisinger, Danville, PA, USA
| | - Christopher Bauer
- Biomedical and Translational Informatics Institute, Geisinger, Danville, PA, USA
| | - Hakon Hakonarson
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, WA, USA
| | - Jordan W Smoller
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Centre, Nashville, TN, USA
| | - Robert Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Centre, Nashville, TN, USA
| | - Digna R Velez Edwards
- Clinical and Translational Hereditary Cancer Program, Division of Genetic Medicine, Department of Medicine, Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, TN, USA
| | - Georgia Wiesner
- Clinical and Translational Hereditary Cancer Program, Division of Genetic Medicine, Department of Medicine, Vanderbilt-Ingram Cancer Center, Vanderbilt University, Nashville, TN, USA
| | - Patrick Sleiman
- Center for Applied Genomics, Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Josh C Denny
- Department of Biomedical Informatics, Vanderbilt University Medical Centre, Nashville, TN, USA
| | - Jonathan D Mosley
- Department of Biomedical Informatics, Vanderbilt University Medical Centre, Nashville, TN, USA
| | - Marylyn D Ritchie
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| | - Jason H Moore
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|