1
|
Kapitan D, Heddema F, Dekker A, Sieswerda M, Verhoeff BJ, Berg M. Data Interoperability in Context: The Importance of Open-Source Implementations When Choosing Open Standards. J Med Internet Res 2025; 27:e66616. [PMID: 40232773 PMCID: PMC12041819 DOI: 10.2196/66616] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 02/05/2025] [Accepted: 03/16/2025] [Indexed: 04/16/2025] Open
Abstract
Following the proposal by Tsafnat et al (2024) to converge on three open health data standards, this viewpoint offers a critical reflection on their proposed alignment of openEHR, Fast Health Interoperability Resources (FHIR), and Observational Medical Outcomes Partnership (OMOP) as default data standards for clinical care and administration, data exchange, and longitudinal analysis, respectively. We argue that open standards are a necessary but not sufficient condition to achieve health data interoperability. The ecosystem of open-source software needs to be considered when choosing an appropriate standard for a given context. We discuss two specific contexts, namely standardization of (1) health data for federated learning, and (2) health data sharing in low- and middle-income countries. Specific design principles, practical considerations, and implementation choices for these two contexts are described, based on ongoing work in both areas. In the case of federated learning, we observe convergence toward OMOP and FHIR, where the two standards can effectively be used side-by-side given the availability of mediators between the two. In the case of health information exchanges in low and middle-income countries, we see a strong convergence toward FHIR as the primary standard. We propose practical guidelines for context-specific adaptation of open health data standards.
Collapse
Affiliation(s)
- Daniel Kapitan
- Eindhoven AI Systems Institute (EAISI), Eindhoven University of Technology, Eindhoven, The Netherlands
- PharmAccess Foundation, Amsterdam, The Netherlands
- Dutch Hospital Data, Utrecht, The Netherlands
| | | | - André Dekker
- MAASTRO Clinic, Maastricht University Medical Centre, Maastricht University, Maastricht, The Netherlands
| | - Melle Sieswerda
- Netherlands Comprehensive Cancer Organisation, Utrecht, The Netherlands
| | | | | |
Collapse
|
2
|
Obeagu EI, Ezeanya CU, Ogenyi FC, Ifu DD. Big data analytics and machine learning in hematology: Transformative insights, applications and challenges. Medicine (Baltimore) 2025; 104:e41766. [PMID: 40068020 PMCID: PMC11902945 DOI: 10.1097/md.0000000000041766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/20/2023] [Revised: 12/14/2024] [Accepted: 02/17/2025] [Indexed: 03/14/2025] Open
Abstract
The integration of big data analytics and machine learning (ML) into hematology has ushered in a new era of precision medicine, offering transformative insights into disease management. By leveraging vast and diverse datasets, including genomic profiles, clinical laboratory results, and imaging data, these technologies enhance diagnostic accuracy, enable robust prognostic modeling, and support personalized therapeutic interventions. Advanced ML algorithms, such as neural networks and ensemble learning, facilitate the discovery of novel biomarkers and refine risk stratification for hematological disorders, including leukemias, lymphomas, and coagulopathies. Despite these advancements, significant challenges persist, particularly in the realms of data integration, algorithm validation, and ethical concerns. The heterogeneity of hematological datasets and the lack of standardized frameworks complicate their application, while the "black-box" nature of ML models raises issues of reliability and clinical trust. Moreover, safeguarding patient privacy in an era of data-driven medicine remains paramount, necessitating the development of secure and ethical analytical practices. Addressing these challenges is critical to ensuring equitable and effective implementation of these technologies. Collaborative efforts between hematologists, data scientists, and bioinformaticians are pivotal in translating these innovations into real-world clinical practice. Emphasis on developing explainable artificial intelligence models, integrating real-time analytics, and adopting federated learning approaches will further enhance the utility and adoption of these technologies. As big data analytics and ML continue to evolve, their potential to revolutionize hematology and improve patient outcomes remains immense.
Collapse
Affiliation(s)
| | | | - Fabian Chukwudi Ogenyi
- Department of Electrical, Telecommunication and Computer Engineering, Kampala International University, Kampala, Uganda
| | - Deborah Domini Ifu
- Department of Biomedical and Laboratory Science, Africa University, Mutare, Zimbabwe
| |
Collapse
|
3
|
Lee H, Kim S, Moon HW, Lee HY, Kim K, Jung SY, Yoo S. Hospital Length of Stay Prediction for Planned Admissions Using Observational Medical Outcomes Partnership Common Data Model: Retrospective Study. J Med Internet Res 2024; 26:e59260. [PMID: 39576284 PMCID: PMC11624451 DOI: 10.2196/59260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2024] [Revised: 07/07/2024] [Accepted: 10/29/2024] [Indexed: 11/24/2024] Open
Abstract
BACKGROUND Accurate hospital length of stay (LoS) prediction enables efficient resource management. Conventional LoS prediction models with limited covariates and nonstandardized data have limited reproducibility when applied to the general population. OBJECTIVE In this study, we developed and validated a machine learning (ML)-based LoS prediction model for planned admissions using the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). METHODS Retrospective patient-level prediction models used electronic health record (EHR) data converted to the OMOP CDM (version 5.3) from Seoul National University Bundang Hospital (SNUBH) in South Korea. The study included 137,437 hospital admission episodes between January 2016 and December 2020. Covariates from the patient, condition occurrence, medication, observation, measurement, procedure, and visit occurrence tables were included in the analysis. To perform feature selection, we applied Lasso regularization in the logistic regression. The primary outcome was an LoS of 7 days or longer, while the secondary outcome was an LoS of 3 days or longer. The prediction models were developed using 6 ML algorithms, with the training and test set split in a 7:3 ratio. The performance of each model was evaluated based on the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC). Shapley Additive Explanations (SHAP) analysis measured feature importance, while calibration plots assessed the reliability of the prediction models. External validation of the developed models occurred at an independent institution, the Seoul National University Hospital. RESULTS The final sample included 129,938 patient entry events in the planned admissions. The Extreme Gradient Boosting (XGB) model achieved the best performance in binary classification for predicting an LoS of 7 days or longer, with an AUROC of 0.891 (95% CI 0.887-0.894) and an AUPRC of 0.819 (95% CI 0.813-0.826) on the internal test set. The Light Gradient Boosting (LGB) model performed the best in the multiclassification for predicting an LoS of 3 days or more, with an AUROC of 0.901 (95% CI 0.898-0.904) and an AUPRC of 0.770 (95% CI 0.762-0.779). The most important features contributing to the models were the operation performed, frequency of previous outpatient visits, patient admission department, age, and day of admission. The RF model showed robust performance in the external validation set, achieving an AUROC of 0.804 (95% CI 0.802-0.807). CONCLUSIONS The use of the OMOP CDM in predicting hospital LoS for planned admissions demonstrates promising predictive capabilities for stays of varying durations. It underscores the advantage of standardized data in achieving reproducible results. This approach should serve as a model for enhancing operational efficiency and patient care coordination across health care settings.
Collapse
Affiliation(s)
- Haeun Lee
- Department of Biomedical Informatics and Data Science, Johns Hopkins School of Medicine, Johns Hopkins University, Baltimore, MD, United States
- Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Seok Kim
- Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Hui-Woun Moon
- Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Ho-Young Lee
- Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Kwangsoo Kim
- Department of Transdisciplinary Medicine, Seoul National University Hospital, Seoul, Republic of Korea
| | - Se Young Jung
- Department of Family Medicine, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Sooyoung Yoo
- Office of eHealth Research and Businesses, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| |
Collapse
|
4
|
Seinen TM, Kors JA, van Mulligen EM, Fridgeirsson EA, Verhamme KM, Rijnbeek PR. Using clinical text to refine unspecific condition codes in Dutch general practitioner EHR data. Int J Med Inform 2024; 189:105506. [PMID: 38820647 DOI: 10.1016/j.ijmedinf.2024.105506] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 05/22/2024] [Accepted: 05/27/2024] [Indexed: 06/02/2024]
Abstract
OBJECTIVE Observational studies using electronic health record (EHR) databases often face challenges due to unspecific clinical codes that can obscure detailed medical information, hindering precise data analysis. In this study, we aimed to assess the feasibility of refining these unspecific condition codes into more specific codes in a Dutch general practitioner (GP) EHR database by leveraging the available clinical free text. METHODS We utilized three approaches for text classification-search queries, semi-supervised learning, and supervised learning-to improve the specificity of ten unspecific International Classification of Primary Care (ICPC-1) codes. Two text representations and three machine learning algorithms were evaluated for the (semi-)supervised models. Additionally, we measured the improvement achieved by the refinement process on all code occurrences in the database. RESULTS The classification models performed well for most codes. In general, no single classification approach consistently outperformed the others. However, there were variations in the relative performance of the classification approaches within each code and in the use of different text representations and machine learning algorithms. Class imbalance and limited training data affected the performance of the (semi-)supervised models, yet the simple search queries remained particularly effective. Ultimately, the developed models improved the specificity of over half of all the unspecific code occurrences in the database. CONCLUSIONS Our findings show the feasibility of using information from clinical text to improve the specificity of unspecific condition codes in observational healthcare databases, even with a limited range of machine-learning techniques and modest annotated training sets. Future work could investigate transfer learning, integration of structured data, alternative semi-supervised methods, and validation of models across healthcare settings. The improved level of detail enriches the interpretation of medical information and can benefit observational research and patient care.
Collapse
Affiliation(s)
- Tom M Seinen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands.
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Katia Mc Verhamme
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| |
Collapse
|
5
|
Seixas-Lopes FA, Lopes C, Marques M, Agostinho C, Jardim-Goncalves R. Musculoskeletal Disorder (MSD) Health Data Collection, Personalized Management and Exchange Using Fast Healthcare Interoperability Resources (FHIR). SENSORS (BASEL, SWITZERLAND) 2024; 24:5175. [PMID: 39204872 PMCID: PMC11360422 DOI: 10.3390/s24165175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 07/21/2024] [Accepted: 08/02/2024] [Indexed: 09/04/2024]
Abstract
With the proliferation and growing complexity of healthcare systems emerges the challenge of implementing scalable and interoperable solutions to seamlessly integrate heterogenous data from sources such as wearables, electronic health records, and patient reports that can provide a comprehensive and personalized view of the patient's health. Lack of standardization hinders the coordination between systems and stakeholders, impacting continuity of care and patient outcomes. Common musculoskeletal conditions affect people of all ages and can have a significant impact on quality of life. With physical activity and rehabilitation, these conditions can be mitigated, promoting recovery and preventing recurrence. Proper management of patient data allows for clinical decision support, facilitating personalized interventions and a patient-centered approach. Fast Healthcare Interoperability Resources (FHIR) is a widely adopted standard that defines healthcare concepts with the objective of easing information exchange and enabling interoperability throughout the healthcare sector, reducing implementation complexity without losing information integrity. This article explores the literature that reviews the contemporary role of FHIR, approaching its functioning, benefits, and challenges, and presents a methodology for structuring several types of health and wellbeing data, that can be routinely collected as observations and then encapsulated in FHIR resources, to ensure interoperability across systems. These were developed considering health industry standard guidelines, technological specifications, and using the experience gained from the implementation in various study cases, within European health-related research projects, to assess its effectiveness in the exchange of patient data in existing healthcare systems towards improving musculoskeletal disorders (MSDs).
Collapse
Affiliation(s)
- Fabio A. Seixas-Lopes
- Centre of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal; (C.L.); (M.M.); (C.A.); (R.J.-G.)
- Department of Electrical and Computer Engineering, NOVA School of Science and Technology, NOVA University Lisbon, 2829-516 Caparica, Portugal
| | - Carlos Lopes
- Centre of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal; (C.L.); (M.M.); (C.A.); (R.J.-G.)
| | - Maria Marques
- Centre of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal; (C.L.); (M.M.); (C.A.); (R.J.-G.)
| | - Carlos Agostinho
- Centre of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal; (C.L.); (M.M.); (C.A.); (R.J.-G.)
| | - Ricardo Jardim-Goncalves
- Centre of Technology and Systems (UNINOVA-CTS) and Associated Lab of Intelligent Systems (LASI), 2829-516 Caparica, Portugal; (C.L.); (M.M.); (C.A.); (R.J.-G.)
- Department of Electrical and Computer Engineering, NOVA School of Science and Technology, NOVA University Lisbon, 2829-516 Caparica, Portugal
| |
Collapse
|
6
|
John LH, Fridgeirsson EA, Kors JA, Reps JM, Williams RD, Ryan PB, Rijnbeek PR. Development and validation of a patient-level model to predict dementia across a network of observational databases. BMC Med 2024; 22:308. [PMID: 39075527 PMCID: PMC11288076 DOI: 10.1186/s12916-024-03530-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 07/15/2024] [Indexed: 07/31/2024] Open
Abstract
BACKGROUND A prediction model can be a useful tool to quantify the risk of a patient developing dementia in the next years and take risk-factor-targeted intervention. Numerous dementia prediction models have been developed, but few have been externally validated, likely limiting their clinical uptake. In our previous work, we had limited success in externally validating some of these existing models due to inadequate reporting. As a result, we are compelled to develop and externally validate novel models to predict dementia in the general population across a network of observational databases. We assess regularization methods to obtain parsimonious models that are of lower complexity and easier to implement. METHODS Logistic regression models were developed across a network of five observational databases with electronic health records (EHRs) and claims data to predict 5-year dementia risk in persons aged 55-84. The regularization methods L1 and Broken Adaptive Ridge (BAR) as well as three candidate predictor sets to optimize prediction performance were assessed. The predictor sets include a baseline set using only age and sex, a full set including all available candidate predictors, and a phenotype set which includes a limited number of clinically relevant predictors. RESULTS BAR can be used for variable selection, outperforming L1 when a parsimonious model is desired. Adding candidate predictors for disease diagnosis and drug exposure generally improves the performance of baseline models using only age and sex. While a model trained on German EHR data saw an increase in AUROC from 0.74 to 0.83 with additional predictors, a model trained on US EHR data showed only minimal improvement from 0.79 to 0.81 AUROC. Nevertheless, the latter model developed using BAR regularization on the clinically relevant predictor set was ultimately chosen as best performing model as it demonstrated more consistent external validation performance and improved calibration. CONCLUSIONS We developed and externally validated patient-level models to predict dementia. Our results show that although dementia prediction is highly driven by demographic age, adding predictors based on condition diagnoses and drug exposures further improves prediction performance. BAR regularization outperforms L1 regularization to yield the most parsimonious yet still well-performing prediction model for dementia.
Collapse
Affiliation(s)
- Luis H John
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
| | - Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jenna M Reps
- Janssen Research and Development, Raritan, NJ, USA
| | - Ross D Williams
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | | | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
7
|
Fridgeirsson EA, Williams R, Rijnbeek P, Suchard MA, Reps JM. Comparing penalization methods for linear models on large observational health data. J Am Med Inform Assoc 2024; 31:1514-1521. [PMID: 38767857 PMCID: PMC11187433 DOI: 10.1093/jamia/ocae109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 04/19/2024] [Accepted: 05/06/2024] [Indexed: 05/22/2024] Open
Abstract
OBJECTIVE This study evaluates regularization variants in logistic regression (L1, L2, ElasticNet, Adaptive L1, Adaptive ElasticNet, Broken adaptive ridge [BAR], and Iterative hard thresholding [IHT]) for discrimination and calibration performance, focusing on both internal and external validation. MATERIALS AND METHODS We use data from 5 US claims and electronic health record databases and develop models for various outcomes in a major depressive disorder patient population. We externally validate all models in the other databases. We use a train-test split of 75%/25% and evaluate performance with discrimination and calibration. Statistical analysis for difference in performance uses Friedman's test and critical difference diagrams. RESULTS Of the 840 models we develop, L1 and ElasticNet emerge as superior in both internal and external discrimination, with a notable AUC difference. BAR and IHT show the best internal calibration, without a clear external calibration leader. ElasticNet typically has larger model sizes than L1. Methods like IHT and BAR, while slightly less discriminative, significantly reduce model complexity. CONCLUSION L1 and ElasticNet offer the best discriminative performance in logistic regression for healthcare predictions, maintaining robustness across validations. For simpler, more interpretable models, L0-based methods (IHT and BAR) are advantageous, providing greater parsimony and calibration with fewer features. This study aids in selecting suitable regularization techniques for healthcare prediction models, balancing performance, complexity, and interpretability.
Collapse
Affiliation(s)
- Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands
| | - Ross Williams
- Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands
| | - Peter Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands
| | - Marc A Suchard
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095-1772, United States
- VA Informatics and Computing Infrastructure, United States Department of Veterans Affairs, Salt Lake City, UT 84148, United States
| | - Jenna M Reps
- Department of Medical Informatics, Erasmus University Medical Center, 3015 GD Rotterdam, The Netherlands
- Observational Health Data Analytics, Janssen Research and Development, Titusville, NJ 08560, United States
| |
Collapse
|
8
|
Naderalvojoud B, Curtin CM, Yanover C, El-Hay T, Choi B, Park RW, Tabuenca JG, Reeve MP, Falconer T, Humphreys K, Asch SM, Hernandez-Boussard T. Towards global model generalizability: independent cross-site feature evaluation for patient-level risk prediction models using the OHDSI network. J Am Med Inform Assoc 2024; 31:1051-1061. [PMID: 38412331 PMCID: PMC11031239 DOI: 10.1093/jamia/ocae028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Revised: 01/26/2024] [Accepted: 02/01/2024] [Indexed: 02/29/2024] Open
Abstract
BACKGROUND Predictive models show promise in healthcare, but their successful deployment is challenging due to limited generalizability. Current external validation often focuses on model performance with restricted feature use from the original training data, lacking insights into their suitability at external sites. Our study introduces an innovative methodology for evaluating features during both the development phase and the validation, focusing on creating and validating predictive models for post-surgery patient outcomes with improved generalizability. METHODS Electronic health records (EHRs) from 4 countries (United States, United Kingdom, Finland, and Korea) were mapped to the OMOP Common Data Model (CDM), 2008-2019. Machine learning (ML) models were developed to predict post-surgery prolonged opioid use (POU) risks using data collected 6 months before surgery. Both local and cross-site feature selection methods were applied in the development and external validation datasets. Models were developed using Observational Health Data Sciences and Informatics (OHDSI) tools and validated on separate patient cohorts. RESULTS Model development included 41 929 patients, 14.6% with POU. The external validation included 31 932 (UK), 23 100 (US), 7295 (Korea), and 3934 (Finland) patients with POU of 44.2%, 22.0%, 15.8%, and 21.8%, respectively. The top-performing model, Lasso logistic regression, achieved an area under the receiver operating characteristic curve (AUROC) of 0.75 during local validation and 0.69 (SD = 0.02) (averaged) in external validation. Models trained with cross-site feature selection significantly outperformed those using only features from the development site through external validation (P < .05). CONCLUSIONS Using EHRs across four countries mapped to the OMOP CDM, we developed generalizable predictive models for POU. Our approach demonstrates the significant impact of cross-site feature selection in improving model performance, underscoring the importance of incorporating diverse feature sets from various clinical settings to enhance the generalizability and utility of predictive healthcare models.
Collapse
Affiliation(s)
| | - Catherine M Curtin
- Department of Surgery, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, United States
| | - Chen Yanover
- KI Research Institute, Kfar Malal, 4592000, Israel
| | - Tal El-Hay
- KI Research Institute, Kfar Malal, 4592000, Israel
| | - Byungjin Choi
- Department of Biomedical Informatics, Ajou University Graduate School of Medicine, Suwon, 16499, Korea
| | - Rae Woong Park
- Department of Biomedical Informatics, Ajou University Graduate School of Medicine, Suwon, 16499, Korea
| | - Javier Gracia Tabuenca
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, 00014, Finland
| | - Mary Pat Reeve
- Institute for Molecular Medicine Finland (FIMM), HiLIFE, University of Helsinki, Helsinki, 00014, Finland
| | - Thomas Falconer
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Keith Humphreys
- Department of Psychiatry and the Behavioral Sciences, Stanford University, Stanford, CA 94305, United States
- Center for Innovation to Implementation, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, United States
| | - Steven M Asch
- Department of Medicine, Stanford University, Stanford, CA 94305, United States
- Center for Innovation to Implementation, Veterans Affairs Palo Alto Health Care System, Palo Alto, CA 94304, United States
| | | |
Collapse
|
9
|
Kim C, Yu DH, Baek H, Cho J, You SC, Park RW. Data Resource Profile: Health Insurance Review and Assessment Service Covid-19 Observational Medical Outcomes Partnership (HIRA Covid-19 OMOP) database in South Korea. Int J Epidemiol 2024; 53:dyae062. [PMID: 38658170 DOI: 10.1093/ije/dyae062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Accepted: 04/08/2024] [Indexed: 04/26/2024] Open
Affiliation(s)
- Chungsoo Kim
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Dong Han Yu
- Big Data Department, Health Insurance Assessment and Review Services, Wonju, Republic of Korea
| | - Hyeran Baek
- Big Data Department, Health Insurance Assessment and Review Services, Wonju, Republic of Korea
| | - Jaehyeong Cho
- Department of Research, Keimyung University Dongsan Medical Center, Daegu, Republic of Korea
| | - Seng Chan You
- Department of Biomedical Systems Informatics, Yonsei University College of Medicine, Seoul, Republic of Korea
- Institute for Innovation in Digital Healthcare, Yonsei University, Seoul, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| |
Collapse
|
10
|
Choi JY, Yoo S, Song W, Kim S, Baek H, Lee JS, Yoon YS, Yoon S, Lee HY, Kim KI. Development and Validation of a Prognostic Classification Model Predicting Postoperative Adverse Outcomes in Older Surgical Patients Using a Machine Learning Algorithm: Retrospective Observational Network Study. J Med Internet Res 2023; 25:e42259. [PMID: 37955965 PMCID: PMC10682929 DOI: 10.2196/42259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 03/08/2023] [Accepted: 10/11/2023] [Indexed: 11/14/2023] Open
Abstract
BACKGROUND Older adults are at an increased risk of postoperative morbidity. Numerous risk stratification tools exist, but effort and manpower are required. OBJECTIVE This study aimed to develop a predictive model of postoperative adverse outcomes in older patients following general surgery with an open-source, patient-level prediction from the Observational Health Data Sciences and Informatics for internal and external validation. METHODS We used the Observational Medical Outcomes Partnership common data model and machine learning algorithms. The primary outcome was a composite of 90-day postoperative all-cause mortality and emergency department visits. Secondary outcomes were postoperative delirium, prolonged postoperative stay (≥75th percentile), and prolonged hospital stay (≥21 days). An 80% versus 20% split of the data from the Seoul National University Bundang Hospital (SNUBH) and Seoul National University Hospital (SNUH) common data model was used for model training and testing versus external validation. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC) with a 95% CI. RESULTS Data from 27,197 (SNUBH) and 32,857 (SNUH) patients were analyzed. Compared to the random forest, Adaboost, and decision tree models, the least absolute shrinkage and selection operator logistic regression model showed good internal discriminative accuracy (internal AUC 0.723, 95% CI 0.701-0.744) and transportability (external AUC 0.703, 95% CI 0.692-0.714) for the primary outcome. The model also possessed good internal and external AUCs for postoperative delirium (internal AUC 0.754, 95% CI 0.713-0.794; external AUC 0.750, 95% CI 0.727-0.772), prolonged postoperative stay (internal AUC 0.813, 95% CI 0.800-0.825; external AUC 0.747, 95% CI 0.741-0.753), and prolonged hospital stay (internal AUC 0.770, 95% CI 0.749-0.792; external AUC 0.707, 95% CI 0.696-0.718). Compared with age or the Charlson comorbidity index, the model showed better prediction performance. CONCLUSIONS The derived model shall assist clinicians and patients in understanding the individualized risks and benefits of surgery.
Collapse
Affiliation(s)
- Jung-Yeon Choi
- Departmentof Internal Medicine, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Sooyoung Yoo
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Wongeun Song
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
- Department of Health Science and Technology, Graduate School of Convergence Science and Technology, Seoul National University, Seongnam-si, Republic of Korea
| | - Seok Kim
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Hyunyoung Baek
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Jun Suh Lee
- Department of Surgery, G Sam Hospital, Gunpo, Republic of Korea
| | - Yoo-Seok Yoon
- Department of Surgery, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
- Department of Surgery, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Seonghae Yoon
- Department of Clinical Pharmacology and Therapeutic, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Hae-Young Lee
- Department of Internal Medicine, Seoul National University Hospital, Seoul, Republic of Korea
- Department of Internal Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Kwang-Il Kim
- Departmentof Internal Medicine, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
- Department of Internal Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
11
|
Pirmani A, De Brouwer E, Geys L, Parciak T, Moreau Y, Peeters LM. The Journey of Data Within a Global Data Sharing Initiative: A Federated 3-Layer Data Analysis Pipeline to Scale Up Multiple Sclerosis Research. JMIR Med Inform 2023; 11:e48030. [PMID: 37943585 PMCID: PMC10667980 DOI: 10.2196/48030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 08/25/2023] [Accepted: 09/30/2023] [Indexed: 11/10/2023] Open
Abstract
BACKGROUND Investigating low-prevalence diseases such as multiple sclerosis is challenging because of the rather small number of individuals affected by this disease and the scattering of real-world data across numerous data sources. These obstacles impair data integration, standardization, and analysis, which negatively impact the generation of significant meaningful clinical evidence. OBJECTIVE This study aims to present a comprehensive, research question-agnostic, multistakeholder-driven end-to-end data analysis pipeline that accommodates 3 prevalent data-sharing streams: individual data sharing, core data set sharing, and federated model sharing. METHODS A demand-driven methodology is employed for standardization, followed by 3 streams of data acquisition, a data quality enhancement process, a data integration procedure, and a concluding analysis stage to fulfill real-world data-sharing requirements. This pipeline's effectiveness was demonstrated through its successful implementation in the COVID-19 and multiple sclerosis global data sharing initiative. RESULTS The global data sharing initiative yielded multiple scientific publications and provided extensive worldwide guidance for the community with multiple sclerosis. The pipeline facilitated gathering pertinent data from various sources, accommodating distinct sharing streams and assimilating them into a unified data set for subsequent statistical analysis or secure data examination. This pipeline contributed to the assembly of the largest data set of people with multiple sclerosis infected with COVID-19. CONCLUSIONS The proposed data analysis pipeline exemplifies the potential of global stakeholder collaboration and underlines the significance of evidence-based decision-making. It serves as a paradigm for how data sharing initiatives can propel advancements in health care, emphasizing its adaptability and capacity to address diverse research inquiries.
Collapse
Affiliation(s)
- Ashkan Pirmani
- ESAT, STADIUS, KU Leuven, Leuven, Belgium
- Biomedical Research Institute, Hasselt University, Diepenbeek, Belgium
- Data Science Institute, Hasselt University, Diepenbeek, Belgium
- University Multiple Sclerosis Center, Hasselt University, Diepenbeek, Belgium
| | | | - Lotte Geys
- Biomedical Research Institute, Hasselt University, Diepenbeek, Belgium
- Data Science Institute, Hasselt University, Diepenbeek, Belgium
- University Multiple Sclerosis Center, Hasselt University, Diepenbeek, Belgium
| | - Tina Parciak
- Biomedical Research Institute, Hasselt University, Diepenbeek, Belgium
- Data Science Institute, Hasselt University, Diepenbeek, Belgium
- University Multiple Sclerosis Center, Hasselt University, Diepenbeek, Belgium
| | | | - Liesbet M Peeters
- Biomedical Research Institute, Hasselt University, Diepenbeek, Belgium
- Data Science Institute, Hasselt University, Diepenbeek, Belgium
- University Multiple Sclerosis Center, Hasselt University, Diepenbeek, Belgium
| |
Collapse
|
12
|
Liu L, Song W, Patil N, Sainlaire M, Jasuja R, Dykes PC. Predicting COVID-19 severity: Challenges in reproducibility and deployment of machine learning methods. Int J Med Inform 2023; 179:105210. [PMID: 37769368 DOI: 10.1016/j.ijmedinf.2023.105210] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Revised: 08/29/2023] [Accepted: 08/30/2023] [Indexed: 09/30/2023]
Abstract
The increasing use of electronic health records (EHR) based computable phenotypes in clinical research is providing new opportunities for development of data-driven medical applications. Adopted widely in the United States and globally, EHRs facilitate systematic collection of patients' longitudinal information, which serves as one of the important foundations for artificial intelligence applications in medicine. Harmonization of input variables and outcome definitions is critically important for wider clinical applicability of artificial intelligence (AI) methodologies. In this review, we focused on Coronavirus Disease 2019 (COVID-19) severity machine learning prediction models and explored the pipeline for standardizing future disease severity model development using EHR information. We identified 2,967 studies published between 01/01/2020 and 02/15/2022 and selected 135 independent studies that had built machine learning prediction models to predict severity related outcomes of COVID-19 patients based on EHR data for the final review. These 135 studies spanning across 27 counties covered a broad range of severity related prediction outcomes. We observed substantial inconsistency in COVID-19 severity phenotype definitions among models in these studies. Moreover, there was a gap between the outcome of these models and clinician-recognized clinical concepts. Accordingly, we recommend that robust clinical input metrics, with outcome definitions which eliminate ambiguity in interpretation, to reduce algorithmic bias, mitigate model brittleness and improve generalizability of a universal model for COVID-19 severity. This framework can potentially be extended to broader clinical application.
Collapse
Affiliation(s)
- Luwei Liu
- Department of Medicine, Brigham & Women's Hospital, Boston, MA, USA
| | - Wenyu Song
- Department of Medicine, Brigham & Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | - Namrata Patil
- Department of Surgery, Brigham & Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| | | | - Ravi Jasuja
- Department of Medicine, Brigham & Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA.
| | - Patricia C Dykes
- Department of Medicine, Brigham & Women's Hospital, Boston, MA, USA; Harvard Medical School, Boston, MA, USA
| |
Collapse
|
13
|
Junior EPP, Normando P, Flores-Ortiz R, Afzal MU, Jamil MA, Bertolin SF, Oliveira VDA, Martufi V, de Sousa F, Bashir A, Burn E, Ichihara MY, Barreto ML, Salles TD, Prieto-Alhambra D, Hafeez H, Khalid S. Integrating real-world data from Brazil and Pakistan into the OMOP common data model and standardized health analytics framework to characterize COVID-19 in the Global South. J Am Med Inform Assoc 2023; 30:643-655. [PMID: 36264262 PMCID: PMC9619798 DOI: 10.1093/jamia/ocac180] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 08/16/2022] [Accepted: 09/29/2022] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVES The aim of this work is to demonstrate the use of a standardized health informatics framework to generate reliable and reproducible real-world evidence from Latin America and South Asia towards characterizing coronavirus disease 2019 (COVID-19) in the Global South. MATERIALS AND METHODS Patient-level COVID-19 records collected in a patient self-reported notification system, hospital in-patient and out-patient records, and community diagnostic labs were harmonized to the Observational Medical Outcomes Partnership common data model and analyzed using a federated network analytics framework. Clinical characteristics of individuals tested for, diagnosed with or tested positive for, hospitalized with, admitted to intensive care unit with, or dying with COVID-19 were estimated. RESULTS Two COVID-19 databases covering 8.3 million people from Pakistan and 2.6 million people from Bahia, Brazil were analyzed. 109 504 (Pakistan) and 921 (Brazil) medical concepts were harmonized to Observational Medical Outcomes Partnership common data model. In total, 341 505 (4.1%) people in the Pakistan dataset and 1 312 832 (49.2%) people in the Brazilian dataset were tested for COVID-19 between January 1, 2020 and April 20, 2022, with a median [IQR] age of 36 [25, 76] and 38 (27, 50); 40.3% and 56.5% were female in Pakistan and Brazil, respectively. 1.2% percent individuals in the Pakistan dataset had Afghan ethnicity. In Brazil, 52.3% had mixed ethnicity. In agreement with international findings, COVID-19 outcomes were more severe in men, elderly, and those with underlying health conditions. CONCLUSIONS COVID-19 data from 2 large countries in the Global South were harmonized and analyzed using a standardized health informatics framework developed by an international community of health informaticians. This proof-of-concept study demonstrates a potential open science framework for global knowledge mobilization and clinical translation for timely response to healthcare needs in pandemics and beyond.
Collapse
Affiliation(s)
- Elzo Pereira Pinto Junior
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Priscilla Normando
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Renzo Flores-Ortiz
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Muhammad Usman Afzal
- Shaukat Khanum Memorial Cancer Hospital and Research Centre, Johar Town, Lahore, 54840, Pakistan
| | - Muhammad Asaad Jamil
- Shaukat Khanum Memorial Cancer Hospital and Research Centre, Johar Town, Lahore, 54840, Pakistan
| | - Sergio Fernandez Bertolin
- Fundació Institut, Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, 587 08007, Spain
| | - Vinícius de Araújo Oliveira
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Valentina Martufi
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Fernanda de Sousa
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Amir Bashir
- Shaukat Khanum Memorial Cancer Hospital and Research Centre, Johar Town, Lahore, 54840, Pakistan
| | - Edward Burn
- Centre for Statistics in Medicine, Botnar Research Centre, University of Oxford, Oxford, OX3 7LD, United Kingdom
| | - Maria Yury Ichihara
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Maurício L Barreto
- Center of Data and Knowledge Integration for Health (Cidacs), Fiocruz-Brazil, Parque Tecnológico da Edf, Tecnocentro, R. Mundo, Salvador, BA 41745-715, Brazil
| | - Talita Duarte Salles
- Fundació Institut, Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, 587 08007, Spain
| | - Daniel Prieto-Alhambra
- Centre for Statistics in Medicine, Botnar Research Centre, University of Oxford, Oxford, OX3 7LD, United Kingdom
| | - Haroon Hafeez
- Shaukat Khanum Memorial Cancer Hospital and Research Centre, Johar Town, Lahore, 54840, Pakistan
| | - Sara Khalid
- Centre for Statistics in Medicine, Botnar Research Centre, University of Oxford, Oxford, OX3 7LD, United Kingdom
| |
Collapse
|
14
|
Chandran U, Reps J, Yang R, Vachani A, Maldonado F, Kalsekar I. Machine Learning and Real-World Data to Predict Lung Cancer Risk in Routine Care. Cancer Epidemiol Biomarkers Prev 2023; 32:337-343. [PMID: 36576991 PMCID: PMC9986687 DOI: 10.1158/1055-9965.epi-22-0873] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2022] [Revised: 10/07/2022] [Accepted: 12/19/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND This study used machine learning to develop a 3-year lung cancer risk prediction model with large real-world data in a mostly younger population. METHODS Over 4.7 million individuals, aged 45 to 65 years with no history of any cancer or lung cancer screening, diagnostic, or treatment procedures, with an outpatient visit in 2013 were identified in Optum's de-identified Electronic Health Record (EHR) dataset. A least absolute shrinkage and selection operator model was fit using all available data in the 365 days prior. Temporal validation was assessed with recent data. External validation was assessed with data from Mercy Health Systems EHR and Optum's de-identified Clinformatics Data Mart Database. Racial inequities in model discrimination were assessed with xAUCs. RESULTS The model AUC was 0.76. Top predictors included age, smoking, race, ethnicity, and diagnosis of chronic obstructive pulmonary disease. The model identified a high-risk group with lung cancer incidence 9 times the average cohort incidence, representing 10% of patients with lung cancer. Model performed well temporally and externally, while performance was reduced for Asians and Hispanics. CONCLUSIONS A high-dimensional model trained using big data identified a subset of patients with high lung cancer risk. The model demonstrated transportability to EHR and claims data, while underscoring the need to assess racial disparities when using machine learning methods. IMPACT This internally and externally validated real-world data-based lung cancer prediction model is available on an open-source platform for broad sharing and application. Model integration into an EHR system could minimize physician burden by automating identification of high-risk patients.
Collapse
Affiliation(s)
- Urmila Chandran
- Johnson & Johnson Global Epidemiology, Titusville, New Jersey.,Lung Cancer Initiative, Johnson & Johnson, New Brunswick, New Jersey
| | - Jenna Reps
- Johnson & Johnson Global Epidemiology, Titusville, New Jersey
| | - Robert Yang
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, New Jersey
| | - Anil Vachani
- University of Pennsylvania, Perelman School of Medicine, Philadelphia, Pennsylvania
| | | | - Iftekhar Kalsekar
- Lung Cancer Initiative, Johnson & Johnson, New Brunswick, New Jersey
| |
Collapse
|
15
|
Yang C, Williams RD, Swerdel JN, Almeida JR, Brouwer ES, Burn E, Carmona L, Chatzidionysiou K, Duarte-Salles T, Fakhouri W, Hottgenroth A, Jani M, Kolde R, Kors JA, Kullamaa L, Lane J, Marinier K, Michel A, Stewart HM, Prats-Uribe A, Reisberg S, Sena AG, Torre CO, Verhamme K, Vizcaya D, Weaver J, Ryan P, Prieto-Alhambra D, Rijnbeek PR. Development and external validation of prediction models for adverse health outcomes in rheumatoid arthritis: A multinational real-world cohort analysis. Semin Arthritis Rheum 2022; 56:152050. [PMID: 35728447 DOI: 10.1016/j.semarthrit.2022.152050] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/11/2022] [Accepted: 06/10/2022] [Indexed: 10/18/2022]
Abstract
BACKGROUND Identification of rheumatoid arthritis (RA) patients at high risk of adverse health outcomes remains a major challenge. We aimed to develop and validate prediction models for a variety of adverse health outcomes in RA patients initiating first-line methotrexate (MTX) monotherapy. METHODS Data from 15 claims and electronic health record databases across 9 countries were used. Models were developed and internally validated on Optum® De-identified Clinformatics® Data Mart Database using L1-regularized logistic regression to estimate the risk of adverse health outcomes within 3 months (leukopenia, pancytopenia, infection), 2 years (myocardial infarction (MI) and stroke), and 5 years (cancers [colorectal, breast, uterine] after treatment initiation. Candidate predictors included demographic variables and past medical history. Models were externally validated on all other databases. Performance was assessed using the area under the receiver operator characteristic curve (AUC) and calibration plots. FINDINGS Models were developed and internally validated on 21,547 RA patients and externally validated on 131,928 RA patients. Models for serious infection (AUC: internal 0.74, external ranging from 0.62 to 0.83), MI (AUC: internal 0.76, external ranging from 0.56 to 0.82), and stroke (AUC: internal 0.77, external ranging from 0.63 to 0.95), showed good discrimination and adequate calibration. Models for the other outcomes showed modest internal discrimination (AUC < 0.65) and were not externally validated. INTERPRETATION We developed and validated prediction models for a variety of adverse health outcomes in RA patients initiating first-line MTX monotherapy. Final models for serious infection, MI, and stroke demonstrated good performance across multiple databases and can be studied for clinical use. FUNDING This activity under the European Health Data & Evidence Network (EHDEN) has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No 806968. This Joint Undertaking receives support from the European Union's Horizon 2020 research and innovation programme and EFPIA.
Collapse
Affiliation(s)
- Cynthia Yang
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands.
| | - Ross D Williams
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Joel N Swerdel
- Janssen Research and Development, Titusville, NJ, United States
| | | | - Emily S Brouwer
- Janssen Research and Development, Titusville, NJ, United States
| | - Edward Burn
- Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, United Kingdom; Fundació Institut Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | | | | | - Talita Duarte-Salles
- Fundació Institut Universitari per a la recerca a l'Atenció Primària de Salut Jordi Gol i Gurina (IDIAPJGol), Barcelona, Spain
| | - Walid Fakhouri
- Eli Lilly and Company, Windlesham, Surrey, United Kingdom
| | | | - Meghna Jani
- Centre for Epidemiology Versus Arthritis, University of Manchester, Manchester, United Kingdom
| | - Raivo Kolde
- Institute of Computer Science, University of Tartu, Tartu, Estonia
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Lembe Kullamaa
- Department of Epidemiology and Biostatistics, National Institute for Health Development, Tallinn, Estonia; Institute of Family Medicine and Public Health, University of Tartu, Tartu, Estonia; European Patients' Forum, Brussels, Belgium
| | - Jennifer Lane
- Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, United Kingdom
| | | | | | | | - Albert Prats-Uribe
- Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, United Kingdom
| | - Sulev Reisberg
- Institute of Computer Science, University of Tartu, Tartu, Estonia; STACC, Tartu, Estonia; Quretec, Tartu, Estonia
| | - Anthony G Sena
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands; Janssen Research and Development, Titusville, NJ, United States
| | | | - Katia Verhamme
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | | | - James Weaver
- Janssen Research and Development, Titusville, NJ, United States; Observational Health Data Sciences and Informatics, New York, NY, United States
| | - Patrick Ryan
- Janssen Research and Development, Titusville, NJ, United States; Observational Health Data Sciences and Informatics, New York, NY, United States
| | - Daniel Prieto-Alhambra
- Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences, University of Oxford, Oxford, United Kingdom
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
16
|
Lin V, Tsouchnika A, Allakhverdiiev E, Rosen AW, Gögenur M, Clausen JSR, Bräuner KB, Walbech JS, Rijnbeek P, Drakos I, Gögenur I. Training prediction models for individual risk assessment of postoperative complications after surgery for colorectal cancer. Tech Coloproctol 2022; 26:665-675. [PMID: 35593971 DOI: 10.1007/s10151-022-02624-x] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/13/2022] [Accepted: 04/20/2022] [Indexed: 12/01/2022]
Abstract
BACKGROUND The occurrence of postoperative complications and anastomotic leakage are major drivers of mortality in the immediate phase after colorectal cancer surgery. We trained prediction models for calculating patients' individual risk of complications based only on preoperatively available data in a multidisciplinary team setting. Knowing prior to surgery the probability of developing a complication could aid in improving informed decision-making by surgeon and patient and individualize surgical treatment trajectories. METHODS All patients over 18 years of age undergoing any resection for colorectal cancer between January 1, 2014 and December 31, 2019 from the nationwide Danish Colorectal Cancer Group database were included. Data from the database were converted into Observational Medical Outcomes Partnership Common Data Model maintained by the Observation Health Data Science and Informatics initiative. Multiple machine learning models were trained to predict postoperative complications of Clavien-Dindo grade ≥ 3B and anastomotic leakage within 30 days after surgery. RESULTS Between 2014 and 2019, 23,907 patients underwent resection for colorectal cancer in Denmark. A Clavien-Dindo complication grade ≥ 3B occurred in 2,958 patients (12.4%). Of 17,190 patients that received an anastomosis, 929 experienced anastomotic leakage (5.4%). Among the compared machine learning models, Lasso Logistic Regression performed best. The predictive model for complications had an area under the receiver operating characteristic curve (AUROC) of 0.704 (95%CI 0.683-0.724) and an AUROC of 0.690 (95%CI 0.655-0.724) for anastomotic leakage. CONCLUSIONS The prediction of postoperative complications based only on preoperative variables using a national quality assurance colorectal cancer database shows promise for calculating patient's individual risk. Future work will focus on assessing the value of adding laboratory parameters and drug exposure as candidate predictors. Furthermore, we plan to assess the external validity of our proposed model.
Collapse
Affiliation(s)
- V Lin
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark.
| | - A Tsouchnika
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - E Allakhverdiiev
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - A W Rosen
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - M Gögenur
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - J S R Clausen
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - K B Bräuner
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - J S Walbech
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - P Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Centre, Rotterdam, The Netherlands
| | - I Drakos
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| | - I Gögenur
- Center for Surgical Science, Department of Surgery, Zealand University Hospital Køge, Lykkebækvej 1, 4600, Køge, Denmark
| |
Collapse
|
17
|
Supporting Clinical COVID-19 Diagnosis with Routine Blood Tests Using Tree-Based Entropy Structured Self-Organizing Maps. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12105137] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2023]
Abstract
Data classification is an automatic or semi-automatic process that, utilizing artificial intelligence algorithms, learns the variable and class relationships of a dataset for use a posteriori in situations where the class result is unknown. For many years, work on this topic has been aimed at increasing the hit rates of algorithms. However, when the problem is restricted to applications in healthcare, besides the concern with performance, it is also necessary to design algorithms whose results are understandable by the specialists responsible for making the decisions. Among the problems in the field of medicine, a current focus is related to COVID-19: AI algorithms may contribute to early diagnosis. Among the available COVID-19 data, the blood test is a typical procedure performed when the patient seeks the hospital, and its use in the diagnosis allows reducing the need for other diagnostic tests that can impact the detection time and add to costs. In this work, we propose using self-organizing map (SOM) to discover attributes in blood test examinations that are relevant for COVID-19 diagnosis. We applied SOM and an entropy calculation in the definition of a hierarchical, semi-supervised and explainable model named TESSOM (tree-based entropy-structured self-organizing maps), in which the main feature is enhancing the investigation of groups of cases with high levels of class overlap, as far as the diagnostic outcome is concerned. Framing the TESSOM algorithm in the context of explainable artificial intelligence (XAI) makes it possible to explain the results to an expert in a simplified way. It is demonstrated in the paper that the use of the TESSOM algorithm to identify attributes of blood tests can help with the identification of COVID-19 cases. It providing a performance increase in 1.489% in multiple scenarios when analyzing 2207 cases from three hospitals in the state of São Paulo, Brazil. This work is a starting point for researchers to identify relevant attributes of blood tests for COVID-19 and to support the diagnosis of other diseases.
Collapse
|
18
|
Jung H, Yoo S, Kim S, Heo E, Kim B, Lee HY, Hwang H. Patient-Level Fall Risk Prediction Using the Observational Medical Outcomes Partnership's Common Data Model: Pilot Feasibility Study. JMIR Med Inform 2022; 10:e35104. [PMID: 35275076 PMCID: PMC8957002 DOI: 10.2196/35104] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 01/02/2022] [Accepted: 01/31/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Falls in acute care settings threaten patients' safety. Researchers have been developing fall risk prediction models and exploring risk factors to provide evidence-based fall prevention practices; however, such efforts are hindered by insufficient samples, limited covariates, and a lack of standardized methodologies that aid study replication. OBJECTIVE The objectives of this study were to (1) convert fall-related electronic health record data into the standardized Observational Medical Outcome Partnership's (OMOP) common data model format and (2) develop models that predict fall risk during 2 time periods. METHODS As a pilot feasibility test, we converted fall-related electronic health record data (nursing notes, fall risk assessment sheet, patient acuity assessment sheet, and clinical observation sheet) into standardized OMOP common data model format using an extraction, transformation, and load process. We developed fall risk prediction models for 2 time periods (within 7 days of admission and during the entire hospital stay) using 2 algorithms (least absolute shrinkage and selection operator logistic regression and random forest). RESULTS In total, 6277 nursing statements, 747,049,486 clinical observation sheet records, 1,554,775 fall risk scores, and 5,685,011 patient acuity scores were converted into OMOP common data model format. All our models (area under the receiver operating characteristic curve 0.692-0.726) performed better than the Hendrich II Fall Risk Model. Patient acuity score, fall history, age ≥60 years, movement disorder, and central nervous system agents were the most important predictors in the logistic regression models. CONCLUSIONS To enhance model performance further, we are currently converting all nursing records into the OMOP common data model data format, which will then be included in the models. Thus, in the near future, the performance of fall risk prediction models could be improved through the application of abundant nursing records and external validation.
Collapse
Affiliation(s)
- Hyesil Jung
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Sooyoung Yoo
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Seok Kim
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Eunjeong Heo
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Borham Kim
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Ho-Young Lee
- Office of eHealth Research and Business, Seoul National University Bundang Hospital, Seongnam-si, Republic of Korea
| | - Hee Hwang
- Kakao Healthcare Company-In-Company, Seongnam-si, Republic of Korea
| |
Collapse
|
19
|
Yang C, Kors JA, Ioannou S, John LH, Markus AF, Rekkas A, de Ridder MAJ, Seinen TM, Williams RD, Rijnbeek PR. Trends in the conduct and reporting of clinical prediction model development and validation: a systematic review. J Am Med Inform Assoc 2022; 29:983-989. [PMID: 35045179 PMCID: PMC9006694 DOI: 10.1093/jamia/ocac002] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Revised: 12/01/2021] [Accepted: 01/07/2022] [Indexed: 12/23/2022] Open
Abstract
Objectives This systematic review aims to provide further insights into the conduct and reporting of clinical prediction model development and validation over time. We focus on assessing the reporting of information necessary to enable external validation by other investigators. Materials and Methods We searched Embase, Medline, Web-of-Science, Cochrane Library, and Google Scholar to identify studies that developed 1 or more multivariable prognostic prediction models using electronic health record (EHR) data published in the period 2009–2019. Results We identified 422 studies that developed a total of 579 clinical prediction models using EHR data. We observed a steep increase over the years in the number of developed models. The percentage of models externally validated in the same paper remained at around 10%. Throughout 2009–2019, for both the target population and the outcome definitions, code lists were provided for less than 20% of the models. For about half of the models that were developed using regression analysis, the final model was not completely presented. Discussion Overall, we observed limited improvement over time in the conduct and reporting of clinical prediction model development and validation. In particular, the prediction problem definition was often not clearly reported, and the final model was often not completely presented. Conclusion Improvement in the reporting of information necessary to enable external validation by other investigators is still urgently needed to increase clinical adoption of developed models.
Collapse
Affiliation(s)
- Cynthia Yang
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Solomon Ioannou
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Luis H John
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Aniek F Markus
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Alexandros Rekkas
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Maria A J de Ridder
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Tom M Seinen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Ross D Williams
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
20
|
Seinen TM, Fridgeirsson EA, Ioannou S, Jeannetot D, John LH, Kors JA, Markus AF, Pera V, Rekkas A, Williams RD, Yang C, van Mulligen EM, Rijnbeek PR. OUP accepted manuscript. J Am Med Inform Assoc 2022; 29:1292-1302. [PMID: 35475536 PMCID: PMC9196702 DOI: 10.1093/jamia/ocac058] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2021] [Revised: 03/06/2022] [Accepted: 04/11/2022] [Indexed: 11/29/2022] Open
Abstract
Objective This systematic review aims to assess how information from unstructured text is used to develop and validate clinical prognostic prediction models. We summarize the prediction problems and methodological landscape and determine whether using text data in addition to more commonly used structured data improves the prediction performance. Materials and Methods We searched Embase, MEDLINE, Web of Science, and Google Scholar to identify studies that developed prognostic prediction models using information extracted from unstructured text in a data-driven manner, published in the period from January 2005 to March 2021. Data items were extracted, analyzed, and a meta-analysis of the model performance was carried out to assess the added value of text to structured-data models. Results We identified 126 studies that described 145 clinical prediction problems. Combining text and structured data improved model performance, compared with using only text or only structured data. In these studies, a wide variety of dense and sparse numeric text representations were combined with both deep learning and more traditional machine learning methods. External validation, public availability, and attention for the explainability of the developed models were limited. Conclusion The use of unstructured text in the development of prognostic prediction models has been found beneficial in addition to structured data in most studies. The text data are source of valuable information for prediction model development and should not be neglected. We suggest a future focus on explainability and external validation of the developed models, promoting robust and trustworthy prediction models in clinical practice.
Collapse
Affiliation(s)
- Tom M Seinen
- Corresponding Author: Tom M. Seinen, MSc, Department of Medical Informatics, Erasmus University Medical Center, Molewaterplein 40, 3015 GD Rotterdam, The Netherlands ()
| | - Egill A Fridgeirsson
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Solomon Ioannou
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Daniel Jeannetot
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Luis H John
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Aniek F Markus
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Victor Pera
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Alexandros Rekkas
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Ross D Williams
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Cynthia Yang
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Erik M van Mulligen
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| | - Peter R Rijnbeek
- Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
21
|
Applying Machine Learning in Distributed Data Networks for Pharmacoepidemiologic and Pharmacovigilance Studies: Opportunities, Challenges, and Considerations. Drug Saf 2022; 45:493-510. [PMID: 35579813 PMCID: PMC9112258 DOI: 10.1007/s40264-022-01158-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/13/2022] [Indexed: 01/28/2023]
Abstract
Increasing availability of electronic health databases capturing real-world experiences with medical products has garnered much interest in their use for pharmacoepidemiologic and pharmacovigilance studies. The traditional practice of having numerous groups use single databases to accomplish similar tasks and address common questions about medical products can be made more efficient through well-coordinated multi-database studies, greatly facilitated through distributed data network (DDN) architectures. Access to larger amounts of electronic health data within DDNs has created a growing interest in using data-adaptive machine learning (ML) techniques that can automatically model complex associations in high-dimensional data with minimal human guidance. However, the siloed storage and diverse nature of the databases in DDNs create unique challenges for using ML. In this paper, we discuss opportunities, challenges, and considerations for applying ML in DDNs for pharmacoepidemiologic and pharmacovigilance studies. We first discuss major types of activities performed by DDNs and how ML may be used. Next, we discuss practical data-related factors influencing how DDNs work in practice. We then combine these discussions and jointly consider how opportunities for ML are affected by practical data-related factors for DDNs, leading to several challenges. We present different approaches for addressing these challenges and highlight efforts that real-world DDNs have taken or are currently taking to help mitigate them. Despite these challenges, the time is ripe for the emerging interest to use ML in DDNs, and the utility of these data-adaptive modeling techniques in pharmacoepidemiologic and pharmacovigilance studies will likely continue to increase in the coming years.
Collapse
|