1
|
Zhang C, Yu X, Zhang B. Assessment of supervised longitudinal learning methods: Insights from predicting low birth weight and very low birth weight using prenatal ultrasound measurements. Comput Biol Med 2024; 182:109084. [PMID: 39250874 DOI: 10.1016/j.compbiomed.2024.109084] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Revised: 08/17/2024] [Accepted: 08/28/2024] [Indexed: 09/11/2024]
Abstract
BACKGROUND This study aimed to assess the efficacy of various supervised longitudinal learning approaches, comparing traditional statistical models and machine learning algorithms for prediction with longitudinal data. The primary objectives were to evaluate the predictive performance of different supervised longitudinal learning methods for low birth weight (LBW) and very low birth weight (VLBW) based on prenatal ultrasound measurements. Additionally, the study sought to extract interpretable risk features for disease prediction. METHODS The evaluation involved benchmarking the performance of longitudinal models against conventional machine learning methods. Classification accuracy for LBW and VLBW at birth, as well as prediction accuracy for birth weight using prenatal sonographic ultrasound measurements, were assessed. RESULTS Among the learning approaches we investigated in this study, the longitudinal machine learning approach, specifically, the mixed effect random forest (MERF), delivered the overall best performance in predicting birthweights and classifying LBW/VLBW disease status. CONCLUSION The MERF combined the power of advanced machine learning algorithms to accommodate the inherent within-individual dependence in the observed data, delivering satisfactory performance in predicting the birthweight and classifying LBW/VLBW disease status. The study emphasized the importance of incorporating previous ultrasound measurements and considering correlations between repeated measurements for accurate prediction. The interpretable trees algorithm used for risk feature extraction proved reliable and applicable to other learning algorithms. These findings underscored the potential of longitudinal learning methods in improving birth weight prediction and highlighted the relevance of consistent risk features in line with established literature.
Collapse
Affiliation(s)
- Cancan Zhang
- Division of General Medicine, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Xiufan Yu
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA.
| | - Bo Zhang
- Department of Neurology and Biostatistics and Research Design Center, Children's Hospital, Harvard Medical School, Boston, 02115, Massachusetts, USA.
| |
Collapse
|
2
|
Velez-Arce A, Huang K, Li MM, Lin X, Gao W, Fu T, Kellis M, Pentelute BL, Zitnik M. TDC-2: Multimodal Foundation for Therapeutic Science. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.12.598655. [PMID: 38948789 PMCID: PMC11212894 DOI: 10.1101/2024.06.12.598655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Therapeutics Data Commons (tdcommons.ai) is an open science initiative with unified datasets, AI models, and benchmarks to support research across therapeutic modalities and drug discovery and development stages. The Commons 2.0 (TDC-2) is a comprehensive overhaul of Therapeutic Data Commons to catalyze research in multimodal models for drug discovery by unifying single-cell biology of diseases, biochemistry of molecules, and effects of drugs through multimodal datasets, AI-powered API endpoints, new multimodal tasks and model frameworks, and comprehensive benchmarks. TDC-2 introduces over 1,000 multimodal datasets spanning approximately 85 million cells, pre-calculated embeddings from 5 state-of-the-art single-cell models, and a biomedical knowledge graph. TDC-2 drastically expands the coverage of ML tasks across therapeutic pipelines and 10+ new modalities, spanning but not limited to single-cell gene expression data, clinical trial data, peptide sequence data, peptidomimetics protein-peptide interaction data regarding newly discovered ligands derived from AS-MS spectroscopy, novel 3D structural data for proteins, and cell-type-specific protein-protein interaction networks at single-cell resolution. TDC-2 introduces multimodal data access under an API-first design using the model-view-controller paradigm. TDC-2 introduces 7 novel ML tasks with fine-grained biological contexts: contextualized drug-target identification, single-cell chemical/genetic perturbation response prediction, protein-peptide binding affinity prediction task, and clinical trial outcome prediction task, which introduce antigen-processing-pathway-specific, cell-type-specific, peptide-specific, and patient-specific biological contexts. TDC-2 also releases benchmarks evaluating 15+ state-of-the-art models across 5+ new learning tasks evaluating models on diverse biological contexts and sampling approaches. Among these, TDC-2 provides the first benchmark for context-specific learning. TDC-2, to our knowledge, is also the first to introduce a protein-peptide binding interaction benchmark.
Collapse
|
3
|
Jiang C, Thompson M, Wallace M. Estimating dynamic treatment regimes for ordinal outcomes with household interference: Application in household smoking cessation. Stat Methods Med Res 2024; 33:981-995. [PMID: 38623615 PMCID: PMC11334379 DOI: 10.1177/09622802241242313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/17/2024]
Abstract
The focus of precision medicine is on decision support, often in the form of dynamic treatment regimes, which are sequences of decision rules. At each decision point, the decision rules determine the next treatment according to the patient's baseline characteristics, the information on treatments and responses accrued by that point, and the patient's current health status, including symptom severity and other measures. However, dynamic treatment regime estimation with ordinal outcomes is rarely studied, and rarer still in the context of interference - where one patient's treatment may affect another's outcome. In this paper, we introduce the weighted proportional odds model: a regression based, approximate doubly-robust approach to single-stage dynamic treatment regime estimation for ordinal outcomes. This method also accounts for the possibility of interference between individuals sharing a household through the use of covariate balancing weights derived from joint propensity scores. Examining different types of balancing weights, we verify the approximate double robustness of weighted proportional odds model with our adjusted weights via simulation studies. We further extend weighted proportional odds model to multi-stage dynamic treatment regime estimation with household interference, namely dynamic weighted proportional odds model. Lastly, we demonstrate our proposed methodology in the analysis of longitudinal survey data from the Population Assessment of Tobacco and Health study, which motivates this work. Furthermore, considering interference, we provide optimal treatment strategies for households to achieve smoking cessation of the pair in the household.
Collapse
Affiliation(s)
- Cong Jiang
- Faculty of Pharmacy, Université de Montréal, Montreal, Canada
| | - Mary Thompson
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada
| | - Michael Wallace
- Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, Canada
| |
Collapse
|
4
|
Eghbali-Zarch M, Masoud S. Application of machine learning in affordable and accessible insulin management for type 1 and 2 diabetes: A comprehensive review. Artif Intell Med 2024; 151:102868. [PMID: 38632030 DOI: 10.1016/j.artmed.2024.102868] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 03/03/2024] [Accepted: 04/03/2024] [Indexed: 04/19/2024]
Abstract
Proper insulin management is vital for maintaining stable blood sugar levels and preventing complications associated with diabetes. However, the soaring costs of insulin present significant challenges to ensuring affordable management. This paper conducts a comprehensive review of current literature on the application of machine learning (ML) in insulin management for diabetes patients, particularly focusing on enhancing affordability and accessibility within the United States. The review encompasses various facets of insulin management, including dosage calculation and response, prediction of blood glucose and insulin sensitivity, initial insulin estimation, resistance prediction, treatment adherence, complications, hypoglycemia prediction, and lifestyle modifications. Additionally, the study identifies key limitations in the utilization of ML within the insulin management literature and suggests future research directions aimed at furthering accessible and affordable insulin treatments. These proposed directions include exploring insurance coverage, optimizing insulin type selection, assessing the impact of biosimilar insulin and market competition, considering mental health factors, evaluating insulin delivery options, addressing cost-related issues affecting insulin usage and adherence, and selecting appropriate patient cost-sharing programs. By examining the potential of ML in addressing insulin management affordability and accessibility, this work aims to envision improved and cost-effective insulin management practices. It not only highlights existing research gaps but also offers insights into future directions, guiding the development of innovative solutions that have the potential to revolutionize insulin management and benefit patients reliant on this life-saving treatment.
Collapse
Affiliation(s)
- Maryam Eghbali-Zarch
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48202, USA
| | - Sara Masoud
- Department of Industrial and Systems Engineering, Wayne State University, Detroit, MI 48202, USA.
| |
Collapse
|
5
|
Naik K, Goyal RK, Foschini L, Chak CW, Thielscher C, Zhu H, Lu J, Lehár J, Pacanoswki MA, Terranova N, Mehta N, Korsbo N, Fakhouri T, Liu Q, Gobburu J. Current Status and Future Directions: The Application of Artificial Intelligence/Machine Learning for Precision Medicine. Clin Pharmacol Ther 2024; 115:673-686. [PMID: 38103204 DOI: 10.1002/cpt.3152] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023]
Abstract
Technological innovations, such as artificial intelligence (AI) and machine learning (ML), have the potential to expedite the goal of precision medicine, especially when combined with increased capacity for voluminous data from multiple sources and expanded therapeutic modalities; however, they also present several challenges. In this communication, we first discuss the goals of precision medicine, and contextualize the use of AI in precision medicine by showcasing innovative applications (e.g., prediction of tumor growth and overall survival, biomarker identification using biomedical images, and identification of patient population for clinical practice) which were presented during the February 2023 virtual public workshop entitled "Application of Artificial Intelligence and Machine Learning for Precision Medicine," hosted by the US Food and Drug Administration (FDA) and University of Maryland Center of Excellence in Regulatory Science and Innovation (M-CERSI). Next, we put forward challenges brought about by the multidisciplinary nature of AI, particularly highlighting the need for AI to be trustworthy. To address such challenges, we subsequently note practical approaches, viz., differential privacy, synthetic data generation, and federated learning. The proposed strategies - some of which are highlighted presentations from the workshop - are for the protection of personal information and intellectual property. In addition, methods such as the risk-based management approach and the need for an agile regulatory ecosystem are discussed. Finally, we lay out a call for action that includes sharing of data and algorithms, development of regulatory guidance documents, and pooling of expertise from a broad-spectrum of stakeholders to enhance the application of AI in precision medicine.
Collapse
Affiliation(s)
- Kunal Naik
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Rahul K Goyal
- Center for Translational Medicine, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
| | | | | | | | - Hao Zhu
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - James Lu
- Modeling & Simulation/Clinical Pharmacology, Genentech Inc., South San Francisco, California, USA
| | | | - Michael A Pacanoswki
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Nadia Terranova
- Quantitative Pharmacology, Ares Trading S.A. (an affiliate of Merck KGaA, Darmstadt, Germany), Lausanne, Switzerland
| | - Neha Mehta
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | | | - Tala Fakhouri
- Office of Medical Policy, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Qi Liu
- Office of Clinical Pharmacology, Office of Translational Sciences, Center for Drug Evaluation and Research, US Food and Drug Administration, Silver Spring, Maryland, USA
| | - Jogarao Gobburu
- Center for Translational Medicine, University of Maryland School of Pharmacy, Baltimore, Maryland, USA
| |
Collapse
|
6
|
Somé NH, Noormohammadpour P, Lange S. The use of machine learning on administrative and survey data to predict suicidal thoughts and behaviors: a systematic review. Front Psychiatry 2024; 15:1291362. [PMID: 38501090 PMCID: PMC10944962 DOI: 10.3389/fpsyt.2024.1291362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/09/2023] [Accepted: 02/12/2024] [Indexed: 03/20/2024] Open
Abstract
Background Machine learning is a promising tool in the area of suicide prevention due to its ability to combine the effects of multiple risk factors and complex interactions. The power of machine learning has led to an influx of studies on suicide prediction, as well as a few recent reviews. Our study distinguished between data sources and reported the most important predictors of suicide outcomes identified in the literature. Objective Our study aimed to identify studies that applied machine learning techniques to administrative and survey data, summarize performance metrics reported in those studies, and enumerate the important risk factors of suicidal thoughts and behaviors identified. Methods A systematic literature search of PubMed, Medline, Embase, PsycINFO, Web of Science, Cumulative Index to Nursing and Allied Health Literature (CINAHL), and Allied and Complementary Medicine Database (AMED) to identify all studies that have used machine learning to predict suicidal thoughts and behaviors using administrative and survey data was performed. The search was conducted for articles published between January 1, 2019 and May 11, 2022. In addition, all articles identified in three recently published systematic reviews (the last of which included studies up until January 1, 2019) were retained if they met our inclusion criteria. The predictive power of machine learning methods in predicting suicidal thoughts and behaviors was explored using box plots to summarize the distribution of the area under the receiver operating characteristic curve (AUC) values by machine learning method and suicide outcome (i.e., suicidal thoughts, suicide attempt, and death by suicide). Mean AUCs with 95% confidence intervals (CIs) were computed for each suicide outcome by study design, data source, total sample size, sample size of cases, and machine learning methods employed. The most important risk factors were listed. Results The search strategy identified 2,200 unique records, of which 104 articles met the inclusion criteria. Machine learning algorithms achieved good prediction of suicidal thoughts and behaviors (i.e., an AUC between 0.80 and 0.89); however, their predictive power appears to differ across suicide outcomes. The boosting algorithms achieved good prediction of suicidal thoughts, death by suicide, and all suicide outcomes combined, while neural network algorithms achieved good prediction of suicide attempts. The risk factors for suicidal thoughts and behaviors differed depending on the data source and the population under study. Conclusion The predictive utility of machine learning for suicidal thoughts and behaviors largely depends on the approach used. The findings of the current review should prove helpful in preparing future machine learning models using administrative and survey data. Systematic review registration https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022333454 identifier CRD42022333454.
Collapse
Affiliation(s)
- Nibene H. Somé
- Institute for Mental Health Policy Research, Centre for Addiction and Mental Health, Toronto, ON, Canada
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada
- Department of Epidemiology and Biostatistics, Schulich School of Medicine & Dentistry, Western University, London, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Pardis Noormohammadpour
- Institute for Mental Health Policy Research, Centre for Addiction and Mental Health, Toronto, ON, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada
| | - Shannon Lange
- Institute for Mental Health Policy Research, Centre for Addiction and Mental Health, Toronto, ON, Canada
- Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
7
|
McCoy RG, Faust L, Heien HC, Patel S, Caffo B, Ngufor C. Longitudinal trajectories of glycemic control among U.S. Adults with newly diagnosed diabetes. Diabetes Res Clin Pract 2023; 205:110989. [PMID: 37918637 PMCID: PMC10842883 DOI: 10.1016/j.diabres.2023.110989] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/23/2023] [Revised: 09/27/2023] [Accepted: 10/31/2023] [Indexed: 11/04/2023]
Abstract
AIMS To identify longitudinal trajectories of glycemic control among adults with newly diagnosed diabetes, overall and by diabetes type. METHODS We analyzed claims data from OptumLabs® Data Warehouse for 119,952 adults newly diagnosed diabetes between 2005 and 2018. We applied a novel Mixed Effects Machine Learning model to identify longitudinal trajectories of hemoglobin A1c (HbA1c) over 3 years of follow-up and used multinomial regression to characterize factors associated with each trajectory. RESULTS The study population was comprised of 119,952 adults with newly diagnosed diabetes, including 696 (0.58%) with type 1 diabetes. Among patients with type 1 diabetes, 52.6% were diagnosed at very high HbA1c, partially improved, but never achieved control; 32.5% were diagnosed at low HbA1c and deteriorated over time; and 14.9% had stable low HbA1c. Among patients with type 2 diabetes, 67.7% had stable low HbA1c, 14.4% were diagnosed at very high HbA1c, partially improved, but never achieved control; 10.0% were diagnosed at moderately high HbA1c and deteriorated over time; and 4.9% were diagnosed at moderately high HbA1c and improved over time. CONCLUSIONS Claims data identified distinct longitudinal trajectories of HbA1c after diabetes diagnosis, which can be used to anticipate challenges and individualize care plans to improve glycemic control.
Collapse
Affiliation(s)
- Rozalina G McCoy
- Division of Endocrinology, Diabetes, & Nutrition, Department of Medicine, University of Maryland School of Medicine, Baltimore, MD, United States; University of Maryland Institute for Health Computing, Bethesda, MD, United States; OptumLabs, Eden Prairie, MN, United States; Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, United States.
| | - Louis Faust
- Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, United States
| | - Herbert C Heien
- Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, United States
| | - Shrinath Patel
- Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, United States
| | - Brian Caffo
- Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, United States
| | - Che Ngufor
- Mayo Clinic Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Rochester, MN, United States; Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
8
|
Salditt M, Nestler S. Parametric and nonparametric propensity score estimation in multilevel observational studies. Stat Med 2023; 42:4147-4176. [PMID: 37532119 DOI: 10.1002/sim.9852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2023] [Revised: 05/16/2023] [Accepted: 07/10/2023] [Indexed: 08/04/2023]
Abstract
There has been growing interest in using nonparametric machine learning approaches for propensity score estimation in order to foster robustness against misspecification of the propensity score model. However, the vast majority of studies focused on single-level data settings, and research on nonparametric propensity score estimation in clustered data settings is scarce. In this article, we extend existing research by describing a general algorithm for incorporating random effects into a machine learning model, which we implemented for generalized boosted modeling (GBM). In a simulation study, we investigated the performance of logistic regression, GBM, and Bayesian additive regression trees for inverse probability of treatment weighting (IPW) when the data are clustered, the treatment exposure mechanism is nonlinear, and unmeasured cluster-level confounding is present. For each approach, we compared fixed and random effects propensity score models to single-level models and evaluated their use in both marginal and clustered IPW. We additionally investigated the performance of the standard Super Learner and the balance Super Learner. The results showed that when there was no unmeasured confounding, logistic regression resulted in moderate bias in both marginal and clustered IPW, whereas the nonparametric approaches were unbiased. In presence of cluster-level confounding, fixed and random effects models greatly reduced bias compared to single-level models in marginal IPW, with fixed effects GBM and fixed effects logistic regression performing best. Finally, clustered IPW was overall preferable to marginal IPW and the balance Super Learner outperformed the standard Super Learner, though neither worked as well as their best candidate model.
Collapse
Affiliation(s)
- Marie Salditt
- Institute of Psychology, University of Münster, Münster, Germany
| | - Steffen Nestler
- Institute of Psychology, University of Münster, Münster, Germany
| |
Collapse
|
9
|
Giuffrè M, Shung DL. Harnessing the power of synthetic data in healthcare: innovation, application, and privacy. NPJ Digit Med 2023; 6:186. [PMID: 37813960 PMCID: PMC10562365 DOI: 10.1038/s41746-023-00927-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2023] [Accepted: 09/14/2023] [Indexed: 10/11/2023] Open
Abstract
Data-driven decision-making in modern healthcare underpins innovation and predictive analytics in public health and clinical research. Synthetic data has shown promise in finance and economics to improve risk assessment, portfolio optimization, and algorithmic trading. However, higher stakes, potential liabilities, and healthcare practitioner distrust make clinical use of synthetic data difficult. This paper explores the potential benefits and limitations of synthetic data in the healthcare analytics context. We begin with real-world healthcare applications of synthetic data that informs government policy, enhance data privacy, and augment datasets for predictive analytics. We then preview future applications of synthetic data in the emergent field of digital twin technology. We explore the issues of data quality and data bias in synthetic data, which can limit applicability across different applications in the clinical context, and privacy concerns stemming from data misuse and risk of re-identification. Finally, we evaluate the role of regulatory agencies in promoting transparency and accountability and propose strategies for risk mitigation such as Differential Privacy (DP) and a dataset chain of custody to maintain data integrity, traceability, and accountability. Synthetic data can improve healthcare, but measures to protect patient well-being and maintain ethical standards are key to promote responsible use.
Collapse
Affiliation(s)
- Mauro Giuffrè
- Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA.
- Department of Medical, Surgical and Health Science, University of Trieste, Trieste, Italy.
| | - Dennis L Shung
- Department of Medicine (Digestive Diseases), Yale School of Medicine, Yale University, New Haven, CT, USA
| |
Collapse
|
10
|
Salditt M, Humberg S, Nestler S. Gradient Tree Boosting for Hierarchical Data. MULTIVARIATE BEHAVIORAL RESEARCH 2023; 58:911-937. [PMID: 36602080 DOI: 10.1080/00273171.2022.2146638] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Gradient tree boosting is a powerful machine learning technique that has shown good performance in predicting a variety of outcomes. However, when applied to hierarchical (e.g., longitudinal or clustered) data, the predictive performance of gradient tree boosting may be harmed by ignoring the hierarchical structure, and may be improved by accounting for it. Tree-based methods such as regression trees and random forests have already been extended to hierarchical data settings by combining them with the linear mixed effects model (MEM). In the present article, we add to this literature by proposing two algorithms to estimate a combination of the MEM and gradient tree boosting. We report on two simulation studies that (i) investigate the predictive performance of the two MEM boosting algorithms and (ii) compare them to standard gradient tree boosting, standard random forest, and other existing methods for hierarchical data (MEM, MEM random forests, model-based boosting, Bayesian additive regression trees [BART]). We found substantial improvements in the predictive performance of our MEM boosting algorithms over standard boosting when the random effects were non-negligible. MEM boosting as well as BART showed a predictive performance similar to the correctly specified MEM (i.e., the benchmark model), and overall outperformed the model-based boosting and random forest approaches.
Collapse
Affiliation(s)
- Marie Salditt
- Department of Psychology, University of Münster, Münster, Germany
| | - Sarah Humberg
- Department of Psychology, University of Münster, Münster, Germany
| | - Steffen Nestler
- Department of Psychology, University of Münster, Münster, Germany
| |
Collapse
|
11
|
Mangino AA, Bolin JH, Finch WH. Fixed Effects or Mixed Effects Classifiers? Evidence From Simulated and Archival Data. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2023; 83:710-739. [PMID: 37398843 PMCID: PMC10311958 DOI: 10.1177/00131644221108180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
This study seeks to compare fixed and mixed effects models for the purposes of predictive classification in the presence of multilevel data. The first part of the study utilizes a Monte Carlo simulation to compare fixed and mixed effects logistic regression and random forests. An applied examination of the prediction of student retention in the public-use U.S. PISA data set was considered to verify the simulation findings. Results of this study indicate fixed effects models performed comparably with mixed effects models across both the simulation and PISA examinations. Results broadly suggest that researchers should be cognizant of the type of predictors and data structure being used, as these factors carried more weight than did the model type.
Collapse
Affiliation(s)
- Anthony A. Mangino
- Ball State University, Muncie, IN, USA
- University of Kentucky, Lexington, USA
| | | | | |
Collapse
|
12
|
Abstract
Propensity score matching is commonly used in observational studies to control for confounding and estimate the causal effects of a treatment or exposure. Frequently, in observational studies data are clustered, which adds to the complexity of using propensity score techniques. In this article, we give an overview of propensity score matching methods for clustered data, and highlight how propensity score matching can be used to account for not just measured confounders, but also unmeasured cluster level confounders. We also consider using machine learning methods such as generalized boosted models to estimate the propensity score and show that accounting for clustering when using these methods can greatly reduce the performance, particularly when there are a large number of clusters and a small number of subjects per cluster. In order to get around this we highlight scenarios where it may be possible to control for measured covariates using propensity score matching, while using fixed effects regression in the outcome model to control for cluster level covariates. Using simulation studies we compare the performance of different propensity score matching methods for clustered data across a number of different settings. Finally, as an illustrative example we apply propensity score matching methods for clustered data to study the causal effect of aspirin on hearing deterioration using data from the conservation of hearing study.
Collapse
Affiliation(s)
- Benjamin Langworthy
- Department of Biostatistics, Harvard T.H. Chan School of Public
Health, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public
Health, Boston, MA, USA
| | - Yujie Wu
- Department of Biostatistics, Harvard T.H. Chan School of Public
Health, Boston, MA, USA
| | - Molin Wang
- Department of Biostatistics, Harvard T.H. Chan School of Public
Health, Boston, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public
Health, Boston, MA, USA
| |
Collapse
|
13
|
Pham K, Ray AW, Fernstrum AJ, Alfahmy A, Ray S, Hijaz AK, Ju M, Sheyn D. Development of a machine learning-based predictive model for prediction of success or failure of medical management for benign prostatic hyperplasia. Neurourol Urodyn 2023; 42:707-717. [PMID: 36826466 DOI: 10.1002/nau.25162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 01/24/2023] [Accepted: 02/11/2023] [Indexed: 02/25/2023]
Abstract
OBJECTIVE To develop a novel predictive model for identifying patients who will and will not respond to the medical management of benign prostatic hyperplasia (BPH). METHODS Using data from the Medical Therapy of Prostatic Symptoms (MTOPS) study, several models were constructed using an initial data set of 2172 patients with BPH who were treated with doxazosin (Group 1), finasteride (Group 2), and combination therapy (Group 3). K-fold stratified cross-validation was performed on each group, Within each group, feature selection and dimensionality reduction using nonnegative matrix factorization (NMF) were performed based on the training data, before several machine learning algorithms were tested; the most accurate models, boosted support vector machines (SVMs), being selected for further refinement. The area under the receiver operating curve (AUC) was calculated and used to determine the optimal operating points. Patients were classified as treatment failures or responders, based on whether they fell below or above the AUC threshold for each group and for the whole data set. RESULTS For the entire cohort, the AUC for the boosted SVM model was 0.698. For patients in Group 1, the AUC was 0.729, for Group 2, the AUC was 0.719, and for Group 3, the AUC was 0.698. CONCLUSION Using MTOPS data, we were able to develop a prediction model with an acceptable rate of discrimination of medical management success for BPH.
Collapse
Affiliation(s)
- Kyle Pham
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, Ohio, USA
| | - Al W Ray
- Urology Institute, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
| | - Austin J Fernstrum
- Urology Institute, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
| | - Anood Alfahmy
- Urology Institute, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
| | - Soumya Ray
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, Ohio, USA
| | - Adonis K Hijaz
- Urology Institute, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
- Division of Female Pelvic Medicine and Reconstructive Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
| | - Mingxuan Ju
- Department of Computer and Data Sciences, Case Western Reserve University, Cleveland, Ohio, USA
| | - David Sheyn
- Urology Institute, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
- Division of Female Pelvic Medicine and Reconstructive Surgery, University Hospitals Cleveland Medical Center, Cleveland, Ohio, USA
| |
Collapse
|
14
|
Hu J, Szymczak S. A review on longitudinal data analysis with random forest. Brief Bioinform 2023; 24:6991123. [PMID: 36653905 PMCID: PMC10025446 DOI: 10.1093/bib/bbad002] [Citation(s) in RCA: 37] [Impact Index Per Article: 37.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 12/12/2022] [Accepted: 12/31/2012] [Indexed: 01/20/2023] Open
Abstract
In longitudinal studies variables are measured repeatedly over time, leading to clustered and correlated observations. If the goal of the study is to develop prediction models, machine learning approaches such as the powerful random forest (RF) are often promising alternatives to standard statistical methods, especially in the context of high-dimensional data. In this paper, we review extensions of the standard RF method for the purpose of longitudinal data analysis. Extension methods are categorized according to the data structures for which they are designed. We consider both univariate and multivariate response longitudinal data and further categorize the repeated measurements according to whether the time effect is relevant. Even though most extensions are proposed for low-dimensional data, some can be applied to high-dimensional data. Information of available software implementations of the reviewed extensions is also given. We conclude with discussions on the limitations of our review and some future research directions.
Collapse
Affiliation(s)
- Jianchang Hu
- Institute of Medical Biometry and Statistics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany
| | - Silke Szymczak
- Institute of Medical Biometry and Statistics, University of Lübeck, Ratzeburger Allee 160, 23562, Lübeck, Germany
| |
Collapse
|
15
|
Mosquera-Lopez C, Ramsey KL, Roquemen-Echeverri V, Jacobs PG. Modeling risk of hypoglycemia during and following physical activity in people with type 1 diabetes using explainable mixed-effects machine learning. Comput Biol Med 2023; 155:106670. [PMID: 36803791 DOI: 10.1016/j.compbiomed.2023.106670] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2022] [Revised: 01/19/2023] [Accepted: 02/10/2023] [Indexed: 02/13/2023]
Abstract
BACKGROUND Physical activity (PA) can cause increased hypoglycemia (glucose <70 mg/dL) risk in people with type 1 diabetes (T1D). We modeled the probability of hypoglycemia during and up to 24 h following PA and identified key factors associated with hypoglycemia risk. METHODS We leveraged a free-living dataset from Tidepool comprised of glucose measurements, insulin doses, and PA data from 50 individuals with T1D (6448 sessions) for training and validating machine learning models. We also used data from the T1Dexi pilot study that contains glucose management and PA data from 20 individuals with T1D (139 session) for assessing the accuracy of the best performing model on an independent test dataset. We used mixed-effects logistic regression (MELR) and mixed-effects random forest (MERF) to model hypoglycemia risk around PA. We identified risk factors associated with hypoglycemia using odds ratio and partial dependence analysis for the MELR and MERF models, respectively. Prediction accuracy was measured using the area under the receiver operating characteristic curve (AUROC). RESULTS The analysis identified risk factors significantly associated with hypoglycemia during and following PA in both MELR and MERF models including glucose and body exposure to insulin at the start of PA, low blood glucose index 24 h prior to PA, and PA intensity and timing. Both models showed overall hypoglycemia risk peaking 1 h after PA and again 5-10 h after PA, which is consistent with the hypoglycemia risk pattern observed in the training dataset. Time following PA impacted hypoglycemia risk differently across different PA types. Accuracy of hypoglycemia prediction using the fixed effects of the MERF model was highest when predicting hypoglycemia during the first hour following the start of PA (AUROCVALIDATION = 0.83 and AUROCTESTING = 0.86) and decreased when predicting hypoglycemia in the 24 h after PA (AUROCVALIDATION = 0.66 and AUROCTESTING = 0.68). CONCLUSION Hypoglycemia risk after the start of PA can be modeled using mixed-effects machine learning to identify key risk factors that may be used within decision support and insulin delivery systems. We published the population-level MERF model online for others to use.
Collapse
Affiliation(s)
- Clara Mosquera-Lopez
- Artificial Intelligence for Medical Systems (AIMS) Lab, Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, USA.
| | - Katrina L Ramsey
- Biostatistics and Design Program, Oregon Health & Science University, Portland, Oregon, USA
| | - Valentina Roquemen-Echeverri
- Artificial Intelligence for Medical Systems (AIMS) Lab, Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, USA
| | - Peter G Jacobs
- Artificial Intelligence for Medical Systems (AIMS) Lab, Department of Biomedical Engineering, Oregon Health & Science University, Portland, Oregon, USA
| |
Collapse
|
16
|
Cakar S, Yavuz FG. Hybrid statistical and machine learning modeling of cognitive neuroscience data. J Appl Stat 2023; 51:1076-1097. [PMID: 38628450 PMCID: PMC11018039 DOI: 10.1080/02664763.2023.2176834] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2022] [Accepted: 01/31/2023] [Indexed: 02/18/2023]
Abstract
The nested data structure is prevalent for cognitive measure experiments due to repeatedly taken observations from different brain locations within subjects. The analysis methods used for this data type should consider the dependency structure among the repeated measurements. However, the dependency assumption is mainly ignored in the cognitive neuroscience data analysis literature. We consider both statistical, and machine learning methods extended to repeated data analysis and compare distinct algorithms in terms of their advantage and disadvantages. Unlike basic algorithm comparison studies, this article analyzes novel neuroscience data considering the dependency structure for the first time with several statistical and machine learning methods and their hybrid forms. In addition, the fitting performances of different algorithms are compared using contaminated data sets, and the cross-validation approach. One of our findings suggests that the GLMM tree, including random term indices indicating the location of functional near-infrared spectroscopy optodes nested within experimental units, shows the best predictive performance with the lowest MSE, RMSE, and MAE model performance metrics. However, there is a trade-off between accuracy and speed since this algorithm is required the highest computational time.
Collapse
Affiliation(s)
- Serenay Cakar
- Department of Statistics, Middle East Technical University, Ankara, Turkey
| | - Fulya Gokalp Yavuz
- Department of Statistics, Middle East Technical University, Ankara, Turkey
| |
Collapse
|
17
|
Synthetic data in health care: A narrative review. PLOS DIGITAL HEALTH 2023; 2:e0000082. [PMID: 36812604 PMCID: PMC9931305 DOI: 10.1371/journal.pdig.0000082] [Citation(s) in RCA: 31] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Accepted: 12/06/2022] [Indexed: 01/09/2023]
Abstract
Data are central to research, public health, and in developing health information technology (IT) systems. Nevertheless, access to most data in health care is tightly controlled, which may limit innovation, development, and efficient implementation of new research, products, services, or systems. Using synthetic data is one of the many innovative ways that can allow organizations to share datasets with broader users. However, only a limited set of literature is available that explores its potentials and applications in health care. In this review paper, we examined existing literature to bridge the gap and highlight the utility of synthetic data in health care. We searched PubMed, Scopus, and Google Scholar to identify peer-reviewed articles, conference papers, reports, and thesis/dissertations articles related to the generation and use of synthetic datasets in health care. The review identified seven use cases of synthetic data in health care: a) simulation and prediction research, b) hypothesis, methods, and algorithm testing, c) epidemiology/public health research, d) health IT development, e) education and training, f) public release of datasets, and g) linking data. The review also identified readily and publicly accessible health care datasets, databases, and sandboxes containing synthetic data with varying degrees of utility for research, education, and software development. The review provided evidence that synthetic data are helpful in different aspects of health care and research. While the original real data remains the preferred choice, synthetic data hold possibilities in bridging data access gaps in research and evidence-based policymaking.
Collapse
|
18
|
Lou YS, Lin CS, Fang WH, Lee CC, Wang CH, Lin C. Development and validation of a dynamic deep learning algorithm using electrocardiogram to predict dyskalaemias in patients with multiple visits. EUROPEAN HEART JOURNAL. DIGITAL HEALTH 2022; 4:22-32. [PMID: 36743876 PMCID: PMC9890087 DOI: 10.1093/ehjdh/ztac072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Revised: 10/26/2022] [Indexed: 11/23/2022]
Abstract
Aims Deep learning models (DLMs) have shown superiority in electrocardiogram (ECG) analysis and have been applied to diagnose dyskalaemias. However, no study has explored the performance of DLM-enabled ECG in continuous follow-up scenarios. Therefore, we proposed a dynamic revision of DLM-enabled ECG to use personal pre-annotated ECGs to enhance the accuracy in patients with multiple visits. Methods and results We retrospectively collected 168 450 ECGs with corresponding serum potassium (K+) levels from 103 091 patients as development samples. In the internal/external validation sets, the numbers of ECGs with corresponding K+ were 37 246/47 604 from 13 555/20 058 patients. Our dynamic revision method showed better performance than the traditional direct prediction for diagnosing hypokalaemia [area under the receiver operating characteristic curve (AUC) = 0.730/0.720-0.788/0.778] and hyperkalaemia (AUC = 0.884/0.888-0.915/0.908) in patients with multiple visits. Conclusion Our method has shown a distinguishable improvement in DLMs for diagnosing dyskalaemias in patients with multiple visits, and we also proved its application in ejection fraction prediction, which could further improve daily clinical practice.
Collapse
Affiliation(s)
- Yu-Sheng Lou
- Graduate Institutes of Life Sciences, National Defense Medical Center, No.161, Min-Chun E. Rd., Sec. 6, Neihu, Taipei 114, Taiwan, Republic of China,School of Public Health, National Defense Medical Center, No. 161, Min-Chun E. Rd., Section 6, Neihu, Taipei 114, Taiwan, Republic of China
| | - Chin-Sheng Lin
- Division of Cardiology, Department of Internal Medicine, Tri-Service General Hospital, National Defense Medical Center,, No. 325, Cheng-Kung Rd., Section 2, Neihu, Taipei 114, Taiwan, Republic of China
| | - Wen-Hui Fang
- Department of Family and Community Medicine, Department of Internal Medicine, Tri-Service General Hospital, National Defense Medical Center, No. 325, Cheng-Kung Rd., Section 2, Neihu, Taipei 114, Taiwan, Republic of China
| | - Chia-Cheng Lee
- Medical Informatics Office, Tri-Service General Hospital, National Defense Medical Center, No. 325, Cheng- Kung Rd., Section 2, Neihu, Taipei 114, Taiwan, Republic of China,Division of Colorectal Surgery, Department of Surgery, Tri-Service General Hospital, National Defense Medical Center, No. 325, Cheng-Kung Rd., Section 2, Neihu, Taipei 114, Taiwan, Republic of China
| | - Chih-Hung Wang
- Department of Otolaryngology-Head and Neck Surgery, Tri-Service General Hospital, National Defense Medical Center, No. 325, Cheng-Kung Rd., Section 2, Neihu, Taipei 114, Taiwan, Republic of China,Graduate Institute of Medical Sciences, National Defense Medical Center, No. 161, Min-Chun E. Rd., Section 6, Neihu, Taipei 114, Taiwan, Republic of China
| | - Chin Lin
- Corresponding author. Tel: +886 2 87923100 #18574, Fax: +886 2 87923147,
| |
Collapse
|
19
|
Hu L, Ji J, Liu H, Ennis R. A Flexible Approach for Assessing Heterogeneity of Causal Treatment Effects on Patient Survival Using Large Datasets with Clustered Observations. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:14903. [PMID: 36429621 PMCID: PMC9690785 DOI: 10.3390/ijerph192214903] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 11/08/2022] [Accepted: 11/09/2022] [Indexed: 06/16/2023]
Abstract
Personalized medicine requires an understanding of treatment effect heterogeneity. Evolving toward causal evidence for scenarios not studied in randomized trials necessitates a methodology using real-world evidence. Herein, we demonstrate a methodology that generates causal effects, assesses the heterogeneity of the effects and adjusts for the clustered nature of the data. This study uses a state-of-the-art machine learning survival model, riAFT-BART, to draw causal inferences about individual survival treatment effects, while accounting for the variability in institutional effects; further, it proposes a data-driven approach to agnostically (as opposed to a priori hypotheses) ascertain which subgroups exhibit an enhanced treatment effect from which intervention, relative to global evidence-average treatment effects measured at the population level. Comprehensive simulations show the advantages of the proposed method in terms of bias, efficiency and precision in estimating heterogeneous causal effects. The empirically validated method was then used to analyze the National Cancer Database.
Collapse
Affiliation(s)
- Liangyuan Hu
- Department of Biostatistics and Epidemiology, Rutgers University, New Brunswick, NJ 07102, USA
| | - Jiayi Ji
- Department of Biostatistics and Epidemiology, Rutgers University, New Brunswick, NJ 07102, USA
| | - Hao Liu
- Department of Biostatistics and Epidemiology, Rutgers University, New Brunswick, NJ 07102, USA
- Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ 07102, USA
| | - Ronald Ennis
- Cancer Institute of New Jersey, Rutgers University, New Brunswick, NJ 07102, USA
- Robert Wood Johnson Medical School, Rutgers University, New Brunswick, NJ 07102, USA
| |
Collapse
|
20
|
Fuh-Ngwa V, Zhou Y, Melton PE, van der Mei I, Charlesworth JC, Lin X, Zarghami A, Broadley SA, Ponsonby AL, Simpson-Yap S, Lechner-Scott J, Taylor BV. Ensemble machine learning identifies genetic loci associated with future worsening of disability in people with multiple sclerosis. Sci Rep 2022; 12:19291. [PMID: 36369345 PMCID: PMC9652373 DOI: 10.1038/s41598-022-23685-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 11/03/2022] [Indexed: 11/13/2022] Open
Abstract
Limited studies have been conducted to identify and validate multiple sclerosis (MS) genetic loci associated with disability progression. We aimed to identify MS genetic loci associated with worsening of disability over time, and to develop and validate ensemble genetic learning model(s) to identify people with MS (PwMS) at risk of future worsening. We examined associations of 208 previously established MS genetic loci with the risk of worsening of disability; we learned ensemble genetic decision rules and validated the predictions in an external dataset. We found 7 genetic loci (rs7731626: HR 0.92, P = 2.4 × 10-5; rs12211604: HR 1.16, P = 3.2 × 10-7; rs55858457: HR 0.93, P = 3.7 × 10-7; rs10271373: HR 0.90, P = 1.1 × 10-7; rs11256593: HR 1.13, P = 5.1 × 10-57; rs12588969: HR = 1.10, P = 2.1 × 10-10; rs1465697: HR 1.09, P = 1.7 × 10-128) associated with risk worsening of disability; most of which were located near or tagged to 13 genomic regions enriched in peptide hormones and steroids biosynthesis pathways by positional and eQTL mapping. The derived ensembles produced a set of genetic decision rules that can be translated to provide additional prognostic values to existing clinical predictions, with the additional benefit of incorporating relevant genetic information into clinical decision making for PwMS. The present study extends our knowledge of MS progression genetics and provides the basis of future studies regarding the functional significance of the identified loci.
Collapse
Affiliation(s)
- Valery Fuh-Ngwa
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Yuan Zhou
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Phillip E. Melton
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Ingrid van der Mei
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Jac C. Charlesworth
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Xin Lin
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Amin Zarghami
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| | - Simon A. Broadley
- grid.1022.10000 0004 0437 5432Menzies Health Institute Queensland and School of Medicine, Griffith University Gold Coast, G40 Griffith Health Centre, QLD 4222, Australia
| | - Anne-Louise Ponsonby
- grid.1058.c0000 0000 9442 535XDeveloping Brain Division, The Florey Institute for Neuroscience and Mental Health, Royal Children’s Hospital, University of Melbourne Murdoch Children’s Research Institute, Parkville, VIC 3052 Australia
| | - Steve Simpson-Yap
- grid.1008.90000 0001 2179 088XNeuroepidemiology Unit, Melbourne School of Population & Global Health, The University of Melbourne, Melbourne, VIC 3053 Australia
| | - Jeannette Lechner-Scott
- grid.266842.c0000 0000 8831 109XDepartment of Neurology, Hunter Medical Research Institute, Hunter New England Health, University of Newcastle, Callaghan, NSW 2310 Australia
| | - Bruce V. Taylor
- grid.1009.80000 0004 1936 826XMenzies Institute for Medical Research, University of Tasmania, 17 Liverpool St, Hobart, TAS 7000 Australia
| |
Collapse
|
21
|
Sokhansanj BA, Rosen GL. Predicting COVID-19 disease severity from SARS-CoV-2 spike protein sequence by mixed effects machine learning. Comput Biol Med 2022; 149:105969. [PMID: 36041271 PMCID: PMC9384346 DOI: 10.1016/j.compbiomed.2022.105969] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2022] [Revised: 07/11/2022] [Accepted: 08/13/2022] [Indexed: 11/17/2022]
Abstract
Epidemiological studies show that COVID-19 variants-of-concern, like Delta and Omicron, pose different risks for severe disease, but they typically lack sequence-level information for the virus. Studies which do obtain viral genome sequences are generally limited in time, location, and population scope. Retrospective meta-analyses require time-consuming data extraction from heterogeneous formats and are limited to publicly available reports. Fortuitously, a subset of GISAID, the global SARS-CoV-2 sequence repository, includes "patient status" metadata that can indicate whether a sequence record is associated with mild or severe disease. While GISAID lacks data on comorbidities relevant to severity, such as obesity and chronic disease, it does include metadata for age and sex to use as additional attributes in modeling. With these caveats, previous efforts have demonstrated that genotype-patient status models can be fit to GISAID data, particularly when country-of-origin is used as an additional feature. But are these models robust and biologically meaningful? This paper shows that, in fact, temporal and geographic biases in sequences submitted to GISAID, as well as the evolving pandemic response, particularly reduction in severe disease due to vaccination, create complex issues for model development and interpretation. This paper poses a potential solution: efficient mixed effects machine learning using GPBoost, treating country as a random effect group. Training and validation using temporally split GISAID data and emerging Omicron variants demonstrates that GPBoost models are more predictive of the impact of spike protein mutations on patient outcomes than fixed effect XGBoost, LightGBM, random forests, and elastic net logistic regression models.
Collapse
Affiliation(s)
- Bahrad A Sokhansanj
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| | - Gail L Rosen
- Ecological and Evolutionary Signal Processing & Informatics Laboratory, Drexel University, 3100 Chestnut St., Philadelphia, PA, 19104, United States of America.
| |
Collapse
|
22
|
Predictions of machine learning with mixed-effects in analyzing longitudinal data under model misspecification. STAT METHOD APPL-GER 2022. [DOI: 10.1007/s10260-022-00658-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/14/2022]
Abstract
AbstractWe consider predictions in longitudinal studies, and investigate the well known statistical mixed-effects model, piecewise linear mixed-effects model and six different popular machine learning approaches: decision trees, bagging, random forest, boosting, support-vector machine and neural network. In order to consider the correlated data in machine learning, the random effects is combined into the traditional tree methods and random forest. Our focus is the performance of statistical modelling and machine learning especially in the cases of the misspecification of the fixed effects and the random effects. Extensive simulation studies have been carried out to evaluate the performance using a number of criteria. Two real datasets from longitudinal studies are analysed to demonstrate our findings. The R code and dataset are freely available at https://github.com/shuwen92/MEML.
Collapse
|
23
|
Shazly SA, Borah BJ, Ngufor CG, Torbenson VE, Theiler RN, Famuyide AO. Impact of labor characteristics on maternal and neonatal outcomes of labor: A machine-learning model. PLoS One 2022; 17:e0273178. [PMID: 35994474 PMCID: PMC9394788 DOI: 10.1371/journal.pone.0273178] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Accepted: 08/01/2022] [Indexed: 11/18/2022] Open
Abstract
Introduction
Since Friedman’s seminal publication on laboring women, numerous publications have sought to define normal labor progress. However, there is paucity of data on contemporary labor cervicometry incorporating both maternal and neonatal outcomes. The objective of this study is to establish intrapartum prediction models of unfavorable labor outcomes using machine-learning algorithms.
Materials and methods
Consortium on Safe Labor is a large database consisting of pregnancy and labor characteristics from 12 medical centers in the United States. Outcomes, including maternal and neonatal outcomes, were retrospectively collected. We defined primary outcome as the composite of following unfavorable outcomes: cesarean delivery in active labor, postpartum hemorrhage, intra-amniotic infection, shoulder dystocia, neonatal morbidity, and mortality. Clinical and obstetric parameters at admission and during labor progression were used to build machine-learning risk-prediction models based on the gradient boosting algorithm.
Results
Of 228,438 delivery episodes, 66,586 were eligible for this study. Mean maternal age was 26.95 ± 6.48 years, mean parity was 0.92 ± 1.23, and mean gestational age was 39.35 ± 1.13 weeks. Unfavorable labor outcome was reported in 14,439 (21.68%) deliveries. Starting at a cervical dilation of 4 cm, the area under receiver operating characteristics curve (AUC) of prediction models increased from 0.75 (95% confidence interval, 0.75–0.75) to 0.89 (95% confidence interval, 0.89–0.90) at a dilation of 10 cm. Baseline labor risk score was above 35% in patients with unfavorable outcomes compared to women with favorable outcomes, whose score was below 25%.
Conclusion
Labor risk score is a machine-learning–based score that provides individualized and dynamic alternatives to conventional labor charts. It predicts composite of adverse birth, maternal, and neonatal outcomes as labor progresses. Therefore, it can be deployed in clinical practice to monitor labor progress in real time and support clinical decisions.
Collapse
Affiliation(s)
- Sherif A. Shazly
- Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota
| | - Bijan J. Borah
- Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota
- Division of Health Care Delivery Research, Mayo Clinic, Rochester, Minnesota
- The Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota
| | - Che G. Ngufor
- The Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota
| | | | - Regan N. Theiler
- Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota
| | - Abimbola O. Famuyide
- Department of Obstetrics and Gynecology, Mayo Clinic, Rochester, Minnesota
- * E-mail:
| |
Collapse
|
24
|
Bi Q, Kuang Z, Haihong E, Song M, Tan L, Tang X, Liu X. Research on early warning of renal damage in hypertensive patients based on the stacking strategy. BMC Med Inform Decis Mak 2022; 22:212. [PMID: 35945608 PMCID: PMC9361646 DOI: 10.1186/s12911-022-01889-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2020] [Accepted: 03/31/2022] [Indexed: 11/26/2022] Open
Abstract
Background Among the problems caused by hypertension, early renal damage is often ignored. It can not be diagnosed until the condition is severe and irreversible damage occurs. So we decided to screen and explore related risk factors for hypertensive patients with early renal damage and establish the early-warning model of renal damage based on the data-mining method to achieve an early diagnosis for hypertensive patients with renal damage. Methods With the aid of an electronic information management system for hypertensive out-patients, we collected 513 cases of original, untreated hypertensive patients. We recorded their demographic data, ambulatory blood pressure parameters, blood routine index, and blood biochemical index to establish the clinical database. Then we screen risk factors for early renal damage through feature engineering and use Random Forest, Extra-Trees, and XGBoost to build an early-warning model, respectively. Finally, we build a new model by model fusion based on the Stacking strategy. We use cross-validation to evaluate the stability and reliability of each model to determine the best risk assessment model. Results According to the degree of importance, the descending order of features selected by feature engineering is the drop rate of systolic blood pressure at night, the red blood cell distribution width, blood pressure circadian rhythm, the average diastolic blood pressure at daytime, body surface area, smoking, age, and HDL. The average precision of the two-dimensional fusion model with full features based on the Stacking strategy is 0.89685, and selected features are 0.93824, which is greatly improved. Conclusions Through feature engineering and risk factor analysis, we select the drop rate of systolic blood pressure at night, the red blood cell distribution width, blood pressure circadian rhythm, and the average diastolic blood pressure at daytime as early-warning factors of early renal damage in patients with hypertension. On this basis, the two-dimensional fusion model based on the Stacking strategy has a better effect than the single model, which can be used for risk assessment of early renal damage in hypertensive patients.
Collapse
Affiliation(s)
- Qiubo Bi
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
| | - Zemin Kuang
- Department of Hypertension, Beijing Anzhen Hospital of Capital Medical University, Beijing, 100029, China
| | - E Haihong
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China.
| | - Meina Song
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
| | - Ling Tan
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing, 100876, China
| | - Xinying Tang
- Department of Cardiology, The First People's Hospital of Chenzhou, The University of South China, Chenzhou, 423000, China
| | - Xing Liu
- Department of Anesthesiology, Third Xiangya Hospital, Central South University, Changsha, 410013, China
| |
Collapse
|
25
|
Baurley JW, Claus ED, Witkiewitz K, McMahan CS. A Bayesian mixed effects support vector machine for learning and predicting daily substance use disorder patterns. THE AMERICAN JOURNAL OF DRUG AND ALCOHOL ABUSE 2022; 48:413-421. [PMID: 35196194 DOI: 10.1080/00952990.2021.2024839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 12/27/2021] [Accepted: 12/29/2021] [Indexed: 06/14/2023]
Abstract
Background: Substance use disorder (SUD) is a heterogeneous disorder. Adapting machine learning algorithms to allow for the parsing of intrapersonal and interpersonal heterogeneity in meaningful ways may accelerate the discovery and implementation of clinically actionable interventions in SUD research.Objectives: Inspired by a study of heavy drinkers that collected daily drinking and substance use (ABQ DrinQ), we develop tools to estimate subject-specific risk trajectories of heavy drinking; estimate and perform inference on patient characteristics and time-varying covariates; and present results in easy-to-use Jupyter notebooks. Methods: We recast support vector machines (SVMs) into a Bayesian model extended to handle mixed effects. We then apply these methods to ABQ DrinQ to model alcohol use patterns. ABQ DrinQ consists of 190 heavy drinkers (44% female) with 109,580 daily observations. Results: We identified male gender (point estimate; 95% credible interval: -0.25;-0.29,-0.21), older age (-0.03;-0.03,-0.03), and time varying usage of nicotine (1.68;1.62,1.73), cannabis (0.05;0.03,0.07), and other drugs (1.16;1.01,1.35) as statistically significant factors of heavy drinking behavior. By adopting random effects to capture the subject-specific longitudinal trajectories, the algorithm outperforms traditional SVM (classifies 84% of heavy drinking days correctly versus 73%). Conclusions: We developed a mixed effects variant of SVM and compare it to the traditional formulation, with an eye toward elucidating the importance of incorporating random effects to account for underlying heterogeneity in SUD data. These tools and examples are packaged into a repository for researchers to explore. Understanding patterns and risk of substance use could be used for developing individualized interventions.
Collapse
Affiliation(s)
| | - Eric D Claus
- Department of Biobehavioral Health, The Pennsylvania State University, University Park, PA USA
| | - Katie Witkiewitz
- Department of Psychology, University of New Mexico, Albuquerque, NM, USA
| | - Christopher S McMahan
- School of Mathematical and Statistical Sciences, Clemson University, Clemson, SC, USA
| |
Collapse
|
26
|
Marsch LA, Chen CH, Adams SR, Asyyed A, Does MB, Hassanpour S, Hichborn E, Jackson-Morris M, Jacobson NC, Jones HK, Kotz D, Lambert-Harris CA, Li Z, McLeman B, Mishra V, Stanger C, Subramaniam G, Wu W, Campbell CI. The Feasibility and Utility of Harnessing Digital Health to Understand Clinical Trajectories in Medication Treatment for Opioid Use Disorder: D-TECT Study Design and Methodological Considerations. Front Psychiatry 2022; 13:871916. [PMID: 35573377 PMCID: PMC9098973 DOI: 10.3389/fpsyt.2022.871916] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 03/22/2022] [Indexed: 11/13/2022] Open
Abstract
Introduction Across the U.S., the prevalence of opioid use disorder (OUD) and the rates of opioid overdoses have risen precipitously in recent years. Several effective medications for OUD (MOUD) exist and have been shown to be life-saving. A large volume of research has identified a confluence of factors that predict attrition and continued substance use during substance use disorder treatment. However, much of this literature has examined a small set of potential moderators or mediators of outcomes in MOUD treatment and may lead to over-simplified accounts of treatment non-adherence. Digital health methodologies offer great promise for capturing intensive, longitudinal ecologically-valid data from individuals in MOUD treatment to extend our understanding of factors that impact treatment engagement and outcomes. Methods This paper describes the protocol (including the study design and methodological considerations) from a novel study supported by the National Drug Abuse Treatment Clinical Trials Network at the National Institute on Drug Abuse (NIDA). This study (D-TECT) primarily seeks to evaluate the feasibility of collecting ecological momentary assessment (EMA), smartphone and smartwatch sensor data, and social media data among patients in outpatient MOUD treatment. It secondarily seeks to examine the utility of EMA, digital sensing, and social media data (separately and compared to one another) in predicting MOUD treatment retention, opioid use events, and medication adherence [as captured in electronic health records (EHR) and EMA data]. To our knowledge, this is the first project to include all three sources of digitally derived data (EMA, digital sensing, and social media) in understanding the clinical trajectories of patients in MOUD treatment. These multiple data streams will allow us to understand the relative and combined utility of collecting digital data from these diverse data sources. The inclusion of EHR data allows us to focus on the utility of digital health data in predicting objectively measured clinical outcomes. Discussion Results may be useful in elucidating novel relations between digital data sources and OUD treatment outcomes. It may also inform approaches to enhancing outcomes measurement in clinical trials by allowing for the assessment of dynamic interactions between individuals' daily lives and their MOUD treatment response. Clinical Trial Registration Identifier: NCT04535583.
Collapse
Affiliation(s)
- Lisa A. Marsch
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Ching-Hua Chen
- Center for Computational Health, International Business Machines (IBM) Research, Yorktown Heights, NY, United States
| | - Sara R. Adams
- Division of Research Kaiser Permanente Northern California, Oakland, CA, United States
| | - Asma Asyyed
- The Permanente Medical Group, Northern California, Addiction Medicine and Recovery Services, Oakland, CA, United States
| | - Monique B. Does
- Division of Research Kaiser Permanente Northern California, Oakland, CA, United States
| | - Saeed Hassanpour
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Emily Hichborn
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | | | - Nicholas C. Jacobson
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Heather K. Jones
- Division of Research Kaiser Permanente Northern California, Oakland, CA, United States
| | - David Kotz
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
- Department of Computer Science, Dartmouth College, Hanover, NH, United States
| | - Chantal A. Lambert-Harris
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Zhiguo Li
- Center for Computational Health, International Business Machines (IBM) Research, Yorktown Heights, NY, United States
| | - Bethany McLeman
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Varun Mishra
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
- Department of Health Sciences, Bouvé College of Health Sciences, Northeastern University, Boston, MA, United States
| | - Catherine Stanger
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Geetha Subramaniam
- Center for Clinical Trials Network, National Institute on Drug Abuse, Bethesda, MD, United States
| | - Weiyi Wu
- Center for Technology and Behavioral Health, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Lebanon, NH, United States
| | - Cynthia I. Campbell
- Division of Research Kaiser Permanente Northern California, Oakland, CA, United States
- Department of Psychiatry and Behavioral Sciences, University of California, San Francisco, San Francisco, CA, United States
| |
Collapse
|
27
|
Mousavi A, Zare H, Asadian A, Mohammadzadeh M. Factors Affecting the Product Life Cycle of Generic Medicines. IRANIAN JOURNAL OF PHARMACEUTICAL RESEARCH 2022; 21:e127039. [PMID: 36060917 PMCID: PMC9420220 DOI: 10.5812/ijpr-127039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/28/2021] [Revised: 01/24/2022] [Accepted: 02/13/2022] [Indexed: 11/26/2022]
Abstract
Background Product life cycle (PLC) refers to the time ranging from when a product is introduced into the market to when it is taken off the shelves. The PLC management can guarantee product survival and prevent its decline. Objectives This study investigated generic antibiotic PLCs and detected factors affecting them in the competitive pharmaceutical market of Iran to improve the PLC management of such drugs. Methods To study the PLC of antibiotics, data were collected from 2002 to 2017, and then the PLC curves were analyzed. Accordingly, factors affecting the PLC of antibiotics were illustrated in two sections: all PLC curves and the PLC curves with one sales peak. Using a generalized linear model combined with a machine learning approach, we identified the sales patterns and the effect of the product-related and the competition-related factors on the PLC curves, peak height, and the time to reach peak sales. Results According to the findings, 16, 11.87, 13.03, and 59% of the antibiotics had linear, binomial, one-peak, and oscillating sales patterns, respectively. The most crucial factors affecting the PLC shape were the quality, microbial spectrum, dosage forms, number of competitors, and entry arrangement. Conclusions This study examined factors affecting the PLC patterns of generic pharmaceutical products. The findings would provide more insights into the generic pharmaceutical market as one of the less-studied markets in many countries.
Collapse
Affiliation(s)
- Atefeh Mousavi
- Pharmaceutical and Health Economics and Management Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Hossein Zare
- Department of Health Policy and Management, Johns Hopkins Center for Health Disparities Solutions, the University of Maryland Global Campus, Health Services Management, 624 North Broadway, Baltimore, Maryland, United States
| | - Aydin Asadian
- School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran
| | - Mehdi Mohammadzadeh
- Pharmaceutical and Health Economics and Management Research Center, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Department of Pharmacoeconomy & Administrative Pharmacy, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran
- Corresponding Author: Department of Pharmacoeconomy & Administrative Pharmacy, School of Pharmacy, Shahid Beheshti University of Medical Sciences, Tehran, Iran. Tel: +98-2188665692,
| |
Collapse
|
28
|
Bayesian Nonlinear Models for Repeated Measurement Data: An Overview, Implementation, and Applications. MATHEMATICS 2022. [DOI: 10.3390/math10060898] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/01/2023]
Abstract
Nonlinear mixed effects models have become a standard platform for analysis when data is in the form of continuous and repeated measurements of subjects from a population of interest, while temporal profiles of subjects commonly follow a nonlinear tendency. While frequentist analysis of nonlinear mixed effects models has a long history, Bayesian analysis of the models has received comparatively little attention until the late 1980s, primarily due to the time-consuming nature of Bayesian computation. Since the early 1990s, Bayesian approaches for the models began to emerge to leverage rapid developments in computing power, and have recently received significant attention due to (1) superiority to quantify the uncertainty of parameter estimation; (2) utility to incorporate prior knowledge into the models; and (3) flexibility to match exactly the increasing complexity of scientific research arising from diverse industrial and academic fields. This review article presents an overview of modeling strategies to implement Bayesian approaches for the nonlinear mixed effects models, ranging from designing a scientific question out of real-life problems to practical computations.
Collapse
|
29
|
Cao X, Yang G, Jin X, He L, Li X, Zheng Z, Liu Z, Wu C. A Machine Learning-Based Aging Measure Among Middle-Aged and Older Chinese Adults: The China Health and Retirement Longitudinal Study. Front Med (Lausanne) 2021; 8:698851. [PMID: 34926482 PMCID: PMC8671693 DOI: 10.3389/fmed.2021.698851] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2021] [Accepted: 10/28/2021] [Indexed: 11/13/2022] Open
Abstract
Objective: Biological age (BA) has been accepted as a more accurate proxy of aging than chronological age (CA). This study aimed to use machine learning (ML) algorithms to estimate BA in the Chinese population. Materials and methods: We used data from 9,771 middle-aged and older Chinese adults (≥45 years) in the 2011/2012 wave of the China Health and Retirement Longitudinal Study and followed until 2018. We used several ML algorithms (e.g., Gradient Boosting Regressor, Random Forest, CatBoost Regressor, and Support Vector Machine) to develop new measures of biological aging (ML-BAs) based on physiological biomarkers. R-squared value and mean absolute error (MAE) were used to determine the optimal performance of these ML-BAs. We used logistic regression models to examine the associations of the best ML-BA and a conventional aging measure-Klemera and Doubal method-BA (KDM-BA) we previously developed-with physical disability and mortality, respectively. Results: The Gradient Boosting Regression model performed the best, resulting in an ML-BA with an R-squared value of 0.270 and an MAE of 6.519. This ML-BA was significantly associated with disability in basic activities of daily living, instrumental activities of daily living, lower extremity mobility, and upper extremity mobility, and mortality, with odds ratios ranging from 1 to 7% (per 1-year increment in ML-BA, all P < 0.001), independent of CA. These associations were generally comparable to that of KDM-BA. Conclusion: This study provides a valid ML-based measure of biological aging for middle-aged and older Chinese adults. These findings support the application of ML in geroscience research and may help facilitate preventive and geroprotector intervention studies.
Collapse
Affiliation(s)
- Xingqi Cao
- Department of Big Data in Health Science, School of Public Health and Center for Clinical Big Data and Analytics, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Guanglai Yang
- Global Health Research Center, Duke Kunshan University, Kunshan, China
| | - Xurui Jin
- Global Health Research Center, Duke Kunshan University, Kunshan, China.,MindRank AI ltd., Hangzhou, China
| | - Liu He
- Department of Big Data in Health Science, School of Public Health and Center for Clinical Big Data and Analytics, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xueqin Li
- Department of Big Data in Health Science, School of Public Health and Center for Clinical Big Data and Analytics, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhoutao Zheng
- Department of Big Data in Health Science, School of Public Health and Center for Clinical Big Data and Analytics, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zuyun Liu
- Department of Big Data in Health Science, School of Public Health and Center for Clinical Big Data and Analytics, Second Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Chenkai Wu
- Global Health Research Center, Duke Kunshan University, Kunshan, China
| |
Collapse
|
30
|
Shishegar R, Cox T, Rolls D, Bourgeat P, Doré V, Lamb F, Robertson J, Laws SM, Porter T, Fripp J, Tosun D, Maruff P, Savage G, Rowe CC, Masters CL, Weiner MW, Villemagne VL, Burnham SC. Using imputation to provide harmonized longitudinal measures of cognition across AIBL and ADNI. Sci Rep 2021; 11:23788. [PMID: 34893624 PMCID: PMC8664816 DOI: 10.1038/s41598-021-02827-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2021] [Accepted: 11/12/2021] [Indexed: 12/12/2022] Open
Abstract
To improve understanding of Alzheimer’s disease, large observational studies are needed to increase power for more nuanced analyses. Combining data across existing observational studies represents one solution. However, the disparity of such datasets makes this a non-trivial task. Here, a machine learning approach was applied to impute longitudinal neuropsychological test scores across two observational studies, namely the Australian Imaging, Biomarkers and Lifestyle Study (AIBL) and the Alzheimer's Disease Neuroimaging Initiative (ADNI) providing an overall harmonised dataset. MissForest, a machine learning algorithm, capitalises on the underlying structure and relationships of data to impute test scores not measured in one study aligning it to the other study. Results demonstrated that simulated missing values from one dataset could be accurately imputed, and that imputation of actual missing data in one dataset showed comparable discrimination (p < 0.001) for clinical classification to measured data in the other dataset. Further, the increased power of the overall harmonised dataset was demonstrated by observing a significant association between CVLT-II test scores (imputed for ADNI) with PET Amyloid-β in MCI APOE-ε4 homozygotes in the imputed data (N = 65) but not for the original AIBL dataset (N = 11). These results suggest that MissForest can provide a practical solution for data harmonization using imputation across studies to improve power for more nuanced analyses.
Collapse
Affiliation(s)
- Rosita Shishegar
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia. .,School of Psychological Sciences and Turner Institute for Brain and Mental Health, Monash University, Melbourne, Australia.
| | - Timothy Cox
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia
| | - David Rolls
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia
| | - Pierrick Bourgeat
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia
| | - Vincent Doré
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia.,Department of Molecular Imaging and Therapy, Austin Health, Heidelberg, VIC, Australia
| | - Fiona Lamb
- Department of Molecular Imaging and Therapy, Austin Health, Heidelberg, VIC, Australia
| | - Joanne Robertson
- Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Parkville, VIC, Australia
| | - Simon M Laws
- Centre for Precision Health, Edith Cowan University, Joondalup, WA, Australia.,Collaborative Genomics and Translation Group, School of Medical and Health Sciences, Edith Cowan University, Joondalup, WA, Australia.,School of Pharmacy and Biomedical Sciences, Faculty of Health Sciences, Curtin Health Innovation Research Institute, Curtin University, Bentley, WA, Australia
| | - Tenielle Porter
- Centre for Precision Health, Edith Cowan University, Joondalup, WA, Australia.,Collaborative Genomics and Translation Group, School of Medical and Health Sciences, Edith Cowan University, Joondalup, WA, Australia.,School of Pharmacy and Biomedical Sciences, Faculty of Health Sciences, Curtin Health Innovation Research Institute, Curtin University, Bentley, WA, Australia
| | - Jurgen Fripp
- The Australian e-Health Research Centre, CSIRO, Melbourne, Australia
| | - Duygu Tosun
- Department of Radiology and Biomedical Imaging, University of California-San Francisco, San Francisco, CA, USA
| | | | - Greg Savage
- Department of Psychology, Macquarie University, Sydney, NSW, Australia
| | - Christopher C Rowe
- Department of Molecular Imaging and Therapy, Austin Health, Heidelberg, VIC, Australia.,Department of Medicine, The University of Melbourne, Parkville, VIC, 3052, Australia
| | - Colin L Masters
- Florey Institute of Neuroscience and Mental Health, The University of Melbourne, Parkville, VIC, Australia
| | - Michael W Weiner
- Department of Radiology and Biomedical Imaging, University of California-San Francisco, San Francisco, CA, USA
| | - Victor L Villemagne
- Department of Molecular Imaging and Therapy, Austin Health, Heidelberg, VIC, Australia.,Department of Psychiatry, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | | |
Collapse
|
31
|
Mangino AA, Finch WH. Prediction With Mixed Effects Models: A Monte Carlo Simulation Study. EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 2021; 81:1118-1142. [PMID: 34565818 PMCID: PMC8451021 DOI: 10.1177/0013164421992818] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Oftentimes in many fields of the social and natural sciences, data are obtained within a nested structure (e.g., students within schools). To effectively analyze data with such a structure, multilevel models are frequently employed. The present study utilizes a Monte Carlo simulation to compare several novel multilevel classification algorithms across several varied data conditions for the purpose of prediction. Among these models, the panel neural network and Bayesian generalized mixed effects model (multilevel Bayes) consistently yielded the highest prediction accuracy in test data across nearly all data conditions.
Collapse
Affiliation(s)
| | - W Holmes Finch
- Ball State University, Teachers College, Muncie, IN, USA
| |
Collapse
|
32
|
Birk N, Matsuzaki M, Fung TT, Li Y, Batis C, Stampfer MJ, Deitchler M, Willett WC, Fawzi WW, Bromage S, Kinra S, Bhupathiraju SN, Lake E. Exploration of Machine Learning and Statistical Techniques in Development of a Low-Cost Screening Method Featuring the Global Diet Quality Score for Detecting Prediabetes in Rural India. J Nutr 2021; 151:110S-118S. [PMID: 34689190 PMCID: PMC8542097 DOI: 10.1093/jn/nxab281] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Revised: 07/26/2021] [Accepted: 08/02/2021] [Indexed: 12/03/2022] Open
Abstract
BACKGROUND The prevalence of type 2 diabetes has increased substantially in India over the past 3 decades. Undiagnosed diabetes presents a public health challenge, especially in rural areas, where access to laboratory testing for diagnosis may not be readily available. OBJECTIVES The present work explores the use of several machine learning and statistical methods in the development of a predictive tool to screen for prediabetes using survey data from an FFQ to compute the Global Diet Quality Score (GDQS). METHODS The outcome variable prediabetes status (yes/no) used throughout this study was determined based upon a fasting blood glucose measurement ≥100 mg/dL. The algorithms utilized included the generalized linear model (GLM), random forest, least absolute shrinkage and selection operator (LASSO), elastic net (EN), and generalized linear mixed model (GLMM) with family unit as a (cluster) random (intercept) effect to account for intrafamily correlation. Model performance was assessed on held-out test data, and comparisons made with respect to area under the receiver operating characteristic curve (AUC), sensitivity, and specificity. RESULTS The GLMM, GLM, LASSO, and random forest modeling techniques each performed quite well (AUCs >0.70) and included the GDQS food groups and age, among other predictors. The fully adjusted GLMM, which included a random intercept for family unit, achieved slightly superior results (AUC of 0.72) in classifying the prediabetes outcome in these cluster-correlated data. CONCLUSIONS The models presented in the current work show promise in identifying individuals at risk of developing diabetes, although further studies are necessary to assess other potentially impactful predictors, as well as the consistency and generalizability of model performance. In addition, future studies to examine the utility of the GDQS in screening for other noncommunicable diseases are recommended.
Collapse
Affiliation(s)
- Nick Birk
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, University of London, London, United Kingdom
| | - Mika Matsuzaki
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of International Health, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Teresa T Fung
- Nutrition Department, Simmons University, Boston, MA, USA
| | - Yanping Li
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Carolina Batis
- CONACYT—Health and Nutrition Research Center, National Institute of Public Health, Cuernavaca, Mexico
| | - Meir J Stampfer
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Megan Deitchler
- Intake—Center for Dietary Assessment, FHI Solutions, Washington, DC, USA
| | - Walter C Willett
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
- Department of Epidemiology, Harvard TH Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Wafaie W Fawzi
- Department of Global Health and Population, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Sabri Bromage
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Sanjay Kinra
- Department of Non-Communicable Disease Epidemiology, London School of Hygiene and Tropical Medicine, University of London, London, United Kingdom
| | - Shilpa N Bhupathiraju
- Department of Nutrition, Harvard TH Chan School of Public Health, Boston, MA, USA
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, MA, USA
| | - Erin Lake
- Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
33
|
Nguyen P, Ohnmacht AJ, Galhoz A, Büttner M, Theis F, Menden MP. Künstliche Intelligenz und maschinelles Lernen in der Diabetesforschung. DIABETOLOGE 2021. [DOI: 10.1007/s11428-021-00817-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
34
|
Cai C, Tafti AP, Ngufor C, Zhang P, Xiao P, Dai M, Liu H, Noseworthy P, Chen M, Friedman PA, Cha YM. Using ensemble of ensemble machine learning methods to predict outcomes of cardiac resynchronization. J Cardiovasc Electrophysiol 2021; 32:2504-2514. [PMID: 34260141 DOI: 10.1111/jce.15171] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/24/2021] [Revised: 05/08/2021] [Accepted: 06/14/2021] [Indexed: 11/29/2022]
Abstract
INTRODUCTION The efficacy of cardiac resynchronization therapy (CRT) has been widely studied in the medical literature; however, about 30% of candidates fail to respond to this treatment strategy. Smart computational approaches based on clinical data can help expose hidden patterns useful for identifying CRT responders. METHODS We retrospectively analyzed the electronic health records of 1664 patients who underwent CRT procedures from January 1, 2002 to December 31, 2017. An ensemble of ensemble (EoE) machine learning (ML) system composed of a supervised and an unsupervised ML layers was developed to generate a prediction model for CRT response. RESULTS We compared the performance of EoE against traditional ML methods and the state-of-the-art convolutional neural network (CNN) model trained on raw electrocardiographic (ECG) waveforms. We observed that the models exhibited improvement in performance as more features were incrementally used for training. Using the most comprehensive set of predictors, the performance of the EoE model in terms of the area under the receiver operating characteristic curve and F1-score were 0.76 and 0.73, respectively. Direct application of the CNN model on the raw ECG waveforms did not generate promising results. CONCLUSION The proposed CRT risk calculator effectively discriminates which heart failure (HF) patient is likely to respond to CRT significantly better than using clinical guidelines and traditional ML methods, thus suggesting that the tool can enhanced care management of HF patients by helping to identify high-risk patients.
Collapse
Affiliation(s)
- Cheng Cai
- Department of Cardiology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China.,Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA
| | - Ahmad P Tafti
- College of Science, Technology, and Health, University of Southern Maine, Portland, Maine, USA
| | - Che Ngufor
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - Pei Zhang
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA.,Department of Cardiology, Sir Run Run Shaw Hospital, School of Medicine Zhejiang University, Hangzhou, China
| | - Peilin Xiao
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA.,Department of Cardiology, The Second Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Mingyan Dai
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA.,Department of Cardiology, Renmin Hospital of Wuhan University; Cardiovascular Research Institute, Wuhan University, Hubei Key Laboratory of Cardiology, Wuhan, China
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, Minnesota, USA
| | - Peter Noseworthy
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA
| | - Minglong Chen
- Department of Cardiology, The First Affiliated Hospital of Nanjing Medical University, Nanjing, China
| | - Paul A Friedman
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA
| | - Yong-Mei Cha
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, Minnesota, USA
| |
Collapse
|
35
|
Speiser JL. A random forest method with feature selection for developing medical prediction models with clustered and longitudinal data. J Biomed Inform 2021; 117:103763. [PMID: 33781921 PMCID: PMC8131242 DOI: 10.1016/j.jbi.2021.103763] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Revised: 03/03/2021] [Accepted: 03/23/2021] [Indexed: 12/22/2022]
Abstract
BACKGROUND Machine learning methodologies are gaining popularity for developing medical prediction models for datasets with a large number of predictors, particularly in the setting of clustered and longitudinal data. Binary Mixed Model (BiMM) forest is a promising machine learning algorithm which may be applied to develop prediction models for clustered and longitudinal binary outcomes. Although machine learning methods for clustered and longitudinal methods such as BiMM forest exist, feature selection has not been analyzed via data simulations. Feature selection improves the practicality and ease of use of prediction models for clinicians by reducing the burden of data collection. Thus, feature selection procedures are not only beneficial, but are often necessary for development of medical prediction models. In this study, we aim to assess feature selection within the BiMM forest setting for modeling clustered and longitudinal binary outcomes. METHODS We conducted a simulation study to compare BiMM forest with feature selection (backward elimination or stepwise selection) to standard generalized linear mixed model feature selection methods (shrinkage and backward elimination). We also evaluated feature selection methods to develop models predicting mobility disability in older adults using the Health, Aging and Body Composition Study dataset as an example utilization of the proposed methodology. RESULTS BiMM forest with backward elimination generally offered higher computational efficiency, similar or higher predictive performance (accuracy and area under the receiver operating curve), and similar or higher ability to identify correct features compared to linear methods for the different simulated scenarios. For predicting mobility disability in older adults, methods generally performed similarly in terms of accuracy, area under the receiver operating curve, and specificity; however, BiMM forest with backward elimination had the highest sensitivity. CONCLUSIONS This study is novel because it is the first investigation of feature selection for developing random forest prediction models for clustered and longitudinal binary outcomes. Results from the simulation study reveal that BiMM forest with backward elimination has the highest accuracy (performance and identification of correct features) and lowest computation time compared to other feature selection methods in some scenarios and similar performance in other scenarios. Many informatics datasets have clustered and longitudinal outcomes and results from this study suggest that BiMM forest with backward elimination may be beneficial for developing medical prediction models.
Collapse
Affiliation(s)
- Jaime Lynn Speiser
- Department of Biostatistics and Data Science, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA.
| |
Collapse
|
36
|
Mofrad SA, Lundervold A, Lundervold AS. A predictive framework based on brain volume trajectories enabling early detection of Alzheimer's disease. Comput Med Imaging Graph 2021; 90:101910. [PMID: 33862355 DOI: 10.1016/j.compmedimag.2021.101910] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Revised: 02/12/2021] [Accepted: 03/26/2021] [Indexed: 10/21/2022]
Abstract
We present a framework for constructing predictive models of cognitive decline from longitudinal MRI examinations, based on mixed effects models and machine learning. We apply the framework to detect conversion from cognitively normal (CN) to mild cognitive impairment (MCI) and from MCI to Alzheimer's disease (AD), using a large collection of subjects sourced from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and the Australian Imaging, Biomarkers and Lifestyle Flagship Study of Aging (AIBL). We extract subcortical segmentation and cortical parcellation from corresponding T1-weighted images using FreeSurfer v.6.0, select bilateral 3D regions of interest relevant to neurodegeneration/dementia, and fit their longitudinal volume trajectories using linear mixed effects models. Features describing these model-based trajectories are then used to train an ensemble of machine learning classifiers to distinguish stable CN from converters to MCI, and stable MCI from converters to AD. On separate test sets the models achieved an average of accuracy/precision/recall score of 69/73/60% for converted to MCI and 75/74/77% for converted to AD, illustrating the framework's ability to extract predictive imaging-based biomarkers from routine T1-weighted MRI acquisitions.
Collapse
Affiliation(s)
- Samaneh Abolpour Mofrad
- Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Norway University of Applied Sciences, Postbox 7030, 5020 Bergen, Norway; The Mohn Medical Imaging and Visualization Centre (MMIV), Department of Radiology, Haukeland University Hospital, Bergen, Norway.
| | - Arvid Lundervold
- The Neural Networks and Microcircuits Research Group, Department of Biomedicine, University of Bergen, Bergen, Norway; The Mohn Medical Imaging and Visualization Centre (MMIV), Department of Radiology, Haukeland University Hospital, Bergen, Norway
| | - Alexander Selvikvåg Lundervold
- Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Norway University of Applied Sciences, Postbox 7030, 5020 Bergen, Norway; The Mohn Medical Imaging and Visualization Centre (MMIV), Department of Radiology, Haukeland University Hospital, Bergen, Norway
| | -
- Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/how_to_apply/ADNI_Acknowledgement_List.pdf
| | -
- Data used in the preparation of this article was obtained from the Australian Imaging Biomarkers and Lifestyle Flagship Study of Ageing (AIBL) funded by the Commonwealth Scientific and Industrial Research Organisation (CSIRO) which was made available at the ADNI database. The AIBL researchers contributed data but did not participate in analysis or writing of this report. AIBL researchers are listed at www.aibl.csiro.au
| |
Collapse
|
37
|
Gujral H, Sinha A. Association between exposure to airborne pollutants and COVID-19 in Los Angeles, United States with ensemble-based dynamic emission model. ENVIRONMENTAL RESEARCH 2021; 194:110704. [PMID: 33417905 PMCID: PMC7836725 DOI: 10.1016/j.envres.2020.110704] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/02/2020] [Revised: 12/13/2020] [Accepted: 12/29/2020] [Indexed: 05/09/2023]
Abstract
This study aims to find the association between short-term exposure to air pollutants, such as particulate matters and ground-level ozone, and SARS-CoV-2 confirmed cases. Generalized linear models (GLM), a typical choice for ecological modeling, have well-established limitations. These limitations include apriori assumptions, inability to handle multicollinearity, and considering differential effects as the fixed effect. We propose an Ensemble-based Dynamic Emission Model (EDEM) to address these limitations. EDEM is developed at the intersection of network science and ensemble learning, i.e., a specialized approach of machine learning. Generalized Additive Model (GAM), i.e., a variant of GLM, and EDEM are tested in Los Angeles and Ventura counties of California, which is one of the biggest SARS-CoV-2 clusters in the US. GAM depicts that a 1 μg/m3, 1 μg/m3, and 1 ppm increase (lag 0-7) in PM 2.5, PM 10, and O3 is associated with 4.51% (CI: 7.01 to -2.00) decrease, 1.62% (CI: 2.23 to -1.022) decrease, and 4.66% (CI: 0.85 to 8.47) increase in daily SARS-CoV-2 cases, respectively. Subsequent increment in lag resulted in the negative association between pollutants and SARS-CoV-2 cases. EDEM results in an R2 score of 90.96% and 79.16% on training and testing datasets, respectively. EDEM confirmed the negative association between particulates and SARS-CoV-2 cases; whereas, the O3 depicts a positive association; however, the positive association observed through GAM is not statistically significant. In addition, the county-level analysis of pollutant concentration interactions suggests that increased emissions from other counties positively affect SARS-CoV-2 cases in adjoining counties as well. The results reiterate the significance of uniformly adhering to air pollution mitigation strategies, especially related to ground-level ozone.
Collapse
Affiliation(s)
- Harshit Gujral
- Department of Computer Science Engineering and IT, Jaypee Institute of Information Technology, Noida, India.
| | - Adwitiya Sinha
- Department of Computer Science Engineering and IT, Jaypee Institute of Information Technology, Noida, India.
| |
Collapse
|
38
|
Affiliation(s)
- Tim C. D. Lucas
- Big Data Institute University of Oxford Old Road Campus Oxford OX3 7LF United Kingdom
| |
Collapse
|
39
|
Abstract
Introduction To identify phenotypes of type 1 diabetes based on glucose curves from continuous glucose-monitoring (CGM) using functional data (FD) analysis to account for longitudinal glucose patterns. We present a reliable prediction model that can accurately predict glycemic levels based on past data collected from the CGM sensor and real-time risk of hypo-/hyperglycemic for individuals with type 1 diabetes. Methods A longitudinal cohort study of 443 type 1 diabetes patients with CGM data from a completed trial. The FD analysis approach, sparse functional principal components (FPCs) analysis was used to identify phenotypes of type 1 diabetes glycemic variation. We employed a nonstationary stochastic linear mixed-effects model (LME) that accommodates between-patient and within-patient heterogeneity to predict glycemic levels and real-time risk of hypo-/hyperglycemic by creating specific target functions for these excursions. Results The majority of the variation (73%) in glucose trajectories was explained by the first two FPCs. Higher order variation in the CGM profiles occurred during weeknights, although variation was higher on weekends. The model has low prediction errors and yields accurate predictions for both glucose levels and real-time risk of glycemic excursions. Conclusions By identifying these distinct longitudinal patterns as phenotypes, interventions can be targeted to optimize type 1 diabetes management for subgroups at the highest risk for compromised long-term outcomes such as cardiac disease or stroke. Further, the estimated change/variability in an individual's glucose trajectory can be used to establish clinically meaningful and patient-specific thresholds that, when coupled with probabilistic predictive inference, provide a useful medical-monitoring tool.
Collapse
|
40
|
Silva KD, Lee WK, Forbes A, Demmer RT, Barton C, Enticott J. Use and performance of machine learning models for type 2 diabetes prediction in community settings: A systematic review and meta-analysis. Int J Med Inform 2020; 143:104268. [PMID: 32950874 DOI: 10.1016/j.ijmedinf.2020.104268] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 08/30/2020] [Accepted: 09/02/2020] [Indexed: 12/11/2022]
Abstract
OBJECTIVE We aimed to identify machine learning (ML) models for type 2 diabetes (T2DM) prediction in community settings and determine their predictive performance. METHOD Systematic review of ML predictive modelling studies in 13 databases since 2009 was conducted. Primary outcomes included metrics of discrimination, calibration, and classification. Secondary outcomes included important variables, level of validation, and intended use of models. Meta-analysis of c-indices, subgroup analyses, meta-regression, publication bias assessments and sensitivity analyses were conducted. RESULTS Twenty-three studies (40 prediction models) were included. Studies with high-, moderate-, and low- risk of bias were 3, 14, and 6 respectively. All studies conducted internal validation whereas none conducted external validation of their models. Twenty studies provided classification metrics to varying extents whereas only 7 studies performed model calibration. Eighteen studies reported information on both the variables used for model development and the feature importance. Twelve studies highlighted potential applicability of their models for T2DM screening. Meta-analysis produced a good pooled c-index (0.812). Sources of heterogeneity were identified through subgroup analyses and meta-regression. Issues pertaining to methodological quality and reporting were observed. CONCLUSIONS We found evidence of good performance of ML models for T2DM prediction in the community. Improvements to methodology, reporting and validation are needed before they can be used at scale.
Collapse
Affiliation(s)
- Kushan De Silva
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia.
| | - Wai Kit Lee
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia
| | - Andrew Forbes
- Biostatistics Unit, Division of Research Methodology, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Melbourne, Victoria, Australia
| | - Ryan T Demmer
- Division of Epidemiology and Community Health, School of Public Health, University of Minnesota, Minneapolis, Minnesota, USA; Mailman School of Public Health, Columbia University, New York, USA
| | - Christopher Barton
- Department of General Practice, School of Primary and Allied Health Care, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Notting Hill, Victoria, Australia
| | - Joanne Enticott
- Monash Centre for Health Research and Implementation, School of Public Health and Preventive Medicine, Faculty of Medicine, Nursing, and Health Sciences, Monash University, Clayton, Victoria, Australia
| |
Collapse
|
41
|
Ljubic B, Hai AA, Stanojevic M, Diaz W, Polimac D, Pavlovski M, Obradovic Z. Predicting complications of diabetes mellitus using advanced machine learning algorithms. J Am Med Inform Assoc 2020; 27:1343-1351. [PMID: 32869093 PMCID: PMC7647294 DOI: 10.1093/jamia/ocaa120] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2020] [Revised: 05/17/2020] [Accepted: 05/28/2020] [Indexed: 12/14/2022] Open
Abstract
OBJECTIVE We sought to predict if patients with type 2 diabetes mellitus (DM2) would develop 10 selected complications. Accurate prediction of complications could help with more targeted measures that would prevent or slow down their development. MATERIALS AND METHODS Experiments were conducted on the Healthcare Cost and Utilization Project State Inpatient Databases of California for the period of 2003 to 2011. Recurrent neural network (RNN) long short-term memory (LSTM) and RNN gated recurrent unit (GRU) deep learning methods were designed and compared with random forest and multilayer perceptron traditional models. Prediction accuracy of selected complications were compared on 3 settings corresponding to minimum number of hospitalizations between diabetes diagnosis and the diagnosis of complications. RESULTS The diagnosis domain was used for experiments. The best results were achieved with RNN GRU model, followed by RNN LSTM model. The prediction accuracy achieved with RNN GRU model was between 73% (myocardial infarction) and 83% (chronic ischemic heart disease), while accuracy of traditional models was between 66% - 76%. DISCUSSION The number of hospitalizations was an important factor for the prediction accuracy. Experiments with 4 hospitalizations achieved significantly better accuracy than with 2 hospitalizations. To achieve improved accuracy deep learning models required training on at least 1000 patients and accuracy significantly dropped if training datasets contained 500 patients. The prediction accuracy of complications decreases over time period. Considering individual complications, the best accuracy was achieved on depressive disorder and chronic ischemic heart disease. CONCLUSIONS The RNN GRU model was the best choice for electronic medical record type of data, based on the achieved results.
Collapse
Affiliation(s)
- Branimir Ljubic
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Ameen Abdel Hai
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Marija Stanojevic
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Wilson Diaz
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Daniel Polimac
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Martin Pavlovski
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| | - Zoran Obradovic
- Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, Pennsylvania, USA
| |
Collapse
|
42
|
Ngufor C, Caraballo PJ, O’Byrne TJ, Chen D, Shah ND, Pruinelli L, Steinbach M, Simon G. Development and Validation of a Risk Stratification Model Using Disease Severity Hierarchy for Mortality or Major Cardiovascular Event. JAMA Netw Open 2020; 3:e208270. [PMID: 32678448 PMCID: PMC7368174 DOI: 10.1001/jamanetworkopen.2020.8270] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
IMPORTANCE Clinical domain knowledge about diseases and their comorbidities, severity, treatment pathways, and outcomes can facilitate diagnosis, enhance preventive strategies, and help create smart evidence-based practice guidelines. OBJECTIVE To introduce a new representation of patient data called disease severity hierarchy that leverages domain knowledge in a nested fashion to create subpopulations that share increasing amounts of clinical details suitable for risk prediction. DESIGN, SETTING, AND PARTICIPANTS This retrospective cohort study included 51 969 patients aged 45 to 85 years, with 10 674 patients who received primary care at the Mayo Clinic between January 2004 and December 2015 in the training cohort and 41 295 patients who received primary care at Fairview Health Services from January 2010 to December 2017 in the validation cohort. Data were analyzed from May 2018 to December 2019. MAIN OUTCOMES AND MEASURES Several binary classification measures, including the area under the receiver operating characteristic curve (AUC), Gini score, sensitivity, and positive predictive value, were used to evaluate models predicting all-cause mortality and major cardiovascular events at ages 60, 65, 75, and 80 years. RESULTS The mean (SD) age and proportions of women and white individuals were 59.4 (10.8) years, 6324 (59.3%) and 9804 (91.9%), respectively, in the training cohort and 57.4 (7.9) years, 21 975 (53.1%), and 37 653 (91.2%), respectively, in the validation cohort. During follow-up, 945 patients (8.9%) in the training cohort died, while 787 (7.4%) had major cardiovascular events. Models using the new representation achieved AUCs for predicting death in the training cohort at ages 60, 65, 75, and 80 years of 0.96 (95% CI, 0.94-0.97), 0.96 (95% CI, 0.95-0.98), 0.97 (95% CI, 0.96-0.98), and 0.98 (95% CI, 0.98-0.99), respectively, while standard methods achieved modest AUCs of 0.67 (95% CI, 0.55-0.80), 0.66 (95% CI, 0.56-0.79), 0.64 (95% CI, 0.57-0.71), and 0.63 (95% CI, 0.54-0.70), respectively. CONCLUSIONS AND RELEVANCE In this study, the proposed patient data representation accurately predicted the age at which a patient was at risk of dying or developing major cardiovascular events substantially better than standard methods. The representation uses known relationships contained in electronic health records to capture disease severity in a natural and clinically meaningful way. Furthermore, it is expressive and interpretable. This novel patient representation can help to support critical decision-making, develop smart guidelines, and enhance health care and disease management by helping to identify patients with high risk.
Collapse
Affiliation(s)
- Che Ngufor
- Division of Digital Health Science, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota
- The Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota
| | - Pedro J. Caraballo
- Division of Digital Health Science, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota
- Division of General Internal Medicine, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota
| | - Thomas J. O’Byrne
- Division of Healthcare Policy and Research, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota
| | - David Chen
- Division of Digital Health Science, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota
| | - Nilay D. Shah
- The Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota
- Division of Healthcare Policy and Research, Department of Health Science Research, Mayo Clinic, Rochester, Minnesota
| | | | - Michael Steinbach
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis
| | - Gyorgy Simon
- Division of General Internal Medicine, Department of Internal Medicine, Mayo Clinic, Rochester, Minnesota
- Institute for Health Informatics, University of Minnesota, Minneapolis
- Department of Medicine, University of Minnesota, Minneapolis
| |
Collapse
|
43
|
Gubbi S, Hamet P, Tremblay J, Koch CA, Hannah-Shmouni F. Artificial Intelligence and Machine Learning in Endocrinology and Metabolism: The Dawn of a New Era. Front Endocrinol (Lausanne) 2019; 10:185. [PMID: 30984108 PMCID: PMC6448412 DOI: 10.3389/fendo.2019.00185] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/05/2018] [Accepted: 03/06/2019] [Indexed: 12/22/2022] Open
Affiliation(s)
- Sriram Gubbi
- Diabetes, Endocrinology, and Obesity Branch, National Institute of Diabetes, Digestive and Kidney Diseases, National Institutes of Health, Bethesda, MD, United States
| | - Pavel Hamet
- Centre de Recherche, Centre Hospitalier de l'Université de Montréal, Montréal, QC, Canada
- Département de Médecine, Université de Montréal, Montréal, QC, Canada
| | - Johanne Tremblay
- Centre de Recherche, Centre Hospitalier de l'Université de Montréal, Montréal, QC, Canada
- Département de Médecine, Université de Montréal, Montréal, QC, Canada
| | - Christian A. Koch
- Medicover GmbH, Berlin, Germany
- Department of Medicine, Carl von Ossietzky University, Oldenburg, Germany
- University of Tennessee Health Science Center, Memphis, TN, United States
| | - Fady Hannah-Shmouni
- Section on Endocrinology and Genetics, National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|