1
|
Cho H, She J, De Marchi D, El-Zaatari H, Barnes EL, Kahkoska AR, Kosorok MR, Virkud AV. Machine Learning and Health Science Research: Tutorial. J Med Internet Res 2024; 26:e50890. [PMID: 38289657 PMCID: PMC10865203 DOI: 10.2196/50890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2023] [Revised: 11/30/2023] [Accepted: 12/21/2023] [Indexed: 02/01/2024] Open
Abstract
Machine learning (ML) has seen impressive growth in health science research due to its capacity for handling complex data to perform a range of tasks, including unsupervised learning, supervised learning, and reinforcement learning. To aid health science researchers in understanding the strengths and limitations of ML and to facilitate its integration into their studies, we present here a guideline for integrating ML into an analysis through a structured framework, covering steps from framing a research question to study design and analysis techniques for specialized data types.
Collapse
Affiliation(s)
- Hunyong Cho
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Jane She
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Daniel De Marchi
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Helal El-Zaatari
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Edward L Barnes
- Division of Gastroenterology and Hepatology, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Center for Gastrointestinal Biology and Diseases, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Anna R Kahkoska
- Department of Nutrition, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Division of Endocrinology and Metabolism, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
- Center for Aging and Health, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Michael R Kosorok
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| | - Arti V Virkud
- Kidney Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, United States
| |
Collapse
|
2
|
Ge Y, Li Z, Zhang J. A simulation study on missing data imputation for dichotomous variables using statistical and machine learning methods. Sci Rep 2023; 13:9432. [PMID: 37296269 PMCID: PMC10256703 DOI: 10.1038/s41598-023-36509-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2023] [Accepted: 06/05/2023] [Indexed: 06/12/2023] Open
Abstract
The problem of missing data, particularly for dichotomous variables, is a common issue in medical research. However, few studies have focused on the imputation methods of dichotomous data and their performance, as well as the applicability of these imputation methods and the factors that may affect their performance. In the arrangement of application scenarios, different missing mechanisms, sample sizes, missing rates, the correlation between variables, value distributions, and the number of missing variables were considered. We used data simulation techniques to establish a variety of different compound scenarios for missing dichotomous variables and conducted real-data validation on two real-world medical datasets. We comprehensively compared the performance of eight imputation methods (mode, logistic regression (LogReg), multiple imputation (MI), decision tree (DT), random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and artificial neural network (ANN)) in each scenario. Accuracy and mean absolute error (MAE) were applied to evaluating their performance. The results showed that missing mechanisms, value distributions and the correlation between variables were the main factors affecting the performance of imputation methods. Machine learning-based methods, especially SVM, ANN, and DT, achieved relatively high accuracy with stable performance and were of potential applicability. Researchers should explore the correlation between variables and their distribution pattern in advance and prioritize machine learning-based methods for practical applications when encountering dichotomous missing data.
Collapse
Affiliation(s)
- Yingfeng Ge
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China
| | - Zhiwei Li
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China
| | - Jinxin Zhang
- Department of Medical Statistics, School of Public Health, Sun Yat-Sen University, Guangzhou, 510080, People's Republic of China.
| |
Collapse
|
3
|
Mohammadi T, D'Ascenzo F, Pepe M, Bonsignore Zanghì S, Bernardi M, Spadafora L, Frati G, Peruzzi M, De Ferrari GM, Biondi-Zoccai G. Unsupervised Machine Learning with Cluster Analysis in Patients Discharged after an Acute Coronary Syndrome: Insights from a 23,270-Patient Study. Am J Cardiol 2023; 193:44-51. [PMID: 36870114 DOI: 10.1016/j.amjcard.2023.01.048] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Revised: 01/06/2023] [Accepted: 01/29/2023] [Indexed: 03/06/2023]
Abstract
Characterization and management of patients admitted for acute coronary syndromes (ACS) remain challenging, and it is unclear whether currently available clinical and procedural features can suffice to inform adequate decision making. We aimed to explore the presence of specific subsets among patients with ACS. The details on patients discharged after ACS were obtained by querying an extensive multicenter registry and detailing patient features, as well as management details. The clinical outcomes included fatal and nonfatal cardiovascular events at 1-year follow-up. After missing data imputation, 2 unsupervised machine learning approaches (k-means and Clustering Large Applications [CLARA]) were used to generate separate clusters with different features. Bivariate- and multivariable-adjusted analyses were performed to compare the different clusters for clinical outcomes. A total of 23,270 patients were included, with 12,930 cases (56%) of ST-elevation myocardial infarction (STEMI). K-means clustering identified 2 main clusters: a first 1 including 21,998 patients (95%) and a second 1 including 1,282 subjects (5%), with equal distribution for STEMI. CLARA generated 2 main clusters: a first 1 including 11,268 patients (48%) and a second 1 with 12,002 subjects (52%). Notably, the STEMI distribution was significantly different in the CLARA-generated clusters. The clinical outcomes were significantly different across clusters, irrespective of the originating algorithm, including death reinfarction and major bleeding, as well as their composite. In conclusion, unsupervised machine learning can be leveraged to explore the patterns in ACS, potentially highlighting specific patient subsets to improve risk stratification and management.
Collapse
Affiliation(s)
- Tanya Mohammadi
- School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran
| | - Fabrizio D'Ascenzo
- Division of Cardiology, Cardiovascular and Thoracic Department, Città della Salute e della Scienza, Turin, Italy
| | - Martino Pepe
- Division of Cardiology, Department of Emergency and Organ Transplantation, University of Bari, Bari, Italy
| | | | - Marco Bernardi
- Department of Clinical, Internal Medicine, Anesthesiology and Cardiovascular Sciences, Sapienza University of Rome, Italy
| | - Luigi Spadafora
- Department of Clinical, Internal Medicine, Anesthesiology and Cardiovascular Sciences, Sapienza University of Rome, Italy
| | - Giacomo Frati
- Department of Medical-Surgical Sciences and Biotechnologies, Sapienza University of Rome, Latina, Italy; IRCCS NEUROMED, Pozzilli, Italy
| | - Mariangela Peruzzi
- Department of Medical-Surgical Sciences and Biotechnologies, Sapienza University of Rome, Latina, Italy; Mediterranea Cardiocentro, Napoli, Italy
| | - Gaetano Maria De Ferrari
- Division of Cardiology, Cardiovascular and Thoracic Department, Città della Salute e della Scienza, Turin, Italy
| | - Giuseppe Biondi-Zoccai
- Department of Medical-Surgical Sciences and Biotechnologies, Sapienza University of Rome, Latina, Italy; Mediterranea Cardiocentro, Napoli, Italy.
| |
Collapse
|
4
|
Hsu JC, Yang YY, Chuang SL, Lin LY, Chen THH. Prediabetes as a risk factor for new-onset atrial fibrillation: the propensity-score matching cohort analyzed using the Cox regression model coupled with the random survival forest. Cardiovasc Diabetol 2023; 22:35. [PMID: 36804876 PMCID: PMC9940357 DOI: 10.1186/s12933-023-01767-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Accepted: 02/06/2023] [Indexed: 02/22/2023] Open
Abstract
BACKGROUND The glycemic continuum often indicates a gradual decline in insulin sensitivity leading to an increase in glucose levels. Although prediabetes is an established risk factor for both macrovascular and microvascular diseases, whether prediabetes is independently associated with the risk of developing atrial fibrillation (AF), particularly the occurrence time, has not been well studied using a high-quality research design in combination with statistical machine-learning algorithms. METHODS Using data available from electronic medical records collected from the National Taiwan University Hospital, a tertiary medical center in Taiwan, we conducted a retrospective cohort study consisting 174,835 adult patients between 2014 and 2019 to investigate the relationship between prediabetes and AF. To render patients with prediabetes as comparable to those with normal glucose test, a propensity-score matching design was used to select the matched pairs of two groups with a 1:1 ratio. The Kaplan-Meier method was used to compare the cumulative risk of AF between prediabetes and normal glucose test using log-rank test. The multivariable Cox regression model was employed to estimate adjusted hazard ratio (HR) for prediabetes versus normal glucose test by stratifying three levels of glycosylated hemoglobin (HbA1c). The machine-learning algorithm using the random survival forest (RSF) method was further used to identify the importance of clinical factors associated with AF in patients with prediabetes. RESULTS A sample of 14,309 pairs of patients with prediabetes and normal glucose test result were selected. The incidence of AF was 11.6 cases per 1000 person-years during a median follow-up period of 47.1 months. The Kaplan-Meier analysis revealed that the risk of AF was significantly higher in patients with prediabetes (log-rank p < 0.001). The multivariable Cox regression model indicated that prediabetes was independently associated with a significant increased risk of AF (HR 1.24, 95% confidence interval 1.11-1.39, p < 0.001), particularly for patients with HbA1c above 5.5%. The RSF method identified elevated N-terminal natriuretic peptide and altered left heart structure as the two most important risk factors for AF among patients with prediabetes. CONCLUSIONS Our study found that prediabetes is independently associated with a higher risk of AF. Furthermore, alterations in left heart structure make a significant contribution to this elevated risk, and these structural changes may begin during the prediabetes stage.
Collapse
Affiliation(s)
- Jung-Chi Hsu
- Division of Cardiology, Department of Internal Medicine, Fu Jen Catholic University Hospital, Fu Jen Catholic University, New Taipei City, Taiwan.,Division of Cardiology, Department of Internal Medicine, National Taiwan University College of Medicine and Hospital, No.7, Chung-Chan South Road, Taipei, 100, Taiwan
| | - Yen-Yun Yang
- Department of Medical Research, National Taiwan University Hospital, Taipei, Taiwan
| | - Shu-Lin Chuang
- Department of Medical Research, National Taiwan University Hospital, Taipei, Taiwan
| | - Lian-Yu Lin
- Division of Cardiology, Department of Internal Medicine, National Taiwan University College of Medicine and Hospital, No.7, Chung-Chan South Road, Taipei, 100, Taiwan. .,Department of Internal Medicine, College of Medicine, National Taiwan University, Taipei, Taiwan.
| | - Tony Hsiu-Hsi Chen
- Institute of Epidemiology and Preventive Medicine, College of Public Health, National Taiwan University, Taipei, Taiwan
| |
Collapse
|
5
|
Kong W, Hui HWH, Peng H, Goh WWB. Dealing with missing values in proteomics data. Proteomics 2022; 22:e2200092. [PMID: 36349819 DOI: 10.1002/pmic.202200092] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Revised: 09/15/2022] [Accepted: 10/11/2022] [Indexed: 11/10/2022]
Abstract
Proteomics data are often plagued with missingness issues. These missing values (MVs) threaten the integrity of subsequent statistical analyses by reduction of statistical power, introduction of bias, and failure to represent the true sample. Over the years, several categories of missing value imputation (MVI) methods have been developed and adapted for proteomics data. These MVI methods perform their tasks based on different prior assumptions (e.g., data is normally or independently distributed) and operating principles (e.g., the algorithm is built to address random missingness only), resulting in varying levels of performance even when dealing with the same dataset. Thus, to achieve a satisfactory outcome, a suitable MVI method must be selected. To guide decision making on suitable MVI method, we provide a decision chart which facilitates strategic considerations on datasets presenting different characteristics. We also bring attention to other issues that can impact proper MVI such as the presence of confounders (e.g., batch effects) which can influence MVI performance. Thus, these too, should be considered during or before MVI.
Collapse
Affiliation(s)
- Weijia Kong
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Harvard Wai Hann Hui
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Hui Peng
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore.,School of Biological Sciences, Nanyang Technological University, Singapore, Singapore.,Centre for Biomedical Informatics, Nanyang Technological University, Singapore, Singapore
| |
Collapse
|
6
|
Adnan FA, Jamaludin KR, Wan Muhamad WZA, Miskon S. A review of the current publication trends on missing data imputation over three decades: direction and future research. Neural Comput Appl 2022. [DOI: 10.1007/s00521-022-07702-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Wang Y, Su J, Zhao X. Interpretability of SurvivalBoost upon Shapley Additive Explanation value on medical data. COMMUN STAT-SIMUL C 2022. [DOI: 10.1080/03610918.2022.2094962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/03/2022]
Affiliation(s)
- Yating Wang
- School of Mathematics and Statistics, Center for Data Science, Lanzhou University, Lanzhou, P.R. China
| | - Jinxia Su
- School of Mathematics and Statistics, Center for Data Science, Lanzhou University, Lanzhou, P.R. China
| | - Xuejing Zhao
- School of Mathematics and Statistics, Center for Data Science, Lanzhou University, Lanzhou, P.R. China
| |
Collapse
|
8
|
A Novel Algorithm to Estimate the Significance Level of a Feature Interaction Using the Extreme Gradient Boosting Machine. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2022; 19:ijerph19042338. [PMID: 35206527 PMCID: PMC8871671 DOI: 10.3390/ijerph19042338] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/04/2022] [Revised: 02/16/2022] [Accepted: 02/17/2022] [Indexed: 02/04/2023]
Abstract
Recent studies have revealed the importance of the interaction effect in cardiac research. An analysis would lead to an erroneous conclusion when the approach failed to tackle a significant interaction. Regression models deal with interaction by adding the product of the two interactive variables. Thus, statistical methods could evaluate the significance and contribution of the interaction term. However, machine learning strategies could not provide the p-value of specific feature interaction. Therefore, we propose a novel machine learning algorithm to assess the p-value of a feature interaction, named the extreme gradient boosting machine for feature interaction (XGB-FI). The first step incorporates the concept of statistical methodology by stratifying the original data into four subgroups according to the two interactive features. The second step builds four XGB machines with cross-validation techniques to avoid overfitting. The third step calculates a newly defined feature interaction ratio (FIR) for all possible combinations of predictors. Finally, we calculate the empirical p-value according to the FIR distribution. Computer simulation studies compared the XGB-FI with the multiple regression model with an interaction term. The results showed that the type I error of XGB-FI is valid under the nominal level of 0.05 when there is no interaction effect. The power of XGB-FI is consistently higher than the multiple regression model in all scenarios we examined. In conclusion, the new machine learning algorithm outperforms the conventional statistical model when searching for an interaction.
Collapse
|