1
|
Wang X, Yang YQ, Hong XY, Liu SH, Li JC, Chen T, Shi JH. A new risk assessment model of venous thromboembolism by considering fuzzy population. BMC Med Inform Decis Mak 2024; 24:413. [PMID: 39736732 DOI: 10.1186/s12911-024-02834-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2023] [Accepted: 12/19/2024] [Indexed: 01/01/2025] Open
Abstract
BACKGROUND Inpatients with high risk of venous thromboembolism (VTE) usually face serious threats to their health and economic conditions. Many studies using machine learning (ML) models to predict VTE risk overlook the impact of class-imbalance problem due to the low incidence rate of VTE, resulting in inferior and unstable model performance, which hinders their ability to replace the Padua model, a widely used linear weighted model in clinic. Our study aims to develop a new VTE risk assessment model suitable for Chinese medical inpatients. METHODS 3284 inpatients in the medical department of Peking Union Medical College Hospital (PUMCH) from January 2014 to June 2016 were collected. The training and test set were divided based on the admission time and inpatients from May 2016 to June 2016 were included as the test dataset. We explained the class imbalance problem from a clinical perspective and defined a new term, "fuzzy population", to elaborate and model this phenomenon. By considering the "fuzzy population", a new ML VTE risk assessment model was built through population splitting. Sensitivity and specificity of our method was compared with five ML models (support vector machine (SVM), random forest (RF), gradient boosting decision tree (GBDT), logistic regression (LR), and XGBoost) and the Padua model. RESULTS The 'fuzzy population' phenomenon was explained and verified on the VTE dataset. The proposed model achieved higher specificity (64.94% vs. 63.30%) and the same sensitivity (90.24% vs. 90.24%) on test data than the Padua model. Other five ML models couldn't simultaneously surpass the Padua's sensitivity and specificity. Besides, our model was more robust than five ML models and its standard deviations of sensitivities and specificities were smaller. Adjusting the distribution of negative samples in the training set based on the 'fuzzy population' would exacerbate the instability of performance of five ML models, which limited the application of ML methods in clinic. CONCLUSIONS The proposed model achieved higher sensitivity and specificity than the Padua model, and better robustness than traditional ML models. This study built a population-split-based ML model of VTE by modeling the class-imbalance problem and it can be applied more broadly in risk assessment of other diseases.
Collapse
Affiliation(s)
- Xin Wang
- Department of Ultrasound, Peking Union Medical College Hospital, Beijing, China
- Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Yu-Qing Yang
- State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Xin-Yu Hong
- Department of Respiration, Peking Union Medical College Hospital, No.1, Shuaifuyuan, Dongcheng District, Beijing, 100730, China
- Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Si-Hua Liu
- Department of Respiration, Peking Union Medical College Hospital, No.1, Shuaifuyuan, Dongcheng District, Beijing, 100730, China
- Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Jian-Chu Li
- Department of Ultrasound, Peking Union Medical College Hospital, Beijing, China
| | - Ting Chen
- Computer Science and Technology, Tsinghua University, Beijing, China
- , Beijing, China
| | - Ju-Hong Shi
- Department of Respiration, Peking Union Medical College Hospital, No.1, Shuaifuyuan, Dongcheng District, Beijing, 100730, China.
| |
Collapse
|
2
|
Aversano L, Iammarino M, Mancino I, Montano D. A systematic review on artificial intelligence approaches for smart health devices. PeerJ Comput Sci 2024; 10:e2232. [PMID: 39650514 PMCID: PMC11623213 DOI: 10.7717/peerj-cs.2232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 07/12/2024] [Indexed: 12/11/2024]
Abstract
In the context of smart health, the use of wearable Internet of Things (IoT) devices is becoming increasingly popular to monitor and manage patients' health conditions in a more efficient and personalized way. However, choosing the most suitable artificial intelligence (AI) methodology to analyze the data collected by these devices is crucial to ensure the reliability and effectiveness of smart healthcare applications. Additionally, protecting the privacy and security of health data is an area of growing concern, given the sensitivity and personal nature of such information. In this context, machine learning (ML) and deep learning (DL) are emerging as successful technologies because they are suitable for application to advanced analysis and prediction of healthcare scenarios. Therefore, the objective of this work is to contribute to the current state of the literature by identifying challenges, best practices, and future opportunities in the field of smart health. We aim to provide a comprehensive overview of the AI methodologies used, the neural network architectures adopted, and the algorithms employed, as well as examine the privacy and security issues related to the management of health data collected by wearable IoT devices. Through this systematic review, we aim to offer practical guidelines for the design, development, and implementation of AI solutions in smart health, to improve the quality of care provided and promote patient well-being. To pursue our goal, several articles focusing on ML or DL network architectures were selected and reviewed. The final discussion highlights research gaps yet to be investigated, as well as the drawbacks and vulnerabilities of existing IoT applications in smart healthcare.
Collapse
Affiliation(s)
- Lerina Aversano
- Department of Agricultural Science, Food, Natural Resources and Engineering, University of Foggia, Foggia, Italy
| | - Martina Iammarino
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Ilaria Mancino
- Department of Engineering, University of Sannio, Benevento, Italy
| | - Debora Montano
- CeRICT scrl—Regional Center Information Communication Technology, Benevento, Italy
| |
Collapse
|
3
|
Jin J, Lu J, Su X, Xiong Y, Ma S, Kong Y, Xu H. Development and Validation of an ICU-Venous Thromboembolism Prediction Model Using Machine Learning Approaches: A Multicenter Study. Int J Gen Med 2024; 17:3279-3292. [PMID: 39070227 PMCID: PMC11283785 DOI: 10.2147/ijgm.s467374] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 07/12/2024] [Indexed: 07/30/2024] Open
Abstract
Purpose The purpose of this study was to establish and validate machine learning-based models for predicting the risk of venous thromboembolism (VTE) in intensive care unit (ICU) patients. Patients and Methods The clinical data of 1494 ICU patients who underwent Doppler ultrasonography or venography between December 2020 and March 2023 were extracted from three tertiary hospitals. The Boruta algorithm was used to screen the essential variables associated with VTE. Five machine learning algorithms were employed: Random Forest (RF), eXtreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), Gradient Boosting Decision Tree (GBDT), and Logistic Regression (LR). Hyperparameter optimization was conducted on the predictive model of the training dataset. The performance in the validation dataset was measured using indicators, including the area under curve (AUC) of the receiver operating characteristic (ROC) curve, specificity, and F1 score. Finally, the optimal model was interpreted using the SHapley Additive exPlanation (SHAP) package. Results The incidence of VTE among the ICU patients in this study was 26.04%. We screened 19 crucial features for the risk prediction model development. Among the five models, the RF model performed best, with an AUC of 0.788 (95% CI: 0.738-0.838), an accuracy of 0.759 (95% CI: 0.709-0.809), a sensitivity of 0.633, and a Brier score of 0.166. Conclusion A machine learning-based model for prediction of VTE in ICU patients were successfully developed, which could assist clinical medical staff in identifying high-risk populations for VTE in the early stages so that prevention measures can be implemented to reduce the burden on the ICU patients.
Collapse
Affiliation(s)
- Jie Jin
- School of Nursing, Binzhou Medical University, Binzhou, People’s Republic of China
| | - Jie Lu
- School of Nursing, Binzhou Medical University, Binzhou, People’s Republic of China
| | - Xinyang Su
- Department of Spine Surgery, Binzhou Medical University Hospital, Binzhou, People’s Republic of China
| | - Yinhuan Xiong
- Department of Nursing, Binzhou People’s Hospital, Binzhou, People’s Republic of China
| | - Shasha Ma
- Department of Neurosurgery, Binzhou Medical University Hospital, Binzhou, People’s Republic of China
| | - Yang Kong
- School of Health Management, Binzhou Medical University, Yantai, People’s Republic of China
| | - Hongmei Xu
- School of Nursing, Binzhou Medical University, Binzhou, People’s Republic of China
| |
Collapse
|
4
|
Chiasakul T, Lam BD, McNichol M, Robertson W, Rosovsky RP, Lake L, Vlachos IS, Adamski A, Reyes N, Abe K, Zwicker JI, Patell R. Artificial intelligence in the prediction of venous thromboembolism: A systematic review and pooled analysis. Eur J Haematol 2023; 111:951-962. [PMID: 37794526 PMCID: PMC10900245 DOI: 10.1111/ejh.14110] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 09/16/2023] [Accepted: 09/18/2023] [Indexed: 10/06/2023]
Abstract
BACKGROUND Accurate diagnostic and prognostic predictions of venous thromboembolism (VTE) are crucial for VTE management. Artificial intelligence (AI) enables autonomous identification of the most predictive patterns from large complex data. Although evidence regarding its performance in VTE prediction is emerging, a comprehensive analysis of performance is lacking. AIMS To systematically review the performance of AI in the diagnosis and prediction of VTE and compare it to clinical risk assessment models (RAMs) or logistic regression models. METHODS A systematic literature search was performed using PubMed, MEDLINE, EMBASE, and Web of Science from inception to April 20, 2021. Search terms included "artificial intelligence" and "venous thromboembolism." Eligible criteria were original studies evaluating AI in the prediction of VTE in adults and reporting one of the following outcomes: sensitivity, specificity, positive predictive value, negative predictive value, or area under receiver operating curve (AUC). Risks of bias were assessed using the PROBAST tool. Unpaired t-test was performed to compare the mean AUC from AI versus conventional methods (RAMs or logistic regression models). RESULTS A total of 20 studies were included. Number of participants ranged from 31 to 111 888. The AI-based models included artificial neural network (six studies), support vector machines (four studies), Bayesian methods (one study), super learner ensemble (one study), genetic programming (one study), unspecified machine learning models (two studies), and multiple machine learning models (five studies). Twelve studies (60%) had both training and testing cohorts. Among 14 studies (70%) where AUCs were reported, the mean AUC for AI versus conventional methods were 0.79 (95% CI: 0.74-0.85) versus 0.61 (95% CI: 0.54-0.68), respectively (p < .001). However, the good to excellent discriminative performance of AI methods is unlikely to be replicated when used in clinical practice, because most studies had high risk of bias due to missing data handling and outcome determination. CONCLUSION The use of AI appears to improve the accuracy of diagnostic and prognostic prediction of VTE over conventional risk models; however, there was a high risk of bias observed across studies. Future studies should focus on transparent reporting, external validation, and clinical application of these models.
Collapse
Affiliation(s)
- Thita Chiasakul
- Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Division of Hematology, Faculty of Medicine, Department of Medicine, Center of Excellence in Translational Hematology, Chulalongkorn University and King Chulalongkorn Memorial Hospital, Bangkok, Thailand
| | - Barbara D. Lam
- Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Megan McNichol
- Division of Knowledge Services, Department of Information Services (M.M.), Beth Israel Deaconess Medical Center, Boston, Massachusetts, USA
| | - William Robertson
- National Blood Clot Alliance, Philadelphia, Pennsylvania, USA
- Department of Emergency Healthcare, College of Health Professions, Weber State University, Ogden, Utah, USA
| | - Rachel P. Rosovsky
- Division of Hematology/Oncology, Department of Massachusetts General Hospital, Harvard Medical School, Boston, Massachusetts, USA
| | - Leslie Lake
- National Blood Clot Alliance, Philadelphia, Pennsylvania, USA
| | - Ioannis S. Vlachos
- Department of Pathology, Cancer Research Institute, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Alys Adamski
- Division of Blood Disorders, National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - Nimia Reyes
- Division of Blood Disorders, National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - Karon Abe
- Division of Blood Disorders, National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention, Atlanta, Georgia, USA
| | - Jeffrey I. Zwicker
- Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Department of Medicine, Hematology Service, Memorial Sloan Kettering Cancer Center, New York City, New York, USA
| | - Rushad Patell
- Division of Hematology, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
- Division of Hemostasis and Thrombosis, Department of Medicine, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Mao C, Xu J, Rasmussen L, Li Y, Adekkanattu P, Pacheco J, Bonakdarpour B, Vassar R, Shen L, Jiang G, Wang F, Pathak J, Luo Y. AD-BERT: Using pre-trained language model to predict the progression from mild cognitive impairment to Alzheimer's disease. J Biomed Inform 2023; 144:104442. [PMID: 37429512 PMCID: PMC11131134 DOI: 10.1016/j.jbi.2023.104442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Revised: 06/13/2023] [Accepted: 07/07/2023] [Indexed: 07/12/2023]
Abstract
OBJECTIVE We develop a deep learning framework based on the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model using unstructured clinical notes from electronic health records (EHRs) to predict the risk of disease progression from Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD). METHODS We identified 3657 patients diagnosed with MCI together with their progress notes from Northwestern Medicine Enterprise Data Warehouse (NMEDW) between 2000 and 2020. The progress notes no later than the first MCI diagnosis were used for the prediction. We first preprocessed the notes by deidentification, cleaning and splitting into sections, and then pre-trained a BERT model for AD (named AD-BERT) based on the publicly available Bio+Clinical BERT on the preprocessed notes. All sections of a patient were embedded into a vector representation by AD-BERT and then combined by global MaxPooling and a fully connected network to compute the probability of MCI-to-AD progression. For validation, we conducted a similar set of experiments on 2563 MCI patients identified at Weill Cornell Medicine (WCM) during the same timeframe. RESULTS Compared with the 7 baseline models, the AD-BERT model achieved the best performance on both datasets, with Area Under receiver operating characteristic Curve (AUC) of 0.849 and F1 score of 0.440 on NMEDW dataset, and AUC of 0.883 and F1 score of 0.680 on WCM dataset. CONCLUSION The use of EHRs for AD-related research is promising, and AD-BERT shows superior predictive performance in modeling MCI-to-AD progression prediction. Our study demonstrates the utility of pre-trained language models and clinical notes in predicting MCI-to-AD progression, which could have important implications for improving early detection and intervention for AD.
Collapse
Affiliation(s)
- Chengsheng Mao
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | - Jie Xu
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, United States; Weill Cornell Medicine, New York, NY, United States
| | - Luke Rasmussen
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | - Yikuan Li
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | | | - Jennifer Pacheco
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | - Borna Bonakdarpour
- Department of Neurology, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | - Robert Vassar
- Department of Neurology, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States
| | - Li Shen
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | | | - Fei Wang
- Weill Cornell Medicine, New York, NY, United States
| | | | - Yuan Luo
- Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, United States.
| |
Collapse
|
6
|
Bean DM, Kraljevic Z, Shek A, Teo J, Dobson RJB. Hospital-wide natural language processing summarising the health data of 1 million patients. PLOS DIGITAL HEALTH 2023; 2:e0000218. [PMID: 37159441 PMCID: PMC10168555 DOI: 10.1371/journal.pdig.0000218] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Accepted: 02/16/2023] [Indexed: 05/11/2023]
Abstract
Electronic health records (EHRs) represent a major repository of real world clinical trajectories, interventions and outcomes. While modern enterprise EHR's try to capture data in structured standardised formats, a significant bulk of the available information captured in the EHR is still recorded only in unstructured text format and can only be transformed into structured codes by manual processes. Recently, Natural Language Processing (NLP) algorithms have reached a level of performance suitable for large scale and accurate information extraction from clinical text. Here we describe the application of open-source named-entity-recognition and linkage (NER+L) methods (CogStack, MedCAT) to the entire text content of a large UK hospital trust (King's College Hospital, London). The resulting dataset contains 157M SNOMED concepts generated from 9.5M documents for 1.07M patients over a period of 9 years. We present a summary of prevalence and disease onset as well as a patient embedding that captures major comorbidity patterns at scale. NLP has the potential to transform the health data lifecycle, through large-scale automation of a traditionally manual task.
Collapse
Affiliation(s)
- Daniel M Bean
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
- Health Data Research UK London, University College London, London, United Kingdom
| | - Zeljko Kraljevic
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, United Kingdom
| | - Anthony Shek
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
- Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
| | - James Teo
- Department of Clinical Neuroscience, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
- Department of Neuroscience, King's College Hospital NHS Foundation Trust, London, United Kingdom
| | - Richard J B Dobson
- Department of Biostatistics and Health Informatics, Institute of Psychiatry, Psychology and Neuroscience, King's College London, London, United Kingdom
- Health Data Research UK London, University College London, London, United Kingdom
- NIHR Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London, London, United Kingdom
- Institute for Health Informatics, University College London, London, United Kingdom
- NIHR Biomedical Research Centre, University College London Hospitals NHS Foundation Trust, London, United Kingdom
| |
Collapse
|
7
|
Denecke K, Reichenpfader D. Sentiment analysis of clinical narratives: A scoping review. J Biomed Inform 2023; 140:104336. [PMID: 36958461 DOI: 10.1016/j.jbi.2023.104336] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 03/06/2023] [Accepted: 03/10/2023] [Indexed: 03/25/2023]
Abstract
A clinical sentiment is a judgment, thought or attitude promoted by an observation with respect to the health of an individual. Sentiment analysis has drawn attention in the healthcare domain for secondary use of data from clinical narratives, with a variety of applications including predicting the likelihood of emerging mental illnesses or clinical outcomes. The current state of research has not yet been summarized. This study presents results from a scoping review aiming at providing an overview of sentiment analysis of clinical narratives in order to summarize existing research and identify open research gaps. The scoping review was carried out in line with the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) guideline. Studies were identified by searching 4 electronic databases (e.g., PubMed, IEEE Xplore) in addition to conducting backward and forward reference list checking of the included studies. We extracted information on use cases, methods and tools applied, used datasets and performance of the sentiment analysis approach. Of 1,200 citations retrieved, 29 unique studies were included in the review covering a period of 8 years. Most studies apply general domain tools (e.g. TextBlob) and sentiment lexicons (e.g. SentiWordNet) for realizing use cases such as prediction of clinical outcomes; others proposed new domain-specific sentiment analysis approaches based on machine learning. Accuracy values between 71.5-88.2% are reported. Data used for evaluation and test are often retrieved from MIMIC databases or i2b2 challenges. Latest developments related to artificial neural networks are not yet fully considered in this domain. We conclude that future research should focus on developing a gold standard sentiment lexicon, adapted to the specific characteristics of clinical narratives. Efforts have to be made to either augment existing or create new high-quality labeled data sets of clinical narratives. Last, the suitability of state-of-the-art machine learning methods for natural language processing and in particular transformer-based models should be investigated for their application for sentiment analysis of clinical narratives.
Collapse
Affiliation(s)
- Kerstin Denecke
- Bern University of Applied Sciences, Institute for Medical Informatics, Quellgasse 21, Biel/Bienne, 2502, Bern, Switzerland.
| | - Daniel Reichenpfader
- Bern University of Applied Sciences, Institute for Medical Informatics, Quellgasse 21, Biel/Bienne, 2502, Bern, Switzerland
| |
Collapse
|
8
|
Fitzsimmons L, Dewan M, Dexheimer JW. Diversity in Machine Learning: A Systematic Review of Text-Based Diagnostic Applications. Appl Clin Inform 2022; 13:569-582. [PMID: 35613914 DOI: 10.1055/s-0042-1749119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022] Open
Abstract
OBJECTIVE As the storage of clinical data has transitioned into electronic formats, medical informatics has become increasingly relevant in providing diagnostic aid. The purpose of this review is to evaluate machine learning models that use text data for diagnosis and to assess the diversity of the included study populations. METHODS We conducted a systematic literature review on three public databases. Two authors reviewed every abstract for inclusion. Articles were included if they used or developed machine learning algorithms to aid in diagnosis. Articles focusing on imaging informatics were excluded. RESULTS From 2,260 identified papers, we included 78. Of the machine learning models used, neural networks were relied upon most frequently (44.9%). Studies had a median population of 661.5 patients, and diseases and disorders of 10 different body systems were studied. Of the 35.9% (N = 28) of papers that included race data, 57.1% (N = 16) of study populations were majority White, 14.3% were majority Asian, and 7.1% were majority Black. In 75% (N = 21) of papers, White was the largest racial group represented. Of the papers included, 43.6% (N = 34) included the sex ratio of the patient population. DISCUSSION With the power to build robust algorithms supported by massive quantities of clinical data, machine learning is shaping the future of diagnostics. Limitations of the underlying data create potential biases, especially if patient demographics are unknown or not included in the training. CONCLUSION As the movement toward clinical reliance on machine learning accelerates, both recording demographic information and using diverse training sets should be emphasized. Extrapolating algorithms to demographics beyond the original study population leaves large gaps for potential biases.
Collapse
Affiliation(s)
- Lane Fitzsimmons
- College of Agriculture and Life Science, Cornell University, Ithaca, New York, United States
| | - Maya Dewan
- Division of Critical Care Medicine, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States.,Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States
| | - Judith W Dexheimer
- Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, Ohio, United States.,Division of Emergency Medicine; Division of Biomedical Informatics, Cincinnati Children's Hospital Medical Center, Cincinnati, Ohio, United States
| |
Collapse
|
9
|
Serrano-Guerrero J, Bani-Doumi M, Romero FP, Olivas JA. Understanding what patients think about hospitals: A deep learning approach for detecting emotions in patient opinions. Artif Intell Med 2022; 128:102298. [DOI: 10.1016/j.artmed.2022.102298] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 03/02/2022] [Accepted: 04/04/2022] [Indexed: 11/02/2022]
|
10
|
Lei H, Zhang M, Wu Z, Liu C, Li X, Zhou W, Long B, Ma J, Zhang H, Wang Y, Wang G, Gong M, Hong N, Liu H, Wu Y. Development and Validation of a Risk Prediction Model for Venous Thromboembolism in Lung Cancer Patients Using Machine Learning. Front Cardiovasc Med 2022; 9:845210. [PMID: 35321110 PMCID: PMC8934875 DOI: 10.3389/fcvm.2022.845210] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Accepted: 02/11/2022] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND There is currently a lack of model for predicting the occurrence of venous thromboembolism (VTE) in patients with lung cancer. Machine learning (ML) techniques are being increasingly adapted for use in the medical field because of their capabilities of intelligent analysis and scalability. This study aimed to develop and validate ML models to predict the incidence of VTE among lung cancer patients. METHODS Data of lung cancer patients from a Grade 3A cancer hospital in China with and without VTE were included. Patient characteristics and clinical predictors related to VTE were collected. The primary endpoint was the diagnosis of VTE during index hospitalization. We calculated and compared the area under the receiver operating characteristic curve (AUROC) using the selected best-performed model (Random Forest model) through multiple model comparison, as well as investigated feature contributions during the training process with both permutation importance scores and the impurity-based feature importance scores in random forest model. RESULTS In total, 3,398 patients were included in our study, 125 of whom experienced VTE during their hospital stay. The ROC curve and precision-recall curve (PRC) for Random Forest Model showed an AUROC of 0.91 (95% CI: 0.893-0.926) and an AUPRC of 0.43 (95% CI: 0.363-0.500). For the simplified model, five most relevant features were selected: Karnofsky Performance Status (KPS), a history of VTE, recombinant human endostatin, EGFR-TKI, and platelet count. We re-trained a random forest classifier with results of the AUROC of 0.87 (95% CI: 0.802-0.917) and AUPRC of 0.30 (95% CI: 0.265-0.358), respectively. CONCLUSION According to the study results, there was no conspicuous decrease in the model's performance when use fewer features to predict, we concluded that our simplified model would be more applicable in real-life clinical settings. The developed model using ML algorithms in our study has the potential to improve the early detection and prediction of the incidence of VTE in patients with lung cancer.
Collapse
Affiliation(s)
- Haike Lei
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Mengyang Zhang
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Zeyi Wu
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Chun Liu
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Xiaosheng Li
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Wei Zhou
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Bo Long
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Jiayang Ma
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Huiyi Zhang
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Ying Wang
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Guixue Wang
- MOE Key Laboratory for Biorheological Science and Technology, State and Local Joint Engineering Laboratory for Vascular Implants, College of Bioengineering, Chongqing University, Chongqing, China
| | - Mengchun Gong
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Na Hong
- Digital Health China Technologies, Co., Ltd., Beijing, China
| | - Haixia Liu
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| | - Yongzhong Wu
- Chongqing Key Laboratory of Translational Research for Cancer Metastasis and Individualized Treatment, Chongqing University Cancer Hospital, Chongqing, China
| |
Collapse
|
11
|
He L, Luo L, Hou X, Liao D, Liu R, Ouyang C, Wang G. Predicting venous thromboembolism in hospitalized trauma patients: a combination of the Caprini score and data-driven machine learning model. BMC Emerg Med 2021; 21:60. [PMID: 33971809 PMCID: PMC8111727 DOI: 10.1186/s12873-021-00447-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2020] [Accepted: 04/06/2021] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Venous thromboembolism (VTE) is a common complication of hospitalized trauma patients and has an adverse impact on patient outcomes. However, there is still a lack of appropriate tools for effectively predicting VTE for trauma patients. We try to verify the accuracy of the Caprini score for predicting VTE in trauma patients, and further improve the prediction through machine learning algorithms. METHODS We retrospectively reviewed emergency trauma patients who were admitted to a trauma center in a tertiary hospital from September 2019 to March 2020. The data in the patient's electronic health record (EHR) and the Caprini score were extracted, combined with multiple feature screening methods and the random forest (RF) algorithm to constructs the VTE prediction model, and compares the prediction performance of (1) using only Caprini score; (2) using EHR data to build a machine learning model; (3) using EHR data and Caprini score to build a machine learning model. True Positive Rate (TPR), False Positive Rate (FPR), Area Under Curve (AUC), accuracy, and precision were reported. RESULTS The Caprini score shows a good VTE prediction effect on the trauma hospitalized population when the cut-off point is 11 (TPR = 0.667, FPR = 0.227, AUC = 0.773), The best prediction model is LASSO+RF model combined with Caprini Score and other five features extracted from EHR data (TPR = 0.757, FPR = 0.290, AUC = 0.799). CONCLUSION The Caprini score has good VTE prediction performance in trauma patients, and the use of machine learning methods can further improve the prediction performance.
Collapse
Affiliation(s)
- Lingxiao He
- Trauma Center of West China Hospital/West China School of Nursing, Sichuan University, Guo Xue Road 37#, Chengdu, 610041, China
| | - Lei Luo
- College of Chemical Engineering, Sichuan University, Chengdu, China
| | - Xiaoling Hou
- Trauma Center of West China Hospital/West China School of Nursing, Sichuan University, Guo Xue Road 37#, Chengdu, 610041, China
| | - Dengbin Liao
- Trauma Center of West China Hospital/West China School of Nursing, Sichuan University, Guo Xue Road 37#, Chengdu, 610041, China
| | - Ran Liu
- Engineering Research Center of Medical Information Technology, Ministry of Education, West China Hospital of Sichuan University, Chengdu, China
| | - Chaowei Ouyang
- Trauma Center of West China Hospital/West China School of Nursing, Sichuan University, Guo Xue Road 37#, Chengdu, 610041, China
| | - Guanglin Wang
- Trauma Center of West China Hospital/West China School of Medicine, Sichuan University, Guo Xue Road 37#, Chengdu, 610041, China.
| |
Collapse
|
12
|
Improving sentiment analysis on clinical narratives by exploiting UMLS semantic types. Artif Intell Med 2021; 113:102033. [PMID: 33685589 DOI: 10.1016/j.artmed.2021.102033] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 01/26/2021] [Accepted: 02/09/2021] [Indexed: 11/20/2022]
Abstract
Sentiments associated with assessments and observations recorded in a clinical narrative can often indicate a patient's health status. To perform sentiment analysis on clinical narratives, domain-specific knowledge concerning meanings of medical terms is required. In this study, semantic types in the Unified Medical Language System (UMLS) are exploited to improve lexicon-based sentiment classification methods. For sentiment classification using SentiWordNet, the overall accuracy is improved from 0.582 to 0.710 by using logistic regression to determine appropriate polarity scores for UMLS 'Disorders' semantic types. For sentiment classification using a trained lexicon, when disorder terms in a training set are replaced with their semantic types, classification accuracies are improved on some data segments containing specific semantic types. To select an appropriate classification method for a given data segment, classifier combination is proposed. Using classifier combination, classification accuracies are improved on most data segments, with the overall accuracy of 0.882 being obtained.
Collapse
|
13
|
Luo YF, Henry S, Wang Y, Shen F, Uzuner O, Rumshisky A. The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task on clinical concept normalization for clinical records. J Am Med Inform Assoc 2020; 27:1529-1537. [PMID: 32968800 PMCID: PMC7647359 DOI: 10.1093/jamia/ocaa106] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/01/2020] [Accepted: 05/14/2020] [Indexed: 01/19/2023] Open
Abstract
OBJECTIVE The 2019 National Natural language processing (NLP) Clinical Challenges (n2c2)/Open Health NLP (OHNLP) shared task track 3, focused on medical concept normalization (MCN) in clinical records. This track aimed to assess the state of the art in identifying and matching salient medical concepts to a controlled vocabulary. In this paper, we describe the task, describe the data set used, compare the participating systems, present results, identify the strengths and limitations of the current state of the art, and identify directions for future research. MATERIALS AND METHODS Participating teams were provided with narrative discharge summaries in which text spans corresponding to medical concepts were identified. This paper refers to these text spans as mentions. Teams were tasked with normalizing these mentions to concepts, represented by concept unique identifiers, within the Unified Medical Language System. Submitted systems represented 4 broad categories of approaches: cascading dictionary matching, cosine distance, deep learning, and retrieve-and-rank systems. Disambiguation modules were common across all approaches. RESULTS A total of 33 teams participated in the MCN task. The best-performing team achieved an accuracy of 0.8526. The median and mean performances among all teams were 0.7733 and 0.7426, respectively. CONCLUSIONS Overall performance among the top 10 teams was high. However, several mention types were challenging for all teams. These included mentions requiring disambiguation of misspelled words, acronyms, abbreviations, and mentions with more than 1 possible semantic type. Also challenging were complex mentions of long, multi-word terms that may require new ways of extracting and representing mention meaning, the use of domain knowledge, parse trees, or hand-crafted rules.
Collapse
Affiliation(s)
- Yen-Fu Luo
- Department of Computer Science, University of Massachusetts
Lowell, Lowell, Massachusetts, USA
| | - Sam Henry
- Department of Information Sciences and Technology, George Mason
University, Fairfax, Virginia, USA
| | - Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester,
New York, USA
| | - Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, Rochester,
New York, USA
| | - Ozlem Uzuner
- Department of Information Sciences and Technology, George Mason
University, Fairfax, Virginia, USA
- Department of Biomedical Informatics, Harvard Medical School,
Boston, Massachusetts, USA
- Computer Science and Artificial Intelligence Laboratory, Massachusetts
Institute of Technology, Cambridge, Massachusetts, USA
| | - Anna Rumshisky
- Department of Computer Science, University of Massachusetts
Lowell, Lowell, Massachusetts, USA
- Computer Science and Artificial Intelligence Laboratory, Massachusetts
Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
14
|
Tajik F, Wang M, Zhang X, Han J. Evaluation of the impact of body mass index on venous thromboembolism risk factors. PLoS One 2020; 15:e0235007. [PMID: 32645000 PMCID: PMC7347165 DOI: 10.1371/journal.pone.0235007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2020] [Accepted: 06/06/2020] [Indexed: 12/23/2022] Open
Abstract
In this paper, we investigate the interaction impacts of body mass index (BMI) on the other important risk factors for venous thromboembolism (VTE), using deep venous thrombosis (DVT) patient data from the International Warfarin Pharmacogenetics Consortium (IWPC). We apply eight machine learning techniques, including naive Bayes classifier (NB), support vector machine (SVM), elastic net regression (ENET), logistic regression (LR), lasso regression (LAR), multivariate adaptive regression splines (MARS), boosted regression tree (BRT) and random forest model (RF). The RF method is selected as the best model for classification. Out of 33 features considered in this study, we identify 12 variables as relatively important risk factors for VTE. Finally, we examine the interaction impacts of BMI on these important VTE risk factors. We conclude that the impacts of risk factors on VTE incidence are varying across different BMI groups, and the variations are different for different risk factors. Therefore the interaction impacts of BMI on the other risk factors have to be taken into account in order to better understand the incidence of VTE.
Collapse
Affiliation(s)
- Fatemeh Tajik
- School of Economics and Management, Dalian University of Technology, Dalian, China
| | - Mingzheng Wang
- School of Management, Zhejiang University, Hangzhou, China
- * E-mail:
| | - Xiaohui Zhang
- Business School, University of Exeter, Exeter, England, United Kingdom
| | - Jie Han
- The First Affiliated Hospital, Zhejiang University, Hangzhou, China
| |
Collapse
|
15
|
Spasic I, Nenadic G. Clinical Text Data in Machine Learning: Systematic Review. JMIR Med Inform 2020; 8:e17984. [PMID: 32229465 PMCID: PMC7157505 DOI: 10.2196/17984] [Citation(s) in RCA: 137] [Impact Index Per Article: 27.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2020] [Revised: 02/24/2020] [Accepted: 02/24/2020] [Indexed: 12/22/2022] Open
Abstract
Background Clinical narratives represent the main form of communication within health care, providing a personalized account of patient history and assessments, and offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective The main aim of this study was to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigated the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multifaceted interface, to perform a literature search against MEDLINE. We identified 110 relevant studies and extracted information about text data used to support machine learning, NLP tasks supported, and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation, and any relevant statistics. Results The majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents, with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing the predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free-text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable because of the sensitive nature of data considered. Besides the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The majority of studies focused on text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management, and surveillance. Conclusions We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which do not require data annotation.
Collapse
Affiliation(s)
- Irena Spasic
- School of Computer Science and Informatics, Cardiff University, Cardiff, United Kingdom
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
16
|
Wang X, Yang YQ, Liu SH, Hong XY, Sun XF, Shi JH. Comparing different venous thromboembolism risk assessment machine learning models in Chinese patients. J Eval Clin Pract 2020; 26:26-34. [PMID: 31840330 DOI: 10.1111/jep.13324] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 11/06/2019] [Accepted: 11/14/2019] [Indexed: 12/14/2022]
Abstract
OBJECTIVE Venous thromboembolism (VTE) is a fatal complication and the most common preventable cause of death in hospitals. The risk-to-benefit ratio of thromboprophylaxis depends on the performance of the risk assessment model. A linear model, the Padua model, is recommended for medical inpatients in the United States but is not suitable for Chinese inpatients due to differences in race and disease spectrum. Currently, machine learning (ML) methods show advantages in modeling complex data patterns and have been applied to clinical data analysis. This study aimed to build VTE risk assessment ML models among Chinese inpatients and compare the predictive validity of the ML models with that of the Padua model. METHODS We used 376 patients, including 188 patients with VTE, to build a model and then evaluate the predictive validity of the model in a consecutive clinical dataset from Peking Union Medical College Hospital. Nine widely used ML methods were trained on the model derivation set and then compared with the Padua model. RESULTS Among the nine ML methods, random forest (RF), boosting-based methods, and logistic regression achieved a higher specificity, Youden index, positive predictive value, and area under the receiver operating characteristic curve than the Padua model on both the test and clinical validation sets. However, their sensitivities were inferior to that of the Padua model. Combined with the receiver operating characteristic curve, RF, as the best performing model, maintained high specificity with relatively better sensitivity and captured VTE patients' patterns more precisely. CONCLUSIONS Advances in ML technology provide powerful tools for medical data analysis, and choosing models conforming to the disease pattern would achieve good performance. Popular ML models do not surpass the Padua model on all indicators of validity, and the drawback of low sensitivity should be improved upon in the future.
Collapse
Affiliation(s)
- Xin Wang
- Department of Ultrasound, Peking Union Medical College Hospital, Beijing, China.,Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Yu-Qing Yang
- Computer Science and Technology, Tsinghua University, Beijing, China
| | - Si-Hua Liu
- Department of Respiration, Peking Union Medical College Hospital, Beijing, China.,Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Xin-Yu Hong
- Department of Respiration, Peking Union Medical College Hospital, Beijing, China.,Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Xue-Feng Sun
- Department of Respiration, Peking Union Medical College Hospital, Beijing, China.,Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| | - Ju-Hong Shi
- Department of Respiration, Peking Union Medical College Hospital, Beijing, China.,Chinese Academy of Medical Sciences, Peking Union Medical College, Beijing, China
| |
Collapse
|
17
|
|
18
|
Assale M, Dui LG, Cina A, Seveso A, Cabitza F. The Revival of the Notes Field: Leveraging the Unstructured Content in Electronic Health Records. Front Med (Lausanne) 2019; 6:66. [PMID: 31058150 PMCID: PMC6478793 DOI: 10.3389/fmed.2019.00066] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2018] [Accepted: 03/18/2019] [Indexed: 01/01/2023] Open
Abstract
Problem: Clinical practice requires the production of a time- and resource-consuming great amount of notes. They contain relevant information, but their secondary use is almost impossible, due to their unstructured nature. Researchers are trying to address this problems, with traditional and promising novel techniques. Application in real hospital settings seems not to be possible yet, though, both because of relatively small and dirty dataset, and for the lack of language-specific pre-trained models. Aim: Our aim is to demonstrate the potential of the above techniques, but also raise awareness of the still open challenges that the scientific communities of IT and medical practitioners must jointly address to realize the full potential of unstructured content that is daily produced and digitized in hospital settings, both to improve its data quality and leverage the insights from data-driven predictive models. Methods: To this extent, we present a narrative literature review of the most recent and relevant contributions to leverage the application of Natural Language Processing techniques to the free-text content electronic patient records. In particular, we focused on four selected application domains, namely: data quality, information extraction, sentiment analysis and predictive models, and automated patient cohort selection. Then, we will present a few empirical studies that we undertook at a major teaching hospital specializing in musculoskeletal diseases. Results: We provide the reader with some simple and affordable pipelines, which demonstrate the feasibility of reaching literature performance levels with a single institution non-English dataset. In such a way, we bridged literature and real world needs, performing a step further toward the revival of notes fields.
Collapse
Affiliation(s)
- Michela Assale
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Linda Greta Dui
- Politecnico di Milano, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Andrea Cina
- K-tree SRL, Pont-Saint-Martin, Italy
- University of Milano-Bicocca, Milan, Italy
| | - Andrea Seveso
- University of Milano-Bicocca, Milan, Italy
- Link-Up Datareg, Cinisello Balsamo, Italy
| | - Federico Cabitza
- University of Milano-Bicocca, Milan, Italy
- IRCCS Istituto Ortopedico Galeazzi, Milan, Italy
| |
Collapse
|
19
|
Alobaidi M, Malik KM, Hussain M. Automated ontology generation framework powered by linked biomedical ontologies for disease-drug domain. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2018; 165:117-128. [PMID: 30337066 DOI: 10.1016/j.cmpb.2018.08.010] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/31/2017] [Revised: 07/31/2018] [Accepted: 08/14/2018] [Indexed: 06/08/2023]
Abstract
OBJECTIVE AND BACKGROUND The exponential growth of the unstructured data available in biomedical literature, and Electronic Health Record (EHR), requires powerful novel technologies and architectures to unlock the information hidden in the unstructured data. The success of smart healthcare applications such as clinical decision support systems, disease diagnosis systems, and healthcare management systems depends on knowledge that is understandable by machines to interpret and infer new knowledge from it. In this regard, ontological data models are expected to play a vital role to organize, integrate, and make informative inferences with the knowledge implicit in that unstructured data and represent the resultant knowledge in a form that machines can understand. However, constructing such models is challenging because they demand intensive labor, domain experts, and ontology engineers. Such requirements impose a limit on the scale or scope of ontological data models. We present a framework that will allow mitigating the time-intensity to build ontologies and achieve machine interoperability. METHODS Empowered by linked biomedical ontologies, our proposed novel Automated Ontology Generation Framework consists of five major modules: a) Text Processing using compute on demand approach. b) Medical Semantic Annotation using N-Gram, ontology linking and classification algorithms, c) Relation Extraction using graph method and Syntactic Patterns, d), Semantic Enrichment using RDF mining, e) Domain Inference Engine to build the formal ontology. RESULTS Quantitative evaluations show 84.78% recall, 53.35% precision, and 67.70% F-measure in terms of disease-drug concepts identification; 85.51% recall, 69.61% precision, and F-measure 76.74% with respect to taxonomic relation extraction; and 77.20% recall, 40.10% precision, and F-measure 52.78% with respect to biomedical non-taxonomic relation extraction. CONCLUSION We present an automated ontology generation framework that is empowered by Linked Biomedical Ontologies. This framework integrates various natural language processing, semantic enrichment, syntactic pattern, and graph algorithm based techniques. Moreover, it shows that using Linked Biomedical Ontologies enables a promising solution to the problem of automating the process of disease-drug ontology generation.
Collapse
Affiliation(s)
- Mazen Alobaidi
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA
| | - Khalid Mahmood Malik
- Department of Computer Science and Engineering, Oakland University, Rochester, MI, USA.
| | - Maqbool Hussain
- Department of Software, College of Electronics and Information Engineering, Sejong University, Seoul, South Korea
| |
Collapse
|