1
|
Daniali M, Galer PD, Lewis-Smith D, Parthasarathy S, Kim E, Salvucci DD, Miller JM, Haag S, Helbig I. Enriching representation learning using 53 million patient notes through human phenotype ontology embedding. Artif Intell Med 2023; 139:102523. [PMID: 37100502 PMCID: PMC10782859 DOI: 10.1016/j.artmed.2023.102523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 02/17/2023] [Accepted: 02/23/2023] [Indexed: 03/04/2023]
Abstract
The Human Phenotype Ontology (HPO) is a dictionary of >15,000 clinical phenotypic terms with defined semantic relationships, developed to standardize phenotypic analysis. Over the last decade, the HPO has been used to accelerate the implementation of precision medicine into clinical practice. In addition, recent research in representation learning, specifically in graph embedding, has led to notable progress in automated prediction via learned features. Here, we present a novel approach to phenotype representation by incorporating phenotypic frequencies based on 53 million full-text health care notes from >1.5 million individuals. We demonstrate the efficacy of our proposed phenotype embedding technique by comparing our work to existing phenotypic similarity-measuring methods. Using phenotype frequencies in our embedding technique, we are able to identify phenotypic similarities that surpass current computational models. Furthermore, our embedding technique exhibits a high degree of agreement with domain experts' judgment. By transforming complex and multidimensional phenotypes from the HPO format into vectors, our proposed method enables efficient representation of these phenotypes for downstream tasks that require deep phenotyping. This is demonstrated in a patient similarity analysis and can further be applied to disease trajectory and risk prediction.
Collapse
Affiliation(s)
- Maryam Daniali
- Department of Computer Science, Drexel University, Philadelphia, PA, USA; Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Peter D Galer
- Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA, USA; The Epilepsy Neuro Genetics Initiative (ENGIN), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Center for Neuroengineering and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - David Lewis-Smith
- Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA, USA; The Epilepsy Neuro Genetics Initiative (ENGIN), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Translational and Clinical Research Institute, Newcastle University, Newcastle-upon-Tyne, UK; Department of Clinical Neurosciences, Royal Victoria Infirmary, Newcastle-upon-Tyne, UK
| | - Shridhar Parthasarathy
- Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA, USA; The Epilepsy Neuro Genetics Initiative (ENGIN), Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Edward Kim
- Department of Computer Science, Drexel University, Philadelphia, PA, USA
| | - Dario D Salvucci
- Department of Computer Science, Drexel University, Philadelphia, PA, USA
| | - Jeffrey M Miller
- Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Scott Haag
- Department of Computer Science, Drexel University, Philadelphia, PA, USA; Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA
| | - Ingo Helbig
- Department of Biomedical and Health Informatics (DBHi), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Division of Neurology, Children's Hospital of Philadelphia, Philadelphia, PA, USA; The Epilepsy Neuro Genetics Initiative (ENGIN), Children's Hospital of Philadelphia, Philadelphia, PA, USA; Department of Neurology, University of Pennsylvania, Perelman School of Medicine, Philadelphia, PA, USA.
| |
Collapse
|
2
|
Hassan Zada MS, Yuan B, Khan WA, Anjum A, Reiff-Marganiec S, Saleem R. A unified graph model based on molecular data binning for disease subtyping. J Biomed Inform 2022; 134:104187. [PMID: 36055637 DOI: 10.1016/j.jbi.2022.104187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 08/05/2022] [Accepted: 08/25/2022] [Indexed: 11/19/2022]
Abstract
Molecular disease subtype discovery from omics data is an important research problem in precision medicine.The biggest challenges are the skewed distribution and data variability in the measurements of omics data. These challenges complicate the efficient identification of molecular disease subtypes defined by clinical differences, such as survival. Existing approaches adopt kernels to construct patient similarity graphs from each view through pairwise matching. However, the distance functions used in kernels are unable to utilize the potentially critical information of extreme values and data variability which leads to the lack of robustness. In this paper, a novel robust distance metric (ROMDEX) is proposed to construct similarity graphs for molecular disease subtypes from omics data, which is able to address the data variability and extreme values challenges. The proposed approach is validated on multiple TCGA cancer datasets, and the results are compared with multiple baseline disease subtyping methods. The evaluation of results is based on Kaplan-Meier survival time analysis, which is validated using statistical tests e.g, Cox-proportional hazard (Cox p-value). We reject the null hypothesis that the cohorts have the same hazard, for the P-values less than 0.05. The proposed approach achieved best P-values of 0.00181, 0.00171, and 0.00758 for Gene Expression, DNA Methylation, and MicroRNA data respectively, which shows significant difference in survival between the cohorts. In the results, the proposed approach outperformed the existing state-of-the-art (MRGC, PINS, SNF, Consensus Clustering and Icluster+) disease subtyping approaches on various individual disease views of multiple TCGA datasets.
Collapse
Affiliation(s)
| | - Bo Yuan
- School of Computing and Mathematical Sciences, University of Leicester, United Kingdom.
| | - Wajahat Ali Khan
- School of Computing and Engineering, University of Derby, United Kingdom.
| | - Ashiq Anjum
- School of Computing and Mathematical Sciences, University of Leicester, United Kingdom.
| | | | - Rabia Saleem
- School of Computing and Engineering, University of Derby, United Kingdom.
| |
Collapse
|
3
|
Kohli M, Kar AK, Bangalore A, AP P. Machine learning-based ABA treatment recommendation and personalization for autism spectrum disorder: an exploratory study. Brain Inform 2022; 9:16. [PMID: 35879626 PMCID: PMC9311349 DOI: 10.1186/s40708-022-00164-6] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 06/25/2022] [Indexed: 12/27/2022] Open
Abstract
Autism spectrum is a brain development condition that impairs an individual's capacity to communicate socially and manifests through strict routines and obsessive-compulsive behavior. Applied behavior analysis (ABA) is the gold-standard treatment for autism spectrum disorder (ASD). However, as the number of ASD cases increases, there is a substantial shortage of licensed ABA practitioners, limiting the timely formulation, revision, and implementation of treatment plans and goals. Additionally, the subjectivity of the clinician and a lack of data-driven decision-making affect treatment quality. We address these obstacles by applying two machine learning algorithms to recommend and personalize ABA treatment goals for 29 study participants with ASD. The patient similarity and collaborative filtering methods predicted ABA treatment with an average accuracy of 81-84%, with a normalized discounted cumulative gain of 79-81% (NDCG) compared to clinician-prepared ABA treatment recommendations. Additionally, we assess the two models' treatment efficacy (TE) by measuring the percentage of recommended treatment goals mastered by the study participants. The proposed treatment recommendation and personalization strategy are generalizable to other intervention methods in addition to ABA and for other brain disorders. This study was registered as a clinical trial on November 5, 2020 with trial registration number CTRI/2020/11/028933.
Collapse
Affiliation(s)
- Manu Kohli
- Indian Institute of Technology-Delhi, Department of Management Studies, IV Floor, Vishwakarma Bhavan, Shaheed Jeet Singh Marg, Hauz Khas, New Delhi, 110016 India
| | - Arpan Kumar Kar
- Indian Institute of Technology-Delhi, Department of Management Studies, IV Floor, Vishwakarma Bhavan, Shaheed Jeet Singh Marg, Hauz Khas, New Delhi, 110016 India
| | - Anjali Bangalore
- ICON Centre, K. M. Chavan chawk, Shivajinagar Road, Garkheda, Aurangabad, 431005 India
| | - Prathosh AP
- Indian Institute of Science, CV Raman Rd, Bengaluru, 560012 Karnataka India
| |
Collapse
|
4
|
Wang N, Huang Y, Liu H, Zhang Z, Wei L, Fei X, Chen H. Study on the semi-supervised learning-based patient similarity from heterogeneous electronic medical records. BMC Med Inform Decis Mak 2021; 21:58. [PMID: 34330261 PMCID: PMC8323210 DOI: 10.1186/s12911-021-01432-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 02/09/2021] [Indexed: 12/24/2022] Open
Abstract
Background A new learning-based patient similarity measurement was proposed to measure patients’ similarity for heterogeneous electronic medical records (EMRs) data. Methods We first calculated feature-level similarities according to the features’ attributes. A domain expert provided patient similarity scores of 30 randomly selected patients. These similarity scores and feature-level similarities for 30 patients comprised the labeled sample set, which was used for the semi-supervised learning algorithm to learn the patient-level similarities for all patients. Then we used the k-nearest neighbor (kNN) classifier to predict four liver conditions. The predictive performances were compared in four different situations. We also compared the performances between personalized kNN models and other machine learning models. We assessed the predictive performances by the area under the receiver operating characteristic curve (AUC), F1-score, and cross-entropy (CE) loss. Results As the size of the random training samples increased, the kNN models using the learned patient similarity to select near neighbors consistently outperformed those using the Euclidean distance to select near neighbors (all P values < 0.001). The kNN models using the learned patient similarity to identify the top k nearest neighbors from the random training samples also had a higher best-performance (AUC: 0.95 vs. 0.89, F1-score: 0.84 vs. 0.67, and CE loss: 1.22 vs. 1.82) than those using the Euclidean distance. As the size of the similar training samples increased, which composed the most similar samples determined by the learned patient similarity, the performance of kNN models using the simple Euclidean distance to select the near neighbors degraded gradually. When exchanging the role of the Euclidean distance, and the learned patient similarity in selecting the near neighbors and similar training samples, the performance of the kNN models gradually increased. These two kinds of kNN models had the same best-performance of AUC 0.95, F1-score 0.84, and CE loss 1.22. Among the four reference models, the highest AUC and F1-score were 0.94 and 0.80, separately, which were both lower than those for the simple and similarity-based kNN models. Conclusions This learning-based method opened an opportunity for similarity measurement based on heterogeneous EMR data and supported the secondary use of EMR data.
Collapse
Affiliation(s)
- Ni Wang
- School of Biomedical Engineering, Capital Medical University, No.10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, People's Republic of China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Yanqun Huang
- School of Biomedical Engineering, Capital Medical University, No.10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, People's Republic of China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Honglei Liu
- School of Biomedical Engineering, Capital Medical University, No.10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, People's Republic of China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Zhiqiang Zhang
- School of Biomedical Engineering, Capital Medical University, No.10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, People's Republic of China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, 100069, People's Republic of China
| | - Lan Wei
- Information Center, Xuanwu Hospital, Capital Medical University, Beijing, 100053, People's Republic of China
| | - Xiaolu Fei
- Information Center, Xuanwu Hospital, Capital Medical University, Beijing, 100053, People's Republic of China
| | - Hui Chen
- School of Biomedical Engineering, Capital Medical University, No.10, Xitoutiao, You An Men, Fengtai District, Beijing, 100069, People's Republic of China. .,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, Beijing, 100069, People's Republic of China.
| |
Collapse
|
5
|
Fang HSA, Tan NC, Tan WY, Oei RW, Lee ML, Hsu W. Patient similarity analytics for explainable clinical risk prediction. BMC Med Inform Decis Mak 2021; 21:207. [PMID: 34210320 PMCID: PMC8247104 DOI: 10.1186/s12911-021-01566-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Accepted: 06/22/2021] [Indexed: 01/27/2023] Open
Abstract
BACKGROUND Clinical risk prediction models (CRPMs) use patient characteristics to estimate the probability of having or developing a particular disease and/or outcome. While CRPMs are gaining in popularity, they have yet to be widely adopted in clinical practice. The lack of explainability and interpretability has limited their utility. Explainability is the extent of which a model's prediction process can be described. Interpretability is the degree to which a user can understand the predictions made by a model. METHODS The study aimed to demonstrate utility of patient similarity analytics in developing an explainable and interpretable CRPM. Data was extracted from the electronic medical records of patients with type-2 diabetes mellitus, hypertension and dyslipidaemia in a Singapore public primary care clinic. We used modified K-nearest neighbour which incorporated expert input, to develop a patient similarity model on this real-world training dataset (n = 7,041) and validated it on a testing dataset (n = 3,018). The results were compared using logistic regression, random forest (RF) and support vector machine (SVM) models from the same dataset. The patient similarity model was then implemented in a prototype system to demonstrate the identification, explainability and interpretability of similar patients and the prediction process. RESULTS The patient similarity model (AUROC = 0.718) was comparable to the logistic regression (AUROC = 0.695), RF (AUROC = 0.764) and SVM models (AUROC = 0.766). We packaged the patient similarity model in a prototype web application. A proof of concept demonstrated how the application provided both quantitative and qualitative information, in the form of patient narratives. This information was used to better inform and influence clinical decision-making, such as getting a patient to agree to start insulin therapy. CONCLUSIONS Patient similarity analytics is a feasible approach to develop an explainable and interpretable CRPM. While the approach is generalizable, it can be used to develop locally relevant information, based on the database it searches. Ultimately, such an approach can generate a more informative CRPMs which can be deployed as part of clinical decision support tools to better facilitate shared decision-making in clinical practice.
Collapse
Affiliation(s)
- Hao Sen Andrew Fang
- SingHealth Polyclinics, SingHealth, 167, Jalan Bukit Merah, Connection One, Tower 5, #15-10, Singapore, P.O. 150167, Singapore.
| | - Ngiap Chuan Tan
- SingHealth Polyclinics, SingHealth, 167, Jalan Bukit Merah, Connection One, Tower 5, #15-10, Singapore, P.O. 150167, Singapore.,Family Medicine Academic Clinical Programme, SingHealth-Duke NUS Academic Medical Centre, Singapore, Singapore
| | - Wei Ying Tan
- Institute of Data Science, National University of Singapore, Singapore, Singapore
| | - Ronald Wihal Oei
- Institute of Data Science, National University of Singapore, Singapore, Singapore
| | - Mong Li Lee
- Institute of Data Science, National University of Singapore, Singapore, Singapore.,School of Computing, National University of Singapore, Singapore, Singapore
| | - Wynne Hsu
- Institute of Data Science, National University of Singapore, Singapore, Singapore.,School of Computing, National University of Singapore, Singapore, Singapore
| |
Collapse
|
6
|
Chen X, Faviez C, Vincent M, Garcelon N, Saunier S, Burgun A. Identification of Similar Patients Through Medical Concept Embedding from Electronic Health Records: A Feasibility Study for Rare Disease Diagnosis. Stud Health Technol Inform 2021; 281:600-4. [PMID: 34042646 DOI: 10.3233/SHTI210241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/12/2023]
Abstract
To identify patients with similar clinical profiles and derive insights from the records and outcomes of similar patients can help fast and precise diagnosis and other clinical decisions for rare diseases. Similarity methods are required to take into account the semantic relations between medical concepts and also the different relevance of all medical concepts presented in patients' medical records. In this paper, we introduce the methods developed in the context of rare disease screening/diagnosis from clinical data warehouse using medical concept embedding and adjusted aggregations. Our methods provided better preliminary results than baseline methods, with a significant improvement of precision among the top ranked similar patients, which is encouraging for further fine-tuning and application on a large-scale dataset for new/candidate patient identification.
Collapse
|
7
|
Cuadrado D, Riaño D, Gómez J, Rodríguez A, Bodí M. Methods and measures to quantify ICU patient heterogeneity. J Biomed Inform 2021; 117:103768. [PMID: 33839305 DOI: 10.1016/j.jbi.2021.103768] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 02/21/2021] [Accepted: 03/29/2021] [Indexed: 11/22/2022]
Abstract
Patients in intensive care units are heterogeneous and the daily prediction of their days to discharge (DTD) a complex task that practitioners and computers are not always able to solve satisfactorily. In order to make more precise DTD predictors, it is necessary to have tools for the analysis of the heterogeneity of the patients. Unfortunately, the number of publications in this field is almost non-existent. In order to alleviate this lack of tools, we propose four methods and their corresponding measures to quantify the heterogeneity of intensive patients in the process of determining the DTD. These new methods and measures have been tested with patients admitted over four years to a tertiary hospital in Spain. The results deepen the understanding of the intensive patient and can serve as a basis for the construction of better DTD predictors.
Collapse
|
8
|
Lopez Pineda A, Pourshafeie A, Ioannidis A, Leibold CM, Chan AL, Bustamante CD, Frankovich J, Wojcik GL. Discovering prescription patterns in pediatric acute-onset neuropsychiatric syndrome patients. J Biomed Inform 2020; 113:103664. [PMID: 33359113 DOI: 10.1016/j.jbi.2020.103664] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 10/28/2020] [Accepted: 12/10/2020] [Indexed: 11/28/2022]
Abstract
OBJECTIVE Pediatric acute-onset neuropsychiatric syndrome (PANS) is a complex neuropsychiatric syndrome characterized by an abrupt onset of obsessive-compulsive symptoms and/or severe eating restrictions, along with at least two concomitant debilitating cognitive, behavioral, or neurological symptoms. A wide range of pharmacological interventions along with behavioral and environmental modifications, and psychotherapies have been adopted to treat symptoms and underlying etiologies. Our goal was to develop a data-driven approach to identify treatment patterns in this cohort. MATERIALS AND METHODS In this cohort study, we extracted medical prescription histories from electronic health records. We developed a modified dynamic programming approach to perform global alignment of those medication histories. Our approach is unique since it considers time gaps in prescription patterns as part of the similarity strategy. RESULTS This study included 43 consecutive new-onset pre-pubertal patients who had at least 3 clinic visits. Our algorithm identified six clusters with distinct medication usage history which may represent clinician's practice of treating PANS of different severities and etiologies i.e., two most severe groups requiring high dose intravenous steroids; two arthritic or inflammatory groups requiring prolonged nonsteroidal anti-inflammatory drug (NSAID); and two mild relapsing/remitting group treated with a short course of NSAID. The psychometric scores as outcomes in each cluster generally improved within the first two years. DISCUSSION AND CONCLUSION Our algorithm shows potential to improve our knowledge of treatment patterns in the PANS cohort, while helping clinicians understand how patients respond to a combination of drugs.
Collapse
Affiliation(s)
- Arturo Lopez Pineda
- Department of Biomedical Data Science, Stanford University, CA, USA; Department of Data Science, Amphora Health, Morelia, Mexico
| | - Armin Pourshafeie
- Department of Biomedical Data Science, Stanford University, CA, USA; Department of Physics, Stanford University, CA, USA
| | | | - Collin McCloskey Leibold
- Department of Pediatrics, Division of Allergy, Immunology, and Rheumatology, Stanford University, CA, USA; Department of Medicine, University of Massachusetts Medical School, Worcester, MA, USA
| | - Avis L Chan
- Department of Pediatrics, Division of Allergy, Immunology, and Rheumatology, Stanford University, CA, USA
| | - Carlos D Bustamante
- Department of Biomedical Data Science, Stanford University, CA, USA; Department of Genetics, Stanford University, CA, USA; Chan Zuckerberg Biohub, San Francisco, CA, USA.
| | - Jennifer Frankovich
- Department of Pediatrics, Division of Allergy, Immunology, and Rheumatology, Stanford University, CA, USA.
| | - Genevieve L Wojcik
- Department of Biomedical Data Science, Stanford University, CA, USA; Department of Epidemiology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
9
|
Faviez C, Chen X, Garcelon N, Neuraz A, Knebelmann B, Salomon R, Lyonnet S, Saunier S, Burgun A. Diagnosis support systems for rare diseases: a scoping review. Orphanet J Rare Dis 2020; 15:94. [PMID: 32299466 PMCID: PMC7164220 DOI: 10.1186/s13023-020-01374-z] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2020] [Accepted: 03/31/2020] [Indexed: 12/14/2022] Open
Abstract
INTRODUCTION Rare diseases affect approximately 350 million people worldwide. Delayed diagnosis is frequent due to lack of knowledge of most clinicians and a small number of expert centers. Consequently, computerized diagnosis support systems have been developed to address these issues, with many relying on rare disease expertise and taking advantage of the increasing volume of generated and accessible health-related data. Our objective is to perform a review of all initiatives aiming to support the diagnosis of rare diseases. METHODS A scoping review was conducted based on methods proposed by Arksey and O'Malley. A charting form for relevant study analysis was developed and used to categorize data. RESULTS Sixty-eight studies were retained at the end of the charting process. Diagnosis targets varied from 1 rare disease to all rare diseases. Material used for diagnosis support consisted mostly of phenotype concepts, images or fluids. Fifty-seven percent of the studies used expert knowledge. Two-thirds of the studies relied on machine learning algorithms, and one-third used simple similarities. Manual algorithms were encountered as well. Most of the studies presented satisfying performance of evaluation by comparison with references or with external validation. Fourteen studies provided online tools, most of which aimed to support the diagnosis of all rare diseases by considering queries based on phenotype concepts. CONCLUSION Numerous solutions relying on different materials and use of various methodologies are emerging with satisfying preliminary results. However, the variability of approaches and evaluation processes complicates the comparison of results. Efforts should be made to adequately validate these tools and guarantee reproducibility and explicability.
Collapse
Affiliation(s)
- Carole Faviez
- Centre de Recherche des Cordeliers, INSERM, Université de Paris, Sorbonne Université, F-75006, Paris, France.
| | - Xiaoyi Chen
- Centre de Recherche des Cordeliers, INSERM, Université de Paris, Sorbonne Université, F-75006, Paris, France
| | - Nicolas Garcelon
- Centre de Recherche des Cordeliers, INSERM, Université de Paris, Sorbonne Université, F-75006, Paris, France.,Institut Imagine, Université de Paris, F-75015, Paris, France
| | - Antoine Neuraz
- Centre de Recherche des Cordeliers, INSERM, Université de Paris, Sorbonne Université, F-75006, Paris, France.,Département d'informatique médicale, Hôpital Necker-Enfants Malades, Assistance Publique - Hôpitaux de Paris (AP-HP), F-75015, Paris, France
| | - Bertrand Knebelmann
- Service de Néphrologie Transplantation Adultes, Hôpital Necker-Enfants Malades, F-75015, Paris, France.,Université de Paris, F-75006, Paris, France.,Institut Necker-Enfants Malades, INSERM, Hôpital Necker-Enfants Malades, F-75015, Paris, France
| | - Rémi Salomon
- Institut Imagine, Université de Paris, F-75015, Paris, France.,Service de Néphrologie Pédiatrique, Hôpital Necker-Enfants Malades, Assistance Publique-Hôpitaux de Paris (AP-HP), Université de Paris, F-75015, Paris, France
| | - Stanislas Lyonnet
- Université de Paris, F-75006, Paris, France.,Laboratory of Embryology and Genetics of Congenital Malformations, INSERM UMR 1163, Université de Paris, Imagine Institute, F-75015, Paris, France.,Service de génétique, Hôpital Necker-Enfants Malades, Assistance Publique - Hôpitaux de Paris (AP-HP), F-75015, Paris, France
| | - Sophie Saunier
- Université de Paris, F-75006, Paris, France.,Laboratory of Renal Hereditary Diseases, INSERM UMR 1163, Université de Paris, Imagine Institute, F-75015, Paris, France
| | - Anita Burgun
- Centre de Recherche des Cordeliers, INSERM, Université de Paris, Sorbonne Université, F-75006, Paris, France.,Département d'informatique médicale, Hôpital Necker-Enfants Malades, Assistance Publique - Hôpitaux de Paris (AP-HP), F-75015, Paris, France.,Université de Paris, F-75006, Paris, France.,PaRis Artificial Intelligence Research InstitutE (PRAIRIE), Paris, France
| |
Collapse
|
10
|
Jia Z, Zeng X, Duan H, Lu X, Li H. A patient-similarity-based model for diagnostic prediction. Int J Med Inform 2020; 135:104073. [PMID: 31923816 DOI: 10.1016/j.ijmedinf.2019.104073] [Citation(s) in RCA: 21] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2019] [Revised: 11/26/2019] [Accepted: 12/30/2019] [Indexed: 12/28/2022]
Abstract
OBJECTIVE To simulate the clinical reasoning of doctors, retrieve analogous patients of an index patient automatically and predict diagnoses by the similar/dissimilar patients. METHODS We proposed a novel patient-similarity-based framework for diagnostic prediction, which is inspired by the structure-mapping theory about analogy reasoning in psychology. Patient similarity is defined as the similarity between two patients' diagnoses sets rather than a dichotomous (absence/presence of just one disease). The multilabel classification problem is converted to a single-value regression problem by integrating the pairwise patients' clinical features into a vector and taking the vector as the input and the patient similarity as the output. In contrast to the common k-NN method which only considering the nearest neighbors, we not only utilize similar patients (positive analogy) to generate diagnostic hypotheses, but also utilize dissimilar patients (negative analogy) are used to reject diagnostic hypotheses. RESULTS The patient-similarity-based models perform better than the one-vs-all baseline and traditional k-NN methods. The f-1 score of positive-analogy-based prediction is 0.698, significantly higher than the scores of baselines ranging from 0.368 to 0.661. It increases to 0.703 when the negative analogy method is applied to modify the prediction results of positive analogy. The performance of this method is highly promising for larger datasets. CONCLUSION The patient-similarity-based model provides diagnostic decision support that is more accurate, generalizable, and interpretable than those of previous methods and is based on heterogeneous and incomplete data. The model also serves as a new application for the use of clinical big data through artificial intelligence technology.
Collapse
|
11
|
Abstract
Background Sequence alignment is a way of arranging sequences (e.g., DNA, RNA, protein, natural language, financial data, or medical events) to identify the relatedness between two or more sequences and regions of similarity. For Electronic Health Records (EHR) data, sequence alignment helps to identify patients of similar disease trajectory for more relevant and precise prognosis, diagnosis and treatment of patients. Methods We tested two cutting-edge global sequence alignment methods, namely dynamic time warping (DTW) and Needleman-Wunsch algorithm (NWA), together with their local modifications, DTW for Local alignment (DTWL) and Smith-Waterman algorithm (SWA), for aligning patient medical records. We also used 4 sets of synthetic patient medical records generated from a large real-world EHR database as gold standard data, to objectively evaluate these sequence alignment algorithms. Results For global sequence alignments, 47 out of 80 DTW alignments and 11 out of 80 NWA alignments had superior similarity scores than reference alignments while the rest 33 DTW alignments and 69 NWA alignments had the same similarity scores as reference alignments. Forty-six out of 80 DTW alignments had better similarity scores than NWA alignments with the rest 34 cases having the equal similarity scores from both algorithms. For local sequence alignments, 70 out of 80 DTWL alignments and 68 out of 80 SWA alignments had larger coverage and higher similarity scores than reference alignments while the rest DTWL alignments and SWA alignments received the same coverage and similarity scores as reference alignments. Six out of 80 DTWL alignments showed larger coverage and higher similarity scores than SWA alignments. Thirty DTWL alignments had the equal coverage but better similarity scores than SWA. DTWL and SWA received the equal coverage and similarity scores for the rest 44 cases. Conclusions DTW, NWA, DTWL and SWA outperformed the reference alignments. DTW (or DTWL) seems to align better than NWA (or SWA) by inserting new daily events and identifying more similarities between patient medical records. The evaluation results could provide valuable information on the strengths and weakness of these sequence alignment methods for future development of sequence alignment methods and patient similarity-based studies.
Collapse
Affiliation(s)
- Ming Huang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Nilay D Shah
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA
| | - Lixia Yao
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
12
|
Wang N, Huang Y, Liu H, Fei X, Wei L, Zhao X, Chen H. Measurement and application of patient similarity in personalized predictive modeling based on electronic medical records. Biomed Eng Online 2019; 18:98. [PMID: 31601207 PMCID: PMC6788002 DOI: 10.1186/s12938-019-0718-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2019] [Accepted: 10/01/2019] [Indexed: 12/24/2022] Open
Abstract
Background Conventional risk prediction techniques may not be the most suitable approach for personalized prediction for individual patients. Therefore, individualized predictive modeling based on similar patients has emerged. This study aimed to propose a comprehensive measurement of patient similarity using real-world electronic medical records data, and evaluate the effectiveness of the individualized prediction of a patient’s diabetes status based on the patient similarity. Results When using no more than 30% of the whole training sample, the personalized predictive models outperformed corresponding traditional models built on randomly selected training samples of the same size as the personalized models (P < 0.001 for all). With only the top 1000 (10%), 700 (7%) and 1400 (14%) similar samples, personalized random forest, k-nearest neighbor and logistic regression models reached the globally optimal performance with the area under the receiver-operating characteristic (ROC) curve of 0.90, 0.82 and 0.89, respectively. Conclusions The proposed patient similarity measurement was effective when developing personalized predictive models. The successful application of patient similarity in predicting a patient’s diabetes status provided useful references for diagnostic decision-making support by investigating the evidence on similar patients.
Collapse
Affiliation(s)
- Ni Wang
- School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China
| | - Yanqun Huang
- School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China
| | - Honglei Liu
- School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China.,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China
| | - Xiaolu Fei
- Information Center, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
| | - Lan Wei
- Information Center, Xuanwu Hospital, Capital Medical University, No. 45 Changchun Street, Xicheng District, Beijing, 100053, China
| | - Xiangkun Zhao
- School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China
| | - Hui Chen
- School of Biomedical Engineering, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China. .,Beijing Key Laboratory of Fundamental Research on Biomechanics in Clinical Application, Capital Medical University, No. 10, Xitoutiao, YouAnMen, Fengtai District, Beijing, 100069, China.
| |
Collapse
|
13
|
Jia Z, Lu X, Duan H, Li H. Using the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. BMC Med Inform Decis Mak 2019; 19:91. [PMID: 31023325 PMCID: PMC6485152 DOI: 10.1186/s12911-019-0807-y] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2017] [Accepted: 04/01/2019] [Indexed: 11/10/2022] Open
Abstract
Background Many clinical concepts are standardized under a categorical and hierarchical taxonomy such as ICD-10, ATC, etc. These taxonomic clinical concepts provide insight into semantic meaning and similarity among clinical concepts and have been applied to patient similarity measures. However, the effects of diverse set sizes of taxonomic clinical concepts contributing to similarity at the patient level have not been well studied. Methods In this paper the most widely used taxonomic clinical concepts system, ICD-10, was studied as a representative taxonomy. The distance between ICD-10-coded diagnosis sets is an integrated estimation of the information content of each concept, the similarity between each pairwise concepts and the similarity between the sets of concepts. We proposed a novel method at the set-level similarity to calculate the distance between sets of hierarchical taxonomic clinical concepts to measure patient similarity. A real-world clinical dataset with ICD-10 coded diagnoses and hospital length of stay (HLOS) information was used to evaluate the performance of various algorithms and their combinations in predicting whether a patient need long-term hospitalization or not. Four subpopulation prototypes that were defined based on age and HLOS with different diagnoses set sizes were used as the target for similarity analysis. The F-score was used to evaluate the performance of different algorithms by controlling other factors. We also evaluated the effect of prototype set size on prediction precision. Results The results identified the strengths and weaknesses of different algorithms to compute information content, code-level similarity and set-level similarity under different contexts, such as set size and concept set background. The minimum weighted bipartite matching approach, which has not been fully recognized previously showed unique advantages in measuring the concepts-based patient similarity. Conclusions This study provides a systematic benchmark evaluation of previous algorithms and novel algorithms used in taxonomic concepts-based patient similarity, and it provides the basis for selecting appropriate methods under different clinical scenarios. Electronic supplementary material The online version of this article (10.1186/s12911-019-0807-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Zheng Jia
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Xudong Lu
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Huilong Duan
- College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou, China
| | - Haomin Li
- The Children's Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,The Institute of Translational Medicine, Zhejiang University, Hangzhou, China.
| |
Collapse
|
14
|
Dudchenko P, Dudchenko A, Kopanitsa G. Heart Disease Dataset Clusterization. Stud Health Technol Inform 2019; 261:162-167. [PMID: 31156109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Clusterization is a promising group of methods in the context of patient similarity. However, results of clustering are not often clear for physicians as well as different clustering methods can produce different results. We have examined a well-known dataset and implemented 3 clustering methods (k-means, Agglomerative and Spectral). We have compared and evaluated clusters and their correlation with data attributes. In contrast to original dataset's target value, the clusters correlated with only a few attributes. Finally, we train 2 predictive models based on k-nearest neighbors (KNN) algorithm and Artificial Neural Network (ANN). Models evaluation demonstrates that using the results of clustering algorithms as predictive attribute give a higher F-score than the original target attribute.
Collapse
|
15
|
Parimbelli E, Marini S, Sacchi L, Bellazzi R. Patient similarity for precision medicine: A systematic review. J Biomed Inform 2018; 83:87-96. [PMID: 29864490 DOI: 10.1016/j.jbi.2018.06.001] [Citation(s) in RCA: 63] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Revised: 05/16/2018] [Accepted: 06/01/2018] [Indexed: 12/19/2022]
Abstract
Evidence-based medicine is the most prevalent paradigm adopted by physicians. Clinical practice guidelines typically define a set of recommendations together with eligibility criteria that restrict their applicability to a specific group of patients. The ever-growing size and availability of health-related data is currently challenging the broad definitions of guideline-defined patient groups. Precision medicine leverages on genetic, phenotypic, or psychosocial characteristics to provide precise identification of patient subsets for treatment targeting. Defining a patient similarity measure is thus an essential step to allow stratification of patients into clinically-meaningful subgroups. The present review investigates the use of patient similarity as a tool to enable precision medicine. 279 articles were analyzed along four dimensions: data types considered, clinical domains of application, data analysis methods, and translational stage of findings. Cancer-related research employing molecular profiling and standard data analysis techniques such as clustering constitute the majority of the retrieved studies. Chronic and psychiatric diseases follow as the second most represented clinical domains. Interestingly, almost one quarter of the studies analyzed presented a novel methodology, with the most advanced employing data integration strategies and being portable to different clinical domains. Integration of such techniques into decision support systems constitutes and interesting trend for future research.
Collapse
Affiliation(s)
- E Parimbelli
- Telfer School of Management, University of Ottawa, Ottawa, Canada; Interdepartmental Centre for Health Technologies, University of Pavia, Italy.
| | - S Marini
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, USA; Interdepartmental Centre for Health Technologies, University of Pavia, Italy
| | - L Sacchi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy; Interdepartmental Centre for Health Technologies, University of Pavia, Italy
| | - R Bellazzi
- Department of Electrical, Computer and Biomedical Engineering, University of Pavia, Italy; Interdepartmental Centre for Health Technologies, University of Pavia, Italy; RCCS ICS Maugeri, Pavia, Italy
| |
Collapse
|
16
|
Dudchenko A, Kopanitsa G, Knaup P, Ganzinger M. A Predictive Model for Patient Similarity: Classes Based on Secondary Data and Simple Measurements as Predictors. Stud Health Technol Inform 2018; 249:167-172. [PMID: 29866975] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Predictive models optimized for average cases might work not perfect for cases deviating from average because they are based on a cohort of all patients. Models could be more personalized if they were built on a sub-cohort of patients similar to a current one and to train models on data collected from those similar patients. In this paper, we consider patient similarity as a classification task. We suppose that data such as diagnoses and treatment obtained by physicians (secondary data) are more relevant for similarity than tests and measurements (primary data). We defined several classes based on diagnoses and outcomes and apply a predictive model for classification. We used five commonly used and easy to obtain measurements as predictors for the model. All measurements were collected during the first 24 hours after admission. We have shown that classes of similar patients can be defined on the basis of a previous patient's secondary data and new patients can be classified into these classes.
Collapse
Affiliation(s)
| | | | - Petra Knaup
- Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany
| | - Matthias Ganzinger
- Institute of Medical Biometry and Informatics, Heidelberg University, Heidelberg, Germany
| |
Collapse
|
17
|
Sha Y, Venugopalan J, Wang MD. A Novel Temporal Similarity Measure for Patients Based on Irregularly Measured Data in Electronic Health Records. ACM BCB 2016; 2016:337-344. [PMID: 32577627 DOI: 10.1145/2975167.2975202] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Patient similarity measurement is an important tool for cohort identification in clinical decision support applications. A reliable similarity metric can be used for deriving diagnostic or prognostic information about a target patient using other patients with similar trajectories of health-care events. However, the measure of similar care trajectories is challenged by the irregularity of measurements, inherent in health care. To address this challenge, we propose a novel temporal similarity measure for patients based on irregularly measured laboratory test data from the Multiparameter Intelligent Monitoring in Intensive Care database and the pediatric Intensive Care Unit (ICU) database of Children's Healthcare of Atlanta. This similarity measure, which is modified from the Smith Waterman algorithm, identifies patients that share sequentially similar laboratory results separated by time intervals of similar length. We demonstrate the predictive power of our method; that is, patients with higher similarity in their previous histories will most likely have higher similarity in their later histories. In addition, compared with other non-temporal measures, our method is stronger at predicting mortality in ICU patients diagnosed with acute kidney injury and sepsis. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Retrieval models and rankings - similarity measures; J.3 [Applied Computing]: Life and medical sciences - health and medical information systems. General Term Algorithm.
Collapse
Affiliation(s)
- Ying Sha
- School of Biology, Georgia Institute of Technology, Atlanta, GA 30332
| | - Janani Venugopalan
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332
| | - May D Wang
- Dept. of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA 30332
| |
Collapse
|