1
|
Hammoud M, Douglas S, Darmach M, Alawneh S, Sanyal S, Kanbour Y. Evaluating the Diagnostic Performance of Symptom Checkers: Clinical Vignette Study. JMIR AI 2024; 3:e46875. [PMID: 38875676 PMCID: PMC11091811 DOI: 10.2196/46875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 06/15/2023] [Accepted: 03/02/2024] [Indexed: 06/16/2024]
Abstract
BACKGROUND Medical self-diagnostic tools (or symptom checkers) are becoming an integral part of digital health and our daily lives, whereby patients are increasingly using them to identify the underlying causes of their symptoms. As such, it is essential to rigorously investigate and comprehensively report the diagnostic performance of symptom checkers using standard clinical and scientific approaches. OBJECTIVE This study aims to evaluate and report the accuracies of a few known and new symptom checkers using a standard and transparent methodology, which allows the scientific community to cross-validate and reproduce the reported results, a step much needed in health informatics. METHODS We propose a 4-stage experimentation methodology that capitalizes on the standard clinical vignette approach to evaluate 6 symptom checkers. To this end, we developed and peer-reviewed 400 vignettes, each approved by at least 5 out of 7 independent and experienced primary care physicians. To establish a frame of reference and interpret the results of symptom checkers accordingly, we further compared the best-performing symptom checker against 3 primary care physicians with an average experience of 16.6 (SD 9.42) years. To measure accuracy, we used 7 standard metrics, including M1 as a measure of a symptom checker's or a physician's ability to return a vignette's main diagnosis at the top of their differential list, F1-score as a trade-off measure between recall and precision, and Normalized Discounted Cumulative Gain (NDCG) as a measure of a differential list's ranking quality, among others. RESULTS The diagnostic accuracies of the 6 tested symptom checkers vary significantly. For instance, the differences in the M1, F1-score, and NDCG results between the best-performing and worst-performing symptom checkers or ranges were 65.3%, 39.2%, and 74.2%, respectively. The same was observed among the participating human physicians, whereby the M1, F1-score, and NDCG ranges were 22.8%, 15.3%, and 21.3%, respectively. When compared against each other, physicians outperformed the best-performing symptom checker by an average of 1.2% using F1-score, whereas the best-performing symptom checker outperformed physicians by averages of 10.2% and 25.1% using M1 and NDCG, respectively. CONCLUSIONS The performance variation between symptom checkers is substantial, suggesting that symptom checkers cannot be treated as a single entity. On a different note, the best-performing symptom checker was an artificial intelligence (AI)-based one, shedding light on the promise of AI in improving the diagnostic capabilities of symptom checkers, especially as AI keeps advancing exponentially.
Collapse
|
2
|
Woodman RJ, Koczwara B, Mangoni AA. Applying precision medicine principles to the management of multimorbidity: the utility of comorbidity networks, graph machine learning, and knowledge graphs. Front Med (Lausanne) 2024; 10:1302844. [PMID: 38404463 PMCID: PMC10885565 DOI: 10.3389/fmed.2023.1302844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2023] [Accepted: 12/22/2023] [Indexed: 02/27/2024] Open
Abstract
The current management of patients with multimorbidity is suboptimal, with either a single-disease approach to care or treatment guideline adaptations that result in poor adherence due to their complexity. Although this has resulted in calls for more holistic and personalized approaches to prescribing, progress toward these goals has remained slow. With the rapid advancement of machine learning (ML) methods, promising approaches now also exist to accelerate the advance of precision medicine in multimorbidity. These include analyzing disease comorbidity networks, using knowledge graphs that integrate knowledge from different medical domains, and applying network analysis and graph ML. Multimorbidity disease networks have been used to improve disease diagnosis, treatment recommendations, and patient prognosis. Knowledge graphs that combine different medical entities connected by multiple relationship types integrate data from different sources, allowing for complex interactions and creating a continuous flow of information. Network analysis and graph ML can then extract the topology and structure of networks and reveal hidden properties, including disease phenotypes, network hubs, and pathways; predict drugs for repurposing; and determine safe and more holistic treatments. In this article, we describe the basic concepts of creating bipartite and unipartite disease and patient networks and review the use of knowledge graphs, graph algorithms, graph embedding methods, and graph ML within the context of multimorbidity. Specifically, we provide an overview of the application of graph theory for studying multimorbidity, the methods employed to extract knowledge from graphs, and examples of the application of disease networks for determining the structure and pathways of multimorbidity, identifying disease phenotypes, predicting health outcomes, and selecting safe and effective treatments. In today's modern data-hungry, ML-focused world, such network-based techniques are likely to be at the forefront of developing robust clinical decision support tools for safer and more holistic approaches to treating older patients with multimorbidity.
Collapse
Affiliation(s)
- Richard John Woodman
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
| | - Bogda Koczwara
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
- Department of Medical Oncology, Flinders Medical Centre, Southern Adelaide Local Health Network, Adelaide, SA, Australia
| | - Arduino Aleksander Mangoni
- Flinders Health and Medical Research Institute, College of Medicine and Public Health, Flinders University, Adelaide, SA, Australia
- Department of Clinical Pharmacology, Flinders Medical Centre, Southern Adelaide Local Health Network, Adelaide, SA, Australia
| |
Collapse
|
3
|
Wen J, Zhang T, Ye S, Zhang P, Han R, Chen X, Huang R, Chen A, Li Q. Quantitative patient graph analysis for transient ischemic attack risk factor distribution based on electronic medical records. Heliyon 2024; 10:e22766. [PMID: 38163107 PMCID: PMC10755279 DOI: 10.1016/j.heliyon.2023.e22766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2022] [Revised: 10/26/2023] [Accepted: 11/19/2023] [Indexed: 01/03/2024] Open
Abstract
A transient ischemic attack (TIA) affects millions of people worldwide. Although TIA risk factors have been identified individually, a systemic quantitative analysis of all health factors relevant to TIA using electronic medical records (EMR) remains lacking. This study employed a data-driven approach, leveraging hospital EMR data to create a TIA patient health factor graph. This graph consisted of 737 TIA and 737 control patient nodes, 740 health factor nodes, and over 33,000 relations between patients and factors. For all health factors in the graph, the connection delta ratios (CDRs) were determined and ranked, generating a quantitative distribution of TIA health factors. A literature review confirmed 56 risk factors in the distribution and unveiled a potential new risk factor "rhinosinusitis" for future validation. Moreover, the patient graph was visualized together with the TIA knowledge graph in the Unified Medical Language System. This integration enables clinicians to access and visualize patient data and international standard knowledge within a unified graph. In conclusion, graph CDR analysis can effectively quantify the distribution of TIA risk factors. The resulting TIA risk factor distribution might be instrumental in developing new risk prediction machine learning models for screening and early detection of TIA.
Collapse
Affiliation(s)
- Jian Wen
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Tianmei Zhang
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Shangrong Ye
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Peng Zhang
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Ruobing Han
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Xiaowang Chen
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Ran Huang
- West China Hospital, Chengdu, Sichuan, China
| | - Anjun Chen
- Learning Health Community, Palo Alto, CA, USA
| | - Qinghua Li
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| |
Collapse
|
4
|
Liu Z, Cao Q, Du N, Shu H, Zhong E, Jiang N, Chen Q, Shen Y, Chen K. FIT-graph: A multi-grained evolutionary graph based framework for disease diagnosis. Artif Intell Med 2024; 147:102735. [PMID: 38184359 DOI: 10.1016/j.artmed.2023.102735] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Revised: 10/04/2023] [Accepted: 11/28/2023] [Indexed: 01/08/2024]
Abstract
Early assessment, with the help of machine learning methods, can aid clinicians in optimizing the diagnosis and treatment process, allowing patients to receive critical treatment time. Due to the advantages of effective information organization and interpretable reasoning, knowledge graph-based methods have become one of the most widely used machine learning algorithms for this task. However, due to a lack of effective organization and use of multi-granularity and temporal information, current knowledge graph-based approaches are hard to fully and comprehensively exploit the information contained in medical records, restricting their capacity to make superior quality diagnoses. To address these challenges, we examine and study disease diagnosis applications in-depth, and propose a novel disease diagnosis framework named FIT-Graph. With novel medical multi-grained evolutionary graphs, FIT-Graph efficiently organizes the extracted information from various granularities and time stages, maximizing the retention of valuable information for disease inference and ensuring the comprehensiveness and validity of the final disease inference. We compare FIT-Graph with two real-world clinical datasets from cardiology and respiratory departments with the baseline. The experimental results show that its effect is better than the baseline model, and the baseline performance of the task is improved by about 5% in multiple indices.
Collapse
Affiliation(s)
- Zizhu Liu
- Department of Cardiovascular Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| | - Qing Cao
- Department of Cardiovascular Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| | - Nan Du
- School of Intelligent Systems Engineering, Sun Yat-Sen University, China.
| | | | | | | | | | - Ying Shen
- School of Intelligent Systems Engineering, Sun Yat-Sen University, China.
| | - Kang Chen
- Department of Cardiovascular Medicine, Ruijin Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai, China.
| |
Collapse
|
5
|
Huang L, Chen Q, Lan W. Predicting drug-drug interactions based on multi-view and multichannel attention deep learning. Health Inf Sci Syst 2023; 11:50. [PMID: 37941825 PMCID: PMC10628064 DOI: 10.1007/s13755-023-00250-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 09/25/2023] [Indexed: 11/10/2023] Open
Abstract
Predicting drug-drug interactions (DDIs) has become a major concern in the drug research field because it helps explore the pharmacological function of drugs and enables the development of new therapeutic drugs. Existing prediction methods simply integrate multiple drug attributes or perform tasks on a biomedical knowledge graph (KG). Though effective, few methods can fully utilize multi-source drug data information. In this paper, a multi-view and multichannel attention deep learning (MMADL) model is proposed, which not only extracts rich drug features containing both drug attributes and drug-related entity information from multi-source databases, but also considers the consistency and complementarity of different drug feature representation learning approaches to improve the effectiveness and accuracy of DDI prediction. A single-layer perceptron encoder is applied to encode multi-source drug information to obtain multi-view drug representation vectors in the same linear space. Then, the multichannel attention mechanism is introduced to obtain the attention weight by adaptively learning the importance of drug features according to their contributions to DDI prediction. Further, the representation vectors of multi-view drug pairs with attention weights are used as inputs of the deep neural network to predict potential DDI. The accuracy and precision-recall curves of MMADL are 93.05 and 95.94, respectively. The results indicate that the proposed method outperforms other state-of-the-art methods.
Collapse
Affiliation(s)
- Liyu Huang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006 China
| | - Qingfeng Chen
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004 China
- Department of Computer Science and Information Technology, La Trobe University, Melbourne, 3086 Australia
| | - Wei Lan
- School of Computer, Electronics and Information, Guangxi University, Nanning, 530004 China
| |
Collapse
|
6
|
Ma X, Wang M, Lin S, Zhang Y, Zhang Y, Ouyang W, Liu X. Knowledge and data-driven prediction of organ failure in critical care patients. Health Inf Sci Syst 2023; 11:7. [PMID: 36703901 PMCID: PMC9871106 DOI: 10.1007/s13755-023-00210-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2022] [Accepted: 01/02/2023] [Indexed: 01/24/2023] Open
Abstract
Purpose The early detection of organ failure mitigates the risk of post-intensive care syndrome and long-term functional impairment. The aim of this study is to predict organ failure in real-time for critical care patients based on a data-driven and knowledge-driven machine learning method (DKM) and provide explanations for the prediction by incorporating a medical knowledge graph. Methods The cohort of this study was a subset of the 4,386 adult Intensive Care Unit (ICU) patients from the MIMIC-III dataset collected between 2001 and 2012, and the primary outcome was the Delta Sequential Organ Failure Assessment (SOFA) score. A real-time Delta SOFA score prediction model was developed with two key components: an improved deep learning temporal convolutional network (S-TCN) and a graph-embedding feature extraction method based on a medical knowledge graph. Entities and relations related to organ failure were extracted from the Unified Medical Language System to build the medical knowledge graph, and patient data were mapped onto the graph to extract the embeddings. We measured the performance of our DKM approach with cross-validation to avoid the formation of biased assessments. Results An area under the receiver operating characteristic curve (AUC) of 0.973, a precision of 0.923, a NPV of 0.989, and an F1 score of 0.927 were achieved using the DKM approach, which significantly outperformed the baseline methods. Additionally, the performance remained stable following external validation on the eICU dataset, which consists of 2,816 admissions (AUC = 0.981, precision = 0.860, NPV = 0.984). Visualization of feature importance for the Delta SOFA score and their relationships on the basic clinical medical (BCM) knowledge graph provided a model explanation. Conclusion The use of an improved TCN model and a medical knowledge graph led to substantial improvement in prediction accuracy, providing generalizability and an independent explanation for organ failure prediction in critical care patients. These findings show the potential of incorporating prior domain knowledge into machine learning models to inform care and service planning. Supplementary Information The online version of this article contains supplementary material available 10.1007/s13755-023-00210-5.
Collapse
Affiliation(s)
- Xinyu Ma
- School of Computer Science and Engineering, Southeast University, Nanjing, 211189 People’s Republic of China
| | - Meng Wang
- School of Computer Science and Engineering, Southeast University, Nanjing, 211189 People’s Republic of China
| | - Sihan Lin
- Department of Anesthesiology, Third Xiangya Hospital, Central South University, Changsha, 410013 People’s Republic of China
| | - Yuhao Zhang
- School of Computer Science and Engineering, The University of Hong Kong, Hong Kong, 999077 People’s Republic of China
| | - Yanjian Zhang
- School of Computer Science and Engineering, Southeast University, Nanjing, 211189 People’s Republic of China
| | - Wen Ouyang
- Department of Anesthesiology, Third Xiangya Hospital, Central South University, Changsha, 410013 People’s Republic of China
| | - Xing Liu
- Department of Anesthesiology, Third Xiangya Hospital, Central South University, Changsha, 410013 People’s Republic of China
| |
Collapse
|
7
|
Feng J, Zhang R, Chen D, Shi L. A Visualization Method of Knowledge Graphs for the Computation and Comprehension of Ultrasound Reports. Biomimetics (Basel) 2023; 8:560. [PMID: 38132500 PMCID: PMC10741754 DOI: 10.3390/biomimetics8080560] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2023] [Revised: 10/30/2023] [Accepted: 11/20/2023] [Indexed: 12/23/2023] Open
Abstract
Knowledge graph visualization in ultrasound reports is essential for enhancing medical decision making and the efficiency and accuracy of computer-aided analysis tools. This study aims to propose an intelligent method for analyzing ultrasound reports through knowledge graph visualization. Firstly, we provide a novel method for extracting key term networks from the narrative text in ultrasound reports with high accuracy, enabling the identification and annotation of clinical concepts within the report. Secondly, a knowledge representation framework based on ultrasound reports is proposed, which enables the structured and intuitive visualization of ultrasound report knowledge. Finally, we propose a knowledge graph completion model to address the lack of entities in physicians' writing habits and improve the accuracy of visualizing ultrasound knowledge. In comparison to traditional methods, our proposed approach outperforms the extraction of knowledge from complex ultrasound reports, achieving a significantly higher extraction index (η) of 2.69, surpassing the general pattern-matching method (2.12). In comparison to other state-of-the-art methods, our approach achieves the highest P (0.85), R (0.89), and F1 (0.87) across three testing datasets. The proposed method can effectively utilize the knowledge embedded in ultrasound reports to obtain relevant clinical information and improve the accuracy of using ultrasound knowledge.
Collapse
Affiliation(s)
- Jiayi Feng
- Department of Information Management, Beijing Jiaotong University, Beijing 100044, China;
| | - Runtong Zhang
- Department of Information Management, Beijing Jiaotong University, Beijing 100044, China;
| | - Donghua Chen
- Department of Information Management, University of International Business and Economics, Beijing 100029, China;
| | - Lei Shi
- School of Computing, Newcastle University, Newcastle upon Tyne NE4 5TG, UK;
| |
Collapse
|
8
|
Blaudin de Thé FX, Baudier C, Andrade Pereira R, Lefebvre C, Moingeon P. Transforming drug discovery with a high-throughput AI-powered platform: A 5-year experience with Patrimony. Drug Discov Today 2023; 28:103772. [PMID: 37717933 DOI: 10.1016/j.drudis.2023.103772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 09/01/2023] [Accepted: 09/12/2023] [Indexed: 09/19/2023]
Abstract
High-throughput computational platforms are being established to accelerate drug discovery. Servier launched the Patrimony platform to harness computational sciences and artificial intelligence (AI) to integrate massive multimodal data from internal and external sources. Patrimony has enabled researchers to prioritize therapeutic targets based on a deep understanding of the pathophysiology of immuno-inflammatory diseases. Herein, we share our experience regarding main challenges and critical success factors faced when industrializing the platform and broadening its applications to neurological diseases. We emphasize the importance of integrating such platforms in an end-to-end drug discovery process and engaging human experts early on to ensure a transforming impact.
Collapse
|
9
|
Schiano di Cola V, Chiaro D, Prezioso E, Izzo S, Giampaolo F. Insight Extraction From E-Health Bookings by Means of Hypergraph and Machine Learning. IEEE J Biomed Health Inform 2023; 27:4649-4659. [PMID: 37018305 DOI: 10.1109/jbhi.2022.3233498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023]
Abstract
New technologies are transforming medicine, and this revolution starts with data. Usually, health services within public healthcare systems are accessed through a booking centre managed by local health authorities and controlled by the regional government. In this perspective, structuring e-health data through a Knowledge Graph (KG) approach can provide a feasible method to quickly and simply organize data and/or retrieve new information. Starting from raw health bookings data from the public healthcare system in Italy, a KG method is presented to support e-health services through the extraction of medical knowledge and novel insights. By exploiting graph embedding which arranges the various attributes of the entities into the same vector space, we are able to apply Machine Learning (ML) techniques to the embedded vectors. The findings suggest that KGs could be used to assess patients' medical booking patterns, either from unsupervised or supervised ML. In particular, the former can determine possible presence of hidden groups of entities that is not immediately available through the original legacy dataset structure. The latter, although the performance of the used algorithms is not very high, shows encouraging results in predicting a patient's likelihood to undergo a particular medical visit within a year. However, many technological advances remain to be made, especially in graph database technologies and graph embedding algorithms.
Collapse
|
10
|
Hou Y, Yeung J, Xu H, Su C, Wang F, Zhang R. From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs. RESEARCH SQUARE 2023:rs.3.rs-3185632. [PMID: 37577545 PMCID: PMC10418534 DOI: 10.21203/rs.3.rs-3185632/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Purpose Large Language Models (LLMs) have shown exceptional performance in various natural language processing tasks, benefiting from their language generation capabilities and ability to acquire knowledge from unstructured text. However, in the biomedical domain, LLMs face limitations that lead to inaccurate and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for organizing structured information. Biomedical Knowledge Graphs (BKGs) have gained significant attention for managing diverse and large-scale biomedical knowledge. The objective of this study is to assess and compare the capabilities of ChatGPT and existing BKGs in question-answering, biomedical knowledge discovery, and reasoning tasks within the biomedical domain. Methods We conducted a series of experiments to assess the performance of ChatGPT and the BKGs in various aspects of querying existing biomedical knowledge, knowledge discovery, and knowledge reasoning. Firstly, we tasked ChatGPT with answering questions sourced from the "Alternative Medicine" sub-category of Yahoo! Answers and recorded the responses. Additionally, we queried BKG to retrieve the relevant knowledge records corresponding to the questions and assessed them manually. In another experiment, we formulated a prediction scenario to assess ChatGPT's ability to suggest potential drug/dietary supplement repurposing candidates. Simultaneously, we utilized BKG to perform link prediction for the same task. The outcomes of ChatGPT and BKG were compared and analyzed. Furthermore, we evaluated ChatGPT and BKG's capabilities in establishing associations between pairs of proposed entities. This evaluation aimed to assess their reasoning abilities and the extent to which they can infer connections within the knowledge domain. Results The results indicate that ChatGPT with GPT-4.0 outperforms both GPT-3.5 and BKGs in providing existing information. However, BKGs demonstrate higher reliability in terms of information accuracy. ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. Conclusions To address the limitations observed, future research should focus on integrating LLMs and BKGs to leverage the strengths of both approaches. Such integration would optimize task performance and mitigate potential risks, leading to advancements in knowledge within the biomedical field and contributing to the overall well-being of individuals.
Collapse
|
11
|
Walke D, Micheel D, Schallert K, Muth T, Broneske D, Saake G, Heyer R. The importance of graph databases and graph learning for clinical applications. Database (Oxford) 2023; 2023:baad045. [PMID: 37428679 PMCID: PMC10332447 DOI: 10.1093/database/baad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 05/26/2023] [Accepted: 06/16/2023] [Indexed: 07/12/2023]
Abstract
The increasing amount and complexity of clinical data require an appropriate way of storing and analyzing those data. Traditional approaches use a tabular structure (relational databases) for storing data and thereby complicate storing and retrieving interlinked data from the clinical domain. Graph databases provide a great solution for this by storing data in a graph as nodes (vertices) that are connected by edges (links). The underlying graph structure can be used for the subsequent data analysis (graph learning). Graph learning consists of two parts: graph representation learning and graph analytics. Graph representation learning aims to reduce high-dimensional input graphs to low-dimensional representations. Then, graph analytics uses the obtained representations for analytical tasks like visualization, classification, link prediction and clustering which can be used to solve domain-specific problems. In this survey, we review current state-of-the-art graph database management systems, graph learning algorithms and a variety of graph applications in the clinical domain. Furthermore, we provide a comprehensive use case for a clearer understanding of complex graph learning algorithms. Graphical abstract.
Collapse
Affiliation(s)
- Daniel Walke
- Bioprocess Engineering, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Daniel Micheel
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Kay Schallert
- Multidimensional Omics Analyses Group, Leibniz-Institut für Analytische Wissenschaften—ISAS—e.V., Bunsen-Kirchhoff-Straße 11, Dortmund 44139, Germany
| | - Thilo Muth
- Section eScience (S.3), Federal Institute for Materials Research and Testing (BAM), Unter den Eichen 87, Berlin 12205, Germany
| | - David Broneske
- Infrastructure and Methods, German Center for Higher Education Research and Science Studies (DZHW), Lange Laube 12, Hannover 30159, Germany
| | - Gunter Saake
- Database and Software Engineering Group, Otto von Guericke University, Universitätsplatz 2, Magdeburg 39106, Germany
| | - Robert Heyer
- Multidimensional Omics Analyses Group, Leibniz-Institut für Analytische Wissenschaften—ISAS—e.V., Bunsen-Kirchhoff-Straße 11, Dortmund 44139, Germany
- Faculty of Technology, Bielefeld University, Universitätsstraße 25, Bielefeld 33615, Germany
| |
Collapse
|
12
|
Caufield JH, Putman T, Schaper K, Unni DR, Hegde H, Callahan TJ, Cappelletti L, Moxon SAT, Ravanmehr V, Carbon S, Chan LE, Cortes K, Shefchek KA, Elsarboukh G, Balhoff J, Fontana T, Matentzoglu N, Bruskiewich RM, Thessen AE, Harris NL, Munoz-Torres MC, Haendel MA, Robinson PN, Joachimiak MP, Mungall CJ, Reese JT. KG-Hub-building and exchanging biological knowledge graphs. Bioinformatics 2023; 39:btad418. [PMID: 37389415 PMCID: PMC10336030 DOI: 10.1093/bioinformatics/btad418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 05/09/2023] [Accepted: 06/29/2023] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Knowledge graphs (KGs) are a powerful approach for integrating heterogeneous data and making inferences in biology and many other domains, but a coherent solution for constructing, exchanging, and facilitating the downstream use of KGs is lacking. RESULTS Here we present KG-Hub, a platform that enables standardized construction, exchange, and reuse of KGs. Features include a simple, modular extract-transform-load pattern for producing graphs compliant with Biolink Model (a high-level data model for standardizing biological data), easy integration of any OBO (Open Biological and Biomedical Ontologies) ontology, cached downloads of upstream data sources, versioned and automatically updated builds with stable URLs, web-browsable storage of KG artifacts on cloud infrastructure, and easy reuse of transformed subgraphs across projects. Current KG-Hub projects span use cases including COVID-19 research, drug repurposing, microbial-environmental interactions, and rare disease research. KG-Hub is equipped with tooling to easily analyze and manipulate KGs. KG-Hub is also tightly integrated with graph machine learning (ML) tools which allow automated graph ML, including node embeddings and training of models for link prediction and node classification. AVAILABILITY AND IMPLEMENTATION https://kghub.org.
Collapse
Affiliation(s)
- J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Tim Putman
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Kevin Schaper
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Deepak R Unni
- SIB Swiss Institute of Bioinformatics, Basel 1015, Switzerland
| | - Harshad Hegde
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY 10032, United States
| | - Luca Cappelletti
- Department of Computer Science, University of Milano, Milan 20126, Italy
| | - Sierra A T Moxon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Vida Ravanmehr
- Department of Lymphoma-Myeloma, MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Seth Carbon
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Lauren E Chan
- College of Public Health and Human Sciences, Oregon State University, Corvallis, OR 97331, United States
| | - Katherina Cortes
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Kent A Shefchek
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Glass Elsarboukh
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Jim Balhoff
- Renaissance Computing Institute, University of North Carolina, Chapel Hill, NC 27517, United States
| | - Tommaso Fontana
- Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milan 20133, Italy
| | | | | | - Anne E Thessen
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Nomi L Harris
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | | | - Melissa A Haendel
- Anschutz Medical Campus, University of Colorado, Aurora, CO 80045, United States
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, United States
| | - Marcin P Joachimiak
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| |
Collapse
|
13
|
Hou Y, Yeung J, Xu H, Su C, Wang F, Zhang R. From Answers to Insights: Unveiling the Strengths and Limitations of ChatGPT and Biomedical Knowledge Graphs. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2023:2023.06.09.23291208. [PMID: 37398259 PMCID: PMC10312889 DOI: 10.1101/2023.06.09.23291208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Large Language Models (LLMs) have demonstrated exceptional performance in various natural language processing tasks, utilizing their language generation capabilities and knowledge acquisition potential from unstructured text. However, when applied to the biomedical domain, LLMs encounter limitations, resulting in erroneous and inconsistent answers. Knowledge Graphs (KGs) have emerged as valuable resources for structured information representation and organization. Specifically, Biomedical Knowledge Graphs (BKGs) have attracted significant interest in managing large-scale and heterogeneous biomedical knowledge. This study evaluates the capabilities of ChatGPT and existing BKGs in question answering, knowledge discovery, and reasoning. Results indicate that while ChatGPT with GPT-4.0 surpasses both GPT-3.5 and BKGs in providing existing information, BKGs demonstrate superior information reliability. Additionally, ChatGPT exhibits limitations in performing novel discoveries and reasoning, particularly in establishing structured links between entities compared to BKGs. To overcome these limitations, future research should focus on integrating LLMs and BKGs to leverage their respective strengths. Such an integrated approach would optimize task performance and mitigate potential risks, thereby advancing knowledge in the biomedical field and contributing to overall well-being.
Collapse
Affiliation(s)
- Yu Hou
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| | - Jeremy Yeung
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale University, New Haven, Connecticut, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
14
|
Murali L, Gopakumar G, Viswanathan DM, Nedungadi P. Towards electronic health record-based medical knowledge graph construction, completion, and applications: A literature study. J Biomed Inform 2023:104403. [PMID: 37230406 DOI: 10.1016/j.jbi.2023.104403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Revised: 05/16/2023] [Accepted: 05/19/2023] [Indexed: 05/27/2023]
Abstract
With the growth of data and intelligent technologies, the healthcare sector opened numerous technology that enabled services for patients, clinicians, and researchers. One major hurdle in achieving state-of-the-art results in health informatics is domain-specific terminologies and their semantic complexities. A knowledge graph crafted from medical concepts, events, and relationships acts as a medical semantic network to extract new links and hidden patterns from health data sources. Current medical knowledge graph construction studies are limited to generic techniques and opportunities and focus less on exploiting real-world data sources in knowledge graph construction. A knowledge graph constructed from Electronic Health Records (EHR) data obtains real-world data from healthcare records. It ensures better results in subsequent tasks like knowledge extraction and inference, knowledge graph completion, and medical knowledge graph applications such as diagnosis predictions, clinical recommendations, and clinical decision support. This review critically analyses existing works on medical knowledge graphs that used EHR data as the data source at (i) representation level, (ii) extraction level (iii) completion level. In this investigation, we found that EHR-based knowledge graph construction involves challenges such as high complexity and dimensionality of data, lack of knowledge fusion, and dynamic update of the knowledge graph. In addition, the study presents possible ways to tackle the challenges identified. Our findings conclude that future research should focus on knowledge graph integration and knowledge graph completion challenges.
Collapse
Affiliation(s)
- Lino Murali
- Center for Research in Analytics and Technologies for Education (CREATE), Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India; Division of Information technology, School of Engineering, Cochin University of Science and Technology, Kochi, 682022, Kerala, India
| | - G Gopakumar
- Department of Computer Science and Engineering, School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India
| | - Daleesha M Viswanathan
- Division of Information technology, School of Engineering, Cochin University of Science and Technology, Kochi, 682022, Kerala, India
| | - Prema Nedungadi
- Center for Research in Analytics and Technologies for Education (CREATE), Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India; Department of Computer Science and Engineering, School of Computing, Amrita Vishwa Vidyapeetham, Amritapuri, Kollam, 690525, Kerala, India.
| |
Collapse
|
15
|
Cenikj G, Strojnik L, Angelski R, Ogrinc N, Koroušić Seljak B, Eftimov T. From language models to large-scale food and biomedical knowledge graphs. Sci Rep 2023; 13:7815. [PMID: 37188766 DOI: 10.1038/s41598-023-34981-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2023] [Accepted: 05/10/2023] [Indexed: 05/17/2023] Open
Abstract
Knowledge about the interactions between dietary and biomedical factors is scattered throughout uncountable research articles in an unstructured form (e.g., text, images, etc.) and requires automatic structuring so that it can be provided to medical professionals in a suitable format. Various biomedical knowledge graphs exist, however, they require further extension with relations between food and biomedical entities. In this study, we evaluate the performance of three state-of-the-art relation-mining pipelines (FooDis, FoodChem and ChemDis) which extract relations between food, chemical and disease entities from textual data. We perform two case studies, where relations were automatically extracted by the pipelines and validated by domain experts. The results show that the pipelines can extract relations with an average precision around 70%, making new discoveries available to domain experts with reduced human effort, since the domain experts should only evaluate the results, instead of finding, and reading all new scientific papers.
Collapse
Affiliation(s)
- Gjorgjina Cenikj
- Jožef Stefan Institute, Ljubljana, 1000, Slovenia.
- Jožef Stefan International Postgraduate School, Ljubljana, 1000, Slovenia.
| | | | | | - Nives Ogrinc
- Jožef Stefan Institute, Ljubljana, 1000, Slovenia
| | | | - Tome Eftimov
- Jožef Stefan Institute, Ljubljana, 1000, Slovenia
| |
Collapse
|
16
|
Su C, Hou Y, Zhou M, Rajendran S, Maasch JRA, Abedi Z, Zhang H, Bai Z, Cuturrufo A, Guo W, Chaudhry FF, Ghahramani G, Tang J, Cheng F, Li Y, Zhang R, DeKosky ST, Bian J, Wang F. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). iScience 2023; 26:106460. [PMID: 37020958 PMCID: PMC10068563 DOI: 10.1016/j.isci.2023.106460] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/20/2022] [Accepted: 03/16/2023] [Indexed: 04/01/2023] Open
Abstract
The abundance of biomedical knowledge gained from biological experiments and clinical practices is an invaluable resource for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In this study, we created a comprehensive BKG called the integrative Biomedical Knowledge Hub (iBKH) by harmonizing and integrating information from diverse biomedical resources. To make iBKH easily accessible for biomedical research, we developed a web-based, user-friendly graphical portal that allows fast and interactive knowledge retrieval. Additionally, we also implemented an efficient and scalable graph learning pipeline for discovering novel biomedical knowledge in iBKH. As a proof of concept, we performed our iBKH-based method for computational in-silico drug repurposing for Alzheimer's disease. The iBKH is publicly available.
Collapse
Affiliation(s)
- Chang Su
- Department of Health Service Administration and Policy, College of Public Health, Temple University, Philadelphia, PA 19122, USA
| | - Yu Hou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Manqi Zhou
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, New York, NY 10065, USA
| | | | - Zehra Abedi
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Haotan Zhang
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | | | - Winston Guo
- Department of Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Fayzan F. Chaudhry
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Gregory Ghahramani
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec AI Institute and HEC Montreal, Montreal, QC H2S 3H1, Canada
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC H3A 0C6, Canada
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Steven T. DeKosky
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
17
|
Gao J, He S, Hu J, Chen G. A hybrid system to understand the relations between assessments and plans in progress notes. J Biomed Inform 2023; 141:104363. [PMID: 37054961 DOI: 10.1016/j.jbi.2023.104363] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 04/05/2023] [Accepted: 04/07/2023] [Indexed: 04/15/2023]
Abstract
OBJECTIVE The paper presents a novel solution to the 2022 National NLP Clinical Challenges (n2c2) Track 3, which aims to predict the relations between assessment and plan subsections in progress notes. METHODS Our approach goes beyond standard transformer models and incorporates external information such as medical ontology and order information to comprehend the semantics of progress notes. We fine-tuned transformers to understand the textual data and incorporated medical ontology concepts and their relationships to enhance the model's accuracy. We also captured order information that regular transformers cannot by taking into account the position of the assessment and plan subsections in progress notes. RESULTS Our submission earned third place in the challenge phase with a macro-F1 score of 0.811. After refining our pipeline further, we achieved a macro-F1 of 0.826, outperforming the top-performing system during the challenge phase. CONCLUSION Our approach, which combines fine-tuned transformers, medical ontology, and order information, outperformed other systems in predicting the relationships between assessment and plan subsections in progress notes. This highlights the importance of incorporating external information beyond textual data in natural language processing (NLP) tasks related to medical documentation. Our work could potentially improve the efficiency and accuracy of progress note analysis.
Collapse
Affiliation(s)
- Jifan Gao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA
| | - Shilu He
- Department of Mathematics, University of Wisconsin-Madison, 480 Lincoln Dr, Madison, 53706, WI, USA
| | - Junjie Hu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA.
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, 610 Walnut St, Madison, 53726, WI, USA.
| |
Collapse
|
18
|
Gani MO, Kethireddy S, Adib R, Hasan U, Griffin P, Adibuzzaman M. Structural causal model with expert augmented knowledge to estimate the effect of oxygen therapy on mortality in the ICU. Artif Intell Med 2023; 137:102493. [PMID: 36868692 PMCID: PMC9992896 DOI: 10.1016/j.artmed.2023.102493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Revised: 01/17/2023] [Accepted: 01/17/2023] [Indexed: 02/01/2023]
Abstract
Recent advances in causal inference techniques, more specifically, in the theory of structural causal models, provide the framework for identifying causal effects from observational data in cases where the causal graph is identifiable, i.e., the data generation mechanism can be recovered from the joint distribution. However, no such studies have been performed to demonstrate this concept with a clinical example. We present a complete framework to estimate the causal effects from observational data by augmenting expert knowledge in the model development phase and with a practical clinical application. Our clinical application entails a timely and essential research question, the effect of oxygen therapy intervention in the intensive care unit (ICU). The result of this project is helpful in a variety of disease conditions, including severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) patients in the ICU. We used data from the MIMIC-III database, a widely used health care database in the machine learning community with 58,976 admissions from an ICU in Boston, MA, to estimate the oxygen therapy effect on morality. We also identified the model's covariate-specific effect on oxygen therapy for more personalized intervention.
Collapse
Affiliation(s)
- Md Osman Gani
- Department of Information Systems, University of Maryland, Baltimore County, Baltimore, MD, USA.
| | | | - Riddhiman Adib
- Oregon Clinical and Translational Research Institute, Oregon Health & Science University, Portland, OR, USA.
| | - Uzma Hasan
- Department of Information Systems, University of Maryland, Baltimore County, Baltimore, MD, USA.
| | - Paul Griffin
- Department of Industrial and Manufacturing Engineering, Penn State University, University Park, PA, USA.
| | - Mohammad Adibuzzaman
- Oregon Clinical and Translational Research Institute, Oregon Health & Science University, Portland, OR, USA.
| |
Collapse
|
19
|
Getzen E, Ungar L, Mowery D, Jiang X, Long Q. Mining for equitable health: Assessing the impact of missing data in electronic health records. J Biomed Inform 2023; 139:104269. [PMID: 36621750 PMCID: PMC10391553 DOI: 10.1016/j.jbi.2022.104269] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 11/25/2022] [Accepted: 12/07/2022] [Indexed: 01/07/2023]
Abstract
Electronic health records (EHR) are collected as a routine part of healthcare delivery, and have great potential to be utilized to improve patient health outcomes. They contain multiple years of health information to be leveraged for risk prediction, disease detection, and treatment evaluation. However, they do not have a consistent, standardized format across institutions, particularly in the United States, and can present significant analytical challenges- they contain multi-scale data from heterogeneous domains and include both structured and unstructured data. Data for individual patients are collected at irregular time intervals and with varying frequencies. In addition to the analytical challenges, EHR can reflect inequity- patients belonging to different groups will have differing amounts of data in their health records. Many of these issues can contribute to biased data collection. The consequence is that the data for under-served groups may be less informative partly due to more fragmented care, which can be viewed as a type of missing data problem. For EHR data in this complex form, there is currently no framework for introducing realistic missing values. There has also been little to no work in assessing the impact of missing data in EHR. In this work, we first introduce a terminology to define three levels of EHR data and then propose a novel framework for simulating realistic missing data scenarios in EHR to adequately assess their impact on predictive modeling. We incorporate the use of a medical knowledge graph to capture dependencies between medical events to create a more realistic missing data framework. In an intensive care unit setting, we found that missing data have greater negative impact on the performance of disease prediction models in groups that tend to have less access to healthcare, or seek less healthcare. We also found that the impact of missing data on disease prediction models is stronger when using the knowledge graph framework to introduce realistic missing values as opposed to random event removal.
Collapse
Affiliation(s)
- Emily Getzen
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, United States.
| | - Lyle Ungar
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, United States
| | - Danielle Mowery
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, United States
| | - Xiaoqian Jiang
- UTHealth School of Biomedical Informatics, Houston, TX, United States
| | - Qi Long
- Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA, United States.
| |
Collapse
|
20
|
Chen A, Lu R, Han R, Huang R, Qin G, Wen J, Li Q, Zhang Z, Jiang W. Building Practical Risk Prediction Models for Nasopharyngeal Carcinoma Screening with Patient Graph Analysis and Machine Learning. Cancer Epidemiol Biomarkers Prev 2023; 32:274-280. [PMID: 36480263 DOI: 10.1158/1055-9965.epi-22-0792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Revised: 09/07/2022] [Accepted: 12/06/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND To expand nasopharyngeal carcinoma (NPC) screening to larger populations, more practical NPC risk prediction models independent of Epstein-Barr virus (EBV) and other lab tests are necessary. METHODS Patient data before diagnosis of NPC were collected from hospital electronic medical records (EMR) and used to develop machine learning (ML) models for NPC risk prediction using XGBoost. NPC risk factor distributions were generated through connection delta ratio (CDR) analysis of patient graphs. By combining EMR-wide ML with patient graph analysis, the number of variables in these risk models was reduced, allowing for more practical NPC risk prediction ML models. RESULTS Using data collected from 1,357 patients with NPC and 1,448 patients with control, an optimal set of 100 variables (ov100) was determined for building NPC risk prediction ML models that had, the following performance metrics: 0.93-0.96 recall, 0.80-0.92 precision, and 0.83-0.94 AUC. Aided by the analysis of top CDR-ranked risk factors, the models were further refined to contain only 20 practical variables (pv20), excluding EBV. The pv20 NPC risk XGBoost model achieved 0.79 recall, 0.94 precision, 0.96 specificity, and 0.87 AUC. CONCLUSIONS This study demonstrated the feasibility of developing practical NPC risk prediction models using EMR-wide ML and patient graph CDR analysis, without requiring EBV data. These models could enable broader implementation of NPC risk evaluation and screening recommendations for larger populations in urban community health centers and rural clinics. IMPACT These more practical NPC risk models could help increase NPC screening rate and identify more patients with early-stage NPC.
Collapse
Affiliation(s)
- Anjun Chen
- Guilin Medical University, Guilin, Guangxi, China.,West China Hospital, Chengdu, Sichuan, China
| | - Roufeng Lu
- Guilin Medical University, Guilin, Guangxi, China
| | - Ruobing Han
- Guilin Medical University, Guilin, Guangxi, China
| | - Ran Huang
- West China Hospital, Chengdu, Sichuan, China
| | - Guanjie Qin
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Jian Wen
- Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | - Qinghua Li
- Guilin Medical University, Guilin, Guangxi, China.,Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| | | | - Wei Jiang
- Guilin Medical University, Guilin, Guangxi, China.,Guilin Medical University Affiliated Hospital, Guilin, Guangxi, China
| |
Collapse
|
21
|
Knowledge mining of unstructured information: application to cyber domain. Sci Rep 2023; 13:1714. [PMID: 36720897 PMCID: PMC9889742 DOI: 10.1038/s41598-023-28796-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2022] [Accepted: 01/24/2023] [Indexed: 02/01/2023] Open
Abstract
Information on cyber-related crimes, incidents, and conflicts is abundantly available in numerous open online sources. However, processing large volumes and streams of data is a challenging task for the analysts and experts, and entails the need for newer methods and techniques. In this article we present and implement a novel knowledge graph and knowledge mining framework for extracting the relevant information from free-form text about incidents in the cyber domain. The computational framework includes a machine learning-based pipeline for generating graphs of organizations, countries, industries, products and attackers with a non-technical cyber-ontology. The extracted knowledge graph is utilized to estimate the incidence of cyberattacks within a given graph configuration. We use publicly available collections of real cyber-incident reports to test the efficacy of our methods. The knowledge extraction is found to be sufficiently accurate, and the graph-based threat estimation demonstrates a level of correlation with the actual records of attacks. In practical use, an analyst utilizing the presented framework can infer additional information from the current cyber-landscape in terms of the risk to various entities and its propagation between industries and countries.
Collapse
|
22
|
Yang Y, Lu Y, Yan W. A comprehensive review on knowledge graphs for complex diseases. Brief Bioinform 2023; 24:6931722. [PMID: 36528805 DOI: 10.1093/bib/bbac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 11/02/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Medical College of Soochow University, and Center for Systems Biology, Soochow University, Suzhou 215123, China
| |
Collapse
|
23
|
A weighted-link graph neural network for lung cancer knowledge classification. APPL INTELL 2023. [DOI: 10.1007/s10489-022-04437-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
|
24
|
Griesser A, Bidmon S. A Process Related View on the Usage of Electronic Health Records from the Patients' Perspective: A Systematic Review. J Med Syst 2022; 47:2. [PMID: 36580132 PMCID: PMC9800349 DOI: 10.1007/s10916-022-01886-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 11/02/2022] [Indexed: 12/30/2022]
Abstract
BACKGROUND In recent years, there has been an increasing interest in electronic health record (EHR) systems and various approaches of encouraging acceptance. Multiple methods of EHR acceptance have been proposed. However, a systematic review of patient's perspectives of their role and challenges in processing EHR remains lacking. Moreover, so far, there has been little discussion about barriers and facilitators of EHR system acceptance and usage from the patients' perspective. METHODS The study was reported according to the PRISMA statement. Six databases were systematically searched using keywords for articles from 2002-2020. We reviewed these data and used an inductive approach to analyse findings. RESULTS A total of 36 studies met the inclusion criteria. Our systematic literature review results reveal a wide range of barriers and facilitators assigned to four distinct stages of EHR system usage: awareness, adoption, behaviour and perception, and consequences. Results were described in a narrative synthesis of the included empirical studies. DISCUSSION Results underline the necessity to put a particular emphasis - but not exclusively - on the initial stage of awareness in the future. Further research in the field is therefore strongly recommended in order to develop tailored mediated communication to foster EHR system usage in the long run.
Collapse
Affiliation(s)
- Anna Griesser
- Department of Marketing and International Management, Alpen-Adria-Universität Klagenfurt, Universitätsstraße 65-67, 9020, Klagenfurt am Wörthersee, Austria
| | - Sonja Bidmon
- Department of Marketing and International Management, Alpen-Adria-Universität Klagenfurt, Universitätsstraße 65-67, 9020, Klagenfurt am Wörthersee, Austria.
| |
Collapse
|
25
|
Cho HN, Ahn I, Gwon H, Kang HJ, Kim Y, Seo H, Choi H, Kim M, Han J, Kee G, Jun TJ, Kim YH. Heterogeneous graph construction and HinSAGE learning from electronic medical records. Sci Rep 2022; 12:21152. [PMID: 36477457 PMCID: PMC9729175 DOI: 10.1038/s41598-022-25693-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Accepted: 12/02/2022] [Indexed: 12/12/2022] Open
Abstract
Graph representation learning is a method for introducing how to effectively construct and learn patient embeddings using electronic medical records. Adapting the integration will support and advance the previous methods to predict the prognosis of patients in network models. This study aims to address the challenge of implementing a complex and highly heterogeneous dataset, including the following: (1) demonstrating how to build a multi-attributed and multi-relational graph model (2) and applying a downstream disease prediction task of a patient's prognosis using the HinSAGE algorithm. We present a bipartite graph schema and a graph database construction in detail. The first constructed graph database illustrates a query of a predictive network that provides analytical insights using a graph representation of a patient's journey. Moreover, we demonstrate an alternative bipartite model where we apply the model to the HinSAGE to perform the link prediction task for predicting the event occurrence. Consequently, the performance evaluation indicated that our heterogeneous graph model was successfully predicted as a baseline model. Overall, our graph database successfully demonstrated efficient real-time query performance and showed HinSAGE implementation to predict cardiovascular disease event outcomes on supervised link prediction learning.
Collapse
Affiliation(s)
- Ha Na Cho
- grid.267370.70000 0004 0533 4667Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Imjin Ahn
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Hansle Gwon
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Hee Jun Kang
- grid.267370.70000 0004 0533 4667Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Yunha Kim
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Hyeram Seo
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Heejung Choi
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Minkyoung Kim
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Jiye Han
- grid.267370.70000 0004 0533 4667Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Gaeun Kee
- grid.267370.70000 0004 0533 4667Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Tae Joon Jun
- grid.413967.e0000 0001 0842 2126Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| | - Young-Hak Kim
- grid.267370.70000 0004 0533 4667Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympicro 43 Gil, Songpagu, 05505 Seoul, Republic of Korea
| |
Collapse
|
26
|
Li MM, Huang K, Zitnik M. Graph representation learning in biomedicine and healthcare. Nat Biomed Eng 2022; 6:1353-1369. [PMID: 36316368 PMCID: PMC10699434 DOI: 10.1038/s41551-022-00942-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 08/09/2022] [Indexed: 11/11/2022]
Abstract
Networks-or graphs-are universal descriptors of systems of interacting elements. In biomedicine and healthcare, they can represent, for example, molecular interactions, signalling pathways, disease co-morbidities or healthcare systems. In this Perspective, we posit that representation learning can realize principles of network medicine, discuss successes and current limitations of the use of representation learning on graphs in biomedicine and healthcare, and outline algorithmic strategies that leverage the topology of graphs to embed them into compact vectorial spaces. We argue that graph representation learning will keep pushing forward machine learning for biomedicine and healthcare applications, including the identification of genetic variants underlying complex traits, the disentanglement of single-cell behaviours and their effects on health, the assistance of patients in diagnosis and treatment, and the development of safe and effective medicines.
Collapse
Affiliation(s)
- Michelle M Li
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kexin Huang
- Health Data Science Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
| |
Collapse
|
27
|
Wang Y, Han X, Hao X, Zhu T, Shu H. A Curriculum Batching Strategy for Automatic ICD Coding with Deep Multi-Label Classification Models. Healthcare (Basel) 2022; 10:healthcare10122397. [PMID: 36553921 PMCID: PMC9777784 DOI: 10.3390/healthcare10122397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Revised: 11/21/2022] [Accepted: 11/25/2022] [Indexed: 12/05/2022] Open
Abstract
The International Classification of Diseases (ICD) has an important role in building applications for clinical medicine. Extremely large ICD coding label sets and imbalanced label distribution bring the problem of inconsistency between the local batch data distribution and the global training data distribution into the minibatch gradient descent (MBGD)-based training procedure for deep multi-label classification models for automatic ICD coding. The problem further leads to an overfitting issue. In order to improve the performance and generalization ability of the deep learning automatic ICD coding model, we proposed a simple and effective curriculum batching strategy in this paper for improving the MBGD-based training procedure. This strategy generates three batch sets offline through applying three predefined sampling algorithms. These batch sets satisfy a uniform data distribution, a shuffling data distribution and the original training data distribution, respectively, and the learning tasks corresponding to these batch sets range from simple to complex. Experiments show that, after replacing the original shuffling algorithm-based batching strategy with the proposed curriculum batching strategy, the performance of the three investigated deep multi-label classification models for automatic ICD coding all have dramatic improvements. At the same time, the models avoid the overfitting issue and all show better ability to learn the long-tailed label information. The performance is also better than a SOTA label set reconstruction model.
Collapse
Affiliation(s)
- Yaqiang Wang
- College of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
- Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service, Chengdu University of Information Technology, Chengdu 610225, China
- Correspondence: (Y.W.); (X.H.)
| | - Xu Han
- College of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
- Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service, Chengdu University of Information Technology, Chengdu 610225, China
| | - Xuechao Hao
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu 621005, China
- The Research Units of West China, Chinese Academy of Medical Sciences, West China Hospital, Sichuan University, Chengdu 621005, China
- Correspondence: (Y.W.); (X.H.)
| | - Tao Zhu
- Department of Anesthesiology, West China Hospital, Sichuan University, Chengdu 621005, China
- The Research Units of West China, Chinese Academy of Medical Sciences, West China Hospital, Sichuan University, Chengdu 621005, China
| | - Hongping Shu
- College of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
- Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service, Chengdu University of Information Technology, Chengdu 610225, China
| |
Collapse
|
28
|
Chen A, Huang R, Wu E, Han R, Wen J, Li Q, Zhang Z, Shen B. The Generation of a Lung Cancer Health Factor Distribution Using Patient Graphs Constructed From Electronic Medical Records: Retrospective Study. J Med Internet Res 2022; 24:e40361. [PMID: 36427233 PMCID: PMC9736747 DOI: 10.2196/40361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2022] [Revised: 09/09/2022] [Accepted: 10/25/2022] [Indexed: 11/27/2022] Open
Abstract
BACKGROUND Electronic medical records (EMRs) of patients with lung cancer (LC) capture a variety of health factors. Understanding the distribution of these factors will help identify key factors for risk prediction in preventive screening for LC. OBJECTIVE We aimed to generate an integrated biomedical graph from EMR data and Unified Medical Language System (UMLS) ontology for LC, and to generate an LC health factor distribution from a hospital EMR of approximately 1 million patients. METHODS The data were collected from 2 sets of 1397 patients with and those without LC. A patient-centered health factor graph was plotted with 108,000 standardized data, and a graph database was generated to integrate the graphs of patient health factors and the UMLS ontology. With the patient graph, we calculated the connection delta ratio (CDR) for each of the health factors to measure the relative strength of the factor's relationship to LC. RESULTS The patient graph had 93,000 relations between the 2794 patient nodes and 650 factor nodes. An LC graph with 187 related biomedical concepts and 188 horizontal biomedical relations was plotted and linked to the patient graph. Searching the integrated biomedical graph with any number or category of health factors resulted in graphical representations of relationships between patients and factors, while searches using any patient presented the patient's health factors from the EMR and the LC knowledge graph (KG) from the UMLS in the same graph. Sorting the health factors by CDR in descending order generated a distribution of health factors for LC. The top 70 CDR-ranked factors of disease, symptom, medical history, observation, and laboratory test categories were verified to be concordant with those found in the literature. CONCLUSIONS By collecting standardized data of thousands of patients with and those without LC from the EMR, it was possible to generate a hospital-wide patient-centered health factor graph for graph search and presentation. The patient graph could be integrated with the UMLS KG for LC and thus enable hospitals to bring continuously updated international standard biomedical KGs from the UMLS for clinical use in hospitals. CDR analysis of the graph of patients with LC generated a CDR-sorted distribution of health factors, in which the top CDR-ranked health factors were concordant with the literature. The resulting distribution of LC health factors can be used to help personalize risk evaluation and preventive screening recommendations.
Collapse
Affiliation(s)
- Anjun Chen
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | - Ran Huang
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | - Erman Wu
- Institutes for System Genetics, West China Hospital, Chengdu, China
| | | | - Jian Wen
- Guilin Medical University Affiliateted Hospital, Guilin, China
| | - Qinghua Li
- Guilin Medical University, Guilin, China
| | | | - Bairong Shen
- Institutes for System Genetics, West China Hospital, Chengdu, China
| |
Collapse
|
29
|
Xiao G, Pfaff E, Prud'hommeaux E, Booth D, Sharma DK, Huo N, Yu Y, Zong N, Ruddy KJ, Chute CG, Jiang G. FHIR-Ontop-OMOP: Building clinical knowledge graphs in FHIR RDF with the OMOP Common data Model. J Biomed Inform 2022; 134:104201. [PMID: 36089199 PMCID: PMC9561043 DOI: 10.1016/j.jbi.2022.104201] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Revised: 08/04/2022] [Accepted: 09/04/2022] [Indexed: 11/26/2022]
Abstract
BACKGROUND Knowledge graphs (KGs) play a key role to enable explainable artificial intelligence (AI) applications in healthcare. Constructing clinical knowledge graphs (CKGs) against heterogeneous electronic health records (EHRs) has been desired by the research and healthcare AI communities. From the standardization perspective, community-based standards such as the Fast Healthcare Interoperability Resources (FHIR) and the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) are increasingly used to represent and standardize EHR data for clinical data analytics, however, the potential of such a standard on building CKG has not been well investigated. OBJECTIVE To develop and evaluate methods and tools that expose the OMOP CDM-based clinical data repositories into virtual clinical KGs that are compliant with FHIR Resource Description Framework (RDF) specification. METHODS We developed a system called FHIR-Ontop-OMOP to generate virtual clinical KGs from the OMOP relational databases. We leveraged an OMOP CDM-based Medical Information Mart for Intensive Care (MIMIC-III) data repository to evaluate the FHIR-Ontop-OMOP system in terms of the faithfulness of data transformation and the conformance of the generated CKGs to the FHIR RDF specification. RESULTS A beta version of the system has been released. A total of more than 100 data element mappings from 11 OMOP CDM clinical data, health system and vocabulary tables were implemented in the system, covering 11 FHIR resources. The generated virtual CKG from MIMIC-III contains 46,520 instances of FHIR Patient, 716,595 instances of Condition, 1,063,525 instances of Procedure, 24,934,751 instances of MedicationStatement, 365,181,104 instances of Observations, and 4,779,672 instances of CodeableConcept. Patient counts identified by five pairs of SQL (over the MIMIC database) and SPARQL (over the virtual CKG) queries were identical, ensuring the faithfulness of the data transformation. Generated CKG in RDF triples for 100 patients were fully conformant with the FHIR RDF specification. CONCLUSION The FHIR-Ontop-OMOP system can expose OMOP database as a FHIR-compliant RDF graph. It provides a meaningful use case demonstrating the potentials that can be enabled by the interoperability between FHIR and OMOP CDM. Generated clinical KGs in FHIR RDF provide a semantic foundation to enable explainable AI applications in healthcare.
Collapse
Affiliation(s)
- Guohui Xiao
- University of Bergen, Norway; University of Oslo, Norway; Ontopic S.r.l., Italy.
| | - Emily Pfaff
- University of North Carolina, Chapel Hill, NC, USA
| | | | | | | | - Nan Huo
- Mayo Clinic, Rochester, MN, USA
| | - Yue Yu
- Mayo Clinic, Rochester, MN, USA
| | | | | | | | | |
Collapse
|
30
|
Jing F, Ren H, Cheng W, Wang X, Zhang Q. Knowledge-enhanced attentive learning for answer selection in community question answering systems. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2022.109117] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
31
|
Brakefield WS, Ammar N, Shaban-Nejad A. An Urban Population Health Observatory for Disease Causal Pathway Analysis and Decision Support: Underlying Explainable Artificial Intelligence Model. JMIR Form Res 2022; 6:e36055. [PMID: 35857363 PMCID: PMC9350817 DOI: 10.2196/36055] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 05/03/2022] [Accepted: 06/07/2022] [Indexed: 01/16/2023] Open
Abstract
Background Many researchers have aimed to develop chronic health surveillance systems to assist in public health decision-making. Several digital health solutions created lack the ability to explain their decisions and actions to human users. Objective This study sought to (1) expand our existing Urban Population Health Observatory (UPHO) system by incorporating a semantics layer; (2) cohesively employ machine learning and semantic/logical inference to provide measurable evidence and detect pathways leading to undesirable health outcomes; (3) provide clinical use case scenarios and design case studies to identify socioenvironmental determinants of health associated with the prevalence of obesity, and (4) design a dashboard that demonstrates the use of UPHO in the context of obesity surveillance using the provided scenarios. Methods The system design includes a knowledge graph generation component that provides contextual knowledge from relevant domains of interest. This system leverages semantics using concepts, properties, and axioms from existing ontologies. In addition, we used the publicly available US Centers for Disease Control and Prevention 500 Cities data set to perform multivariate analysis. A cohesive approach that employs machine learning and semantic/logical inference reveals pathways leading to diseases. Results In this study, we present 2 clinical case scenarios and a proof-of-concept prototype design of a dashboard that provides warnings, recommendations, and explanations and demonstrates the use of UPHO in the context of obesity surveillance, treatment, and prevention. While exploring the case scenarios using a support vector regression machine learning model, we found that poverty, lack of physical activity, education, and unemployment were the most important predictive variables that contribute to obesity in Memphis, TN. Conclusions The application of UPHO could help reduce health disparities and improve urban population health. The expanded UPHO feature incorporates an additional level of interpretable knowledge to enhance physicians, researchers, and health officials' informed decision-making at both patient and community levels. International Registered Report Identifier (IRRID) RR2-10.2196/28269
Collapse
Affiliation(s)
- Whitney S Brakefield
- Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States.,Bredesen Center for Data Science, University of Tennessee, Knoxville, TN, United States
| | - Nariman Ammar
- Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States.,Ochsner Xavier Institute for Health Equity and Research, Ochsner Clinic Foundation, New Orleans, LA, United States
| | - Arash Shaban-Nejad
- Center for Biomedical Informatics, Department of Pediatrics, College of Medicine, University of Tennessee Health Science Center, Memphis, TN, United States
| |
Collapse
|
32
|
Construction of Disease-Symptom Knowledge Graph from Web-Board Documents. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12136615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
Abstract
The research aim is to construct a disease-symptom knowledge graph (DSKG) as a cause-effect knowledge graph containing disease-symptom relations as a cause-effect relation type determined from downloaded documents on medical web-board resources. Each disease-symptom relation connects a disease-name concept node (a causative-concept node) to a corresponding node having a group of correlated symptom-concept/effect-concept features as common symptom-concept/effect-concept features among some disease-name concepts. The DSKG benefits non-professionals in preliminary diagnosis through a recommender web-board. There are three main problems: how to determine symptom concepts from sentences without annotation on the documents having disease-name concepts as the documents’ topic-names; how to determine the disease-symptom relations from the documents with/without complications; and how to construct the DSKG involving high dimensional symptom-concept features after union of the correlated symptom-concept groups. Therefore, we apply a word co-occurrence pattern including medical-symptom expressions from Wikipedia including MeSH and the Lexitron Dictionary to determine the symptom concepts. The Cartesian product is applied for automatic-supervised machine learning to determine the disease-symptom relation. We propose using Principal Component Analysis for constructing the DSKG by dimensionality reduction in the symptom-concept features with minimized information loss. In contrast to previous works, the proposed approach enables the DSKG construction with precise and concise representation scores of 7.8 and 9, respectively.
Collapse
|
33
|
A lattice LSTM-based framework for knowledge graph construction from power plants maintenance reports. SERVICE ORIENTED COMPUTING AND APPLICATIONS 2022. [DOI: 10.1007/s11761-022-00338-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
34
|
Graph neural network approaches for drug-target interactions. Curr Opin Struct Biol 2022; 73:102327. [DOI: 10.1016/j.sbi.2021.102327] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 11/22/2021] [Accepted: 12/13/2021] [Indexed: 01/06/2023]
|
35
|
Weng H, Chen J, Ou A, Lao Y. Leveraging Representation Learning for the Construction and Application of Knowledge Graph for Traditional Chinese Medicine (Preprint). JMIR Med Inform 2022; 10:e38414. [PMID: 36053574 PMCID: PMC9482071 DOI: 10.2196/38414] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 07/04/2022] [Accepted: 07/27/2022] [Indexed: 11/13/2022] Open
Affiliation(s)
- Heng Weng
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China
| | - Jielong Chen
- School of Information Science, Guangdong University of Finance & Economics, Guangzhou, China
| | - Aihua Ou
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China
| | - Yingrong Lao
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China
| |
Collapse
|
36
|
Lan G, Liu T, Wang X, Pan X, Huang Z. A semantic web technology index. Sci Rep 2022; 12:3672. [PMID: 35256665 PMCID: PMC8901930 DOI: 10.1038/s41598-022-07615-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2021] [Accepted: 01/13/2022] [Indexed: 01/22/2023] Open
Abstract
Semantic web (SW) technology has been widely applied to many domains such as medicine, health care, finance, geology. At present, researchers mainly rely on their experience and preferences to develop and evaluate the work of SW technology. Although the general architecture (e.g., Tim Berners-Lee’s Semantic Web Layer Cake) of SW technology was proposed many years ago and has been well-known, it still lacks a concrete guideline for standardizing the development of SW technology. In this paper, we propose an SW technology index to standardize the development for ensuring that the work of SW technology is designed well and to quantitatively evaluate the quality of the work in SW technology. This index consists of 10 criteria that quantify the quality as a score of \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$0{-}10$$\end{document}0-10. We address each criterion in detail for a clear explanation from three aspects: (1) what is the criterion? (2) why do we consider this criterion and (3) how do the current studies meet this criterion? Finally, we present the validation of this index by providing some examples of how to apply the index to the validation cases. We conclude that the index is a useful standard to guide and evaluate the work in SW technology.
Collapse
|
37
|
Process knowledge graph modeling techniques and application methods for ship heterogeneous models. Sci Rep 2022; 12:2911. [PMID: 35190625 PMCID: PMC8861156 DOI: 10.1038/s41598-022-06940-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2021] [Accepted: 02/09/2022] [Indexed: 12/03/2022] Open
Abstract
In the process design and reuse of marine component products, there are a lot of heterogeneous models, causing the problem that the process knowledge and process design experience contained in them are difficult to express and reuse. Therefore, a process knowledge representation model for ship heterogeneous model is proposed in this paper. Firstly, the multi-element process knowledge graph is constructed, and the heterogeneous ship model is described in a unified way. Then, the multi-strategy ontology mapping method is applied, and the semantic expression between the process knowledge graph and the entity model is realized. Finally, by obtaining implicit semantics based on case-based reasoning and checking the similarity of the matching results, the case knowledge reuse is achieved, to achieve rapid design of the process. This method provides reliable technical support for the design of ship component assembly and welding process, greatly shortens the design cycle, and improves the working efficiency. In addition, taking the double-deck bottom segment of a ship as an example, the process knowledge map of the heterogeneous model is constructed to realize the rapid design of ship process, which shows that the method can effectively acquire the process knowledge in the design case and improve the efficiency and intelligence of knowledge reuse in the process design of the heterogeneous model of a ship.
Collapse
|
38
|
|
39
|
Risk Assessment of Alpine Skiing Events Based on Knowledge Graph: A Focus on Meteorological Conditions. ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION 2021. [DOI: 10.3390/ijgi10120835] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
The alpine skiing event is particularly vulnerable to changes in meteorological conditions as a winter sport held outdoors. The commonly used risk assessment methods cannot be inflexible and cannot be dynamically adjusted to combine multiple risk factors and actual conditions. A knowledge graph can organize data resources in the risk domain as structured knowledge systems. This paper combines a knowledge graph and risk assessment to effectively assess the risk status. First of all, we introduce the relevant literature review of sports event risk assessment, combining the characteristics of alpine skiing events. Then, we summarize the risk types of alpine skiing events and related risk knowledge. Secondly, a model is proposed to introduce an event risk assessment model based on the RippleNet framework combined with the characteristics of large-scale sports events. Moreover, the validity of the model is verified. The results show that the RippleNet-based event risk assessment model can be used to assess the risk of alpine skiing events. In order to effectively deal with the large-scale sports events that occur with a variety of risks, the smooth implementation of large-scale sports events provides a strong guarantee.
Collapse
|
40
|
Li Z, Zhong Q, Yang J, Duan Y, Wang W, Wu C, He K. DeepKG: an end-to-end deep learning-based workflow for biomedical knowledge graph extraction, optimization and applications. Bioinformatics 2021; 38:1477-1479. [PMID: 34788369 PMCID: PMC8689937 DOI: 10.1093/bioinformatics/btab767] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Revised: 10/11/2021] [Accepted: 11/01/2021] [Indexed: 01/05/2023] Open
Abstract
SUMMARY DeepKG is an end-to-end deep learning-based workflow that helps researchers automatically mine valuable knowledge in biomedical literature. Users can utilize it to establish customized knowledge graphs in specified domains, thus facilitating in-depth understanding on disease mechanisms and applications on drug repurposing and clinical research. To improve the performance of DeepKG, a cascaded hybrid information extraction framework is developed for training model of 3-tuple extraction, and a novel AutoML-based knowledge representation algorithm (AutoTransX) is proposed for knowledge representation and inference. The system has been deployed in dozens of hospitals and extensive experiments strongly evidence the effectiveness. In the context of 144 900 COVID-19 scholarly full-text literature, DeepKG generates a high-quality knowledge graph with 7980 entities and 43 760 3-tuples, a candidate drug list, and relevant animal experimental studies are being carried out. To accelerate more studies, we make DeepKG publicly available and provide an online tool including the data of 3-tuples, potential drug list, question answering system, visualization platform. AVAILABILITY AND IMPLEMENTATION All the results are publicly available at the website (http://covidkg.ai/). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zongren Li
- Medical Big Data Research Center, Chinese PLA General Hospital, Beijing 100039, China,Medical Artificial Intelligence Research Center, Chinese PLA General Hospital, Beijing 100853, China
| | - Qin Zhong
- The Medical School of Chinese PLA, Chinese PLA General Hospital, Beijing 100039, China
| | - Jing Yang
- The Medical School of Chinese PLA, Chinese PLA General Hospital, Beijing 100039, China
| | - Yongjie Duan
- The Medical School of Chinese PLA, Chinese PLA General Hospital, Beijing 100039, China
| | - Wenjun Wang
- Bio-engineering Research Center, Chinese PLA General Hospital, Beijing 100039, China
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, School of Computer Science, National University of Defense Technology, Hunan, Changsha, 410073, China,To whom correspondence should be addressed. E-mail: or
| | - Kunlun He
- Medical Big Data Research Center, Chinese PLA General Hospital, Beijing 100039, China,To whom correspondence should be addressed. E-mail: or
| |
Collapse
|
41
|
Sentiment Analysis in Twitter Based on Knowledge Graph and Deep Learning Classification. ELECTRONICS 2021. [DOI: 10.3390/electronics10222739] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
The traditional way to address the problem of sentiment classification is based on machine learning techniques; however, these models are not able to grasp all the richness of the text that comes from different social media, personal web pages, blogs, etc., ignoring the semantic of the text. Knowledge graphs give a way to extract structured knowledge from images and texts in order to facilitate their semantic analysis. This work proposes a new hybrid approach for Sentiment Analysis based on Knowledge Graphs and Deep Learning techniques to identify the sentiment polarity (positive or negative) in short documents, such as posts on Twitter. In this proposal, tweets are represented as graphs; then, graph similarity metrics and a Deep Learning classification algorithm are applied to produce sentiment predictions. This approach facilitates the traceability and interpretability of the classification results, thanks to the integration of the Local Interpretable Model-agnostic Explanations (LIME) model at the end of the pipeline. LIME allows raising trust in predictive models, since the model is not a black box anymore. Uncovering the black box allows understanding and interpreting how the network could distinguish between sentiment polarities. Each phase of the proposed approach conformed by pre-processing, graph construction, dimensionality reduction, graph similarity, sentiment prediction, and interpretability steps is described. The proposal is compared with character n-gram embeddings-based Deep Learning models to perform Sentiment Analysis. Results show that the proposal is able to outperforms classical n-gram models, with a recall up to 89% and F1-score of 88%.
Collapse
|
42
|
Lu Z, Sim JA, Wang JX, Forrest CB, Krull KR, Srivastava D, Hudson MM, Robison LL, Baker JN, Huang IC. Natural Language Processing and Machine Learning Methods to Characterize Unstructured Patient-Reported Outcomes: Validation Study. J Med Internet Res 2021; 23:e26777. [PMID: 34730546 PMCID: PMC8600437 DOI: 10.2196/26777] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 03/20/2021] [Accepted: 08/12/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Assessing patient-reported outcomes (PROs) through interviews or conversations during clinical encounters provides insightful information about survivorship. OBJECTIVE This study aims to test the validity of natural language processing (NLP) and machine learning (ML) algorithms in identifying different attributes of pain interference and fatigue symptoms experienced by child and adolescent survivors of cancer versus the judgment by PRO content experts as the gold standard to validate NLP/ML algorithms. METHODS This cross-sectional study focused on child and adolescent survivors of cancer, aged 8 to 17 years, and caregivers, from whom 391 meaning units in the pain interference domain and 423 in the fatigue domain were generated for analyses. Data were collected from the After Completion of Therapy Clinic at St. Jude Children's Research Hospital. Experienced pain interference and fatigue symptoms were reported through in-depth interviews. After verbatim transcription, analyzable sentences (ie, meaning units) were semantically labeled by 2 content experts for each attribute (physical, cognitive, social, or unclassified). Two NLP/ML methods were used to extract and validate the semantic features: bidirectional encoder representations from transformers (BERT) and Word2vec plus one of the ML methods, the support vector machine or extreme gradient boosting. Receiver operating characteristic and precision-recall curves were used to evaluate the accuracy and validity of the NLP/ML methods. RESULTS Compared with Word2vec/support vector machine and Word2vec/extreme gradient boosting, BERT demonstrated higher accuracy in both symptom domains, with 0.931 (95% CI 0.905-0.957) and 0.916 (95% CI 0.887-0.941) for problems with cognitive and social attributes on pain interference, respectively, and 0.929 (95% CI 0.903-0.953) and 0.917 (95% CI 0.891-0.943) for problems with cognitive and social attributes on fatigue, respectively. In addition, BERT yielded superior areas under the receiver operating characteristic curve for cognitive attributes on pain interference and fatigue domains (0.923, 95% CI 0.879-0.997; 0.948, 95% CI 0.922-0.979) and superior areas under the precision-recall curve for cognitive attributes on pain interference and fatigue domains (0.818, 95% CI 0.735-0.917; 0.855, 95% CI 0.791-0.930). CONCLUSIONS The BERT method performed better than the other methods. As an alternative to using standard PRO surveys, collecting unstructured PROs via interviews or conversations during clinical encounters and applying NLP/ML methods can facilitate PRO assessment in child and adolescent cancer survivors.
Collapse
Affiliation(s)
- Zhaohua Lu
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Jin-Ah Sim
- Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, United States
- School of AI Convergence, Hallym University, Chuncheon, Republic of Korea
| | - Jade X Wang
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Christopher B Forrest
- Roberts Center for Pediatric Research, Children's Hospital of Philadelphia, Philadelphia, PA, United States
| | - Kevin R Krull
- Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Deokumar Srivastava
- Department of Biostatistics, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Melissa M Hudson
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Leslie L Robison
- Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - Justin N Baker
- Department of Oncology, St. Jude Children's Research Hospital, Memphis, TN, United States
| | - I-Chan Huang
- Department of Epidemiology and Cancer Control, St. Jude Children's Research Hospital, Memphis, TN, United States
| |
Collapse
|
43
|
|
44
|
Zanotto BS, Beck da Silva Etges AP, Dal Bosco A, Cortes EG, Ruschel R, De Souza AC, Andrade CMV, Viegas F, Canuto S, Luiz W, Ouriques Martins S, Vieira R, Polanczyk C, André Gonçalves M. Stroke Outcome Measurements From Electronic Medical Records: Cross-sectional Study on the Effectiveness of Neural and Nonneural Classifiers. JMIR Med Inform 2021; 9:e29120. [PMID: 34723829 PMCID: PMC8593798 DOI: 10.2196/29120] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2021] [Revised: 06/27/2021] [Accepted: 08/05/2021] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND With the rapid adoption of electronic medical records (EMRs), there is an ever-increasing opportunity to collect data and extract knowledge from EMRs to support patient-centered stroke management. OBJECTIVE This study aims to compare the effectiveness of state-of-the-art automatic text classification methods in classifying data to support the prediction of clinical patient outcomes and the extraction of patient characteristics from EMRs. METHODS Our study addressed the computational problems of information extraction and automatic text classification. We identified essential tasks to be considered in an ischemic stroke value-based program. The 30 selected tasks were classified (manually labeled by specialists) according to the following value agenda: tier 1 (achieved health care status), tier 2 (recovery process), care related (clinical management and risk scores), and baseline characteristics. The analyzed data set was retrospectively extracted from the EMRs of patients with stroke from a private Brazilian hospital between 2018 and 2019. A total of 44,206 sentences from free-text medical records in Portuguese were used to train and develop 10 supervised computational machine learning methods, including state-of-the-art neural and nonneural methods, along with ontological rules. As an experimental protocol, we used a 5-fold cross-validation procedure repeated 6 times, along with subject-wise sampling. A heatmap was used to display comparative result analyses according to the best algorithmic effectiveness (F1 score), supported by statistical significance tests. A feature importance analysis was conducted to provide insights into the results. RESULTS The top-performing models were support vector machines trained with lexical and semantic textual features, showing the importance of dealing with noise in EMR textual representations. The support vector machine models produced statistically superior results in 71% (17/24) of tasks, with an F1 score >80% regarding care-related tasks (patient treatment location, fall risk, thrombolytic therapy, and pressure ulcer risk), the process of recovery (ability to feed orally or ambulate and communicate), health care status achieved (mortality), and baseline characteristics (diabetes, obesity, dyslipidemia, and smoking status). Neural methods were largely outperformed by more traditional nonneural methods, given the characteristics of the data set. Ontological rules were also effective in tasks such as baseline characteristics (alcoholism, atrial fibrillation, and coronary artery disease) and the Rankin scale. The complementarity in effectiveness among models suggests that a combination of models could enhance the results and cover more tasks in the future. CONCLUSIONS Advances in information technology capacity are essential for scalability and agility in measuring health status outcomes. This study allowed us to measure effectiveness and identify opportunities for automating the classification of outcomes of specific tasks related to clinical conditions of stroke victims, and thus ultimately assess the possibility of proactively using these machine learning techniques in real-world situations.
Collapse
Affiliation(s)
- Bruna Stella Zanotto
- National Institute of Health Technology Assessment - INCT/IATS (CNPQ 465518/2014-1), Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil.,Graduate Program in Epidemiology, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Ana Paula Beck da Silva Etges
- National Institute of Health Technology Assessment - INCT/IATS (CNPQ 465518/2014-1), Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil.,School of Technology, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil
| | - Avner Dal Bosco
- School of Technology, Pontifícia Universidade Católica do Rio Grande do Sul, Porto Alegre, Brazil
| | - Eduardo Gabriel Cortes
- Graduate Program of Computer Science, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Renata Ruschel
- National Institute of Health Technology Assessment - INCT/IATS (CNPQ 465518/2014-1), Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | | | - Claudio M V Andrade
- Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Felipe Viegas
- Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Sergio Canuto
- Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - Washington Luiz
- Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | | | - Renata Vieira
- Centro Interdisciplinar de História, Culturas e Sociedades (CIDEHUS), Universidade de Évora, Évora, Portugal
| | - Carisi Polanczyk
- National Institute of Health Technology Assessment - INCT/IATS (CNPQ 465518/2014-1), Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil.,Graduate Program in Epidemiology, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
| | - Marcos André Gonçalves
- Computer Science Department, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| |
Collapse
|
45
|
Jiang Y, Gao X, Su W, Li J. Systematic Knowledge Management of Construction Safety Standards Based on Knowledge Graphs: A Case Study in China. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph182010692. [PMID: 34682437 PMCID: PMC8536078 DOI: 10.3390/ijerph182010692] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/26/2021] [Revised: 10/01/2021] [Accepted: 10/03/2021] [Indexed: 12/02/2022]
Abstract
Construction safety standards (CSS) have knowledge characteristics, but few studies have introduced knowledge graphs (KG) as a tool into CSS management. In order to improve CSS knowledge management, this paper first analyzed the knowledge structure of 218 standards and obtained three knowledge levels of CSS. Second, a concept layer was designed which consisted of five levels of concepts and eight types of relationships. Third, an entity layer containing 147 entities was constructed via entity identification, attribute extraction and entity extraction. Finally, 177 nodes and 11 types of attributes were collected and the construction of a knowledge graph of construction safety standard (KGCSS) was completed using knowledge storage. Furthermore, we implemented knowledge inference and obtained CSS planning, i.e., the list of standard work plans used to guide the development and revision of CSS. In addition, we conducted CSS knowledge retrieval; a process which supports interrogative input. The construction of KGCSS thus facilitates the analysis, querying, and sharing of safety standards knowledge.
Collapse
|
46
|
Jonnagaddala J, Chen A, Batongbacal S, Nekkantti C. The OpenDeID corpus for patient de-identification. Sci Rep 2021; 11:19973. [PMID: 34620985 PMCID: PMC8497517 DOI: 10.1038/s41598-021-99554-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Accepted: 09/28/2021] [Indexed: 11/18/2022] Open
Abstract
For research purposes, protected health information is often redacted from unstructured electronic health records to preserve patient privacy and confidentiality. The OpenDeID corpus is designed to assist development of automatic methods to redact sensitive information from unstructured electronic health records. We retrieved 4548 unstructured surgical pathology reports from four urban Australian hospitals. The corpus was developed by two annotators under three different experimental settings. The quality of the annotations was evaluated for each setting. Specifically, we employed serial annotations, parallel annotations, and pre-annotations. Our results suggest that the pre-annotations approach is not reliable in terms of quality when compared to the serial annotations but can drastically reduce annotation time. The OpenDeID corpus comprises 2,100 pathology reports from 1,833 cancer patients with an average of 737.49 tokens and 7.35 protected health information entities annotated per report. The overall inter annotator agreement and deviation scores are 0.9464 and 0.9726, respectively. Realistic surrogates are also generated to make the corpus suitable for distribution to other researchers.
Collapse
Affiliation(s)
| | - Aipeng Chen
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | - Sean Batongbacal
- School of Computer Science and Engineering, UNSW Sydney, Sydney, Australia
| | | |
Collapse
|
47
|
Yu Y, Huang K, Zhang C, Glass LM, Sun J, Xiao C. SumGNN: multi-typed drug interaction prediction via efficient knowledge graph summarization. Bioinformatics 2021; 37:2988-2995. [PMID: 33769494 DOI: 10.1093/bioinformatics/btab207] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2020] [Revised: 02/07/2021] [Accepted: 03/24/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Thanks to the increasing availability of drug-drug interactions (DDI) datasets and large biomedical knowledge graphs (KGs), accurate detection of adverse DDI using machine learning models becomes possible. However, it remains largely an open problem how to effectively utilize large and noisy biomedical KG for DDI detection. Due to its sheer size and amount of noise in KGs, it is often less beneficial to directly integrate KGs with other smaller but higher quality data (e.g. experimental data). Most of existing approaches ignore KGs altogether. Some tries to directly integrate KGs with other data via graph neural networks with limited success. Furthermore most previous works focus on binary DDI prediction whereas the multi-typed DDI pharmacological effect prediction is more meaningful but harder task. RESULTS To fill the gaps, we propose a new method SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module that can efficiently anchor on relevant subgraphs from a KG, a self-attention based subgraph summarization scheme to generate reasoning path within the subgraph, and a multi-channel knowledge and data integration module that utilizes massive external biomedical knowledge for significantly improved multi-typed DDI predictions. SumGNN outperforms the best baseline by up to 5.54%, and performance gain is particularly significant in low data relation types. In addition, SumGNN provides interpretable prediction via the generated reasoning paths for each prediction. AVAILABILITY AND IMPLEMENTATION The code is available in Supplementary Material. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yue Yu
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Kexin Huang
- Health Data Science, Harvard T.H. Chan School of Public Health, Boston, MA 02115, USA
| | - Chao Zhang
- College of Computing, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Lucas M Glass
- Analytic Center of Excellence, IQVIA, Cambridge, MA 02139, USA.,Department of Statistics, Temple University, Philadelphia, PA 19122, USA
| | - Jimeng Sun
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Cao Xiao
- Analytic Center of Excellence, IQVIA, Cambridge, MA 02139, USA
| |
Collapse
|
48
|
Constructing public health evidence knowledge graph for decision-making support from COVID-19 literature of modelling study. JOURNAL OF SAFETY SCIENCE AND RESILIENCE 2021. [PMCID: PMC8361008 DOI: 10.1016/j.jnlssr.2021.08.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/03/2023]
Abstract
The needs of mitigating COVID-19 epidemic prompt policymakers to make public health-related decision under the guidelines of science. Tremendous unstructured COVID-19 publications make it challenging for policymakers to obtain relevant evidence. Knowledge graphs (KGs) can formalize unstructured knowledge into structured form and have been used in supporting decision-making recently. Here, we introduce a novel framework that can extract the COVID-19 public health evidence knowledge graph (CPHE-KG) from papers relating to a modelling study. We screen out a corpus of 3096 COVID-19 modelling study papers by performing a literature assessment process. We define a novel annotation schema to construct the COVID-19 modelling study-related IE dataset (CPHIE). We also propose a novel multi-tasks document-level information extraction model SS-DYGIE++ based on the dataset. Leveraging the model on the new corpus, we construct CPHE-KG containing 60,967 entities and 51,140 relations. Finally, we seek to apply our KG to support evidence querying and evidence mapping visualization. Our SS-DYGIE++(SpanBERT) model has achieved a F1 score of 0.77 and 0.55 respectively in document-level entity recognition and coreference resolution tasks. It has also shown high performance in the relation identification task. With evidence querying, our KG can present the dynamic transmissions of COVID-19 pandemic in different countries and regions. The evidence mapping of our KG can show the impacts of variable non-pharmacological interventions to COVID-19 pandemic. Analysis demonstrates the quality of our KG and shows that it has the potential to support COVID-19 policy making in public health.
Collapse
|
49
|
Yi HC, You ZH, Huang DS, Kwoh CK. Graph representation learning in bioinformatics: trends, methods and applications. Brief Bioinform 2021; 23:6361044. [PMID: 34471921 DOI: 10.1093/bib/bbab340] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Revised: 07/18/2021] [Accepted: 08/02/2021] [Indexed: 12/12/2022] Open
Abstract
Graph is a natural data structure for describing complex systems, which contains a set of objects and relationships. Ubiquitous real-life biomedical problems can be modeled as graph analytics tasks. Machine learning, especially deep learning, succeeds in vast bioinformatics scenarios with data represented in Euclidean domain. However, rich relational information between biological elements is retained in the non-Euclidean biomedical graphs, which is not learning friendly to classic machine learning methods. Graph representation learning aims to embed graph into a low-dimensional space while preserving graph topology and node properties. It bridges biomedical graphs and modern machine learning methods and has recently raised widespread interest in both machine learning and bioinformatics communities. In this work, we summarize the advances of graph representation learning and its representative applications in bioinformatics. To provide a comprehensive and structured analysis and perspective, we first categorize and analyze both graph embedding methods (homogeneous graph embedding, heterogeneous graph embedding, attribute graph embedding) and graph neural networks. Furthermore, we summarize their representative applications from molecular level to genomics, pharmaceutical and healthcare systems level. Moreover, we provide open resource platforms and libraries for implementing these graph representation learning methods and discuss the challenges and opportunities of graph representation learning in bioinformatics. This work provides a comprehensive survey of emerging graph representation learning algorithms and their applications in bioinformatics. It is anticipated that it could bring valuable insights for researchers to contribute their knowledge to graph representation learning and future-oriented bioinformatics studies.
Collapse
Affiliation(s)
- Hai-Cheng Yi
- Chinese Academy of Sciences, Xinjiang Technical Institute of Physics and Chemistry, Urumqi 830011, China.,University of Chinese Academy of Sciences, Beijing 100049, China
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an 710129, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore
| |
Collapse
|
50
|
Prediction of Bladder Cancer Treatment Side Effects Using an Ontology-Based Reasoning for Enhanced Patient Health Safety. INFORMATICS 2021. [DOI: 10.3390/informatics8030055] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
Predicting potential cancer treatment side effects at time of prescription could decrease potential health risks and achieve better patient satisfaction. This paper presents a new approach, founded on evidence-based medical knowledge, using as much information and proof as possible to help a computer program to predict bladder cancer treatment side effects and support the oncologist’s decision. This will help in deciding treatment options for patients with bladder malignancies. Bladder cancer knowledge is complex and requires simplification before any attempt to represent it in a formal or computerized manner. In this work we rely on the capabilities of OWL ontologies to seamlessly capture and conceptualize the required knowledge about this type of cancer and the underlying patient treatment process. Our ontology allows case-based reasoning to effectively predict treatment side effects for a given set of contextual information related to a specific medical case. The ontology is enriched with proofs and evidence collected from online biomedical research databases using “web crawlers”. We have exclusively designed the crawler algorithm to search for the required knowledge based on a set of specified keywords. Results from the study presented 80.3% of real reported bladder cancer treatment side-effects prediction and were close to really occurring adverse events recorded within the collected test samples when applying the approach. Evidence-based medicine combined with semantic knowledge-based models is prominent in generating predictions related to possible health concerns. The integration of a diversity of knowledge and evidence into one single integrated knowledge-base could dramatically enhance the process of predicting treatment risks and side effects applied to bladder cancer oncotherapy.
Collapse
|