1
|
Soysal E, Roberts K. PheNormGPT: a framework for extraction and normalization of key medical findings. Database (Oxford) 2024; 2024:baae103. [PMID: 39444329 PMCID: PMC11498178 DOI: 10.1093/database/baae103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 07/31/2024] [Accepted: 08/27/2024] [Indexed: 10/25/2024]
Abstract
This manuscript presents PheNormGPT, a framework for extraction and normalization of key findings in clinical text. PheNormGPT relies on an innovative approach, leveraging large language models to extract key findings and phenotypic data in unstructured clinical text and map them to Human Phenotype Ontology concepts. It utilizes OpenAI's GPT-3.5 Turbo and GPT-4 models with fine-tuning and few-shot learning strategies, including a novel few-shot learning strategy for custom-tailored few-shot example selection per request. PheNormGPT was evaluated in the BioCreative VIII Track 3: Genetic Phenotype Extraction from Dysmorphology Physical Examination Entries shared task. PheNormGPT achieved an F1 score of 0.82 for standard matching and 0.72 for exact matching, securing first place for this shared task.
Collapse
Affiliation(s)
- Ekin Soysal
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St #600, Houston, TX 77030, United States
| | - Kirk Roberts
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, 7000 Fannin St #600, Houston, TX 77030, United States
| |
Collapse
|
2
|
Chen ZK, Wang XQ, Xiao LL, Sun JD, Mao MY, Zhang HB, Guan J. Construction and application of nasopharyngeal carcinoma-specific big data platform based on electronic health records. Am J Otolaryngol 2024; 45:104204. [PMID: 38181649 DOI: 10.1016/j.amjoto.2023.104204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2023] [Accepted: 12/13/2023] [Indexed: 01/07/2024]
Abstract
OBJECTIVE To establish a nasopharyngeal carcinoma-specific big data platform based on electronic health records (EHRs) to provide data support for real-world study of nasopharyngeal carcinoma. METHODS A multidisciplinary expert team was established for this project. Based on industry standards and practical feasibility, the team designed the nasopharyngeal carcinoma data element standards including 14 modules and 640 fields. Data from patients diagnosed with nasopharyngeal carcinoma who visited Southern Hospital after 1999 were extracted from 15 EHRs systems and were cleaned, structured, and standardized using information technologies such as machine learning and natural language processing. In addition, a series of measures such as quality control and data encryption were taken to ensure data quality and patient privacy. At the platform application level, 10 functional modules were designed according to the needs of nasopharyngeal carcinoma research. RESULTS As of 1 October 2022, the Big Data platform has included 11,617patients, of whom 8228 (70.83 %) were male and 3389 (29.17 %) were female, with a median age of 48 years (interquartile range, 40 years). The data in the platform were validated to have a high level of completeness and accuracy, especially for key variables such as social demographics, laboratory tests and vital signs. Currently, six projects involving risk factors, early diagnosis, treatment efficacy and prevention of treatment-related toxic reactions have been conducted on the platform. CONCLUSIONS We have established a high-quality NPC-specific big data platform by integrating heterogeneous data from multiple sources in the EHR. The platform provides an effective tool and strong data support for real-world studies of nasopharyngeal carcinoma, which helps to improve research efficiency, reduce costs, and improve the quality of research results. We expect to promote multicenter nasopharyngeal carcinoma data sharing in the future to facilitate the generation of high-quality real-world evidence in nasopharyngeal carcinoma. This article may provide some reference value for other comprehensive hospitals to establish a big data platform for nasopharyngeal carcinoma.
Collapse
Affiliation(s)
- Ze-Kai Chen
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Xiao-Qing Wang
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Lin-Lin Xiao
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Jian-Da Sun
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China; Department of Radiation Oncology, Meizhou People's Hospital, Meizhou, Guangdong, China
| | - Meng-Yuan Mao
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Han-Bin Zhang
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China
| | - Jian Guan
- Department of Radiation Oncology, Nanfang Hospital, Southern Medical University, Guangzhou, China; Guangdong Province Key Laboratory of Molecular Tumor Pathology, Guangzhou, China.
| |
Collapse
|
3
|
Zheng F, Abeysinghe R, Cui L. Identification of missing concepts in biomedical terminologies using sequence-based formal concept analysis. BMC Med Inform Decis Mak 2021; 21:234. [PMID: 34753458 PMCID: PMC8579614 DOI: 10.1186/s12911-021-01592-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 07/21/2021] [Indexed: 11/15/2022] Open
Abstract
Background As biomedical knowledge is rapidly evolving, concept enrichment of biomedical terminologies is an active research area involving automatic identification of missing or new concepts. Previously, we prototyped a lexical-based formal concept analysis (FCA) approach in which concepts were derived by intersecting bags of words, to identify potentially missing concepts in the National Cancer Institute (NCI) Thesaurus. However, this prototype did not handle concept naming and positioning. In this paper, we introduce a sequenced-based FCA approach to identify potentially missing concepts, supporting concept naming and positioning. Methods We consider the concept name sequences as FCA attributes to construct the formal context. The concept-forming process is performed by computing the longest common substrings of concept name sequences. After new concepts are formalized, we further predict their potential positions in the original hierarchy by identifying their supertypes and subtypes from original concepts. Automated validation via external terminologies in the Unified Medical Language System (UMLS) and biomedical literature in PubMed is performed to evaluate the effectiveness of our approach. Results We applied our sequenced-based FCA approach to all the sub-hierarchies under Disease or Disorder in the NCI Thesaurus (19.08d version) and five sub-hierarchies under Clinical Finding and Procedure in the SNOMED CT (US Edition, March 2020 release). In total, 1397 potentially missing concepts were identified in the NCI Thesaurus and 7223 in the SNOMED CT. For NCI Thesaurus, 85 potentially missing concepts were found in external terminologies and 315 of the remaining 1312 appeared in biomedical literature. For SNOMED CT, 576 were found in external terminologies and 1159 out of the remaining 6647 were found in biomedical literature. Conclusion Our sequence-based FCA approach has shown the promise for identifying potentially missing concepts in biomedical terminologies.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, KY, USA.,School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Rashmie Abeysinghe
- Department of Neurology, McGovern School of Medicine, University of Texas Health Science Center at Houston, Houston, TX, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, USA.
| |
Collapse
|
4
|
Jing X. The Unified Medical Language System at 30 Years and How It Is Used and Published: Systematic Review and Content Analysis. JMIR Med Inform 2021; 9:e20675. [PMID: 34236337 PMCID: PMC8433943 DOI: 10.2196/20675] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2020] [Revised: 11/25/2020] [Accepted: 07/02/2021] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND The Unified Medical Language System (UMLS) has been a critical tool in biomedical and health informatics, and the year 2021 marks its 30th anniversary. The UMLS brings together many broadly used vocabularies and standards in the biomedical field to facilitate interoperability among different computer systems and applications. OBJECTIVE Despite its longevity, there is no comprehensive publication analysis of the use of the UMLS. Thus, this review and analysis is conducted to provide an overview of the UMLS and its use in English-language peer-reviewed publications, with the objective of providing a comprehensive understanding of how the UMLS has been used in English-language peer-reviewed publications over the last 30 years. METHODS PubMed, ACM Digital Library, and the Nursing & Allied Health Database were used to search for studies. The primary search strategy was as follows: UMLS was used as a Medical Subject Headings term or a keyword or appeared in the title or abstract. Only English-language publications were considered. The publications were screened first, then coded and categorized iteratively, following the grounded theory. The review process followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. RESULTS A total of 943 publications were included in the final analysis. Moreover, 32 publications were categorized into 2 categories; hence the total number of publications before duplicates are removed is 975. After analysis and categorization of the publications, UMLS was found to be used in the following emerging themes or areas (the number of publications and their respective percentages are given in parentheses): natural language processing (230/975, 23.6%), information retrieval (125/975, 12.8%), terminology study (90/975, 9.2%), ontology and modeling (80/975, 8.2%), medical subdomains (76/975, 7.8%), other language studies (53/975, 5.4%), artificial intelligence tools and applications (46/975, 4.7%), patient care (35/975, 3.6%), data mining and knowledge discovery (25/975, 2.6%), medical education (20/975, 2.1%), degree-related theses (13/975, 1.3%), digital library (5/975, 0.5%), and the UMLS itself (150/975, 15.4%), as well as the UMLS for other purposes (27/975, 2.8%). CONCLUSIONS The UMLS has been used successfully in patient care, medical education, digital libraries, and software development, as originally planned, as well as in degree-related theses, the building of artificial intelligence tools, data mining and knowledge discovery, foundational work in methodology, and middle layers that may lead to advanced products. Natural language processing, the UMLS itself, and information retrieval are the 3 most common themes that emerged among the included publications. The results, although largely related to academia, demonstrate that UMLS achieves its intended uses successfully, in addition to achieving uses broadly beyond its original intentions.
Collapse
Affiliation(s)
- Xia Jing
- Department of Public Health Sciences, College of Behavioral, Social and Health Sciences, Clemson University, Clemson, SC, United States
| |
Collapse
|
5
|
Chandra A, Philips ST, Pandey A, Basit M, Kannan V, Sara EJ, Das SR, Lee SJC, Haley B, Willett DL, Zaha VG. Electronic Health Records-Based Cardio-Oncology Registry for Care Gap Identification and Pragmatic Research: Procedure and Observational Study. JMIR Cardio 2021; 5:e22296. [PMID: 33797396 PMCID: PMC8411429 DOI: 10.2196/22296] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 11/23/2020] [Accepted: 03/12/2021] [Indexed: 11/13/2022] Open
Abstract
Background Professional society guidelines are emerging for cardiovascular care in cancer patients. However, it is not yet clear how effectively the cancer survivor population is screened and treated for cardiomyopathy in contemporary clinical practice. As electronic health records (EHRs) are now widely used in clinical practice, we tested the hypothesis that an EHR-based cardio-oncology registry can address these questions. Objective The aim of this study was to develop an EHR-based pragmatic cardio-oncology registry and, as proof of principle, to investigate care gaps in the cardiovascular care of cancer patients. Methods We generated a programmatically deidentified, real-time EHR-based cardio-oncology registry from all patients in our institutional Cancer Population Registry (N=8275, 2011-2017). We investigated: (1) left ventricular ejection fraction (LVEF) assessment before and after treatment with potentially cardiotoxic agents; and (2) guideline-directed medical therapy (GDMT) for left ventricular dysfunction (LVD), defined as LVEF<50%, and symptomatic heart failure with reduced LVEF (HFrEF), defined as LVEF<50% and Problem List documentation of systolic congestive heart failure or dilated cardiomyopathy. Results Rapid development of an EHR-based cardio-oncology registry was feasible. Identification of tests and outcomes was similar using the EHR-based cardio-oncology registry and manual chart abstraction (100% sensitivity and 83% specificity for LVD). LVEF was documented prior to initiation of cancer therapy in 19.8% of patients. Prevalence of postchemotherapy LVD and HFrEF was relatively low (9.4% and 2.5%, respectively). Among patients with postchemotherapy LVD or HFrEF, those referred to cardiology had a significantly higher prescription rate of a GDMT. Conclusions EHR data can efficiently populate a real-time, pragmatic cardio-oncology registry as a byproduct of clinical care for health care delivery investigations.
Collapse
Affiliation(s)
- Alvin Chandra
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Harold C Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States
| | - Steven T Philips
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States
| | - Ambarish Pandey
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States
| | - Mujeeb Basit
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States.,Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Vaishnavi Kannan
- Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Evan J Sara
- Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Sandeep R Das
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States
| | - Simon J C Lee
- Harold C Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Department of Population and Data Sciences, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Barbara Haley
- Harold C Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Hematology and Oncology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - DuWayne L Willett
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States.,Clinical Informatics Center, University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Vlad G Zaha
- Cardiology Division, Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Harold C Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, United States.,Parkland Health & Hospital System, Dallas, TX, United States
| |
Collapse
|
6
|
Reimer AP, Milinovich A. Using UMLS for electronic health data standardization and database design. J Am Med Inform Assoc 2021; 27:1520-1528. [PMID: 32940707 DOI: 10.1093/jamia/ocaa176] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 07/08/2020] [Accepted: 07/21/2020] [Indexed: 02/06/2023] Open
Abstract
OBJECTIVE Patients that undergo medical transfer represent 1 patient population that remains infrequently studied due to challenges in aggregating data across multiple domains and sources that are necessary to capture the entire episode of patient care. To facilitate access to and secondary use of transport patient data, we developed the Transport Data Repository that combines data from 3 separate domains and many sources within our health system. METHODS The repository is a relational database anchored by the Unified Medical Language System unique concept identifiers to integrate, map, and standardize the data into a common data model. Primary data domains included sending and receiving hospital encounters, medical transport record, and custom hospital transport log data. A 4-step mapping process was developed: 1) automatic source code match, 2) exact text match, 3) fuzzy matching, and 4) manual matching. RESULTS 431 090 total mappings were generated in the Transport Data Repository, consisting of 69 010 unique concepts with 77% of the data being mapped automatically. Transport Source Data yielded significantly lower mapping results with only 8% of data entities automatically mapped and a significant amount (43%) remaining unmapped. DISCUSSION The multistep mapping process resulted in a majority of data been automatically mapped. Poor matching of transport medical record data is due to the third-party vendor data being generated and stored in a nonstandardized format. CONCLUSION The multistep mapping process developed and implemented is necessary to normalize electronic health data from multiple domains and sources into a common data model to support secondary use of data.
Collapse
Affiliation(s)
- Andrew P Reimer
- Frances Payne Bolton School of Nursing, Case Western Reserve University, Cleveland, Ohio,USA.,Critical Care Transport, Cleveland Clinic, Cleveland, Ohio,USA
| | - Alex Milinovich
- Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, Ohio,USA
| |
Collapse
|
7
|
Nguyen T, Zhang T, Fox G, Zeng S, Cao N, Pan C, Chen JY. Linking clinotypes to phenotypes and genotypes from laboratory test results in comprehensive physical exams. BMC Med Inform Decis Mak 2021; 21:51. [PMID: 33627109 PMCID: PMC7903607 DOI: 10.1186/s12911-021-01387-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2020] [Accepted: 01/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In this work, we aimed to demonstrate how to utilize the lab test results and other clinical information to support precision medicine research and clinical decisions on complex diseases, with the support of electronic medical record facilities. We defined "clinotypes" as clinical information that could be observed and measured objectively using biomedical instruments. From well-known 'omic' problem definitions, we defined problems using clinotype information, including stratifying patients-identifying interested sub cohorts for future studies, mining significant associations between clinotypes and specific phenotypes-diseases, and discovering potential linkages between clinotype and genomic information. We solved these problems by integrating public omic databases and applying advanced machine learning and visual analytic techniques on two-year health exam records from a large population of healthy southern Chinese individuals (size n = 91,354). When developing the solution, we carefully addressed the missing information, imbalance and non-uniformed data annotation issues. RESULTS We organized the techniques and solutions to address the problems and issues above into CPA framework (Clinotype Prediction and Association-finding). At the data preprocessing step, we handled the missing value issue with predicted accuracy of 0.760. We curated 12,635 clinotype-gene associations. We found 147 Associations between 147 chronic diseases-phenotype and clinotypes, which improved the disease predictive performance to AUC (average) of 0.967. We mined 182 significant clinotype-clinotype associations among 69 clinotypes. CONCLUSIONS Our results showed strong potential connectivity between the omics information and the clinical lab test information. The results further emphasized the needs to utilize and integrate the clinical information, especially the lab test results, in future PheWas and omic studies. Furthermore, it showed that the clinotype information could initiate an alternative research direction and serve as an independent field of data to support the well-known 'phenome' and 'genome' researches.
Collapse
Affiliation(s)
- Thanh Nguyen
- Informatics Institute, School of Medicine, The University of Alabama at Birmingham, AL, Birmingham, USA
| | - Tongbin Zhang
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
- Department of Computer Technology and Information Management, The First Affiliated Hospital of Wenzhou Medical University, Zhejiang, China
| | - Geoffrey Fox
- School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
| | - Sisi Zeng
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
| | - Ni Cao
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
| | - Chuandi Pan
- School of First Clinical Medical Sciences - School of Information and Engineering, Wenzhou Medical University, Zhejiang, China
- Department of Computer Technology and Information Management, The First Affiliated Hospital of Wenzhou Medical University, Zhejiang, China
| | - Jake Y Chen
- Informatics Institute, School of Medicine, The University of Alabama at Birmingham, AL, Birmingham, USA.
| |
Collapse
|
8
|
Zheng F, Shi J, Yang Y, Zheng WJ, Cui L. A transformation-based method for auditing the IS-A hierarchy of biomedical terminologies in the Unified Medical Language System. J Am Med Inform Assoc 2020; 27:1568-1575. [PMID: 32918476 PMCID: PMC7566369 DOI: 10.1093/jamia/ocaa123] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 05/09/2020] [Accepted: 05/20/2020] [Indexed: 01/06/2023] Open
Abstract
OBJECTIVE The Unified Medical Language System (UMLS) integrates various source terminologies to support interoperability between biomedical information systems. In this article, we introduce a novel transformation-based auditing method that leverages the UMLS knowledge to systematically identify missing hierarchical IS-A relations in the source terminologies. MATERIALS AND METHODS Given a concept name in the UMLS, we first identify its base and secondary noun chunks. For each identified noun chunk, we generate replacement candidates that are more general than the noun chunk. Then, we replace the noun chunks with their replacement candidates to generate new potential concept names that may serve as supertypes of the original concept. If a newly generated name is an existing concept name in the same source terminology with the original concept, then a potentially missing IS-A relation between the original and the new concept is identified. RESULTS Applying our transformation-based method to English-language concept names in the UMLS (2019AB release), a total of 39 359 potentially missing IS-A relations were detected in 13 source terminologies. Domain experts evaluated a random sample of 200 potentially missing IS-A relations identified in the SNOMED CT (U.S. edition) and 100 in Gene Ontology. A total of 173 of 200 and 63 of 100 potentially missing IS-A relations were confirmed by domain experts, indicating that our method achieved a precision of 86.5% and 63% for the SNOMED CT and Gene Ontology, respectively. CONCLUSIONS Our results showed that our transformation-based method is effective in identifying missing IS-A relations in the UMLS source terminologies.
Collapse
Affiliation(s)
- Fengbo Zheng
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Jay Shi
- Department of Internal Medicine, University of Kentucky, Lexington, Kentucky, USA
| | - Yuntao Yang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - W Jim Zheng
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Licong Cui
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
9
|
Electronic health records for the diagnosis of rare diseases. Kidney Int 2020; 97:676-686. [DOI: 10.1016/j.kint.2019.11.037] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 11/15/2019] [Accepted: 11/22/2019] [Indexed: 01/13/2023]
|
10
|
Lin L, Liang W, Li CF, Huang XD, Lv JW, Peng H, Wang BY, Zhu BW, Sun Y. Development and implementation of a dynamically updated big data intelligence platform from electronic health records for nasopharyngeal carcinoma research. Br J Radiol 2019; 92:20190255. [PMID: 31430186 DOI: 10.1259/bjr.20190255] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/25/2022] Open
Abstract
OBJECTIVE To develop a big data intelligence platform for secondary use of electronic health records (EHRs) data to facilitate research for nasopharyngeal cancer (NPC). METHODS This project was launched in 2015 and carried out by the cooperation of an academic cancer centre and a technology company. Patients diagnosed with NPC at Sun Yat-sen University Cancer Centre since January 2008 were included in the platform. Standard data elements were established to defined 981 variables for the platform. For each patient, data from 13 EHRs systems were extracted, integrated, structurized and normalized. Eight functional modules were constructed for the platform to facilitate the investigators to identify eligible patients, establish research projects, conduct statistical analysis, track the follow-up, search literature, etc. RESULTS From January 2008 to December 2018, 54,703 patients diagnosed with NPC were included. Of these patients, 39,058 (71.4%) were male, and 15,645 (28.6%) were female; median age was 47 (interquartile range, 39-55) years. Of 981 variables, 341 were obtained from data structurization and normalization, of which 68 were generated by interacting multiple data sources via well-defined logical rules. The average precision rate, recall rate and F-measure for 341 variables were 0.97 ± 0.024, 0.92 ± 0.030, and 0.94 ± 0.027 respectively. The platform is regularly updated every seven days to include new patients and add new data for existing patients. Up to now, eight big data-driven retrospective studies have been published from the platform. CONCLUSION Our big data intelligence platform demonstrates the feasibility of integrating EHRs data of routine healthcare, and offers an important perspective on real-world study of NPC. The continued efforts may be focus on data sharing among multiple hospitals and publicly releasing of data files. ADVANCES IN KNOWLEDGE Our big data intelligence platform is the first disease-specific data platform for NPC research. It incorporates comprehensive EHRs data from routine healthcare, which can facilitate real-world study of NPC in risk stratification, decision-making and comorbidities management.
Collapse
Affiliation(s)
- Li Lin
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| | - Wei Liang
- YiduCloud Technology Ltd, Beijing, China
| | - Chao-Feng Li
- Department of Information Technology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| | - Xiao-Dan Huang
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| | - Jia-Wei Lv
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| | - Hao Peng
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| | | | - Bo-Wei Zhu
- YiduCloud Technology Ltd, Beijing, China
| | - Ying Sun
- Department of Radiation Oncology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine; Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Guangzhou, China
| |
Collapse
|
11
|
Chen D, Zhang R, Feng J, Liu K. Fulfilling information needs of patients in online health communities. Health Info Libr J 2019; 37:48-59. [PMID: 31090185 DOI: 10.1111/hir.12253] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Accepted: 01/28/2019] [Indexed: 12/21/2022]
Abstract
BACKGROUND Online health communities (OHCs) experience difficulties in utilising patient reported posts to fulfil the information needs of online patients concerning health related issues. OBJECTIVES We aim to propose a comprehensive method that leverages medical domain knowledge to extract useful information from posts to fulfil information needs of online patients. METHODS A knowledge representation framework based on authoritative knowledge sources in the medical field for the OHC is proposed. On the basis of the framework, a health related information extraction process for analysing the posts in the OHC is proposed. Then, knowledge support rate (KSR) and effective information rate (EIR) are introduced as metrics to evaluate changes in knowledge extracted from the knowledge sources in terms of fulfilling the information needs of patients in the OHC. RESULTS On the basis of a data set with 372 343 posts in an OHC, experimental results indicate that our method effectively extracts relevant knowledge for online patients. Moreover, KSR and EIR are feasible metrics of changes in knowledge in terms of fulfilling the information needs. CONCLUSIONS The OHCs effectively fulfil the information needs of patients by utilising authoritative domain knowledge in the medical field. Knowledge based services for online patients facilitate an intelligent OHC in the future.
Collapse
Affiliation(s)
- Donghua Chen
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing, China
| | - Runtong Zhang
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing, China
| | - Jiayi Feng
- Department of Information Management, School of Economics and Management, Beijing Jiaotong University, Beijing, China
| | - Kecheng Liu
- Informatics Research Centre, Henley Business School, University of Reading, Reading, UK
| |
Collapse
|
12
|
Beeksma M, Verberne S, van den Bosch A, Das E, Hendrickx I, Groenewoud S. Predicting life expectancy with a long short-term memory recurrent neural network using electronic medical records. BMC Med Inform Decis Mak 2019; 19:36. [PMID: 30819172 PMCID: PMC6394008 DOI: 10.1186/s12911-019-0775-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2018] [Accepted: 02/18/2019] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Life expectancy is one of the most important factors in end-of-life decision making. Good prognostication for example helps to determine the course of treatment and helps to anticipate the procurement of health care services and facilities, or more broadly: facilitates Advance Care Planning. Advance Care Planning improves the quality of the final phase of life by stimulating doctors to explore the preferences for end-of-life care with their patients, and people close to the patients. Physicians, however, tend to overestimate life expectancy, and miss the window of opportunity to initiate Advance Care Planning. This research tests the potential of using machine learning and natural language processing techniques for predicting life expectancy from electronic medical records. METHODS We approached the task of predicting life expectancy as a supervised machine learning task. We trained and tested a long short-term memory recurrent neural network on the medical records of deceased patients. We developed the model with a ten-fold cross-validation procedure, and evaluated its performance on a held-out set of test data. We compared the performance of a model which does not use text features (baseline model) to the performance of a model which uses features extracted from the free texts of the medical records (keyword model), and to doctors' performance on a similar task as described in scientific literature. RESULTS Both doctors and the baseline model were correct in 20% of the cases, taking a margin of 33% around the actual life expectancy as the target. The keyword model, in comparison, attained an accuracy of 29% with its prognoses. While doctors overestimated life expectancy in 63% of the incorrect prognoses, which harms anticipation to appropriate end-of-life care, the keyword model overestimated life expectancy in only 31% of the incorrect prognoses. CONCLUSIONS Prognostication of life expectancy is difficult for humans. Our research shows that machine learning and natural language processing techniques offer a feasible and promising approach to predicting life expectancy. The research has potential for real-life applications, such as supporting timely recognition of the right moment to start Advance Care Planning.
Collapse
Affiliation(s)
- Merijn Beeksma
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Suzan Verberne
- Leiden Institute for Advanced Computer Sciences, Leiden University, Niels Bohrweg 1, 2333 CA Leiden, The Netherlands
| | - Antal van den Bosch
- KNAW Meertens Institute, Oudezijds Achterburgwal 185, 1012 DK Amsterdam, The Netherlands
| | - Enny Das
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Iris Hendrickx
- Centre for Language Studies, Radboud University, Erasmusplein 1, 6525 HT Nijmegen, The Netherlands
| | - Stef Groenewoud
- IQ Healthcare, Radboudumc, Mailbox 9101, 6500 HB Nijmegen, The Netherlands
| |
Collapse
|
13
|
Chu L, Kannan V, Basit MA, Schaeflein DJ, Ortuzar AR, Glorioso JF, Buchanan JR, Willett DL. SNOMED CT Concept Hierarchies for Computable Clinical Phenotypes From Electronic Health Record Data: Comparison of Intensional Versus Extensional Value Sets. JMIR Med Inform 2019; 7:e11487. [PMID: 30664458 PMCID: PMC6351992 DOI: 10.2196/11487] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2018] [Revised: 11/23/2018] [Accepted: 12/09/2018] [Indexed: 01/19/2023] Open
Abstract
Background Defining clinical phenotypes from electronic health record (EHR)–derived data proves crucial for clinical decision support, population health endeavors, and translational research. EHR diagnoses now commonly draw from a finely grained clinical terminology—either native SNOMED CT or a vendor-supplied terminology mapped to SNOMED CT concepts as the standard for EHR interoperability. Accordingly, electronic clinical quality measures (eCQMs) increasingly define clinical phenotypes with SNOMED CT value sets. The work of creating and maintaining list-based value sets proves daunting, as does insuring that their contents accurately represent the clinically intended condition. Objective The goal of the research was to compare an intensional (concept hierarchy-based) versus extensional (list-based) value set approach to defining clinical phenotypes using SNOMED CT–encoded data from EHRs by evaluating value set conciseness, time to create, and completeness. Methods Starting from published Centers for Medicare and Medicaid Services (CMS) high-priority eCQMs, we selected 10 clinical conditions referenced by those eCQMs. For each, the published SNOMED CT list-based (extensional) value set was downloaded from the Value Set Authority Center (VSAC). Ten corresponding SNOMED CT hierarchy-based intensional value sets for the same conditions were identified within our EHR. From each hierarchy-based intensional value set, an exactly equivalent full extensional value set was derived enumerating all included descendant SNOMED CT concepts. Comparisons were then made between (1) VSAC-downloaded list-based (extensional) value sets, (2) corresponding hierarchy-based intensional value sets for the same conditions, and (3) derived list-based (extensional) value sets exactly equivalent to the hierarchy-based intensional value sets. Value set conciseness was assessed by the number of SNOMED CT concepts needed for definition. Time to construct the value sets for local use was measured. Value set completeness was assessed by comparing contents of the downloaded extensional versus intensional value sets. Two measures of content completeness were made: for individual SNOMED CT concepts and for the mapped diagnosis clinical terms available for selection within the EHR by clinicians. Results The 10 hierarchy-based intensional value sets proved far simpler and faster to construct than exactly equivalent derived extensional value set lists, requiring a median 3 versus 78 concepts to define and 5 versus 37 minutes to build. The hierarchy-based intensional value sets also proved more complete: in comparison, the 10 downloaded 2018 extensional value sets contained a median of just 35% of the intensional value sets’ SNOMED CT concepts and 65% of mapped EHR clinical terms. Conclusions In the EHR era, defining conditions preferentially should employ SNOMED CT concept hierarchy-based (intensional) value sets rather than extensional lists. By doing so, clinical guideline and eCQM authors can more readily engage specialists in vetting condition subtypes to include and exclude, and streamline broad EHR implementation of condition-specific decision support promoting guideline adherence for patient benefit.
Collapse
Affiliation(s)
- Ling Chu
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Vaishnavi Kannan
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Mujeeb A Basit
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Diane J Schaeflein
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Adolfo R Ortuzar
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Jimmie F Glorioso
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Joel R Buchanan
- University of Wisconsin School of Medicine and Public Health, Madison, WI, United States
| | - Duwayne L Willett
- University of Texas Southwestern Medical Center, Dallas, TX, United States
| |
Collapse
|
14
|
Willett DL, Kannan V, Chu L, Buchanan JR, Velasco FT, Clark JD, Fish JS, Ortuzar AR, Youngblood JE, Bhat DG, Basit MA. SNOMED CT Concept Hierarchies for Sharing Definitions of Clinical Conditions Using Electronic Health Record Data. Appl Clin Inform 2018; 9:667-682. [PMID: 30157499 PMCID: PMC6115233 DOI: 10.1055/s-0038-1668090] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
BACKGROUND Defining clinical conditions from electronic health record (EHR) data underpins population health activities, clinical decision support, and analytics. In an EHR, defining a condition commonly employs a diagnosis value set or "grouper." For constructing value sets, Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) offers high clinical fidelity, a hierarchical ontology, and wide implementation in EHRs as the standard interoperability vocabulary for problems. OBJECTIVE This article demonstrates a practical approach to defining conditions with combinations of SNOMED CT concept hierarchies, and evaluates sharing of definitions for clinical and analytic uses. METHODS We constructed diagnosis value sets for EHR patient registries using SNOMED CT concept hierarchies combined with Boolean logic, and shared them for clinical decision support, reporting, and analytic purposes. RESULTS A total of 125 condition-defining "standard" SNOMED CT diagnosis value sets were created within our EHR. The median number of SNOMED CT concept hierarchies needed was only 2 (25th-75th percentiles: 1-5). Each value set, when compiled as an EHR diagnosis grouper, was associated with a median of 22 International Classification of Diseases (ICD)-9 and ICD-10 codes (25th-75th percentiles: 8-85) and yielded a median of 155 clinical terms available for selection by clinicians in the EHR (25th-75th percentiles: 63-976). Sharing of standard groupers for population health, clinical decision support, and analytic uses was high, including 57 patient registries (with 362 uses of standard groupers), 132 clinical decision support records, 190 rules, 124 EHR reports, 125 diagnosis dimension slicers for self-service analytics, and 111 clinical quality measure calculations. Identical SNOMED CT definitions were created in an EHR-agnostic tool enabling application across disparate organizations and EHRs. CONCLUSION SNOMED CT-based diagnosis value sets are simple to develop, concise, understandable to clinicians, useful in the EHR and for analytics, and shareable. Developing curated SNOMED CT hierarchy-based condition definitions for public use could accelerate cross-organizational population health efforts, "smarter" EHR feature configuration, and clinical-translational research employing EHR-derived data.
Collapse
Affiliation(s)
- Duwayne L Willett
- Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, Texas, United States.,Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Vaishnavi Kannan
- Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Ling Chu
- Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, Texas, United States.,Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Joel R Buchanan
- Department of Medicine, University of Wisconsin School of Medicine and Public Health, University of Wisconsin-Madison, Madison, Wisconsin, United States
| | - Ferdinand T Velasco
- Texas Health Resources, Arlington, Texas, United States.,Southwestern Health Resources, Dallas, Texas, United States
| | - John D Clark
- Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Jason S Fish
- Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, Texas, United States.,Southwestern Health Resources, Dallas, Texas, United States
| | - Adolfo R Ortuzar
- Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Josh E Youngblood
- Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| | - Deepa G Bhat
- Southwestern Health Resources, Dallas, Texas, United States
| | - Mujeeb A Basit
- Department of Internal Medicine, University of Texas Southwestern Medical Center, Dallas, Texas, United States.,Health System Information Resources Department, University of Texas Southwestern Medical Center, Dallas, Texas, United States
| |
Collapse
|
15
|
Sendak MP, Balu S, Schulman KA. Barriers to Achieving Economies of Scale in Analysis of EHR Data. A Cautionary Tale. Appl Clin Inform 2017; 8:826-831. [PMID: 28837212 DOI: 10.4338/aci-2017-03-cr-0046] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2017] [Accepted: 06/15/2017] [Indexed: 01/13/2023] Open
Abstract
Signed in 2009, the Health Information Technology for Economic and Clinical Health Act infused $28 billion of federal funds to accelerate adoption of electronic health records (EHRs). Yet, EHRs have produced mixed results and have even raised concern that the current technology ecosystem stifles innovation. We describe the development process and report initial outcomes of a chronic kidney disease analytics application that identifies high-risk patients for nephrology referral. The cost to validate and integrate the analytics application into clinical workflow was $217,138. Despite the success of the program, redundant development and validation efforts will require $38.8 million to scale the application across all multihospital systems in the nation. We address the shortcomings of current technology investments and distill insights from the technology industry. To yield a return on technology investments, we propose policy changes that address the underlying issues now being imposed on the system by an ineffective technology business model.
Collapse
Affiliation(s)
| | | | - Kevin A Schulman
- Kevin A. Schulman, MD,, Duke Clinical Research Institute, PO Box 17969, Durham, NC 27715, Phone: 919-668-8101,
| |
Collapse
|
16
|
Mbagwu M, French DD, Gill M, Mitchell C, Jackson K, Kho A, Bryar PJ. Creation of an Accurate Algorithm to Detect Snellen Best Documented Visual Acuity from Ophthalmology Electronic Health Record Notes. JMIR Med Inform 2016; 4:e14. [PMID: 27146002 PMCID: PMC4871992 DOI: 10.2196/medinform.4732] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Revised: 01/28/2016] [Accepted: 02/20/2016] [Indexed: 11/21/2022] Open
Abstract
Background Visual acuity is the primary measure used in ophthalmology to determine how well a patient can see. Visual acuity for a single eye may be recorded in multiple ways for a single patient visit (eg, Snellen vs. Jäger units vs. font print size), and be recorded for either distance or near vision. Capturing the best documented visual acuity (BDVA) of each eye in an individual patient visit is an important step for making electronic ophthalmology clinical notes useful in research. Objective Currently, there is limited methodology for capturing BDVA in an efficient and accurate manner from electronic health record (EHR) notes. We developed an algorithm to detect BDVA for right and left eyes from defined fields within electronic ophthalmology clinical notes. Methods We designed an algorithm to detect the BDVA from defined fields within 295,218 ophthalmology clinical notes with visual acuity data present. About 5668 unique responses were identified and an algorithm was developed to map all of the unique responses to a structured list of Snellen visual acuities. Results Visual acuity was captured from a total of 295,218 ophthalmology clinical notes during the study dates. The algorithm identified all visual acuities in the defined visual acuity section for each eye and returned a single BDVA for each eye. A clinician chart review of 100 random patient notes showed a 99% accuracy detecting BDVA from these records and 1% observed error. Conclusions Our algorithm successfully captures best documented Snellen distance visual acuity from ophthalmology clinical notes and transforms a variety of inputs into a structured Snellen equivalent list. Our work, to the best of our knowledge, represents the first attempt at capturing visual acuity accurately from large numbers of electronic ophthalmology notes. Use of this algorithm can benefit research groups interested in assessing visual acuity for patient centered outcome. All codes used for this study are currently available, and will be made available online at https://phekb.org.
Collapse
Affiliation(s)
- Michael Mbagwu
- Department of Ophthalmology, Northwestern University Feinberg School of Medicine, Chicago, IL, United States.
| | | | | | | | | | | | | |
Collapse
|
17
|
Pan X, Cimino JJ. Identifying the Clinical Laboratory Tests from Unspecified "Other Lab Test" Data for Secondary Use. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:1018-1023. [PMID: 26958239 PMCID: PMC4765675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Clinical laboratory results are stored in electronic health records (EHRs) as structured data coded with local or standard terms. However, laboratory tests that are performed at outside laboratories are often simply labeled "outside test" or something similar, with the actual test name in a free-text result or comment field. After being aggregated into clinical data repositories, these ambiguous labels impede the retrieval of specific test results. We present a general multi-step solution that can facilitate the identification, standardization, reconciliation, and transformation of such test results. We applied our approach to data in the NIH Biomedical Translational Research Information System (BTRIS) to identify laboratory tests, map comment values to the LOINC codes that will be incorporated into our Research Entities Dictionary (RED), and develop a reference table that can be used in the EHR data extract-transform-load (ETL) process.
Collapse
Affiliation(s)
- Xuequn Pan
- Lister Hill National Center for Biomedical Communications, National Library of Medicine;; Laboratory for Informatics Development, NIH Clinical Center; Bethesda, MD
| | - James J Cimino
- Lister Hill National Center for Biomedical Communications, National Library of Medicine;; Laboratory for Informatics Development, NIH Clinical Center; Bethesda, MD
| |
Collapse
|