1
|
Keloth VK, Banda JM, Gurley M, Heider PM, Kennedy G, Liu H, Liu F, Miller T, Natarajan K, V Patterson O, Peng Y, Raja K, Reeves RM, Rouhizadeh M, Shi J, Wang X, Wang Y, Wei WQ, Williams AE, Zhang R, Belenkaya R, Reich C, Blacketer C, Ryan P, Hripcsak G, Elhadad N, Xu H. Representing and utilizing clinical textual data for real world studies: An OHDSI approach. J Biomed Inform 2023; 142:104343. [PMID: 36935011 PMCID: PMC10428170 DOI: 10.1016/j.jbi.2023.104343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 01/21/2023] [Accepted: 03/13/2023] [Indexed: 03/19/2023]
Abstract
Clinical documentation in electronic health records contains crucial narratives and details about patients and their care. Natural language processing (NLP) can unlock the information conveyed in clinical notes and reports, and thus plays a critical role in real-world studies. The NLP Working Group at the Observational Health Data Sciences and Informatics (OHDSI) consortium was established to develop methods and tools to promote the use of textual data and NLP in real-world observational studies. In this paper, we describe a framework for representing and utilizing textual data in real-world evidence generation, including representations of information from clinical text in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), the workflow and tools that were developed to extract, transform and load (ETL) data from clinical notes into tables in OMOP CDM, as well as current applications and specific use cases of the proposed OHDSI NLP solution at large consortia and individual institutions with English textual data. Challenges faced and lessons learned during the process are also discussed to provide valuable insights for researchers who are planning to implement NLP solutions in real-world studies.
Collapse
Affiliation(s)
- Vipina K Keloth
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Juan M Banda
- Department of Computer Science, Georgia State University, Atlanta, GA, USA
| | - Michael Gurley
- Lurie Cancer Center, Northwestern University, Chicago, Illinois, USA
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA
| | - Georgina Kennedy
- Ingham Institute for Applied Medical Research, Sydney, Australia
| | - Hongfang Liu
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Feifan Liu
- Department of Population and Quantitative Health Sciences, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, and Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| | - Olga V Patterson
- VA Informatics and Computing Infrastructure, Department of Veterans Affairs Salt Lake City Health Care System, Salt Lake City, Utah, USA; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, Utah, USA; Verily Life Sciences, Mountain View, CA, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Kalpana Raja
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Ruth M Reeves
- TN Valley Healthcare System, U.S. Department of Veterans Affairs, Nashville, TN, USA; Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Masoud Rouhizadeh
- Department of Pharmaceutical Outcomes & Policy, University of Florida, Gainesville, FL, USA; Biomedical Informatics and Data Science, Johns Hopkins University, Baltimore, MD, USA
| | - Jianlin Shi
- VA Informatics and Computing Infrastructure, Department of Veterans Affairs Salt Lake City Health Care System, Salt Lake City, Utah, USA; Division of Epidemiology, Department of Internal Medicine, School of Medicine, University of Utah, Salt Lake City, Utah, USA; Department of Biomedical Informatics, University of Utah, Salt Lake City, USA
| | - Xiaoyan Wang
- Sema4 Mount Sinai Genomics Incorporation, Stamford, CT, USA
| | - Yanshan Wang
- Department of Health Information Management, Department of Biomedical Informatics, and Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Wei-Qi Wei
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | | | - Rui Zhang
- Institute for Health Informatics, and Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, USA
| | | | | | - Clair Blacketer
- Janssen Pharmaceutical Research and Development LLC, Titusville, NJ, USA; Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, the Netherlands
| | - Patrick Ryan
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA; Janssen Pharmaceutical Research and Development LLC, Titusville, NJ, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
| | - Noémie Elhadad
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA.
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
| |
Collapse
|
2
|
Meystre SM, Heider PM, Cates A, Bastian G, Pittman T, Gentilin S, Kelechi TJ. Piloting an automated clinical trial eligibility surveillance and provider alert system based on artificial intelligence and standard data models. BMC Med Res Methodol 2023; 23:88. [PMID: 37041475 PMCID: PMC10088225 DOI: 10.1186/s12874-023-01916-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2022] [Accepted: 04/04/2023] [Indexed: 04/13/2023] Open
Abstract
BACKGROUND To advance new therapies into clinical care, clinical trials must recruit enough participants. Yet, many trials fail to do so, leading to delays, early trial termination, and wasted resources. Under-enrolling trials make it impossible to draw conclusions about the efficacy of new therapies. An oft-cited reason for insufficient enrollment is lack of study team and provider awareness about patient eligibility. Automating clinical trial eligibility surveillance and study team and provider notification could offer a solution. METHODS To address this need for an automated solution, we conducted an observational pilot study of our TAES (TriAl Eligibility Surveillance) system. We tested the hypothesis that an automated system based on natural language processing and machine learning algorithms could detect patients eligible for specific clinical trials by linking the information extracted from trial descriptions to the corresponding clinical information in the electronic health record (EHR). To evaluate the TAES information extraction and matching prototype (i.e., TAES prototype), we selected five open cardiovascular and cancer trials at the Medical University of South Carolina and created a new reference standard of 21,974 clinical text notes from a random selection of 400 patients (including at least 100 enrolled in the selected trials), with a small subset of 20 notes annotated in detail. We also developed a simple web interface for a new database that stores all trial eligibility criteria, corresponding clinical information, and trial-patient match characteristics using the Observational Medical Outcomes Partnership (OMOP) common data model. Finally, we investigated options for integrating an automated clinical trial eligibility system into the EHR and for notifying health care providers promptly of potential patient eligibility without interrupting their clinical workflow. RESULTS Although the rapidly implemented TAES prototype achieved only moderate accuracy (recall up to 0.778; precision up to 1.000), it enabled us to assess options for integrating an automated system successfully into the clinical workflow at a healthcare system. CONCLUSIONS Once optimized, the TAES system could exponentially enhance identification of patients potentially eligible for clinical trials, while simultaneously decreasing the burden on research teams of manual EHR review. Through timely notifications, it could also raise physician awareness of patient eligibility for clinical trials.
Collapse
Affiliation(s)
- Stéphane M Meystre
- OnePlanet Research Center and imec, Toernooiveld 300, Nijmegen, 6525 EC, The Netherlands.
| | - Paul M Heider
- Medical University of South Carolina, Charleston, SC, USA
| | - Andrew Cates
- Medical University of South Carolina, Charleston, SC, USA
| | - Grace Bastian
- Medical University of South Carolina, Charleston, SC, USA
| | - Tara Pittman
- Medical University of South Carolina, Charleston, SC, USA
| | | | | |
Collapse
|
3
|
Heider PM, Pipaliya RM, Meystre SM. A Natural Language Processing Tool Offering Data Extraction for COVID-19 Related Information (DECOVRI). Stud Health Technol Inform 2022; 290:1062-1063. [PMID: 35673206 DOI: 10.3233/shti220268] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
A new natural language processing (NLP) application for COVID-19 related information extraction from clinical text notes is being developed as part of our pandemic response efforts. This NLP application called DECOVRI (Data Extraction for COVID-19 Related Information) will be released as a free and open source tool to convert unstructured notes into structured data within an OMOP CDM-based ecosystem. The DECOVRI prototype is being continuously improved and will be released early (beta) and in a full version.
Collapse
Affiliation(s)
- Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA
| | - Ronak M Pipaliya
- College of Medicine, Medical University of South Carolina, Charleston, SC, USA
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USA
| |
Collapse
|
4
|
Pipaliya R, Heider PM, Meystre SM. Comparing Multiple Models for Section Header Classification with Feature Evaluation. Stud Health Technol Inform 2022; 290:1064-1065. [PMID: 35673207 DOI: 10.3233/shti220269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
We present on the performance evaluation of machine learning (ML) and Natural Language Processing (NLP) based Section Header classification. The section headers classification task was performed as a two-pass system. The first pass detects a section header while the second pass classifies it. Recall, precision, and F1-measure metrics were reported to explore the best approach for ML based section header classification for use in downstream NLP tasks.
Collapse
Affiliation(s)
- Ronak Pipaliya
- College of Medicine, Medical University of South Carolina, Charleston, SC, USC
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USC
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, USC
| |
Collapse
|
5
|
Meystre SM, Heider PM, Kim Y, Davis M, Obeid J, Madory J, Alekseyenko AV. Natural language processing enabling COVID-19 predictive analytics to support data-driven patient advising and pooled testing. J Am Med Inform Assoc 2021; 29:12-21. [PMID: 34415311 PMCID: PMC8714262 DOI: 10.1093/jamia/ocab186] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 08/04/2021] [Accepted: 08/16/2021] [Indexed: 12/15/2022] Open
Abstract
OBJECTIVE The COVID-19 (coronavirus disease 2019) pandemic response at the Medical University of South Carolina included virtual care visits for patients with suspected severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. The telehealth system used for these visits only exports a text note to integrate with the electronic health record, but structured and coded information about COVID-19 (eg, exposure, risk factors, symptoms) was needed to support clinical care and early research as well as predictive analytics for data-driven patient advising and pooled testing. MATERIALS AND METHODS To capture COVID-19 information from multiple sources, a new data mart and a new natural language processing (NLP) application prototype were developed. The NLP application combined reused components with dictionaries and rules crafted by domain experts. It was deployed as a Web service for hourly processing of new data from patients assessed or treated for COVID-19. The extracted information was then used to develop algorithms predicting SARS-CoV-2 diagnostic test results based on symptoms and exposure information. RESULTS The dedicated data mart and NLP application were developed and deployed in a mere 10-day sprint in March 2020. The NLP application was evaluated with good accuracy (85.8% recall and 81.5% precision). The SARS-CoV-2 testing predictive analytics algorithms were configured to provide patients with data-driven COVID-19 testing advices with a sensitivity of 81% to 92% and to enable pooled testing with a negative predictive value of 90% to 91%, reducing the required tests to about 63%. CONCLUSIONS SARS-CoV-2 testing predictive analytics and NLP successfully enabled data-driven patient advising and pooled testing.
Collapse
Affiliation(s)
- Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Youngjun Kim
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Matthew Davis
- Information Solutions, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Jihad Obeid
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - James Madory
- Department of Pathology, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Alexander V Alekseyenko
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
6
|
Kim Y, Heider PM, Lally IR, Meystre SM. A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System. JMIR Med Inform 2021; 9:e22797. [PMID: 33885370 PMCID: PMC8103307 DOI: 10.2196/22797] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/15/2020] [Accepted: 02/19/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 National Natural Language Processing Clinical Challenge (n2c2)/Open Health Natural Language Processing (OHNLP) shared task. OBJECTIVE This task involves identifying mentions of family members and observations in electronic health record text notes and recognizing the 2 types of relations (family member-living status relations and family member-observation relations). Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of 2 subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained 2 relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of postchallenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these 2 datasets. We also pretrained language models using clinical notes from the Medical Information Mart for Intensive Care (MIMIC-III) clinical database. RESULTS The voting ensemble achieved better performance than individual classifiers. In the entity identification task, our top-performing system reached a precision of 78.90% and a recall of 83.84%. Our natural language processing system for entity identification took 3rd place out of 17 teams in the challenge. We ranked 4th out of 9 teams in the relation extraction task. Our system substantially benefited from the combination of the 2 datasets. Compared to our official submission with F1 scores of 81.30% and 64.94% for entity identification and relation extraction, respectively, the revised system yielded significantly better performance (P<.05) with F1 scores of 86.02% and 72.48%, respectively. CONCLUSIONS We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach to entity identification as a sequence labeling problem produced satisfactory results. Our postchallenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.
Collapse
Affiliation(s)
- Youngjun Kim
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Isabel Rh Lally
- Department of Computer Science, College of Charleston, Charleston, SC, United States
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| |
Collapse
|
7
|
Kim Y, Heider PM, Meystre SM. Comparative Study of Various Approaches for Ensemble-based De-identification of Electronic Health Record Narratives. AMIA Annu Symp Proc 2021; 2020:648-657. [PMID: 33936439 PMCID: PMC8075417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
De-identification of electric health record narratives is a fundamental task applying natural language processing to better protect patient information privacy. We explore different types of ensemble learning methods to improve clinical text de-identification. We present two ensemble-based approaches for combining multiple predictive models. The first method selects an optimal subset of de-identification models by greedy exclusion. This ensemble pruning allows one to save computational time or physical resources while achieving similar or better performance than the ensemble of all members. The second method uses a sequence of words to train a sequential model. For this sequence labelling-based stacked ensemble, we employ search-based structured prediction and bidirectional long short-term memory algorithms. We create ensembles consisting of de-identification models trained on two clinical text corpora. Experimental results show that our ensemble systems can effectively integrate predictions from individual models and offer better generalization across two different corpora.
Collapse
Affiliation(s)
- Youngjun Kim
- Medical University of South Carolina, Charleston, South Carolina, USA
| | - Paul M Heider
- Medical University of South Carolina, Charleston, South Carolina, USA
| | - Stéphane M Meystre
- Medical University of South Carolina, Charleston, South Carolina, USA
- Clinacuity, Inc., Charleston, South Carolina, USA
| |
Collapse
|
8
|
Obeid JS, Davis M, Turner M, Meystre SM, Heider PM, O'Bryan EC, Lenert LA. An artificial intelligence approach to COVID-19 infection risk assessment in virtual visits: A case report. J Am Med Inform Assoc 2020; 27:1321-1325. [PMID: 32449766 PMCID: PMC7313981 DOI: 10.1093/jamia/ocaa105] [Citation(s) in RCA: 41] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Revised: 05/07/2020] [Accepted: 05/21/2020] [Indexed: 12/15/2022] Open
Abstract
Objective In an effort to improve the efficiency of computer algorithms applied to screening for coronavirus disease 2019 (COVID-19) testing, we used natural language processing and artificial intelligence–based methods with unstructured patient data collected through telehealth visits. Materials and Methods After segmenting and parsing documents, we conducted analysis of overrepresented words in patient symptoms. We then developed a word embedding–based convolutional neural network for predicting COVID-19 test results based on patients’ self-reported symptoms. Results Text analytics revealed that concepts such as smell and taste were more prevalent than expected in patients testing positive. As a result, screening algorithms were adapted to include these symptoms. The deep learning model yielded an area under the receiver-operating characteristic curve of 0.729 for predicting positive results and was subsequently applied to prioritize testing appointment scheduling. Conclusions Informatics tools such as natural language processing and artificial intelligence methods can have significant clinical impacts when applied to data streams early in the development of clinical systems for outbreak response.
Collapse
Affiliation(s)
- Jihad S Obeid
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, South Carolina, USA.,Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Matthew Davis
- Information Solutions, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Matthew Turner
- Information Solutions, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Stephane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA.,Department of Psychiatry and Behavioral Sciences, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Edward C O'Bryan
- Department of Emergency Medicine, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Leslie A Lenert
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA.,Department of Medicine, Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
9
|
Heider PM, Obeid JS, Meystre SM. A Comparative Analysis of Speed and Accuracy for Three Off-the-Shelf De-Identification Tools. AMIA Jt Summits Transl Sci Proc 2020; 2020:241-250. [PMID: 32477643 PMCID: PMC7233098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
A growing quantity of health data is being stored in Electronic Health Records (EHR). The free-text section of these clinical notes contains important patient and treatment information for research but also contains Personally Identifiable Information (PII), which cannot be freely shared within the research community without compromising patient confidentiality and privacy rights. Significant work has been invested in investigating automated approaches to text de-identification, the process of removing or redacting PII. Few studies have examined the performance of existing de-identification pipelines in a controlled comparative analysis. In this study, we use publicly available corpora to analyze speed and accuracy differences between three de-identification systems that can be run off-the-shelf: Amazon Comprehend Medical PHId, Clinacuity's CliniDeID, and the National Library of Medicine's Scrubber. No single system dominated all the compared metrics. NLM Scrubber was the fastest while CliniDeID generally had the highest accuracy.
Collapse
Affiliation(s)
- Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC
| | - Jihad S Obeid
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC
| |
Collapse
|
10
|
Obeid JS, Heider PM, Weeda ER, Matuskowitz AJ, Carr CM, Gagnon K, Crawford T, Meystre SM. Impact of De-Identification on Clinical Text Classification Using Traditional and Deep Learning Classifiers. Stud Health Technol Inform 2019; 264:283-287. [PMID: 31437930 PMCID: PMC6779034 DOI: 10.3233/shti190228] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Clinical text de-identification enables collaborative research while protecting patient privacy and confidentiality; however, concerns persist about the reduction in the utility of the de-identified text for information extraction and machine learning tasks. In the context of a deep learning experiment to detect altered mental status in emergency department provider notes, we tested several classifiers on clinical notes in their original form and on their automatically de-identified counterpart. We tested both traditional bag-of-words based machine learning models as well as word-embedding based deep learning models. We evaluated the models on 1,113 history of present illness notes. A total of 1,795 protected health information tokens were replaced in the de-identification process across all notes. The deep learning models had the best performance with accuracies of 95% on both original and de-identified notes. However, there was no significant difference in the performance of any of the models on the original vs. the de-identified notes.
Collapse
Affiliation(s)
- Jihad S. Obeid
- Biomedical Informatics Center, Medical University of South
Carolina, Charleston, SC, USA
| | - Paul M. Heider
- Biomedical Informatics Center, Medical University of South
Carolina, Charleston, SC, USA
| | - Erin R. Weeda
- Department of Clinical Pharmacy and Outcome Sciences,
Medical University of South Carolina, Charleston, SC, USA
| | - Andrew J. Matuskowitz
- Department of Emergency Medicine, Medical University of
South Carolina, Charleston, SC, USA
| | - Christine M. Carr
- Biomedical Informatics Center, Medical University of South
Carolina, Charleston, SC, USA
- Department of Emergency Medicine, Medical University of
South Carolina, Charleston, SC, USA
| | - Kevin Gagnon
- Department of Computer Science, University of South
Carolina, Columbia, SC, USA
| | - Tami Crawford
- Biomedical Informatics Center, Medical University of South
Carolina, Charleston, SC, USA
| | - Stephane M. Meystre
- Biomedical Informatics Center, Medical University of South
Carolina, Charleston, SC, USA
| |
Collapse
|
11
|
Heider PM, Meystre SM. Patient-Pivoted Automated Trial Eligibility Pipeline: The First of Three Phases in a Modular Architecture. Stud Health Technol Inform 2019; 264:1476-1477. [PMID: 31438189 DOI: 10.3233/shti190492] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Automated extraction of patient trial eligibility for clinical research studies can increase enrollment at a decreased time and money cost. We have developed a modular trial eligibility pipeline including patient-batched processing and an internal webservice backed by a uimaFIT pipeline as part of a multi-phase approach to include note-batched processing, the ability to query trials matching patients or patients matching trials, and an external alignment engine to connect patients to trials.
Collapse
Affiliation(s)
- Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, South Carolina, USA
| |
Collapse
|
12
|
Meystre SM, Heider PM, Kim Y, Aruch DB, Britten CD. Automatic trial eligibility surveillance based on unstructured clinical data. Int J Med Inform 2019; 129:13-19. [PMID: 31445247 DOI: 10.1016/j.ijmedinf.2019.05.018] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2018] [Revised: 12/06/2018] [Accepted: 05/21/2019] [Indexed: 12/26/2022]
Abstract
INTRODUCTION Insufficient patient enrollment in clinical trials remains a serious and costly problem and is often considered the most critical issue to solve for the clinical trials community. In this project, we assessed the feasibility of automatically detecting a patient's eligibility for a sample of breast cancer clinical trials by mapping coded clinical trial eligibility criteria to the corresponding clinical information automatically extracted from text in the EHR. METHODS Three open breast cancer clinical trials were selected by oncologists. Their eligibility criteria were manually abstracted from trial descriptions using the OHDSI ATLAS web application. Patients enrolled or screened for these trials were selected as 'positive' or 'possible' cases. Other patients diagnosed with breast cancer were selected as 'negative' cases. A selection of the clinical data and all clinical notes of these 229 selected patients was extracted from the MUSC clinical data warehouse and stored in a database implementing the OMOP common data model. Eligibility criteria were extracted from clinical notes using either manually crafted pattern matching (regular expressions) or a new natural language processing (NLP) application. These extracted criteria were then compared with reference criteria from trial descriptions. This comparison was realized with three different versions of a new application: rule-based, cosine similarity-based, and machine learning-based. RESULTS For eligibility criteria extraction from clinical notes, the machine learning-based NLP application allowed for the highest accuracy with a micro-averaged recall of 90.9% and precision of 89.7%. For trial eligibility determination, the highest accuracy was reached by the machine learning-based approach with a per-trial AUC between 75.5% and 89.8%. CONCLUSION NLP can be used to extract eligibility criteria from EHR clinical notes and automatically discover patients possibly eligible for a clinical trial with good accuracy, which could be leveraged to reduce the workload of humans screening patients for trials.
Collapse
Affiliation(s)
- Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States; Division of Hematology/Oncology, Medical University of South Carolina, Charleston, SC, United States.
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Youngjun Kim
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Daniel B Aruch
- Division of Hematology/Oncology, Medical University of South Carolina, Charleston, SC, United States
| | - Carolyn D Britten
- Division of Hematology/Oncology, Medical University of South Carolina, Charleston, SC, United States
| |
Collapse
|