1
|
Sivarajkumar S, Mohammad HA, Oniani D, Roberts K, Hersh W, Liu H, He D, Visweswaran S, Wang Y. Clinical Information Retrieval: A Literature Review. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:313-352. [PMID: 38681755 PMCID: PMC11052968 DOI: 10.1007/s41666-024-00159-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 12/07/2023] [Accepted: 01/08/2024] [Indexed: 05/01/2024]
Abstract
Clinical information retrieval (IR) plays a vital role in modern healthcare by facilitating efficient access and analysis of medical literature for clinicians and researchers. This scoping review aims to offer a comprehensive overview of the current state of clinical IR research and identify gaps and potential opportunities for future studies in this field. The main objective was to assess and analyze the existing literature on clinical IR, focusing on the methods, techniques, and tools employed for effective retrieval and analysis of medical information. Adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, we conducted an extensive search across databases such as Ovid Embase, Ovid Medline, Scopus, ACM Digital Library, IEEE Xplore, and Web of Science, covering publications from January 1, 2010, to January 4, 2023. The rigorous screening process led to the inclusion of 184 papers in our review. Our findings provide a detailed analysis of the clinical IR research landscape, covering aspects like publication trends, data sources, methodologies, evaluation metrics, and applications. The review identifies key research gaps in clinical IR methods such as indexing, ranking, and query expansion, offering insights and opportunities for future studies in clinical IR, thus serving as a guiding framework for upcoming research efforts in this rapidly evolving field. The study also underscores an imperative for innovative research on advanced clinical IR systems capable of fast semantic vector search and adoption of neural IR techniques for effective retrieval of information from unstructured electronic health records (EHRs). Supplementary Information The online version contains supplementary material available at 10.1007/s41666-024-00159-4.
Collapse
Affiliation(s)
| | | | - David Oniani
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - William Hersh
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health & Science University, Portland, OR USA
| | - Hongfang Liu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Daqing He
- Department of Information Science, University of Pittsburgh, Pittsburgh, PA USA
| | - Shyam Visweswaran
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA USA
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA USA
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA USA
| | - Yanshan Wang
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA USA
- Department of Health Information Management, University of Pittsburgh, Pittsburgh, PA USA
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA USA
- Clinical and Translational Science Institute, University of Pittsburgh, Pittsburgh, PA USA
| |
Collapse
|
2
|
Barrett AK, Ford J, Zhu Y. Sending and Receiving Safety and Risk Messages in Hospitals: An Exploration into Organizational Communication Channels and Providers' Communication Overload. HEALTH COMMUNICATION 2021; 36:1697-1708. [PMID: 32633142 DOI: 10.1080/10410236.2020.1788498] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This study explores hospital workers' experiences with workplace communication overload and its implications for effective safety and risk messaging in hospital organizations. We use a multi-step thematic analysis of interview (N = 12) and focus group (N = 8, 28 participants) data collected from hospital workers to analyze how they describe specific organizational communication channels influencing their communication overload. We specifically examine how workers' socially constructed channel affordances and constraints for sending/receiving safety information provide meaning to their communicatively overloaded states. Hospital workers explained that asynchronous channels such as e-mail and voicemail aggravated communication overload, while synchronous channels such as team huddles alleviated it. We discuss the implications of these results for the communication overload model by pointing to violations of communication channel preference and literature on the social affordances of communication channels. Study limitations and future directions are offered.
Collapse
Affiliation(s)
- Ashley K Barrett
- Health Communication in the Department, Communication at Baylor University
| | - Jessica Ford
- Health Communication in the Department, Communication at Baylor University
| | - Yaguang Zhu
- Organizational Communication in the Department of Communication, The University of Arkansas
| |
Collapse
|
3
|
Wu S, Roberts K, Datta S, Du J, Ji Z, Si Y, Soni S, Wang Q, Wei Q, Xiang Y, Zhao B, Xu H. Deep learning in clinical natural language processing: a methodical review. J Am Med Inform Assoc 2021; 27:457-470. [PMID: 31794016 DOI: 10.1093/jamia/ocz200] [Citation(s) in RCA: 158] [Impact Index Per Article: 52.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2019] [Revised: 10/15/2019] [Accepted: 11/09/2019] [Indexed: 02/07/2023] Open
Abstract
OBJECTIVE This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. MATERIALS AND METHODS We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. RESULTS DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a "long tail" of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. DISCUSSION Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). CONCLUSION Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.
Collapse
Affiliation(s)
- Stephen Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Surabhi Datta
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingcheng Du
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zongcheng Ji
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Sarvesh Soni
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiong Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Qiang Wei
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Bo Zhao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
4
|
Xiang Y, Xu J, Si Y, Li Z, Rasmy L, Zhou Y, Tiryaki F, Li F, Zhang Y, Wu Y, Jiang X, Zheng WJ, Zhi D, Tao C, Xu H. Time-sensitive clinical concept embeddings learned from large electronic health records. BMC Med Inform Decis Mak 2019; 19:58. [PMID: 30961579 PMCID: PMC6454598 DOI: 10.1186/s12911-019-0766-3] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Learning distributional representation of clinical concepts (e.g., diseases, drugs, and labs) is an important research area of deep learning in the medical domain. However, many existing relevant methods do not consider temporal dependencies along the longitudinal sequence of a patient's records, which may lead to incorrect selection of contexts. METHODS To address this issue, we extended three popular concept embedding learning methods: word2vec, positive pointwise mutual information (PPMI) and FastText, to consider time-sensitive information. We then trained them on a large electronic health records (EHR) database containing about 50 million patients to generate concept embeddings and evaluated them for both intrinsic evaluations focusing on concept similarity measure and an extrinsic evaluation to assess the use of generated concept embeddings in the task of predicting disease onset. RESULTS Our experiments show that embeddings learned from information within one visit (time window zero) improve performance on the concept similarity measure and the FastText algorithm usually had better performance than the other two algorithms. For the predictive modeling task, the optimal result was achieved by word2vec embeddings with a 30-day sliding window. CONCLUSIONS Considering time constraints are important in training clinical concept embeddings. We expect they can benefit a series of downstream applications.
Collapse
Affiliation(s)
- Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Zhiheng Li
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Laila Rasmy
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yujia Zhou
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Firat Tiryaki
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Fang Li
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Yonghui Wu
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Wenjin Jim Zheng
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Degui Zhi
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX USA
| |
Collapse
|
5
|
Ning W, Chan S, Beam A, Yu M, Geva A, Liao K, Mullen M, Mandl KD, Kohane I, Cai T, Yu S. Feature extraction for phenotyping from semantic and knowledge resources. J Biomed Inform 2019; 91:103122. [PMID: 30738949 PMCID: PMC6424621 DOI: 10.1016/j.jbi.2019.103122] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
OBJECTIVE Phenotyping algorithms can efficiently and accurately identify patients with a specific disease phenotype and construct electronic health records (EHR)-based cohorts for subsequent clinical or genomic studies. Previous studies have introduced unsupervised EHR-based feature selection methods that yielded algorithms with high accuracy. However, those selection methods still require expert intervention to tweak the parameter settings according to the EHR data distribution for each phenotype. To further accelerate the development of phenotyping algorithms, we propose a fully automated and robust unsupervised feature selection method that leverages only publicly available medical knowledge sources, instead of EHR data. METHODS SEmantics-Driven Feature Extraction (SEDFE) collects medical concepts from online knowledge sources as candidate features and gives them vector-form distributional semantic representations derived with neural word embedding and the Unified Medical Language System Metathesaurus. A number of features that are semantically closest and that sufficiently characterize the target phenotype are determined by a linear decomposition criterion and are selected for the final classification algorithm. RESULTS SEDFE was compared with the EHR-based SAFE algorithm and domain experts on feature selection for the classification of five phenotypes including coronary artery disease, rheumatoid arthritis, Crohn's disease, ulcerative colitis, and pediatric pulmonary arterial hypertension using both supervised and unsupervised approaches. Algorithms yielded by SEDFE achieved comparable accuracy to those yielded by SAFE and expert-curated features. SEDFE is also robust to the input semantic vectors. CONCLUSION SEDFE attains satisfying performance in unsupervised feature selection for EHR phenotyping. Both fully automated and EHR-independent, this method promises efficiency and accuracy in developing algorithms for high-throughput phenotyping.
Collapse
Affiliation(s)
- Wenxin Ning
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Stephanie Chan
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Ming Yu
- Department of Industrial Engineering, Tsinghua University, Beijing, China
| | - Alon Geva
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesiology, Critical Care, and Pain Medicine, Boston Children's Hospital, Boston, MA, USA; Department of Anesthesia, Harvard Medical School, Boston, MA, USA
| | - Katherine Liao
- Department of Medicine, Division of Rheumatology, Immunology and Allergy, Brigham and Women's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Mary Mullen
- Department of Cardiology, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School, Boston, MA, USA
| | - Kenneth D Mandl
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Sheng Yu
- Center for Statistical Science, Tsinghua University, Beijing, China; Department of Industrial Engineering, Tsinghua University, Beijing, China; Institute for Data Science, Tsinghua University, Beijing, China.
| |
Collapse
|
6
|
Bai T, Chanda AK, Egleston BL, Vucetic S. EHR phenotyping via jointly embedding medical concepts and words into a unified vector space. BMC Med Inform Decis Mak 2018; 18:123. [PMID: 30537974 PMCID: PMC6290514 DOI: 10.1186/s12911-018-0672-0] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Background There has been an increasing interest in learning low-dimensional vector representations of medical concepts from Electronic Health Records (EHRs). Vector representations of medical concepts facilitate exploratory analysis and predictive modeling of EHR data to gain insights about the patterns of care and health outcomes. EHRs contain structured data such as diagnostic codes and laboratory tests, as well as unstructured free text data in form of clinical notes, which provide more detail about condition and treatment of patients. Methods In this work, we propose a method that jointly learns vector representations of medical concepts and words. This is achieved by a novel learning scheme based on the word2vec model. Our model learns those relationships by integrating clinical notes and sets of accompanying medical codes and by defining joint contexts for each observed word and medical code. Results In our experiments, we learned joint representations using MIMIC-III data. Using the learned representations of words and medical codes, we evaluated phenotypes for 6 diseases discovered by our and baseline method. The experimental results show that for each of the 6 diseases our method finds highly relevant words. We also show that our representations can be very useful when predicting the reason for the next visit. Conclusions The jointly learned representations of medical concepts and words capture not only similarity between codes or words themselves, but also similarity between codes and words. They can be used to extract phenotypes of different diseases. The representations learned by the joint model are also useful for construction of patient features.
Collapse
Affiliation(s)
- Tian Bai
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA
| | - Ashis Kumar Chanda
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA
| | - Brian L Egleston
- Fox Chase Cancer Center, Temple University, Philadelphia, PA, USA
| | - Slobodan Vucetic
- Department of Computer & Information Sciences, Temple University, Philadelphia, PA, USA.
| |
Collapse
|
7
|
Jackson R, Kartoglu I, Stringer C, Gorrell G, Roberts A, Song X, Wu H, Agrawal A, Lui K, Groza T, Lewsley D, Northwood D, Folarin A, Stewart R, Dobson R. CogStack - experiences of deploying integrated information retrieval and extraction services in a large National Health Service Foundation Trust hospital. BMC Med Inform Decis Mak 2018; 18:47. [PMID: 29941004 PMCID: PMC6020175 DOI: 10.1186/s12911-018-0623-9] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2017] [Accepted: 06/01/2018] [Indexed: 03/05/2023] Open
Abstract
BACKGROUND Traditional health information systems are generally devised to support clinical data collection at the point of care. However, as the significance of the modern information economy expands in scope and permeates the healthcare domain, there is an increasing urgency for healthcare organisations to offer information systems that address the expectations of clinicians, researchers and the business intelligence community alike. Amongst other emergent requirements, the principal unmet need might be defined as the 3R principle (right data, right place, right time) to address deficiencies in organisational data flow while retaining the strict information governance policies that apply within the UK National Health Service (NHS). Here, we describe our work on creating and deploying a low cost structured and unstructured information retrieval and extraction architecture within King's College Hospital, the management of governance concerns and the associated use cases and cost saving opportunities that such components present. RESULTS To date, our CogStack architecture has processed over 300 million lines of clinical data, making it available for internal service improvement projects at King's College London. On generated data designed to simulate real world clinical text, our de-identification algorithm achieved up to 94% precision and up to 96% recall. CONCLUSION We describe a toolkit which we feel is of huge value to the UK (and beyond) healthcare community. It is the only open source, easily deployable solution designed for the UK healthcare environment, in a landscape populated by expensive proprietary systems. Solutions such as these provide a crucial foundation for the genomic revolution in medicine.
Collapse
Affiliation(s)
- Richard Jackson
- Institute of Psychiatry, Psychology and Neuroscience, King’s College London, 16 De Crespigne Park, London, SE5 8AF UK
- South London and Maudsley NHS Foundation Trust, Denmark Hill, London, SE5 8AZ UK
| | - Ismail Kartoglu
- InterDigital Communications, 64 Great Eastern Street, 1st Floor, London, EC2A 3QR UK
| | - Clive Stringer
- King’s College Hospital, Denmark Hill, London, SE5 9RS UK
| | | | - Angus Roberts
- University of Sheffield, Western Bank, Sheffield, S10 2TN UK
| | - Xingyi Song
- University of Sheffield, Western Bank, Sheffield, S10 2TN UK
| | - Honghan Wu
- Institute of Psychiatry, Psychology and Neuroscience, King’s College London, 16 De Crespigne Park, London, SE5 8AF UK
- Usher Institute of Population Health Sciences and Informatics, University of Edinburgh, Edinburgh, EH16 4UX UK
| | - Asha Agrawal
- King’s College Hospital, Denmark Hill, London, SE5 9RS UK
| | - Kenneth Lui
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London, WC1E 6BT UK
| | - Tudor Groza
- Garvan Institute of Medical Research, Sydney, NSW 2010 Australia
| | - Damian Lewsley
- King’s College Hospital, Denmark Hill, London, SE5 9RS UK
| | - Doug Northwood
- King’s College Hospital, Denmark Hill, London, SE5 9RS UK
| | - Amos Folarin
- Institute of Psychiatry, Psychology and Neuroscience, King’s College London, 16 De Crespigne Park, London, SE5 8AF UK
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London, WC1E 6BT UK
| | - Robert Stewart
- Institute of Psychiatry, Psychology and Neuroscience, King’s College London, 16 De Crespigne Park, London, SE5 8AF UK
- South London and Maudsley NHS Foundation Trust, Denmark Hill, London, SE5 8AZ UK
| | - Richard Dobson
- Institute of Psychiatry, Psychology and Neuroscience, King’s College London, 16 De Crespigne Park, London, SE5 8AF UK
- Farr Institute of Health Informatics Research, UCL Institute of Health Informatics, University College London, London, WC1E 6BT UK
| |
Collapse
|
8
|
Effective Identification of Similar Patients Through Sequential Matching over ICD Code Embedding. J Med Syst 2018; 42:94. [PMID: 29644446 DOI: 10.1007/s10916-018-0951-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 03/26/2018] [Indexed: 10/17/2022]
Abstract
Evidence-based medicine often involves the identification of patients with similar conditions, which are often captured in ICD (International Classification of Diseases (World Health Organization 2013)) code sequences. With no satisfying prior solutions for matching ICD-10 code sequences, this paper presents a method which effectively captures the clinical similarity among routine patients who have multiple comorbidities and complex care needs. Our method leverages the recent progress in representation learning of individual ICD-10 codes, and it explicitly uses the sequential order of codes for matching. Empirical evaluation on a state-wide cancer data collection shows that our proposed method achieves significantly higher matching performance compared with state-of-the-art methods ignoring the sequential order. Our method better identifies similar patients in a number of clinical outcomes including readmission and mortality outlook. Although this paper focuses on ICD-10 diagnosis code sequences, our method can be adapted to work with other codified sequence data.
Collapse
|
9
|
Névéol A, Dalianis H, Velupillai S, Savova G, Zweigenbaum P. Clinical Natural Language Processing in languages other than English: opportunities and challenges. J Biomed Semantics 2018; 9:12. [PMID: 29602312 PMCID: PMC5877394 DOI: 10.1186/s13326-018-0179-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 15.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2017] [Accepted: 02/14/2018] [Indexed: 01/22/2023] Open
Abstract
Background Natural language processing applied to clinical text or aimed at a clinical outcome has been thriving in recent years. This paper offers the first broad overview of clinical Natural Language Processing (NLP) for languages other than English. Recent studies are summarized to offer insights and outline opportunities in this area. Main Body We envision three groups of intended readers: (1) NLP researchers leveraging experience gained in other languages, (2) NLP researchers faced with establishing clinical text processing in a language other than English, and (3) clinical informatics researchers and practitioners looking for resources in their languages in order to apply NLP techniques and tools to clinical practice and/or investigation. We review work in clinical NLP in languages other than English. We classify these studies into three groups: (i) studies describing the development of new NLP systems or components de novo, (ii) studies describing the adaptation of NLP architectures developed for English to another language, and (iii) studies focusing on a particular clinical application. Conclusion We show the advantages and drawbacks of each method, and highlight the appropriate application context. Finally, we identify major challenges and opportunities that will affect the impact of NLP on clinical practice and public health studies in a context that encompasses English as well as other languages.
Collapse
Affiliation(s)
- Aurélie Névéol
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| | | | - Sumithra Velupillai
- School of Computer Science and Communication, KTH, Stockholm, Sweden.,Institute of Psychiatry, Psychology and Neuroscience, King's College, London, UK
| | - Guergana Savova
- Children's Hospital Boston and Harvard Medical School, Boston, Massachusetts, USA
| | - Pierre Zweigenbaum
- LIMSI, CNRS, Université Paris Saclay, Rue John von Neumann, Paris, F-91405 Orsay, France
| |
Collapse
|
10
|
Bai T, Chanda AK, Egleston BL, Vucetic S. Joint Learning of Representations of Medical Concepts and Words from EHR Data. PROCEEDINGS. IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE 2017; 2017:764-769. [PMID: 29375929 DOI: 10.1109/bibm.2017.8217752] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
There has been an increasing interest in learning low-dimensional vector representations of medical concepts from electronic health records (EHRs). While EHRs contain structured data such as diagnostic codes and laboratory tests, they also contain unstructured clinical notes, which provide more nuanced details on a patient's health status. In this work, we propose a method that jointly learns medical concept and word representations. In particular, we focus on capturing the relationship between medical codes and words by using a novel learning scheme for word2vec model. Our method exploits relationships between different parts of EHRs in the same visit and embeds both codes and words in the same continuous vector space. In the end, we are able to derive clusters which reflect distinct disease and treatment patterns. In our experiments, we qualitatively show how our methods of grouping words for given diagnostic codes compares with a topic modeling approach. We also test how well our representations can be used to predict disease patterns of the next visit. The results show that our approach outperforms several common methods.
Collapse
Affiliation(s)
- Tian Bai
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
| | - Ashis Kumar Chanda
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
| | | | - Slobodan Vucetic
- Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA
| |
Collapse
|
11
|
Assigning clinical codes with data-driven concept representation on Dutch clinical free text. J Biomed Inform 2017; 69:118-127. [DOI: 10.1016/j.jbi.2017.04.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 03/06/2017] [Accepted: 04/07/2017] [Indexed: 11/21/2022]
|
12
|
Wang Y, Wu S, Li D, Mehrabi S, Liu H. A Part-Of-Speech term weighting scheme for biomedical information retrieval. J Biomed Inform 2016; 63:379-389. [PMID: 27593166 DOI: 10.1016/j.jbi.2016.08.026] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2016] [Revised: 08/30/2016] [Accepted: 08/31/2016] [Indexed: 11/24/2022]
Abstract
In the era of digitalization, information retrieval (IR), which retrieves and ranks documents from large collections according to users' search queries, has been popularly applied in the biomedical domain. Building patient cohorts using electronic health records (EHRs) and searching literature for topics of interest are some IR use cases. Meanwhile, natural language processing (NLP), such as tokenization or Part-Of-Speech (POS) tagging, has been developed for processing clinical documents or biomedical literature. We hypothesize that NLP can be incorporated into IR to strengthen the conventional IR models. In this study, we propose two NLP-empowered IR models, POS-BoW and POS-MRF, which incorporate automatic POS-based term weighting schemes into bag-of-word (BoW) and Markov Random Field (MRF) IR models, respectively. In the proposed models, the POS-based term weights are iteratively calculated by utilizing a cyclic coordinate method where golden section line search algorithm is applied along each coordinate to optimize the objective function defined by mean average precision (MAP). In the empirical experiments, we used the data sets from the Medical Records track in Text REtrieval Conference (TREC) 2011 and 2012 and the Genomics track in TREC 2004. The evaluation on TREC 2011 and 2012 Medical Records tracks shows that, for the POS-BoW models, the mean improvement rates for IR evaluation metrics, MAP, bpref, and P@10, are 10.88%, 4.54%, and 3.82%, compared to the BoW models; and for the POS-MRF models, these rates are 13.59%, 8.20%, and 8.78%, compared to the MRF models. Additionally, we experimentally verify that the proposed weighting approach is superior to the simple heuristic and frequency based weighting approaches, and validate our POS category selection. Using the optimal weights calculated in this experiment, we tested the proposed models on the TREC 2004 Genomics track and obtained average of 8.63% and 10.04% improvement rates for POS-BoW and POS-MRF, respectively. These significant improvements verify the effectiveness of leveraging POS tagging for biomedical IR tasks.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Stephen Wu
- Department of Medical Informatics & Clinical Epidemiology, Oregon Health and Science University, Portland, OR, USA.
| | - Dingcheng Li
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Saeed Mehrabi
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
13
|
Thompson P, Batista-Navarro RT, Kontonatsios G, Carter J, Toon E, McNaught J, Timmermann C, Worboys M, Ananiadou S. Text Mining the History of Medicine. PLoS One 2016; 11:e0144717. [PMID: 26734936 PMCID: PMC4703377 DOI: 10.1371/journal.pone.0144717] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2015] [Accepted: 11/23/2015] [Indexed: 11/19/2022] Open
Abstract
Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform.
Collapse
Affiliation(s)
- Paul Thompson
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
- * E-mail:
| | - Riza Theresa Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Georgios Kontonatsios
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Jacob Carter
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Elizabeth Toon
- Centre for the History of Science, Technology and Medicine, University of Manchester, Manchester, United Kingdom
| | - John McNaught
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
| | - Carsten Timmermann
- Centre for the History of Science, Technology and Medicine, University of Manchester, Manchester, United Kingdom
| | - Michael Worboys
- Centre for the History of Science, Technology and Medicine, University of Manchester, Manchester, United Kingdom
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, United Kingdom
| |
Collapse
|