1
|
Egorov M, Funkner A. Automatic Extraction and Decryption of Abbreviations from Domain-Specific Texts. Stud Health Technol Inform 2021; 285:281-284. [PMID: 34734887 DOI: 10.3233/shti210615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
This paper explores the problems of extraction and decryption of abbreviations from domain-specific texts in Russian. The main focus are unstructured electronic medical records which pose specific preprocessing problems. The major challenge is that there is no uniform way to write medical histories. The aim of the paper is to generalize the way of decrypting abbreviations from any variant of text. A dataset of nearly three million medical records was collected. A classifier model was trained in order to extract and decrypt abbreviations. After testing the proposed method with 224,307 records, the model showed an F1 score of 93.7% on a valid dataset.
Collapse
|
2
|
Alfattni G, Peek N, Nenadic G. Attention-based bidirectional long short-term memory networks for extracting temporal relationships from clinical discharge summaries. J Biomed Inform 2021; 123:103915. [PMID: 34600144 DOI: 10.1016/j.jbi.2021.103915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 08/05/2021] [Accepted: 09/09/2021] [Indexed: 10/20/2022]
Abstract
Temporal relation extraction between health-related events is a widely studied task in clinical Natural Language Processing (NLP). The current state-of-the-art methods mostly rely on engineered features (i.e., rule-based modelling) and sequence modelling, which often encodes a source sentence into a single fixed-length context. An obvious disadvantage of this fixed-length context design is its incapability to model longer sentences, as important temporal information in the clinical text may appear at different positions. To address this issue, we propose an Attention-based Bidirectional Long Short-Term Memory (Att-BiLSTM) model to enable learning the important semantic information in long source text segments and to better determine which parts of the text are most important. We experimented with two embeddings and compared the performances to traditional state-of-the-art methods that require elaborate linguistic pre-processing and hand-engineered features. The experimental results on the i2b2 2012 temporal relation test corpus show that the proposed method achieves a significant improvement with an F-score of 0.811, which is at least 10% better than state-of-the-art in the field. We show that the model can be remarkably effective at classifying temporal relations when provided with word embeddings trained on corpora in a general domain. Finally, we perform an error analysis to gain insight into the common errors made by the model.
Collapse
Affiliation(s)
- Ghada Alfattni
- Department of Computer Science, University of Manchester, Manchester, UK; Department of Computer Science, Jamoum University College, Umm Al-Qura University, Makkah, Saudi Arabia.
| | - Niels Peek
- Centre for Health Informatics, Division of Informatics, Imaging and Data Sciences, University of Manchester, Manchester, UK; National Institute of Health Research Manchester Biomedical Research Centre, Manchester Academic Health Science Centre, University of Manchester, Manchester, UK; The Alan Turing Institute, UK
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, UK; The Alan Turing Institute, UK
| |
Collapse
|
3
|
Fu JT, Sholle E, Krichevsky S, Scandura J, Campion TR. Extracting and classifying diagnosis dates from clinical notes: A case study. J Biomed Inform 2020; 110:103569. [PMID: 32949781 DOI: 10.1016/j.jbi.2020.103569] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 08/24/2020] [Accepted: 09/12/2020] [Indexed: 11/29/2022]
Abstract
Myeloproliferative neoplasms (MPNs) are chronic hematologic malignancies that may progress over long disease courses. The original date of diagnosis is an important piece of information for patient care and research, but is not consistently documented. We describe an attempt to build a pipeline for extracting dates with natural language processing (NLP) tools and techniques and classifying them as relevant diagnoses or not. Inaccurate and incomplete date extraction and interpretation impacted the performance of the overall pipeline. Existing lightweight Python packages tended to have low specificity for identifying and interpreting partial and relative dates in clinical text. A rules-based regular expression (regex) approach achieved recall of 83.0% on dates manually annotated as diagnosis dates, and 77.4% on all annotated dates. With only 3.8% of annotated dates representing initial MPN diagnoses, additional methods of targeting candidate date instances may alleviate noise and class imbalance.
Collapse
Affiliation(s)
- Julia T Fu
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Division of Health Informatics, Memorial Sloan Kettering Cancer Center, 600 3rd Ave, 8th Fl, New York, NY 10016, United States.
| | - Evan Sholle
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Information Technologies & Services, Weill Cornell Medicine, 575 Lexington Ave, 3rd Fl, New York, NY 10022, United States.
| | - Spencer Krichevsky
- Joint Clinical Trials Office, Weill Cornell Medicine, 1300 York Ave, Box 305, New York, NY 10065, United States.
| | - Joseph Scandura
- Department of Hematology and Oncology, Weill Cornell Medicine, 428 E 72nd St, Ste 300, New York, NY 10065, United States.
| | - Thomas R Campion
- Department of Health Policy and Research, Weill Cornell Medicine, 402 E. 67th St, New York, NY 10065, United States; Information Technologies & Services, Weill Cornell Medicine, 575 Lexington Ave, 3rd Fl, New York, NY 10022, United States; Clinical and Translational Science Center, Weill Cornell Medicine, 1300 York Ave., Box 149, New York, NY 10065, United States; Department of Pediatrics, Weill Cornell Medicine, 525 E 68th St, Rm M610A, New York, NY 10065, United States.
| |
Collapse
|
4
|
Pan X, Chen B, Weng H, Gong Y, Qu Y. Temporal Expression Classification and Normalization From Chinese Narrative Clinical Texts: Pattern Learning Approach. JMIR Med Inform 2020; 8:e17652. [PMID: 32716307 PMCID: PMC7418025 DOI: 10.2196/17652] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 02/28/2020] [Accepted: 03/13/2020] [Indexed: 11/13/2022] Open
Abstract
Background Temporal information frequently exists in the representation of the disease progress, prescription, medication, surgery progress, or discharge summary in narrative clinical text. The accurate extraction and normalization of temporal expressions can positively boost the analysis and understanding of narrative clinical texts to promote clinical research and practice. Objective The goal of the study was to propose a novel approach for extracting and normalizing temporal expressions from Chinese narrative clinical text. Methods TNorm, a rule-based and pattern learning-based approach, has been developed for automatic temporal expression extraction and normalization from unstructured Chinese clinical text data. TNorm consists of three stages: extraction, classification, and normalization. It applies a set of heuristic rules and automatically generated patterns for temporal expression identification and extraction of clinical texts. Then, it collects the features of extracted temporal expressions for temporal type prediction and classification by using machine learning algorithms. Finally, the features are combined with the rule-based and a pattern learning-based approach to normalize the extracted temporal expressions. Results The evaluation dataset is a set of narrative clinical texts in Chinese containing 1459 discharge summaries of a domestic Grade A Class 3 hospital. The results show that TNorm, combined with temporal expressions extraction and temporal types prediction, achieves a precision of 0.8491, a recall of 0.8328, and a F1 score of 0.8409 in temporal expressions normalization. Conclusions This study illustrates an automatic approach, TNorm, that extracts and normalizes temporal expression from Chinese narrative clinical texts. TNorm was evaluated on the basis of discharge summary data, and results demonstrate its effectiveness on temporal expression normalization.
Collapse
Affiliation(s)
- Xiaoyi Pan
- School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
| | - Boyu Chen
- School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
| | - Heng Weng
- Department of Big Data Research of Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese Medicine, Guangzhou, China
| | - Yongyi Gong
- School of Information Science and Technology, Guangdong University of Foreign Studies, Guangzhou, China
| | - Yingying Qu
- School of Business, Guangdong University of Foreign Studies, Guangzhou, China
| |
Collapse
|
5
|
Workman TE, Shao Y, Divita G, Zeng-Treitler Q. An efficient prototype method to identify and correct misspellings in clinical text. BMC Res Notes 2019; 12:42. [PMID: 30658682 DOI: 10.1186/s13104-019-4073-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Accepted: 01/11/2019] [Indexed: 11/17/2022] Open
Abstract
Objective Misspellings in clinical free text present challenges to natural language processing. With an objective to identify misspellings and their corrections, we developed a prototype spelling analysis method that implements Word2Vec, Levenshtein edit distance constraints, a lexical resource, and corpus term frequencies. We used the prototype method to process two different corpora, surgical pathology reports, and emergency department progress and visit notes, extracted from Veterans Health Administration resources. We evaluated performance by measuring positive predictive value and performing an error analysis of false positive output, using four classifications. We also performed an analysis of spelling errors in each corpus, using common error classifications. Results In this small-scale study utilizing a total of 76,786 clinical notes, the prototype method achieved positive predictive values of 0.9057 and 0.8979, respectively, for the surgical pathology reports, and emergency department progress and visit notes, in identifying and correcting misspelled words. False positives varied by corpus. Spelling error types were similar among the two corpora, however, the authors of emergency department progress and visit notes made over four times as many errors. Overall, the results of this study suggest that this method could also perform sufficiently in identifying misspellings in other clinical document types. Electronic supplementary material The online version of this article (10.1186/s13104-019-4073-y) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
He B, Guan Y, Dai R. Classifying medical relations in clinical text via convolutional neural networks. Artif Intell Med 2019; 93:43-49. [PMID: 29778673 DOI: 10.1016/j.artmed.2018.05.001] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2017] [Revised: 02/27/2018] [Accepted: 05/04/2018] [Indexed: 11/15/2022]
Abstract
Deep learning research on relation classification has achieved solid performance in the general domain. This study proposes a convolutional neural network (CNN) architecture with a multi-pooling operation for medical relation classification on clinical records and explores a loss function with a category-level constraint matrix. Experiments using the 2010 i2b2/VA relation corpus demonstrate these models, which do not depend on any external features, outperform previous single-model methods and our best model is competitive with the existing ensemble-based method.
Collapse
Affiliation(s)
- Bin He
- Research Center of Language Technology, Harbin Institute of Technology, Harbin, China.
| | - Yi Guan
- Research Center of Language Technology, Harbin Institute of Technology, Harbin, China.
| | - Rui Dai
- Department of Mathematics, Harbin Institute of Technology, Harbin, China.
| |
Collapse
|
7
|
Abstract
Background Information extraction in clinical texts enables medical workers to find out problems of patients faster as well as makes intelligent diagnosis possible in the future. There has been a lot of work about disorder mention recognition in clinical narratives. But recognition of some more complicated disorder mentions like overlapping ones is still an open issue. This paper proposes a multi-label structured Support Vector Machine (SVM) based method for disorder mention recognition. We present a multi-label scheme which could be used in complicated entity recognition tasks. Results We performed three sets of experiments to evaluate our model. Our best F1-Score on the 2013 Conference and Labs of the Evaluation Forum data set is 0.7343. There are six types of labels in our multi-label scheme, all of which are represented by 24-bit binary numbers. The binary digits of each label contain information about different disorder mentions. Our multi-label method can recognize not only disorder mentions in the form of contiguous or discontiguous words but also mentions whose spans overlap with each other. The experiments indicate that our multi-label structured SVM model outperforms the condition random field (CRF) model for this disorder mention recognition task. The experiments show that our multi-label scheme surpasses the baseline. Especially for overlapping disorder mentions, the F1-Score of our multi-label scheme is 0.1428 higher than the baseline BIOHD1234 scheme. Conclusions This multi-label structured SVM based approach is demonstrated to work well with this disorder recognition task. The novel multi-label scheme we presented is superior to the baseline and it can be used in other models to solve various types of complicated entity recognition tasks as well.
Collapse
Affiliation(s)
- Wutao Lin
- School of Electronics Engineering and Computer Science, Peking University, Beijing, 100871, China
| | - Donghong Ji
- School of Computer, Wuhan University, Wuhan, 430072, China.
| | - Yanan Lu
- School of Computer, Wuhan University, Wuhan, 430072, China
| |
Collapse
|
8
|
Hughes M, Li I, Kotoulas S, Suzumura T. Medical Text Classification Using Convolutional Neural Networks. Stud Health Technol Inform 2017; 235:246-250. [PMID: 28423791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
We present an approach to automatically classify clinical text at a sentence level. We are using deep convolutional neural networks to represent complex features. We train the network on a dataset providing a broad categorization of health information. Through a detailed evaluation, we demonstrate that our method outperforms several approaches widely used in natural language processing tasks by about 15%.
Collapse
|
9
|
Abstract
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning.
Collapse
Affiliation(s)
- Aleksandar Savkov
- Department of Informatics, University of Sussex, Brighton, BN1 9QJ UK
| | - John Carroll
- Department of Informatics, University of Sussex, Brighton, BN1 9QJ UK
| | - Rob Koeling
- Department of Informatics, University of Sussex, Brighton, BN1 9QJ UK
| | - Jackie Cassell
- Division of Primary Care and Public Health, Brighton and Sussex Medical School, Brighton, BN1 9PH UK
| |
Collapse
|
10
|
Zhu D, Wu S, Carterette B, Liu H. Using large clinical corpora for query expansion in text-based cohort identification. J Biomed Inform 2014; 49:275-81. [PMID: 24680983 DOI: 10.1016/j.jbi.2014.03.010] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2013] [Revised: 02/18/2014] [Accepted: 03/15/2014] [Indexed: 10/25/2022]
Abstract
In light of the heightened problems of polysemy, synonymy, and hyponymy in clinical text, we hypothesize that patient cohort identification can be improved by using a large, in-domain clinical corpus for query expansion. We evaluate the utility of four auxiliary collections for the Text REtrieval Conference task of IR-based cohort retrieval, considering the effects of collection size, the inherent difficulty of a query, and the interaction between the collections. Each collection was applied to aid in cohort retrieval from the Pittsburgh NLP Repository by using a mixture of relevance models. Measured by mean average precision, performance using any auxiliary resource (MAP=0.386 and above) is shown to improve over the baseline query likelihood model (MAP=0.373). Considering subsets of the Mayo Clinic collection, we found that after including 2.5 billion term instances, retrieval is not improved by adding more instances. However, adding the Mayo Clinic collection did improve performance significantly over any existing setup, with a system using all four auxiliary collections obtaining the best results (MAP=0.4223). Because optimal results in the mixture of relevance models would require selective sampling of the collections, the common sense approach of "use all available data" is inappropriate. However, we found that it was still beneficial to add the Mayo corpus to any mixture of relevance models. On the task of IR-based cohort identification, query expansion with the Mayo Clinic corpus resulted in consistent and significant improvements. As such, any IR query expansion with access to a large clinical corpus could benefit from the additional resource. Additionally, we have shown that more data is not necessarily better, implying that there is value in collection curation.
Collapse
Affiliation(s)
- Dongqing Zhu
- Department of Computer and Information Sciences, University of Delaware, 440 Smith Hall, Newark, DE 19716, USA.
| | - Stephen Wu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA.
| | - Ben Carterette
- Department of Computer and Information Sciences, University of Delaware, 440 Smith Hall, Newark, DE 19716, USA.
| | - Hongfang Liu
- Division of Biomedical Statistics and Informatics, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA.
| |
Collapse
|