Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Arnold CW, Oh A, Chen S, Speier W. Evaluating topic model interpretability from a primary care physician perspective. Comput Methods Programs Biomed 2016;124:67-75. [PMID: 26614020 PMCID: PMC4724339 DOI: 10.1016/j.cmpb.2015.10.014] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Revised: 09/14/2015] [Accepted: 10/20/2015] [Indexed: 06/05/2023]

Number

Cited by Other Article(s)

Keszthelyi D, Gaudet-Blavignac C, Bjelogrlic M, Lovis C. Patient Information Summarization in Clinical Settings: Scoping Review. JMIR Med Inform 2023;11:e44639. [PMID: 38015588 DOI: 10.2196/44639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Revised: 03/15/2023] [Accepted: 07/25/2023] [Indexed: 11/29/2023] Open

Abstract

BACKGROUND

Information overflow, a common problem in the present clinical environment, can be mitigated by summarizing clinical data. Although there are several solutions for clinical summarization, there is a lack of a complete overview of the research relevant to this field.

OBJECTIVE

This study aims to identify state-of-the-art solutions for clinical summarization, to analyze their capabilities, and to identify their properties.

METHODS

A scoping review of articles published between 2005 and 2022 was conducted. With a clinical focus, PubMed and Web of Science were queried to find an initial set of reports, later extended by articles found through a chain of citations. The included reports were analyzed to answer the questions of where, what, and how medical information is summarized; whether summarization conserves temporality, uncertainty, and medical pertinence; and how the propositions are evaluated and deployed. To answer how information is summarized, methods were compared through a new framework "collect-synthesize-communicate" referring to information gathering from data, its synthesis, and communication to the end user.

RESULTS

Overall, 128 articles were included, representing various medical fields. Exclusively structured data were used as input in 46.1% (59/128) of papers, text in 41.4% (53/128) of articles, and both in 10.2% (13/128) of papers. Using the proposed framework, 42.2% (54/128) of the records contributed to information collection, 27.3% (35/128) contributed to information synthesis, and 46.1% (59/128) presented solutions for summary communication. Numerous summarization approaches have been presented, including extractive (n=13) and abstractive summarization (n=19); topic modeling (n=5); summary specification (n=11); concept and relation extraction (n=30); visual design considerations (n=59); and complete pipelines (n=7) using information extraction, synthesis, and communication. Graphical displays (n=53), short texts (n=41), static reports (n=7), and problem-oriented views (n=7) were the most common types in terms of summary communication. Although temporality and uncertainty information were usually not conserved in most studies (74/128, 57.8% and 113/128, 88.3%, respectively), some studies presented solutions to treat this information. Overall, 115 (89.8%) articles showed results of an evaluation, and methods included evaluations with human participants (median 15, IQR 24 participants): measurements in experiments with human participants (n=31), real situations (n=8), and usability studies (n=28). Methods without human involvement included intrinsic evaluation (n=24), performance on a proxy (n=10), or domain-specific tasks (n=11). Overall, 11 (8.6%) reports described a system deployed in clinical settings.

CONCLUSIONS

The scientific literature contains many propositions for summarizing patient information but reports very few comparisons of these proposals. This work proposes to compare these algorithms through how they conserve essential aspects of clinical information and through the "collect-synthesize-communicate" framework. We found that current propositions usually address these 3 steps only partially. Moreover, they conserve and use temporality, uncertainty, and pertinent medical aspects to varying extents, and solutions are often preliminary.

Collapse

Wood J, Arnold C, Wang W. Knowledge Source Rankings for Semi-Supervised Topic Modeling. Information 2022;13:57. [DOI: 10.3390/info13020057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open

Meng Y, Speier W, Ong MK, Arnold CW. Bidirectional Representation Learning From Transformers Using Multimodal Electronic Health Record Data to Predict Depression. IEEE J Biomed Health Inform 2021;25:3121-3129. [PMID: 33661740 DOI: 10.1109/jbhi.2021.3063721] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Meng Y, Speier W, Ong M, Arnold CW. HCET: Hierarchical Clinical Embedding With Topic Modeling on Electronic Health Records for Predicting Future Depression. IEEE J Biomed Health Inform 2021;25:1265-1272. [PMID: 32749975 DOI: 10.1109/jbhi.2020.3004072] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Spasic I, Button K. Patient Triage by Topic Modeling of Referral Letters: Feasibility Study. JMIR Med Inform 2020;8:e21252. [PMID: 33155985 PMCID: PMC7679210 DOI: 10.2196/21252] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2020] [Revised: 09/17/2020] [Accepted: 10/05/2020] [Indexed: 01/22/2023] Open

Abstract

Background

Musculoskeletal conditions are managed within primary care, but patients can be referred to secondary care if a specialist opinion is required. The ever-increasing demand for health care resources emphasizes the need to streamline care pathways with the ultimate aim of ensuring that patients receive timely and optimal care. Information contained in referral letters underpins the referral decision-making process but is yet to be explored systematically for the purposes of treatment prioritization for musculoskeletal conditions.

Objective

This study aims to explore the feasibility of using natural language processing and machine learning to automate the triage of patients with musculoskeletal conditions by analyzing information from referral letters. Specifically, we aim to determine whether referral letters can be automatically assorted into latent topics that are clinically relevant, that is, considered relevant when prescribing treatments. Here, clinical relevance is assessed by posing 2 research questions. Can latent topics be used to automatically predict treatment? Can clinicians interpret latent topics as cohorts of patients who share common characteristics or experiences such as medical history, demographics, and possible treatments?

Methods

We used latent Dirichlet allocation to model each referral letter as a finite mixture over an underlying set of topics and model each topic as an infinite mixture over an underlying set of topic probabilities. The topic model was evaluated in the context of automating patient triage. Given a set of treatment outcomes, a binary classifier was trained for each outcome using previously extracted topics as the input features of the machine learning algorithm. In addition, a qualitative evaluation was performed to assess the human interpretability of topics.

Results

The prediction accuracy of binary classifiers outperformed the stratified random classifier by a large margin, indicating that topic modeling could be used to predict the treatment, thus effectively supporting patient triage. The qualitative evaluation confirmed the high clinical interpretability of the topic model.

Conclusions

The results established the feasibility of using natural language processing and machine learning to automate triage of patients with knee or hip pain by analyzing information from their referral letters.

Collapse

Juan L, Wang Y, Jiang J, Yang Q, Wang G, Wang Y. Evaluating individual genome similarity with a topic model. Bioinformatics 2020;36:4757-4764. [PMID: 32573702 DOI: 10.1093/bioinformatics/btaa583] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2019] [Revised: 04/30/2020] [Accepted: 06/15/2020] [Indexed: 12/13/2022] Open

Afshar M, Joyce C, Dligach D, Sharma B, Kania R, Xie M, Swope K, Salisbury-Afshar E, Karnik NS. Subtypes in patients with opioid misuse: A prognostic enrichment strategy using electronic health record data in hospitalized patients. PLoS One 2019;14:e0219717. [PMID: 31310611 PMCID: PMC6634397 DOI: 10.1371/journal.pone.0219717] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Accepted: 06/28/2019] [Indexed: 11/24/2022] Open

Abstract

BACKGROUND

Approaches are needed to better delineate the continuum of opioid misuse that occurs in hospitalized patients. A prognostic enrichment strategy with latent class analysis (LCA) may facilitate treatment strategies in subtypes of opioid misuse. We aim to identify subtypes of patients with opioid misuse and examine the distinctions between the subtypes by examining patient characteristics, topic models from clinical notes, and clinical outcomes.

METHODS

This was an observational study of inpatient hospitalizations at a tertiary care center between 2007 and 2017. Patients with opioid misuse were identified using an operational definition applied to all inpatient encounters. LCA with eight class-defining variables from the electronic health record (EHR) was applied to identify subtypes in the cohort of patients with opioid misuse. Comparisons between subtypes were made using the following approaches: (1) descriptive statistics on patient characteristics and healthcare utilization using EHR data and census-level data; (2) topic models with natural language processing (NLP) from clinical notes; (3) association with hospital outcomes.

FINDINGS

The analysis cohort was 6,224 (2.7% of all hospitalizations) patient encounters with opioid misuse with a data corpus of 422,147 clinical notes. LCA identified four subtypes with differing patient characteristics, topics from the clinical notes, and hospital outcomes. Class 1 was categorized by high hospital utilization with known opioid-related conditions (36.5%); Class 2 included patients with illicit use, low socioeconomic status, and psychoses (12.8%); Class 3 contained patients with alcohol use disorders with complications (39.2%); and class 4 consisted of those with low hospital utilization and incidental opioid misuse (11.5%). The following hospital outcomes were the highest for each subtype when compared against the other subtypes: readmission for class 1 (13.9% vs. 10.5%, p<0.01); discharge against medical advice for class 2 (12.3% vs. 5.3%, p<0.01); and in-hospital death for classes 3 and 4 (3.2% vs. 1.9%, p<0.01).

CONCLUSIONS

A 4-class latent model was the most parsimonious model that defined clinically interpretable and relevant subtypes for opioid misuse. Distinct subtypes were delineated after examining multiple domains of EHR data and applying methods in artificial intelligence. The approach with LCA and readily available class-defining substance use variables from the EHR may be applied as a prognostic enrichment strategy for targeted interventions.

Collapse

Affiliation(s)

Majid Afshar Department of Public Health Sciences, Loyola University, Maywood, Illinois, United States of America Center for Health Outcomes and Informatics Research, Loyola University, Maywood, Illinois, United States of America Stritch School of Medicine, Loyola University, Maywood, Illinois, United States of America
Cara Joyce Department of Public Health Sciences, Loyola University, Maywood, Illinois, United States of America Center for Health Outcomes and Informatics Research, Loyola University, Maywood, Illinois, United States of America Stritch School of Medicine, Loyola University, Maywood, Illinois, United States of America
Dmitriy Dligach Department of Public Health Sciences, Loyola University, Maywood, Illinois, United States of America Center for Health Outcomes and Informatics Research, Loyola University, Maywood, Illinois, United States of America Department of Computer Science, Loyola University Medical Center, Maywood, Illinois, United States of America
Brihat Sharma Department of Computer Science, Loyola University Medical Center, Maywood, Illinois, United States of America
Robert Kania Department of Computer Science, Loyola University Medical Center, Maywood, Illinois, United States of America
Meng Xie Department of Mathematics and Statistics, Loyola University, Chicago, Illinois, United States of America
Kristin Swope Department of Public Health Sciences, Loyola University, Maywood, Illinois, United States of America Stritch School of Medicine, Loyola University, Maywood, Illinois, United States of America
Elizabeth Salisbury-Afshar Center for Multi-System Solutions to the Opioid Epidemic, American Institute for Research, Chicago, Illinois, United States of America
Niranjan S. Karnik Department of Psychiatry & Behavioral Sciences, Rush University Medical Center, Chicago, Illinois, United States of America

Collapse

Rusanov A, Miotto R, Weng C. Trends in anesthesiology research: a machine learning approach to theme discovery and summarization. JAMIA Open 2018;1:283-293. [PMID: 30474079 PMCID: PMC6241511 DOI: 10.1093/jamiaopen/ooy009] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2017] [Revised: 03/18/2018] [Accepted: 08/23/2018] [Indexed: 11/13/2022] Open

Sultanum N, Singh D, Brudno M, Chevalier F. Doccurate: A Curation-Based Approach for Clinical Text Visualization. IEEE Trans Vis Comput Graph 2018;25:142-151. [PMID: 30136959 DOI: 10.1109/tvcg.2018.2864905] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]

Hardjojo A, Gunachandran A, Pang L, Abdullah MRB, Wah W, Chong JWC, Goh EH, Teo SH, Lim G, Lee ML, Hsu W, Lee V, Chen MIC, Wong F, Phang JSK. Validation of a Natural Language Processing Algorithm for Detecting Infectious Disease Symptoms in Primary Care Electronic Medical Records in Singapore. JMIR Med Inform 2018;6:e36. [PMID: 29907560 PMCID: PMC6026305 DOI: 10.2196/medinform.8204] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2017] [Revised: 02/14/2018] [Accepted: 03/19/2018] [Indexed: 02/04/2023] Open

Abstract

Background

Free-text clinical records provide a source of information that complements traditional disease surveillance. To electronically harness these records, they need to be transformed into codified fields by natural language processing algorithms.

Objective

The aim of this study was to develop, train, and validate Clinical History Extractor for Syndromic Surveillance (CHESS), an natural language processing algorithm to extract clinical information from free-text primary care records.

Methods

CHESS is a keyword-based natural language processing algorithm to extract 48 signs and symptoms suggesting respiratory infections, gastrointestinal infections, constitutional, as well as other signs and symptoms potentially associated with infectious diseases. The algorithm also captured the assertion status (affirmed, negated, or suspected) and symptom duration.

Electronic medical records from the National Healthcare Group Polyclinics, a major public sector primary care provider in Singapore, were randomly extracted and manually reviewed by 2 human reviewers, with a third reviewer as the adjudicator. The algorithm was evaluated based on 1680 notes against the human-coded result as the reference standard, with half of the data used for training and the other half for validation.

Results

The symptoms most commonly present within the 1680 clinical records at the episode level were those typically present in respiratory infections such as cough (744/7703, 9.66%), sore throat (591/7703, 7.67%), rhinorrhea (552/7703, 7.17%), and fever (928/7703, 12.04%). At the episode level, CHESS had an overall performance of 96.7% precision and 97.6% recall on the training dataset and 96.0% precision and 93.1% recall on the validation dataset. Symptoms suggesting respiratory and gastrointestinal infections were all detected with more than 90% precision and recall. CHESS correctly assigned the assertion status in 97.3%, 97.9%, and 89.8% of affirmed, negated, and suspected signs and symptoms, respectively (97.6% overall accuracy). Symptom episode duration was correctly identified in 81.2% of records with known duration status.

Conclusions

We have developed an natural language processing algorithm dubbed CHESS that achieves good performance in extracting signs and symptoms from primary care free-text clinical records. In addition to the presence of symptoms, our algorithm can also accurately distinguish affirmed, negated, and suspected assertion statuses and extract symptom durations.

Collapse

Affiliation(s)

Antony Hardjojo Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Arunan Gunachandran Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Long Pang Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Mohammed Ridzwan Bin Abdullah Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Win Wah Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Joash Wen Chen Chong Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Ee Hui Goh Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Sok Huang Teo National Healthcare Group Polyclinics, Singapore, Singapore
Gilbert Lim School of Computing, National University of Singapore, Singapore, Singapore
Mong Li Lee School of Computing, National University of Singapore, Singapore, Singapore
Wynne Hsu School of Computing, National University of Singapore, Singapore, Singapore
Vernon Lee Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore
Mark I-Cheng Chen Saw Swee Hock School of Public Health, National University Health System, National University of Singapore, Singapore, Singapore.,National Centre for Infectious Diseases, Singapore, Singapore
Franco Wong National Healthcare Group Polyclinics, Singapore, Singapore.,National University Polyclinics, Singapore, Singapore
Jonathan Siung King Phang National Healthcare Group Polyclinics, Singapore, Singapore.,National University Polyclinics, Singapore, Singapore

Collapse

Feller DJ, Zucker J, Yin MT, Gordon P, Elhadad N. Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment. J Acquir Immune Defic Syndr 2018;77:160-166. [PMID: 29084046 PMCID: PMC5762388 DOI: 10.1097/qai.0000000000001580] [Citation(s) in RCA: 58] [Impact Index Per Article: 9.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Tapi Nzali MD, Bringay S, Lavergne C, Mollevi C, Opitz T. What Patients Can Tell Us: Topic Analysis for Social Media on Breast Cancer. JMIR Med Inform 2017;5:e23. [PMID: 28760725 PMCID: PMC5556259 DOI: 10.2196/medinform.7779] [Citation(s) in RCA: 52] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2017] [Revised: 06/16/2017] [Accepted: 06/17/2017] [Indexed: 11/13/2022] Open

Abstract

Background

Social media dedicated to health are increasingly used by patients and health professionals. They are rich textual resources with content generated through free exchange between patients. We are proposing a method to tackle the problem of retrieving clinically relevant information from such social media in order to analyze the quality of life of patients with breast cancer.

Objective

Our aim was to detect the different topics discussed by patients on social media and to relate them to functional and symptomatic dimensions assessed in the internationally standardized self-administered questionnaires used in cancer clinical trials (European Organization for Research and Treatment of Cancer [EORTC] Quality of Life Questionnaire Core 30 [QLQ-C30] and breast cancer module [QLQ-BR23]).

Methods

First, we applied a classic text mining technique, latent Dirichlet allocation (LDA), to detect the different topics discussed on social media dealing with breast cancer. We applied the LDA model to 2 datasets composed of messages extracted from public Facebook groups and from a public health forum (cancerdusein.org, a French breast cancer forum) with relevant preprocessing. Second, we applied a customized Jaccard coefficient to automatically compute similarity distance between the topics detected with LDA and the questions in the self-administered questionnaires used to study quality of life.

Results

Among the 23 topics present in the self-administered questionnaires, 22 matched with the topics discussed by patients on social media. Interestingly, these topics corresponded to 95% (22/23) of the forum and 86% (20/23) of the Facebook group topics. These figures underline that topics related to quality of life are an important concern for patients. However, 5 social media topics had no corresponding topic in the questionnaires, which do not cover all of the patients’ concerns. Of these 5 topics, 2 could potentially be used in the questionnaires, and these 2 topics corresponded to a total of 3.10% (523/16,868) of topics in the cancerdusein.org corpus and 4.30% (3014/70,092) of the Facebook corpus.

Conclusions

We found a good correspondence between detected topics on social media and topics covered by the self-administered questionnaires, which substantiates the sound construction of such questionnaires. We detected new emerging topics from social media that can be used to complete current self-administered questionnaires. Moreover, we confirmed that social media mining is an important source of information for complementary analysis of quality of life.

Collapse

Luo YF, Rumshisky A. Interpretable Topic Features for Post-ICU Mortality Prediction. AMIA Annu Symp Proc 2017;2016:827-836. [PMID: 28269879 PMCID: PMC5333300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]

Yu Z, Bernstam E, Cohen T, Wallace BC, Johnson TR. Improving the utility of MeSH® terms using the TopicalMeSH representation. J Biomed Inform 2016;61:77-86. [PMID: 27001195 PMCID: PMC4893983 DOI: 10.1016/j.jbi.2016.03.013] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2015] [Revised: 03/16/2016] [Accepted: 03/17/2016] [Indexed: 11/30/2022]

Abstract

OBJECTIVE

To evaluate whether vector representations encoding latent topic proportions that capture similarities to MeSH terms can improve performance on biomedical document retrieval and classification tasks, compared to using MeSH terms.

MATERIALS AND METHODS

We developed the TopicalMeSH representation, which exploits the 'correspondence' between topics generated using latent Dirichlet allocation (LDA) and MeSH terms to create new document representations that combine MeSH terms and latent topic vectors. We used 15 systematic drug review corpora to evaluate performance on information retrieval and classification tasks using this TopicalMeSH representation, compared to using standard encodings that rely on either (1) the original MeSH terms, (2) the text, or (3) their combination. For the document retrieval task, we compared the precision and recall achieved by ranking citations using MeSH and TopicalMeSH representations, respectively. For the classification task, we considered three supervised machine learning approaches, Support Vector Machines (SVMs), logistic regression, and decision trees. We used these to classify documents as relevant or irrelevant using (independently) MeSH, TopicalMeSH, Words (i.e., n-grams extracted from citation titles and abstracts, encoded via bag-of-words representation), a combination of MeSH and Words, and a combination of TopicalMeSH and Words. We also used SVM to compare the classification performance of tf-idf weighted MeSH terms, LDA Topics, a combination of Topics and MeSH, and TopicalMeSH to supervised LDA's classification performance.

RESULTS

For the document retrieval task, using the TopicalMeSH representation resulted in higher precision than MeSH in 11 of 15 corpora while achieving the same recall. For the classification task, use of TopicalMeSH features realized a higher F1 score in 14 of 15 corpora when used by SVMs, 12 of 15 corpora using logistic regression, and 12 of 15 corpora using decision trees. TopicalMeSH also had better document classification performance on 12 of 15 corpora when compared to Topics, tf-idf weighted MeSH terms, and a combination of Topics and MeSH using SVMs. Supervised LDA achieved the worst performance in most of the corpora.

CONCLUSION

The proposed TopicalMeSH representation (which combines MeSH terms with latent topics) consistently improved performance on document retrieval and classification tasks, compared to using alternative standard representations using MeSH terms alone, as well as, several standard alternative approaches.

Collapse

Speier W, Ong MK, Arnold CW. Using phrases and document metadata to improve topic modeling of clinical reports. J Biomed Inform 2016;61:260-6. [PMID: 27109931 DOI: 10.1016/j.jbi.2016.04.005] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Revised: 04/19/2016] [Accepted: 04/20/2016] [Indexed: 11/24/2022]