1
|
Yokokawa D, Noda K, Uehara T, Yanagita Y, Ohira Y, Ikusaka M. Do Japanese word-embedded representations obtained in the academic corpus retain the medical concepts of "infarction"? Artif Intell Med 2023; 143:102604. [PMID: 37673573 DOI: 10.1016/j.artmed.2023.102604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 05/29/2023] [Accepted: 06/05/2023] [Indexed: 09/08/2023]
Abstract
OBJECTIVE The pathophysiological concepts of diseases are encapsulated in patients' medical histories. Whether information on the pathophysiology or anatomy of "infarction" can be preserved and objectively expressed in the distributed representation obtained from a corpus of scientific Japanese medical texts in the "infarction" domain is currently unknown. Word2Vec was used to obtain distributed representations, meanings, and word analogies of word vectors, and this process was verified mathematically. MATERIALS & METHODS The texts were abstracts that were obtained by searching for "infarction," "abstract," and "case report" in the Japan Medical Journal Association's Ichushi Data Base. The abstracted text was morphologically analyzed to produce word sequences converted into their standard form. MeCab was used for morphological analysis and mecab-ipadic-NEologd and ComeJisyo were used as dictionaries. The accuracy of the known tasks for medical terms was evaluated using a word analogy task specific to the "infarction" domain. RESULTS Only 33 % of the word analogy tasks for medical terminology were correct. However, 52 % of the new original tasks, which were specific to the "infarction" domain, were correct, especially those regarding anatomical differences. DISCUSSION Documents related to "infarction" were collected from a corpus of Japanese medical documents and word-embedded expressions were obtained using Word2Vec. Terminology that had similar meanings to "infarction" included words such as "cavity" and "ischemia," which suggest the pathology of an infarction. CONCLUSION The pathophysiological and anatomical features of an "infarction" may be retained in a distributed representation.
Collapse
Affiliation(s)
- Daiki Yokokawa
- Chiba University Hospital, Department of General Medicine, Chiba, Japan.
| | - Kazutaka Noda
- Chiba University Hospital, Department of General Medicine, Chiba, Japan
| | - Takanori Uehara
- Chiba University Hospital, Department of General Medicine, Chiba, Japan
| | - Yasutaka Yanagita
- Chiba University Hospital, Department of General Medicine, Chiba, Japan
| | - Yoshiyuki Ohira
- Chiba University Hospital, Department of General Medicine, Chiba, Japan; International University of Health and Welfare, School of Medicine, Department of General Medicine, Chiba, Japan
| | - Masatomi Ikusaka
- Chiba University Hospital, Department of General Medicine, Chiba, Japan
| |
Collapse
|
2
|
Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events. Psychon Bull Rev 2023; 30:308-316. [PMID: 36085232 DOI: 10.3758/s13423-022-02171-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/24/2022] [Indexed: 11/08/2022]
Abstract
The use of naturalistic stimuli, such as narrative movies, is gaining popularity in many fields, characterizing memory, affect, and decision-making. Narrative recall paradigms are often used to capture the complexity and richness of memory for naturalistic events. However, scoring narrative recalls is time-consuming and prone to human biases. Here, we show the validity and reliability of using a natural language processing tool, the Universal Sentence Encoder (USE), to automatically score narrative recalls. We compared the reliability in scoring made between two independent raters (i.e., hand scored) and between our automated algorithm and individual raters (i.e., automated) on trial-unique video clips of magic tricks. Study 1 showed that our automated segmentation approaches yielded high reliability and reflected measures yielded by hand scoring. Study 1 further showed that the results using USE outperformed another popular natural language processing tool, GloVe. In Study 2, we tested whether our automated approach remained valid when testing individuals varying on clinically relevant dimensions that influence episodic memory, age, and anxiety. We found that our automated approach was equally reliable across both age groups and anxiety groups, which shows the efficacy of our approach to assess narrative recall in large-scale individual difference analysis. In sum, these findings suggested that machine learning approach implementing USE is a promising tool for scoring large-scale narrative recalls and perform individual difference analysis for research using naturalistic stimuli.
Collapse
|
3
|
Yokokawa D, Noda K, Yanagita Y, Uehara T, Ohira Y, Shikino K, Tsukamoto T, Ikusaka M. Validating the representation of distance between infarct diseases using word embedding. BMC Med Inform Decis Mak 2022; 22:322. [PMID: 36476486 PMCID: PMC9730570 DOI: 10.1186/s12911-022-02061-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/22/2022] [Indexed: 12/12/2022] Open
Abstract
BACKGROUND The pivot and cluster strategy (PCS) is a diagnostic reasoning strategy that automatically elicits disease clusters similar to a differential diagnosis in a batch. Although physicians know empirically which disease clusters are similar, there has been no quantitative evaluation. This study aimed to determine whether inter-disease distances between word embedding vectors using the PCS are a valid quantitative representation of similar disease groups in a limited domain. METHODS Abstracts were extracted from the Ichushi Web database and subjected to morphological analysis and training using Word2Vec, FastText, and GloVe. Consequently, word embedding vectors were obtained. For words including "infarction," we calculated the cophenetic correlation coefficient (CCC) as an internal validity measure and the adjusted rand index (ARI), normalized mutual information (NMI), and adjusted mutual information (AMI) with ICD-10 codes as the external validity measures. This was performed for each combination of metric and hierarchical clustering method. RESULTS Seventy-one words included "infarction," of which 38 diseases matched the ICD-10 standard with the appearance of 21 unique ICD-10 codes. When using Word2Vec, the CCC was most significant at 0.8690 (metric and method: euclidean and centroid), whereas the AMI was maximal at 0.4109 (metric and method: cosine and correlation, and average and weighted). The NMI and ARI were maximal at 0.8463 and 0.3593, respectively (metric and method: cosine and complete). FastText and GloVe generally resulted in the same trend as Word2Vec, and the metric and method that maximized CCC differed from the ones that maximized the external validity measures. CONCLUSIONS The metric and method that maximized the internal validity measure differed from those that maximized the external validity measures; both produced different results. The cosine distance should be used when considering ICD-10, and the Euclidean distance when considering the frequency of word occurrence. The distributed representation, when trained by Word2Vec on the "infarction" domain from a Japanese academic corpus, provides an objective inter-disease distance used in PCS.
Collapse
Affiliation(s)
- Daiki Yokokawa
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Kazutaka Noda
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Yasutaka Yanagita
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Takanori Uehara
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Yoshiyuki Ohira
- grid.412764.20000 0004 0372 3116Department of General Internal Medicine, St. Marianna University School of Medicine, 2-16-1 Sugao, Miyamae-Ku, Kawasaki City, Kanagawa Japan
| | - Kiyoshi Shikino
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Tomoko Tsukamoto
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| | - Masatomi Ikusaka
- grid.411321.40000 0004 0632 2959Department of General Medicine, Chiba University Hospital, 1-8-1 Inohana, Chuo-Ku, Chiba City, Chiba 260-8677 Japan
| |
Collapse
|
4
|
Pan W, Han Y, Li J, Zhang E, He B. The positive energy of netizens: development and application of fine-grained sentiment lexicon and emotional intensity model. CURRENT PSYCHOLOGY 2022; 42:1-18. [PMID: 36345548 PMCID: PMC9630060 DOI: 10.1007/s12144-022-03876-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/10/2022] [Indexed: 11/06/2022]
Abstract
The outbreak of COVID-19 has led to a global health crisis and caused huge emotional swings. However, the positive emotional expressions, like self-confidence, optimism, and praise, that appear in Chinese social networks are rarely explored by researchers. This study aims to analyze the characteristics of netizens' positive energy expressions and the impact of node events on public emotional expression during the COVID-19 pandemic. First, a total of 6,525,249 Chinese texts posted by Sina Weibo users were randomly selected through textual data cleaning and word segmentation for corpus construction. A fine-grained sentiment lexicon that contained POSITIVE ENERGY was built using Word2Vec technology; this lexicon was later used to conduct sentiment category analysis on original posts. Next, through manual labeling and multi-classification machine learning model construction, four mainstream machine learning algorithms were selected to train the emotional intensity model. Finally, the lexicon and optimized emotional intensity model were used to analyze the emotional expressions of Chinese netizens. The results show that POSITIVE ENERGY expression accounted for 40.97% during the COVID-19 pandemic. Over the course of time, POSITIVE ENERGY emotions were displayed at the highest levels and SURPRISES the lowest. The analysis results of the node events showed after the outbreak was confirmed officially, the expressions of POSITIVE ENERGY and FEAR increased simultaneously. After the initial victory in pandemic prevention and control, the expression of POSITIVE ENERGY and SAD reached a peak, while the increase of SAD was the most prominent. The fine-grained sentiment lexicon, which includes a POSITIVE ENERGY category, demonstrated reliable algorithm performance and can be used for sentiment classification of Chinese Internet context. We also found many POSITIVE ENERGY expressions in Chinese online social platforms which are proven to be significantly affected by nod events of different nature.
Collapse
Affiliation(s)
- Wenhao Pan
- School of Public Administration, South China University of Technology, Guangzhou, China
| | - Yingying Han
- School of Public Administration, South China University of Technology, Guangzhou, China
| | - Jinjin Li
- School of Psychology, Guizhou Normal University, Guiyang, China
| | | | - Bikai He
- Department of Intelligent Engineering, Guiyang Institute of Information Science and Technology, Guiyang, China
| |
Collapse
|
5
|
An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022; 2022:8238432. [PMID: 36065380 PMCID: PMC9440839 DOI: 10.1155/2022/8238432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 06/15/2022] [Indexed: 11/17/2022]
Abstract
With the increasing volume of the published biomedical literature, the fast and effective retrieval of the literature on the sequence, structure, and function of biological entities is an essential task for the rapid development of biology and medicine. To capture the semantic information in biomedical literature more effectively when biomedical documents are clustered, we propose a new multi-evidence-based semantic text similarity calculation method. Two semantic similarities and one content similarity are used, in which two semantic similarities include MeSH-based semantic similarity and word embedding-based semantic similarity. To fuse three different similarities more effectively, after, respectively, calculating two semantic and one content similarities between biomedical documents, feedforward neural network is applied to integrate the two semantic similarities. Finally, weighted linear combination method is used to integrate the semantic and content similarities. To evaluate the effectiveness, the proposed method is compared with the existing basic methods, and the proposed method outperforms the existing related methods. Based on the proven results of this study, this method can be used not only in actual biological or medical experiments such as protein sequence or function analysis but also in biological and medical research fields, which will help to provide, use, and understand thematically consistent documents.
Collapse
|
6
|
Luo X, Gandhi P, Storey S, Zhang Z, Han Z, Huang K. A Computational Framework to Analyze the Associations Between Symptoms and Cancer Patient Attributes Post Chemotherapy Using EHR Data. IEEE J Biomed Health Inform 2021; 25:4098-4109. [PMID: 34613922 DOI: 10.1109/jbhi.2021.3117238] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Patients with cancer, such as breast and colorectal cancer, often experience different symptoms post-chemotherapy. The symptoms could be fatigue, gastrointestinal (nausea, vomiting, lack of appetite), psychoneurological symptoms (depressive symptoms, anxiety), or other types. Previous research focused on understanding the symptoms using survey data. In this research, we propose to utilize the data within the Electronic Health Record (EHR). A computational framework is developed to use a natural language processing (NLP) pipeline to extract the clinician-documented symptoms from clinical notes. Then, a patient clustering method is based on the symptom severity levels to group the patient in clusters. The association rule mining is used to analyze the associations between symptoms and patient attributes (smoking history, number of comorbidities, diabetes status, age at diagnosis) in the patient clusters. The results show that the various symptom types and severity levels have different associations between breast and colorectal cancers and different timeframes post-chemotherapy. The results also show that patients with breast or colorectal cancers, who smoke and have severe fatigue, likely have severe gastrointestinal symptoms six months after the chemotherapy. Our framework can be generalized to analyze symptoms or symptom clusters of other chronic diseases where symptom management is critical.
Collapse
|
7
|
Geng W, Qin X, Yang T, Cong Z, Wang Z, Kong Q, Tang Z, Jiang L. Model-Based Reasoning of Clinical Diagnosis in Integrative Medicine: Real-World Methodological Study of Electronic Medical Records and Natural Language Processing Methods. JMIR Med Inform 2020; 8:e23082. [PMID: 33346740 PMCID: PMC7781803 DOI: 10.2196/23082] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 10/18/2020] [Accepted: 11/07/2020] [Indexed: 01/17/2023] Open
Abstract
Background Integrative medicine is a form of medicine that combines practices and treatments from alternative medicine with conventional medicine. The diagnosis in integrative medicine involves the clinical diagnosis based on modern medicine and syndrome pattern diagnosis. Electronic medical records (EMRs) are the systematized collection of patients health information stored in a digital format that can be shared across different health care settings. Although syndrome and sign information or relative information can be extracted from the EMR and content texts can be mapped to computability vectors using natural language processing techniques, application of artificial intelligence techniques to support physicians in medical practices remains a major challenge. Objective The purpose of this study was to investigate model-based reasoning (MBR) algorithms for the clinical diagnosis in integrative medicine based on EMRs and natural language processing. We also estimated the associations among the factors of sample size, number of syndrome pattern type, and diagnosis in modern medicine using the MBR algorithms. Methods A total of 14,075 medical records of clinical cases were extracted from the EMRs as the development data set, and an external test data set consisting of 1000 medical records of clinical cases was extracted from independent EMRs. MBR methods based on word embedding, machine learning, and deep learning algorithms were developed for the automatic diagnosis of syndrome pattern in integrative medicine. MBR algorithms combining rule-based reasoning (RBR) were also developed. A standard evaluation metrics consisting of accuracy, precision, recall, and F1 score was used for the performance estimation of the methods. The association analyses were conducted on the sample size, number of syndrome pattern type, and diagnosis of lung diseases with the best algorithms. Results The Word2Vec convolutional neural network (CNN) MBR algorithms showed high performance (accuracy of 0.9586 in the test data set) in the syndrome pattern diagnosis of lung diseases. The Word2Vec CNN MBR combined with RBR also showed high performance (accuracy of 0.9229 in the test data set). The diagnosis of lung diseases could enhance the performance of the Word2Vec CNN MBR algorithms. Each group sample size and syndrome pattern type affected the performance of these algorithms. Conclusions The MBR methods based on Word2Vec and CNN showed high performance in the syndrome pattern diagnosis of lung diseases in integrative medicine. The parameters of each group’s sample size, syndrome pattern type, and diagnosis of lung diseases were associated with the performance of the methods. Trial Registration ClinicalTrials.gov NCT03274908; https://clinicaltrials.gov/ct2/show/NCT03274908
Collapse
Affiliation(s)
- Wenye Geng
- Department of Integrative Medicine, Fudan University Huashan Hospital, Shanghai, China
| | - Xuanfeng Qin
- Department of Neurosurgery, Fudan University Huashan Hospital, Shanghai, China
| | - Tao Yang
- Emergency Department, Huashan Hospital of Fudan University, Shanghai, China
| | - Zhilei Cong
- Emergency Department, Huashan Hospital of Fudan University, Shanghai, China
| | - Zhuo Wang
- Shanghai Sunjian Informatics Technology Company Limited, Shanghai, China
| | - Qing Kong
- Department of Integrative Medicine, Fudan University Huashan Hospital, Shanghai, China
| | - Zihui Tang
- Department of Integrative Medicine, Fudan University Huashan Hospital, Shanghai, China
| | - Lin Jiang
- Healthcare Center, Fudan University Huashan Hospital, Shanghai, China
| |
Collapse
|
8
|
Yeganova L, Kim S, Chen Q, Balasanov G, Wilbur WJ, Lu Z. Better synonyms for enriching biomedical search. J Am Med Inform Assoc 2020; 27:1894-1902. [PMID: 33083825 PMCID: PMC7727334 DOI: 10.1093/jamia/ocaa151] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 05/20/2020] [Accepted: 08/20/2020] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. MATERIALS AND METHODS In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. RESULTS Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. CONCLUSIONS We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
Collapse
Affiliation(s)
- Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Grigory Balasanov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
9
|
A semantic approach to extractive multi-document summarization: Applying sentence expansion for tuning of conceptual densities. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2020.102341] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
|
10
|
Cox S, Dong X, Rai R, Christopherson L, Zheng W, Tropsha A, Schmitt C. A semantic similarity based methodology for predicting protein-protein interactions: Evaluation with P53-interacting kinases. J Biomed Inform 2020; 111:103579. [PMID: 33007449 DOI: 10.1016/j.jbi.2020.103579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 09/14/2020] [Accepted: 09/25/2020] [Indexed: 10/23/2022]
Abstract
Biomedical literature contains unstructured, rich information regarding proteins, ligands, diseases as well as biological pathways in which they are involved. Systematically analyzing such textual corpus has the potential for biomedical discovery of new protein-protein interactions and hidden drug indications. For this purpose, we have investigated a methodology that is based on a well-established text mining tool, Word2Vec, for the analysis of PubMed full text articles to derive word embeddings, and the use of a simple semantic similarity comparison either by itself or in conjunction with k-Nearest Neighbor (kNN) technique for the prediction of new relationships. To test this methodology, three lines of retrospective analyses of a dataset with known P53-interacting proteins have been conducted. First, we demonstrated that Word2Vec semantic similarity can infer functional relatedness among all kinases known to interact with P53. Second, in a series of time-split experiments, we demonstrated that both a simple similarity comparison and kNN models built with papers published up to a certain year were able to discover P53 interactors described in later publications. Third, in a different scenario of time-split experiments, we examined the predictions of P53-interacting proteins based on the kNN models built on data prior to a certain split year for different time ranges past that year, and found that the cumulative number of correct predictions was indeed increasing with time. We conclude that text mining of research papers in the PubMed literature based on Word2Vec analysis followed by a simple similarity comparison or kNN modeling affords excellent predictions of protein-protein interactions between P53 and kinases, and should have wide applications in translational biomedical studies such as repurposing of existing drugs, drug-drug interaction, and elucidation of mechanisms of action for drugs.
Collapse
Affiliation(s)
- Steven Cox
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Xialan Dong
- The Laboratory for Molecular Informatics and Data Sciences, Department of Pharmaceutical Sciences and the BRITE Institute, College of Health and Sciences, North Carolina Central University, Durham, NC 27707, USA
| | - Ruhi Rai
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Laura Christopherson
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Weifan Zheng
- The Laboratory for Molecular Informatics and Data Sciences, Department of Pharmaceutical Sciences and the BRITE Institute, College of Health and Sciences, North Carolina Central University, Durham, NC 27707, USA; UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
| | - Alexander Tropsha
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA; UNC Eshelman School of Pharmacy, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.
| | - Charles Schmitt
- Renaissance Computing Institute (RENCI), University of North Carolina at Chapel Hill, Chapel Hill, NC, USA.
| |
Collapse
|
11
|
Lee GE, Sun A. Understanding the stability of medical concept embeddings. J Assoc Inf Sci Technol 2020. [DOI: 10.1002/asi.24411] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Grace E. Lee
- School of Computer Science and Engineering Nanyang Technological University Singapore Singapore
| | - Aixin Sun
- School of Computer Science and Engineering Nanyang Technological University Singapore Singapore
| |
Collapse
|
12
|
Abstract
Rapid increases in data volume and variety pose a challenge to safe drug prescription for health professionals like doctors and dentists. This is addressed by our study, which presents innovative approaches in mining data from drug corpus and extracting feature vectors to combine this knowledge with individual patient medical profiles. Within our three-tiered framework-the prediction layer, the knowledge layer and the presentation layer-we describe multiple approaches in computing similarity ratios from the feature vectors, illustrated with an example of applying the framework in a typical medical clinic. Experimental evaluation shows that the word embedding model performs better than the adverse network model, with a F score of 0.75. The F score is a common metrics used for evaluating the performance of classification algorithms. Similarity to a drug the patient is allergic to or is taking are important considerations for the suitability of a drug for prescription. Hence, such an approach, when integrated within the clinical work-flow, will reduce prescription errors thereby increasing patient health outcomes.
Collapse
|
13
|
Summarization of biomedical articles using domain-specific word embeddings and graph ranking. J Biomed Inform 2020; 107:103452. [DOI: 10.1016/j.jbi.2020.103452] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 05/06/2020] [Accepted: 05/09/2020] [Indexed: 12/21/2022]
|
14
|
Characterization of near death experiences using text mining analyses: A preliminary study. PLoS One 2020; 15:e0227402. [PMID: 31999716 PMCID: PMC6992169 DOI: 10.1371/journal.pone.0227402] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 12/18/2019] [Indexed: 01/04/2023] Open
Abstract
The notion that death represents a passing to an afterlife, where we are reunited with loved ones and live eternally in a utopian paradise, is common in the reports of people who have encountered a “Near-Death Experience” (NDE). NDEs are thoroughly portrayed by the media but empirical studies are rather recent. The definition of the phenomenon as well as the identification of NDE experiencers is still a matter of debate. To date, NDEs’ identification and description in studies have mostly derived from answered items in questionnaires. However, questionnaires’ content could be restricting and subject to personal interpretation. We believe that in addition to their use, user-independent statistical text examination of freely expressed NDEs narratives is of prior importance to help capture the phenomenology of such a subjective and complex phenomenon. Towards that aim, we included 158 participants with a firsthand retrospective narrative of their self-reported NDE that we analyzed using an automated text-mining method. The output revealed the top words expressed by experiencers. In a second step, a hierarchical clustering analysis was conducted to visualize the relationships between these words. It revealed three main clusters of features: visual perceptions, emotions and spatial components. We believe the user-independent and data-driven text mining approach used in this study is promising by contributing to the building a rigorous description and definition of NDEs.
Collapse
|
15
|
Zhang L, Zhang Y, Cai T, Ahuja Y, He Z, Ho YL, Beam A, Cho K, Carroll R, Denny J, Kohane I, Liao K, Cai T. Automated grouping of medical codes via multiview banded spectral clustering. J Biomed Inform 2019; 100:103322. [PMID: 31672532 PMCID: PMC7261410 DOI: 10.1016/j.jbi.2019.103322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 10/25/2019] [Accepted: 10/27/2019] [Indexed: 01/28/2023]
Abstract
OBJECTIVE With its increasingly widespread adoption, electronic health records (EHR) have enabled phenotypic information extraction at an unprecedented granularity and scale. However, often a medical concept (e.g. diagnosis, prescription, symptom) is described in various synonyms across different EHR systems, hindering data integration for signal enhancement and complicating dimensionality reduction for knowledge discovery. Despite existing ontologies and hierarchies, tremendous human effort is needed for curation and maintenance - a process that is both unscalable and susceptible to subjective biases. This paper aims to develop a data-driven approach to automate grouping medical terms into clinically relevant concepts by combining multiple up-to-date data sources in an unbiased manner. METHODS We present a novel data-driven grouping approach - multi-view banded spectral clustering (mvBSC) combining summary data from multiple healthcare systems. The proposed method consists of a banding step that leverages the prior knowledge from the existing coding hierarchy, and a combining step that performs spectral clustering on an optimally weighted matrix. RESULTS We apply the proposed method to group ICD-9 and ICD-10-CM codes together by integrating data from two healthcare systems. We show grouping results and hierarchies for 13 representative disease categories. Individual grouping qualities were evaluated using normalized mutual information, adjusted Rand index, and F1-measure, and were found to consistently exhibit great similarity to the existing manual grouping counterpart. The resulting ICD groupings also enjoy comparable interpretability and are well aligned with the current ICD hierarchy. CONCLUSION The proposed approach, by systematically leveraging multiple data sources, is able to overcome bias while maximizing consensus to achieve generalizability. It has the advantage of being efficient, scalable, and adaptive to the evolving human knowledge reflected in the data, showing a significant step toward automating medical knowledge integration.
Collapse
Affiliation(s)
- Luwan Zhang
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | - Yichi Zhang
- Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI, USA
| | - Tianrun Cai
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Yuri Ahuja
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Zeling He
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Yuk-Lam Ho
- Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kelly Cho
- Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Division of Aging, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Robert Carroll
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Joshua Denny
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Katherine Liao
- Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
16
|
|
17
|
Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings. J Biomed Inform 2019; 90:103096. [PMID: 30654030 DOI: 10.1016/j.jbi.2019.103096] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Revised: 11/27/2018] [Accepted: 12/31/2018] [Indexed: 11/21/2022]
Abstract
Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5-0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.
Collapse
|
18
|
Topaz M, Murga L, Gaddis KM, McDonald MV, Bar-Bachar O, Goldberg Y, Bowles KH. Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches. J Biomed Inform 2019; 90:103103. [PMID: 30639392 DOI: 10.1016/j.jbi.2019.103103] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 11/14/2018] [Accepted: 12/31/2018] [Indexed: 10/27/2022]
Abstract
BACKGROUND Natural language processing (NLP) of health-related data is still an expertise demanding, and resource expensive process. We created a novel, open source rapid clinical text mining system called NimbleMiner. NimbleMiner combines several machine learning techniques (word embedding models and positive only labels learning) to facilitate the process in which a human rapidly performs text mining of clinical narratives, while being aided by the machine learning components. OBJECTIVE This manuscript describes the general system architecture and user Interface and presents results of a case study aimed at classifying fall-related information (including fall history, fall prevention interventions, and fall risk) in homecare visit notes. METHODS We extracted a corpus of homecare visit notes (n = 1,149,586) for 89,459 patients from a large US-based homecare agency. We used a gold standard testing dataset of 750 notes annotated by two human reviewers to compare the NimbleMiner's ability to classify documents regarding whether they contain fall-related information with a previously developed rule-based NLP system. RESULTS NimbleMiner outperformed the rule-based system in almost all domains. The overall F- score was 85.8% compared to 81% by the rule based-system with the best performance for identifying general fall history (F = 89% vs. F = 85.1% rule-based), followed by fall risk (F = 87% vs. F = 78.7% rule-based), fall prevention interventions (F = 88.1% vs. F = 78.2% rule-based) and fall within 2 days of the note date (F = 83.1% vs. F = 80.6% rule-based). The rule-based system achieved slightly better performance for fall within 2 weeks of the note date (F = 81.9% vs. F = 84% rule-based). DISCUSSION & CONCLUSIONS NimbleMiner outperformed other systems aimed at fall information classification, including our previously developed rule-based approach. These promising results indicate that clinical text mining can be implemented without the need for large labeled datasets necessary for other types of machine learning. This is critical for domains with little NLP developments, like nursing or allied health professions.
Collapse
Affiliation(s)
- Maxim Topaz
- School of Nursing & Data Science Institute, Columbia University, New York, NY, USA; The Visiting Nurse Service of New York, New York, NY, USA.
| | - Ludmila Murga
- Cheryl Spencer Department of Nursing, University of Haifa, Haifa, Israel
| | | | | | - Ofrit Bar-Bachar
- Cheryl Spencer Department of Nursing, University of Haifa, Haifa, Israel
| | - Yoav Goldberg
- Department of Computer Science, Bar Ilan University, Tel Aviv, Israel
| | - Kathryn H Bowles
- The Visiting Nurse Service of New York, New York, NY, USA; School of Nursing, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
19
|
A survey of word embeddings for clinical text. J Biomed Inform 2019; 100S:100057. [DOI: 10.1016/j.yjbinx.2019.100057] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 08/22/2019] [Accepted: 09/28/2019] [Indexed: 11/22/2022]
|
20
|
Khatua A, Khatua A, Cambria E. A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks. Inf Process Manag 2019. [DOI: 10.1016/j.ipm.2018.10.010] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
21
|
A Bayesian Failure Prediction Network Based on Text Sequence Mining and Clustering. ENTROPY 2018; 20:e20120923. [PMID: 33266647 PMCID: PMC7512510 DOI: 10.3390/e20120923] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 11/27/2018] [Accepted: 11/30/2018] [Indexed: 12/02/2022]
Abstract
The purpose of this paper is to predict failures based on textual sequence data. The current failure prediction is mainly based on structured data. However, there are many unstructured data in aircraft maintenance. The failure mentioned here refers to failure types, such as transmitter failure and signal failure, which are classified by the clustering algorithm based on the failure text. For the failure text, this paper uses the natural language processing technology. Firstly, segmentation and the removal of stop words for Chinese failure text data is performed. The study applies the word2vec moving distance model to obtain the failure occurrence sequence for failure texts collected in a fixed period of time. According to the distance, a clustering algorithm is used to obtain a typical number of fault types. Secondly, the failure occurrence sequence is mined using sequence mining algorithms, such as-PrefixSpan. Finally, the above failure sequence is used to train the Bayesian failure network model. The final experimental results show that the Bayesian failure network has higher accuracy for failure prediction.
Collapse
|
22
|
Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf Sci Syst 2018; 7:1. [PMID: 30588291 DOI: 10.1007/s13755-018-0062-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Accepted: 11/08/2018] [Indexed: 01/20/2023] Open
Abstract
There are challenges for analyzing the narrative clinical notes in Electronic Health Records (EHRs) because of their unstructured nature. Mining the associations between the clinical concepts within the clinical notes can support physicians in making decisions, and provide researchers evidence about disease development and treatment. In this paper, in order to model and analyze disease and symptom relationships in the clinical notes, we present a concept association mining framework that is based on word embedding learned through neural networks. The approach is tested using 154,738 clinical notes from 500 patients, which are extracted from the Indiana University Health's Electronic Health Records system. All patients are diagnosed with more than one type of disease. The results show that this concept association mining framework can identify related diseases and symptoms. We also propose a method to visualize a patients' diseases and related symptoms in chronological order. This visualization can provide physicians an overview of the medical history of a patient and support decision making. The presented approach can also be expanded to analyze the associations of other clinical concepts, such as social history, family history, medications, etc.
Collapse
|
23
|
Chen Z, He Z, Liu X, Bian J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018; 18:65. [PMID: 30066651 PMCID: PMC6069806 DOI: 10.1186/s12911-018-0630-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the past few years, neural word embeddings have been widely used in text mining. However, the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and how semantically-related terms can be retrieved from word embeddings. METHODS To improve the transparency of word embeddings and the interpretability of the applications using them, in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles. RESULTS Regarding the retrieval of semantic relations, we were able to retrieve diverse semantic relations in the nearest neighbors of a given word. Meanwhile, the two popular word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles. CONCLUSION It is evident from this study that word embeddings can group terms with diverse semantic relations together. The domain of the training corpus does have impact on the semantic relations represented by word embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text mining tasks.
Collapse
Affiliation(s)
- Zhiwei Chen
- Department of Computer Science, Florida State University, Tallahassee, FL, USA
| | - Zhe He
- School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, FL, 32306 USA
| | - Xiuwen Liu
- Department of Computer Science, Florida State University, Tallahassee, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, Gainesville, FL, USA
| |
Collapse
|
24
|
Westergaard D, Stærfeldt HH, Tønsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018; 14:e1005962. [PMID: 29447159 PMCID: PMC5831415 DOI: 10.1371/journal.pcbi.1005962] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 02/28/2018] [Accepted: 01/05/2018] [Indexed: 12/21/2022] Open
Abstract
Across academia and industry, text mining has become a popular strategy for keeping up with the rapid growth of the scientific literature. Text mining of the scientific literature has mostly been carried out on collections of abstracts, due to their availability. Here we present an analysis of 15 million English scientific full-text articles published during the period 1823-2016. We describe the development in article length and publication sub-topics during these nearly 250 years. We showcase the potential of text mining by extracting published protein-protein, disease-gene, and protein subcellular associations using a named entity recognition system, and quantitatively report on their accuracy using gold standard benchmark data sets. We subsequently compare the findings to corresponding results obtained on 16.5 million abstracts included in MEDLINE and show that text mining of full-text articles consistently outperforms using abstracts only.
Collapse
Affiliation(s)
- David Westergaard
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Hans-Henrik Stærfeldt
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
| | - Christian Tønsberg
- Office for Innovation and Sector Services, Technical Information Center of Denmark, Technical University of Denmark, Lyngby, Denmark
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
- * E-mail: (LJJ); (SB)
| | - Søren Brunak
- Center for Biological Sequence Analysis, Department of Bio and Health Informatics, Technical University of Denmark, Lyngby, Denmark
- * E-mail: (LJJ); (SB)
| |
Collapse
|