Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Zhu Y, Yan E, Wang F. Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec. BMC Med Inform Decis Mak 2017;17:95. [PMID: 28673289 PMCID: PMC5496182 DOI: 10.1186/s12911-017-0498-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Accepted: 06/28/2017] [Indexed: 11/10/2022] Open

For:	Zhu Y, Yan E, Wang F. Semantic relatedness and similarity of biomedical terms: examining the effects of recency, size, and section of biomedical publications on the performance of word2vec. BMC Med Inform Decis Mak 2017;17:95. [PMID: 28673289 PMCID: PMC5496182 DOI: 10.1186/s12911-017-0498-1] [Citation(s) in RCA: 44] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2017] [Accepted: 06/28/2017] [Indexed: 11/10/2022] Open

Number

Cited by Other Article(s)

Yokokawa D, Noda K, Uehara T, Yanagita Y, Ohira Y, Ikusaka M. Do Japanese word-embedded representations obtained in the academic corpus retain the medical concepts of "infarction"? Artif Intell Med 2023;143:102604. [PMID: 37673573 DOI: 10.1016/j.artmed.2023.102604] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2021] [Revised: 05/29/2023] [Accepted: 06/05/2023] [Indexed: 09/08/2023]

Machine-learning as a validated tool to characterize individual differences in free recall of naturalistic events. Psychon Bull Rev 2023;30:308-316. [PMID: 36085232 DOI: 10.3758/s13423-022-02171-4] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/24/2022] [Indexed: 11/08/2022]

Yokokawa D, Noda K, Yanagita Y, Uehara T, Ohira Y, Shikino K, Tsukamoto T, Ikusaka M. Validating the representation of distance between infarct diseases using word embedding. BMC Med Inform Decis Mak 2022;22:322. [PMID: 36476486 PMCID: PMC9730570 DOI: 10.1186/s12911-022-02061-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Accepted: 11/22/2022] [Indexed: 12/12/2022] Open

Abstract

BACKGROUND

The pivot and cluster strategy (PCS) is a diagnostic reasoning strategy that automatically elicits disease clusters similar to a differential diagnosis in a batch. Although physicians know empirically which disease clusters are similar, there has been no quantitative evaluation. This study aimed to determine whether inter-disease distances between word embedding vectors using the PCS are a valid quantitative representation of similar disease groups in a limited domain.

METHODS

Abstracts were extracted from the Ichushi Web database and subjected to morphological analysis and training using Word2Vec, FastText, and GloVe. Consequently, word embedding vectors were obtained. For words including "infarction," we calculated the cophenetic correlation coefficient (CCC) as an internal validity measure and the adjusted rand index (ARI), normalized mutual information (NMI), and adjusted mutual information (AMI) with ICD-10 codes as the external validity measures. This was performed for each combination of metric and hierarchical clustering method.

RESULTS

Seventy-one words included "infarction," of which 38 diseases matched the ICD-10 standard with the appearance of 21 unique ICD-10 codes. When using Word2Vec, the CCC was most significant at 0.8690 (metric and method: euclidean and centroid), whereas the AMI was maximal at 0.4109 (metric and method: cosine and correlation, and average and weighted). The NMI and ARI were maximal at 0.8463 and 0.3593, respectively (metric and method: cosine and complete). FastText and GloVe generally resulted in the same trend as Word2Vec, and the metric and method that maximized CCC differed from the ones that maximized the external validity measures.

CONCLUSIONS

The metric and method that maximized the internal validity measure differed from those that maximized the external validity measures; both produced different results. The cosine distance should be used when considering ICD-10, and the Euclidean distance when considering the frequency of word occurrence. The distributed representation, when trained by Word2Vec on the "infarction" domain from a Japanese academic corpus, provides an objective inter-disease distance used in PCS.

Collapse

Pan W, Han Y, Li J, Zhang E, He B. The positive energy of netizens: development and application of fine-grained sentiment lexicon and emotional intensity model. CURRENT PSYCHOLOGY 2022;42:1-18. [PMID: 36345548 PMCID: PMC9630060 DOI: 10.1007/s12144-022-03876-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/10/2022] [Indexed: 11/06/2022]

Abstract

The outbreak of COVID-19 has led to a global health crisis and caused huge emotional swings. However, the positive emotional expressions, like self-confidence, optimism, and praise, that appear in Chinese social networks are rarely explored by researchers. This study aims to analyze the characteristics of netizens' positive energy expressions and the impact of node events on public emotional expression during the COVID-19 pandemic. First, a total of 6,525,249 Chinese texts posted by Sina Weibo users were randomly selected through textual data cleaning and word segmentation for corpus construction. A fine-grained sentiment lexicon that contained POSITIVE ENERGY was built using Word2Vec technology; this lexicon was later used to conduct sentiment category analysis on original posts. Next, through manual labeling and multi-classification machine learning model construction, four mainstream machine learning algorithms were selected to train the emotional intensity model. Finally, the lexicon and optimized emotional intensity model were used to analyze the emotional expressions of Chinese netizens. The results show that POSITIVE ENERGY expression accounted for 40.97% during the COVID-19 pandemic. Over the course of time, POSITIVE ENERGY emotions were displayed at the highest levels and SURPRISES the lowest. The analysis results of the node events showed after the outbreak was confirmed officially, the expressions of POSITIVE ENERGY and FEAR increased simultaneously. After the initial victory in pandemic prevention and control, the expression of POSITIVE ENERGY and SAD reached a peak, while the increase of SAD was the most prominent. The fine-grained sentiment lexicon, which includes a POSITIVE ENERGY category, demonstrated reliable algorithm performance and can be used for sentiment classification of Chinese Internet context. We also found many POSITIVE ENERGY expressions in Chinese online social platforms which are proven to be significantly affected by nod events of different nature.

Collapse

An Ensemble Semantic Textual Similarity Measure Based on Multiple Evidences for Biomedical Documents. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2022;2022:8238432. [PMID: 36065380 PMCID: PMC9440839 DOI: 10.1155/2022/8238432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 06/15/2022] [Indexed: 11/17/2022]

Luo X, Gandhi P, Storey S, Zhang Z, Han Z, Huang K. A Computational Framework to Analyze the Associations Between Symptoms and Cancer Patient Attributes Post Chemotherapy Using EHR Data. IEEE J Biomed Health Inform 2021;25:4098-4109. [PMID: 34613922 DOI: 10.1109/jbhi.2021.3117238] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Geng W, Qin X, Yang T, Cong Z, Wang Z, Kong Q, Tang Z, Jiang L. Model-Based Reasoning of Clinical Diagnosis in Integrative Medicine: Real-World Methodological Study of Electronic Medical Records and Natural Language Processing Methods. JMIR Med Inform 2020;8:e23082. [PMID: 33346740 PMCID: PMC7781803 DOI: 10.2196/23082] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2020] [Revised: 10/18/2020] [Accepted: 11/07/2020] [Indexed: 01/17/2023] Open

Abstract

Background

Integrative medicine is a form of medicine that combines practices and treatments from alternative medicine with conventional medicine. The diagnosis in integrative medicine involves the clinical diagnosis based on modern medicine and syndrome pattern diagnosis. Electronic medical records (EMRs) are the systematized collection of patients health information stored in a digital format that can be shared across different health care settings. Although syndrome and sign information or relative information can be extracted from the EMR and content texts can be mapped to computability vectors using natural language processing techniques, application of artificial intelligence techniques to support physicians in medical practices remains a major challenge.

Objective

The purpose of this study was to investigate model-based reasoning (MBR) algorithms for the clinical diagnosis in integrative medicine based on EMRs and natural language processing. We also estimated the associations among the factors of sample size, number of syndrome pattern type, and diagnosis in modern medicine using the MBR algorithms.

Methods

A total of 14,075 medical records of clinical cases were extracted from the EMRs as the development data set, and an external test data set consisting of 1000 medical records of clinical cases was extracted from independent EMRs. MBR methods based on word embedding, machine learning, and deep learning algorithms were developed for the automatic diagnosis of syndrome pattern in integrative medicine. MBR algorithms combining rule-based reasoning (RBR) were also developed. A standard evaluation metrics consisting of accuracy, precision, recall, and F1 score was used for the performance estimation of the methods. The association analyses were conducted on the sample size, number of syndrome pattern type, and diagnosis of lung diseases with the best algorithms.

Results

The Word2Vec convolutional neural network (CNN) MBR algorithms showed high performance (accuracy of 0.9586 in the test data set) in the syndrome pattern diagnosis of lung diseases. The Word2Vec CNN MBR combined with RBR also showed high performance (accuracy of 0.9229 in the test data set). The diagnosis of lung diseases could enhance the performance of the Word2Vec CNN MBR algorithms. Each group sample size and syndrome pattern type affected the performance of these algorithms.

Conclusions

The MBR methods based on Word2Vec and CNN showed high performance in the syndrome pattern diagnosis of lung diseases in integrative medicine. The parameters of each group’s sample size, syndrome pattern type, and diagnosis of lung diseases were associated with the performance of the methods.

Trial Registration

ClinicalTrials.gov NCT03274908; https://clinicaltrials.gov/ct2/show/NCT03274908

Collapse

Yeganova L, Kim S, Chen Q, Balasanov G, Wilbur WJ, Lu Z. Better synonyms for enriching biomedical search. J Am Med Inform Assoc 2020;27:1894-1902. [PMID: 33083825 PMCID: PMC7727334 DOI: 10.1093/jamia/ocaa151] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 05/20/2020] [Accepted: 08/20/2020] [Indexed: 01/12/2023] Open

A semantic approach to extractive multi-document summarization: Applying sentence expansion for tuning of conceptual densities. Inf Process Manag 2020. [DOI: 10.1016/j.ipm.2020.102341] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]

Cox S, Dong X, Rai R, Christopherson L, Zheng W, Tropsha A, Schmitt C. A semantic similarity based methodology for predicting protein-protein interactions: Evaluation with P53-interacting kinases. J Biomed Inform 2020;111:103579. [PMID: 33007449 DOI: 10.1016/j.jbi.2020.103579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2020] [Revised: 09/14/2020] [Accepted: 09/25/2020] [Indexed: 10/23/2022]

Abstract

Biomedical literature contains unstructured, rich information regarding proteins, ligands, diseases as well as biological pathways in which they are involved. Systematically analyzing such textual corpus has the potential for biomedical discovery of new protein-protein interactions and hidden drug indications. For this purpose, we have investigated a methodology that is based on a well-established text mining tool, Word2Vec, for the analysis of PubMed full text articles to derive word embeddings, and the use of a simple semantic similarity comparison either by itself or in conjunction with k-Nearest Neighbor (kNN) technique for the prediction of new relationships. To test this methodology, three lines of retrospective analyses of a dataset with known P53-interacting proteins have been conducted. First, we demonstrated that Word2Vec semantic similarity can infer functional relatedness among all kinases known to interact with P53. Second, in a series of time-split experiments, we demonstrated that both a simple similarity comparison and kNN models built with papers published up to a certain year were able to discover P53 interactors described in later publications. Third, in a different scenario of time-split experiments, we examined the predictions of P53-interacting proteins based on the kNN models built on data prior to a certain split year for different time ranges past that year, and found that the cumulative number of correct predictions was indeed increasing with time. We conclude that text mining of research papers in the PubMed literature based on Word2Vec analysis followed by a simple similarity comparison or kNN modeling affords excellent predictions of protein-protein interactions between P53 and kinases, and should have wide applications in translational biomedical studies such as repurposing of existing drugs, drug-drug interaction, and elucidation of mechanisms of action for drugs.

Collapse

Lee GE, Sun A. Understanding the stability of medical concept embeddings. J Assoc Inf Sci Technol 2020. [DOI: 10.1002/asi.24411] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]

Feature-Based Learning in Drug Prescription System for Medical Clinics. Neural Process Lett 2020;52:1703-1721. [PMID: 32837244 PMCID: PMC7331919 DOI: 10.1007/s11063-020-10296-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]

Summarization of biomedical articles using domain-specific word embeddings and graph ranking. J Biomed Inform 2020;107:103452. [DOI: 10.1016/j.jbi.2020.103452] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 05/06/2020] [Accepted: 05/09/2020] [Indexed: 12/21/2022]

Characterization of near death experiences using text mining analyses: A preliminary study. PLoS One 2020;15:e0227402. [PMID: 31999716 PMCID: PMC6992169 DOI: 10.1371/journal.pone.0227402] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2017] [Accepted: 12/18/2019] [Indexed: 01/04/2023] Open

Zhang L, Zhang Y, Cai T, Ahuja Y, He Z, Ho YL, Beam A, Cho K, Carroll R, Denny J, Kohane I, Liao K, Cai T. Automated grouping of medical codes via multiview banded spectral clustering. J Biomed Inform 2019;100:103322. [PMID: 31672532 PMCID: PMC7261410 DOI: 10.1016/j.jbi.2019.103322] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2019] [Revised: 10/25/2019] [Accepted: 10/27/2019] [Indexed: 01/28/2023]

Abstract

OBJECTIVE

With its increasingly widespread adoption, electronic health records (EHR) have enabled phenotypic information extraction at an unprecedented granularity and scale. However, often a medical concept (e.g. diagnosis, prescription, symptom) is described in various synonyms across different EHR systems, hindering data integration for signal enhancement and complicating dimensionality reduction for knowledge discovery. Despite existing ontologies and hierarchies, tremendous human effort is needed for curation and maintenance - a process that is both unscalable and susceptible to subjective biases. This paper aims to develop a data-driven approach to automate grouping medical terms into clinically relevant concepts by combining multiple up-to-date data sources in an unbiased manner.

METHODS

We present a novel data-driven grouping approach - multi-view banded spectral clustering (mvBSC) combining summary data from multiple healthcare systems. The proposed method consists of a banding step that leverages the prior knowledge from the existing coding hierarchy, and a combining step that performs spectral clustering on an optimally weighted matrix.

RESULTS

We apply the proposed method to group ICD-9 and ICD-10-CM codes together by integrating data from two healthcare systems. We show grouping results and hierarchies for 13 representative disease categories. Individual grouping qualities were evaluated using normalized mutual information, adjusted Rand index, and F1-measure, and were found to consistently exhibit great similarity to the existing manual grouping counterpart. The resulting ICD groupings also enjoy comparable interpretability and are well aligned with the current ICD hierarchy.

CONCLUSION

The proposed approach, by systematically leveraging multiple data sources, is able to overcome bias while maximizing consensus to achieve generalizability. It has the advantage of being efficient, scalable, and adaptive to the evolving human knowledge reflected in the data, showing a significant step toward automating medical knowledge integration.

Collapse

Affiliation(s)

Luwan Zhang Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
Yichi Zhang Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI, USA
Tianrun Cai Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
Yuri Ahuja Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Zeling He Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
Yuk-Lam Ho Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA
Andrew Beam Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Kelly Cho Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Division of Aging, Brigham and Women's Hospital, Boston, MA, USA; Department of Medicine, Harvard Medical School, Boston, MA, USA
Robert Carroll Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
Joshua Denny Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
Isaac Kohane Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Katherine Liao Division of Rheumatology, Brigham and Women's Hospital, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
Tianxi Cai Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA; Division of Population Health and Data Sciences, MAVERIC, VA Boston Healthcare System, Boston, MA, USA; Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA

Collapse

NimbleMiner. ACTA ACUST UNITED AC 2019;37:583-590. [DOI: 10.1097/cin.0000000000000557] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings. J Biomed Inform 2019;90:103096. [PMID: 30654030 DOI: 10.1016/j.jbi.2019.103096] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2018] [Revised: 11/27/2018] [Accepted: 12/31/2018] [Indexed: 11/21/2022]

Topaz M, Murga L, Gaddis KM, McDonald MV, Bar-Bachar O, Goldberg Y, Bowles KH. Mining fall-related information in clinical notes: Comparison of rule-based and novel word embedding-based machine learning approaches. J Biomed Inform 2019;90:103103. [PMID: 30639392 DOI: 10.1016/j.jbi.2019.103103] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2018] [Revised: 11/14/2018] [Accepted: 12/31/2018] [Indexed: 10/27/2022]

Abstract

BACKGROUND

Natural language processing (NLP) of health-related data is still an expertise demanding, and resource expensive process. We created a novel, open source rapid clinical text mining system called NimbleMiner. NimbleMiner combines several machine learning techniques (word embedding models and positive only labels learning) to facilitate the process in which a human rapidly performs text mining of clinical narratives, while being aided by the machine learning components.

OBJECTIVE

This manuscript describes the general system architecture and user Interface and presents results of a case study aimed at classifying fall-related information (including fall history, fall prevention interventions, and fall risk) in homecare visit notes.

METHODS

We extracted a corpus of homecare visit notes (n = 1,149,586) for 89,459 patients from a large US-based homecare agency. We used a gold standard testing dataset of 750 notes annotated by two human reviewers to compare the NimbleMiner's ability to classify documents regarding whether they contain fall-related information with a previously developed rule-based NLP system.

RESULTS

NimbleMiner outperformed the rule-based system in almost all domains. The overall F- score was 85.8% compared to 81% by the rule based-system with the best performance for identifying general fall history (F = 89% vs. F = 85.1% rule-based), followed by fall risk (F = 87% vs. F = 78.7% rule-based), fall prevention interventions (F = 88.1% vs. F = 78.2% rule-based) and fall within 2 days of the note date (F = 83.1% vs. F = 80.6% rule-based). The rule-based system achieved slightly better performance for fall within 2 weeks of the note date (F = 81.9% vs. F = 84% rule-based).

DISCUSSION & CONCLUSIONS

NimbleMiner outperformed other systems aimed at fall information classification, including our previously developed rule-based approach. These promising results indicate that clinical text mining can be implemented without the need for large labeled datasets necessary for other types of machine learning. This is critical for domains with little NLP developments, like nursing or allied health professions.

Collapse

A survey of word embeddings for clinical text. J Biomed Inform 2019;100S:100057. [DOI: 10.1016/j.yjbinx.2019.100057] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2019] [Revised: 08/22/2019] [Accepted: 09/28/2019] [Indexed: 11/22/2022]

Khatua A, Khatua A, Cambria E. A tale of two epidemics: Contextual Word2Vec for classifying twitter streams during outbreaks. Inf Process Manag 2019. [DOI: 10.1016/j.ipm.2018.10.010] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]

A Bayesian Failure Prediction Network Based on Text Sequence Mining and Clustering. ENTROPY 2018;20:e20120923. [PMID: 33266647 PMCID: PMC7512510 DOI: 10.3390/e20120923] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/23/2018] [Revised: 11/27/2018] [Accepted: 11/30/2018] [Indexed: 12/02/2022]

Neural networks for mining the associations between diseases and symptoms in clinical notes. Health Inf Sci Syst 2018;7:1. [PMID: 30588291 DOI: 10.1007/s13755-018-0062-0] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2018] [Accepted: 11/08/2018] [Indexed: 01/20/2023] Open

Chen Z, He Z, Liu X, Bian J. Evaluating semantic relations in neural word embeddings with biomedical and general domain knowledge bases. BMC Med Inform Decis Mak 2018;18:65. [PMID: 30066651 PMCID: PMC6069806 DOI: 10.1186/s12911-018-0630-x] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

In the past few years, neural word embeddings have been widely used in text mining. However, the vector representations of word embeddings mostly act as a black box in downstream applications using them, thereby limiting their interpretability. Even though word embeddings are able to capture semantic regularities in free text documents, it is not clear how different kinds of semantic relations are represented by word embeddings and how semantically-related terms can be retrieved from word embeddings.

METHODS

To improve the transparency of word embeddings and the interpretability of the applications using them, in this study, we propose a novel approach for evaluating the semantic relations in word embeddings using external knowledge bases: Wikipedia, WordNet and Unified Medical Language System (UMLS). We trained multiple word embeddings using health-related articles in Wikipedia and then evaluated their performance in the analogy and semantic relation term retrieval tasks. We also assessed if the evaluation results depend on the domain of the textual corpora by comparing the embeddings of health-related Wikipedia articles with those of general Wikipedia articles.

RESULTS

Regarding the retrieval of semantic relations, we were able to retrieve diverse semantic relations in the nearest neighbors of a given word. Meanwhile, the two popular word embedding approaches, Word2vec and GloVe, obtained comparable results on both the analogy retrieval task and the semantic relation retrieval task, while dependency-based word embeddings had much worse performance in both tasks. We also found that the word embeddings trained with health-related Wikipedia articles obtained better performance in the health-related relation retrieval tasks than those trained with general Wikipedia articles.

CONCLUSION

It is evident from this study that word embeddings can group terms with diverse semantic relations together. The domain of the training corpus does have impact on the semantic relations represented by word embeddings. We thus recommend using domain-specific corpus to train word embeddings for domain-specific text mining tasks.

Collapse

Westergaard D, Stærfeldt HH, Tønsberg C, Jensen LJ, Brunak S. A comprehensive and quantitative comparison of text-mining in 15 million full-text articles versus their corresponding abstracts. PLoS Comput Biol 2018;14:e1005962. [PMID: 29447159 PMCID: PMC5831415 DOI: 10.1371/journal.pcbi.1005962] [Citation(s) in RCA: 87] [Impact Index Per Article: 14.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Revised: 02/28/2018] [Accepted: 01/05/2018] [Indexed: 12/21/2022] Open